字体反爬的现象:
例如58同城中,不管是F12选中元素,还是直接查看网页源代码,在网页上看到的是正常的,但是在源代码中它是属于乱码,如下图样子就是字体反爬


字体反爬的原理:
利用字体的方式,它在源代码中展示了字体的一个代号,根据代号从字体中找到真正的文字然后再在网页上显示出来
破解方法:
既然可以显示出来,说明了这个字体已经在源代码中已经加载进来,只要在网页中找到这个字体,通过字体加载即可
破解58同城字体反爬:
从58同城网页源代码中可以看到一个font-family字体设置的选项,这个就是58同城自己设计的字体,可以看出通过base64进行编码,真正的编码信息在蓝色竖线之间

1 2 3 4 5 6 7 8 9 10 11 12 |
import base64 from fontTools.ttLib import TTFont #这个引入的是一个字体文件的对象 def generate_base_font(): font_face = "AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8Z/YQAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQZyHXPAAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QEQwahAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOFKrQZfDzz1AAsIAAAAAADbIJVfAAAAANsglV8AAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAABwAFAAIAAwAJAAQACgABAAYACAAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAHAACVjwAAlY8AAAAFAACZPAAAmTwAAAACAACaSwAAmksAAAADAACeOgAAnjoAAAAJAACeowAAnqMAAAAEAACfZAAAn2QAAAAKAACfkgAAn5IAAAABAACfpAAAn6QAAAAGAACfpQAAn6UAAAAIAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAD/EwB3AAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA" # 通过base64编码的字体信息 font_face = base64.b64decode(font_face) # 解码 with open("58.ttf", "wb") as fp: # 因为TTFont对象需要传入一个字体文件,不是字符串 # 创建一个.ttf的文件并且要以二进制的形式打开(b) fp.write(font_face) |
创建的ttf文件可以通过FontCreator进行查看

生成ttf文件之后,再通过TTFont去生成字体的xml后缀的一个配置文件
1 2 3 4 |
def generate_xml_font(): #通过.ttf文件去生成字体的xml格式的配置文件 font = TTFont("58.ttf") font.saveXML("58.xml") |
再对xml文件进行解析


记录的字体的名字和形状的映射关系,和FontCreator软件中的图像进行对比,glyph00001对应的是0,以此类推,就是找到字体的name和软件中哪个字进行对应

再去找到name对应的编码,这个编码值就是在网页源代码中展示的,也就是从网页的编码找到编码对应的名字,再从名字找到字符的形状,就是如下映射关系:
name <->数字
code<->name
但是要注意name和数字之间的关系不是永久一样的,第一次获取的时候可能glyph1->1,第二次可能就是glyph1->2
再添加一层关系(name<->shape这个形状就是数字的样子),原因就是每次获取再怎么变,字或者数字的形状都是不变的,所以这层关系是固定的,通过形状找到name,再通过name找到数字,然后再找数字和code之间的关系
总结就是找形状和数字之间关系,形状和code之间的关系,然后通过形状把数字和code联系在一起
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
import re import requests import base64 from fontTools.ttLib import TTFont def request_web(): url = "https://hz.58.com/chuzu/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d100000-0004-f4f9-a075-176fdd580451&ClickID=2" headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36" } resp = requests.get(url=url,headers=headers) html = resp.text base64_code = re.search(r"charset=utf-8;base64,(.*?)\'\)",html).group(1) #找出base64的编码 return html,base64_code def anti_html(): html,base64_code = request_web() font_face = base64.b64decode(base64_code) with open("58.ttf","wb") as fp: fp.write(font_face) font = TTFont("58.ttf") font.saveXML("58.xml") base_glyf = font["glyf"] #获取到glyf节点的所有数据,也就是说去到了所有字体的形状 num_glyf_map = { #数字和形状名字的映射表 0:base_glyf["glyph00001"], 1:base_glyf["glyph00002"], 2:base_glyf["glyph00003"], 3:base_glyf["glyph00004"], 4:base_glyf["glyph00005"], 5:base_glyf["glyph00006"], 6:base_glyf["glyph00007"], 7:base_glyf["glyph00008"], 8:base_glyf["glyph00009"], 9:base_glyf["glyph00010"], } code_name_map = font.getBestCmap() #获取编码和数字名字之间的关系 for code,name in code_name_map.items(): #拿出映射表中编码和数字名字 for number,shape in num_glyf_map.items(): if shape == base_glyf[name]: #通过两张映射表中的形状属性,映射出编码和数字之间的关系 codestr = str(hex(code)).replace("0","&#",1) + ";" #因为getBestCmap函数返回的是10进制,编码是16进制,先转换在重新编成需要替换的字符串 html = html.replace(codestr,str(number)) with open("58.html","w",encoding="utf-8") as fp: fp.write(html) def main(): anti_html() if __name__ == '__main__': main() |

关于字体文件:
一个最基本的字体文件一定会包含以下的表:
cmap: unicode跟 Name的映射关系
head: 字体全局信息
hhea:定义了水平header
hmtx:定义了水平metric
maxp:用于为字体分配内存
name:定义字体名称、风格名以及版权说明等
glyf: 字形数据即轮廓定义和调整指令
OS__2:
post:
高级书法特性的字体包含表:
BASE: Baseline data
GDEF: Glyph definition data
GPOS: Glyph positioning data
GSUB: Glyph substitution data
JSTF: Justification data
MATH: Math layout data
一些常用字体操作可见:https://blog.csdn.net/Obgo_6/article/details/101169682