发明授权
US08380488B1 Identifying a property of a document 有权
识别文档的属性

Identifying a property of a document
摘要:
Methods, systems and apparatus, including computer program products, for identifying properties of an electronic document. In one aspect, a sequence of bytes representing text in a document is received. A plurality of byte-n-grams are identified from the bytes. For multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings is identified. A respective encoding score for each of the multiple encodings is determined. A most likely encoding of the document is identified based on a highest encoding score among the encoding scores. In another aspect, a sequence of characters, having an encoding, are identified in a document. The sequence is segmented into features, each corresponding to two or more characters. A respective score for each of multiple languages is determined based on the features and a respective language model. A language of the document is identified based on the scores.
信息查询
0/0