发明授权
- 专利标题: Identifying a property of a document
- 专利标题(中): 识别文档的属性
-
申请号: US11737603申请日: 2007-04-19
-
公开(公告)号: US08380488B1公开(公告)日: 2013-02-19
- 发明人: Xin Liu , Stewart Yang
- 申请人: Xin Liu , Stewart Yang
- 申请人地址: US CA Mountain View
- 专利权人: Google Inc.
- 当前专利权人: Google Inc.
- 当前专利权人地址: US CA Mountain View
- 代理机构: Fish & Richardson P.C.
- 主分类号: G06F17/28
- IPC分类号: G06F17/28
摘要:
Methods, systems and apparatus, including computer program products, for identifying properties of an electronic document. In one aspect, a sequence of bytes representing text in a document is received. A plurality of byte-n-grams are identified from the bytes. For multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings is identified. A respective encoding score for each of the multiple encodings is determined. A most likely encoding of the document is identified based on a highest encoding score among the encoding scores. In another aspect, a sequence of characters, having an encoding, are identified in a document. The sequence is segmented into features, each corresponding to two or more characters. A respective score for each of multiple languages is determined based on the features and a respective language model. A language of the document is identified based on the scores.
信息查询