发明授权
- 专利标题: Language identification in multilingual text
- 专利标题(中): 多语言文字中的语言识别
-
申请号: US12904642申请日: 2010-10-14
-
公开(公告)号: US08635061B2公开(公告)日: 2014-01-21
- 发明人: Kang Li , Stephen Allen Kloder , Ian George Johnson , Siarhei Alonichau
- 申请人: Kang Li , Stephen Allen Kloder , Ian George Johnson , Siarhei Alonichau
- 申请人地址: US WA Redmond
- 专利权人: Microsoft Corporation
- 当前专利权人: Microsoft Corporation
- 当前专利权人地址: US WA Redmond
- 代理机构: Shook, Hardy & Bacon L.L.P.
- 主分类号: G06F17/20
- IPC分类号: G06F17/20 ; G06F17/27 ; G10L15/00
摘要:
Methods, systems, and media are provided for identifying languages in multilingual text. A document is decoded into a universal representative coding for easier tag manipulation, then broken into plain-text content sections. The sections are identified and assigned a weight, wherein more informative sections are given a higher weight and less informative sections are given a lesser weight. A language likelihood score is determined for each word, phrase, or character n-gram in a section. The language likelihood scores within a section are combined for each language. The combined section scores are then summed together to obtain a total document score for each language. This results in a document score for each language, which can be ranked to determine the primary language for the document.
公开/授权文献
- US20120095748A1 Language Identification in Multilingual Text 公开/授权日:2012-04-19
信息查询