Language identification in multilingual text

发明授权

US08635061B2 Language identification in multilingual text 有权

标题翻译：多语言文字中的语言识别

请登陆查看更多内容

专利标题： Language identification in multilingual text
专利标题（中）： 多语言文字中的语言识别
申请号： US12904642

申请日： 2010-10-14
公开(公告)号： US08635061B2

公开(公告)日： 2014-01-21
发明人: Kang Li , Stephen Allen Kloder , Ian George Johnson , Siarhei Alonichau
申请人： Kang Li , Stephen Allen Kloder , Ian George Johnson , Siarhei Alonichau
申请人地址： US WA Redmond
专利权人： Microsoft Corporation
当前专利权人： Microsoft Corporation
当前专利权人地址： US WA Redmond
代理机构： Shook, Hardy & Bacon L.L.P.
主分类号： G06F17/20
IPC分类号： G06F17/20 ; G06F17/27 ; G10L15/00

Language identification in multilingual text

摘要：

Methods, systems, and media are provided for identifying languages in multilingual text. A document is decoded into a universal representative coding for easier tag manipulation, then broken into plain-text content sections. The sections are identified and assigned a weight, wherein more informative sections are given a higher weight and less informative sections are given a lesser weight. A language likelihood score is determined for each word, phrase, or character n-gram in a section. The language likelihood scores within a section are combined for each language. The combined section scores are then summed together to obtain a total document score for each language. This results in a document score for each language, which can be ranked to determine the primary language for the document.

摘要（中）：

提供方法，系统和媒体用于识别多语言文本中的语言。将文档解码为通用代表编码，便于标签操纵，然后分解成纯文本内容部分。这些部分被识别并分配了一个重量，其中更多的信息部分被给予较高的重量，并且较少的信息部分被给予较小的重量。确定一个部分中每个单词，短语或字符n-gram的语言可能性得分。一个部分内的语言可能性分数与每种语言相结合。然后将组合的分数相加在一起以获得每种语言的总文档分数。这导致每个语言的文档分数，其可以被排序以确定文档的主要语言。

公开/授权文献

US20120095748A1 Language Identification in Multilingual Text 公开/授权日：2012-04-19

信息查询

Espacenet