Language Identification in Multilingual Text
    1.
    发明申请
    Language Identification in Multilingual Text 有权
    多语言文本中的语言识别

    公开(公告)号:US20120095748A1

    公开(公告)日:2012-04-19

    申请号:US12904642

    申请日:2010-10-14

    IPC分类号: G06F17/20

    CPC分类号: G06F17/275 G06F17/30864

    摘要: Methods, systems, and media are provided for identifying languages in multilingual text. A document is decoded into a universal representative coding for easier tag manipulation, then broken into plain-text content sections. The sections are identified and assigned a weight, wherein more informative sections are given a higher weight and less informative sections are given a lesser weight. A language likelihood score is determined for each word, phrase, or character n-gram in a section. The language likelihood scores within a section are combined for each language. The combined section scores are then summed together to obtain a total document score for each language. This results in a document score for each language, which can be ranked to determine the primary language for the document.

    摘要翻译: 提供方法,系统和媒体用于识别多语言文本中的语言。 将文档解码为通用代表编码,便于标签操纵,然后分解成纯文本内容部分。 这些部分被识别并分配了一个重量,其中更多的信息部分被给予较高的重量,并且较少的信息部分被给予较小的重量。 确定一个部分中每个单词,短语或字符n-gram的语言可能性得分。 一个部分内的语言可能性分数与每种语言相结合。 然后将组合的分数相加在一起以获得每种语言的总文档分数。 这导致每个语言的文档分数,其可以被排序以确定文档的主要语言。