Language identification in multilingual text
    1.
    发明授权
    Language identification in multilingual text 有权
    多语言文字中的语言识别

    公开(公告)号:US08635061B2

    公开(公告)日:2014-01-21

    申请号:US12904642

    申请日:2010-10-14

    IPC分类号: G06F17/20 G06F17/27 G10L15/00

    CPC分类号: G06F17/275 G06F17/30864

    摘要: Methods, systems, and media are provided for identifying languages in multilingual text. A document is decoded into a universal representative coding for easier tag manipulation, then broken into plain-text content sections. The sections are identified and assigned a weight, wherein more informative sections are given a higher weight and less informative sections are given a lesser weight. A language likelihood score is determined for each word, phrase, or character n-gram in a section. The language likelihood scores within a section are combined for each language. The combined section scores are then summed together to obtain a total document score for each language. This results in a document score for each language, which can be ranked to determine the primary language for the document.

    摘要翻译: 提供方法,系统和媒体用于识别多语言文本中的语言。 将文档解码为通用代表编码,便于标签操纵,然后分解成纯文本内容部分。 这些部分被识别并分配了一个重量,其中更多的信息部分被给予较高的重量,并且较少的信息部分被给予较小的重量。 确定一个部分中每个单词,短语或字符n-gram的语言可能性得分。 一个部分内的语言可能性分数与每种语言相结合。 然后将组合的分数相加在一起以获得每种语言的总文档分数。 这导致每个语言的文档分数,其可以被排序以确定文档的主要语言。

    Language Identification in Multilingual Text
    2.
    发明申请
    Language Identification in Multilingual Text 有权
    多语言文本中的语言识别

    公开(公告)号:US20120095748A1

    公开(公告)日:2012-04-19

    申请号:US12904642

    申请日:2010-10-14

    IPC分类号: G06F17/20

    CPC分类号: G06F17/275 G06F17/30864

    摘要: Methods, systems, and media are provided for identifying languages in multilingual text. A document is decoded into a universal representative coding for easier tag manipulation, then broken into plain-text content sections. The sections are identified and assigned a weight, wherein more informative sections are given a higher weight and less informative sections are given a lesser weight. A language likelihood score is determined for each word, phrase, or character n-gram in a section. The language likelihood scores within a section are combined for each language. The combined section scores are then summed together to obtain a total document score for each language. This results in a document score for each language, which can be ranked to determine the primary language for the document.

    摘要翻译: 提供方法,系统和媒体用于识别多语言文本中的语言。 将文档解码为通用代表编码,便于标签操纵,然后分解成纯文本内容部分。 这些部分被识别并分配了一个重量,其中更多的信息部分被给予较高的重量,并且较少的信息部分被给予较小的重量。 确定一个部分中每个单词,短语或字符n-gram的语言可能性得分。 一个部分内的语言可能性分数与每种语言相结合。 然后将组合的分数相加在一起以获得每种语言的总文档分数。 这导致每个语言的文档分数,其可以被排序以确定文档的主要语言。

    Machine translation system using well formed substructures
    3.
    发明授权
    Machine translation system using well formed substructures 失效
    机器翻译系统使用良好的子结构

    公开(公告)号:US5848385A

    公开(公告)日:1998-12-08

    申请号:US562686

    申请日:1995-11-27

    IPC分类号: G06F17/27 G06F17/28

    摘要: Source language text from an input interface is broken down into source language morphemes by a morphological analyzer. A syntactic analyzer converts the morphemes into source language signs labelled with identifiers and data identifying other signs which are grammatically related. A bilingual equivalence transformer transforms the source language signs to target language signs which are combined by a combiner to provide a first attempt at a target language structure. The structure is repeatedly evaluated by an evaluator and transformed by a transformer. The signs of well formed substructures identified by the evaluator are not dissociated from each other by the transformer. This process ends when either the whole target language structure is evaluated as being well formed or all transformations have been unsuccessfully evaluated.

    摘要翻译: 输入界面的源语言文本由形态分析器分解为源语言语素。 句法分析器将语素转换为标示标识符的源语言符号,并标识与语法相关的其他符号。 双语等效变换器将源语言符号转换为由组合器组合的目标语言符号,以提供目标语言结构的第一尝试。 该结构由评估者重复评估并由变压器变换。 由评估者识别的形成良好的子结构的迹象不会被变压器彼此解离。 当整个目标语言结构被评估为正确形成或者所有转换都被成功评估时,该过程就会结束。