INTELLIGENT SYSTEM THAT DYNAMICALLY IMPROVES ITS KNOWLEDGE AND CODE-BASE FOR NATURAL LANGUAGE UNDERSTANDING
    2.
    发明申请
    INTELLIGENT SYSTEM THAT DYNAMICALLY IMPROVES ITS KNOWLEDGE AND CODE-BASE FOR NATURAL LANGUAGE UNDERSTANDING 有权
    智能系统动态改进自然语言理解知识和代码

    公开(公告)号:US20160162466A1

    公开(公告)日:2016-06-09

    申请号:US14964512

    申请日:2015-12-09

    IPC分类号: G06F17/27

    摘要: Systems, methods, and apparatuses are presented for a novel natural language tokenizer and tagger. In some embodiments, a method for tokenizing text for natural language processing comprises: generating from a pool of documents, a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents; receiving a set of rules comprising rules that identify character/letter sequences as valid tokens; transforming one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood; receiving a document to be processed; dividing the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and outputting the divided tokens for natural language processing.

    摘要翻译: 系统,方法和设备被呈现给一种新颖的自然语言标记器和标签器。 在一些实施例中,用于对自然语言处理的文本进行标记化的方法包括:从文档池生成包括一个或多个条目的统计模型集合,每个条目表示在文档库中出现字符/字母序列的可能性; 接收一组包含将字符/字符序列识别为有效令牌的规则的规则; 将统计模型中的一个或多个条目转换为当条目表示高可能性时添加到规则集合中的新规则; 接收待处理的文件; 基于统计模型和规则集合将要处理的文档划分为令牌,其中在规则未能明确地标记文档的情况下应用统计模型; 并输出用于自然语言处理的分割令牌。