发明申请
- 专利标题: INTELLIGENT SYSTEM THAT DYNAMICALLY IMPROVES ITS KNOWLEDGE AND CODE-BASE FOR NATURAL LANGUAGE UNDERSTANDING
- 专利标题(中): 智能系统动态改进自然语言理解知识和代码
-
申请号: US14964512申请日: 2015-12-09
-
公开(公告)号: US20160162466A1公开(公告)日: 2016-06-09
- 发明人: Robert J. Munro , Rob Voigt , Schuyler D. Erle , Brendan D. Callahan , Gary C. King , Jessica D. Long , Jason Brenier , Tripti Saxena , Stefan Krawczyk
- 申请人: Robert J. Munro , Rob Voigt , Schuyler D. Erle , Brendan D. Callahan , Gary C. King , Jessica D. Long , Jason Brenier , Tripti Saxena , Stefan Krawczyk
- 申请人地址: US CA San Francisco
- 专利权人: Idibon, Inc.
- 当前专利权人: Idibon, Inc.
- 当前专利权人地址: US CA San Francisco
- 主分类号: G06F17/27
- IPC分类号: G06F17/27
摘要:
Systems, methods, and apparatuses are presented for a novel natural language tokenizer and tagger. In some embodiments, a method for tokenizing text for natural language processing comprises: generating from a pool of documents, a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents; receiving a set of rules comprising rules that identify character/letter sequences as valid tokens; transforming one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood; receiving a document to be processed; dividing the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and outputting the divided tokens for natural language processing.
公开/授权文献
信息查询