-
公开(公告)号:US20060116862A1
公开(公告)日:2006-06-01
申请号:US11001654
申请日:2004-12-01
申请人: Jill Carrier , Alwin Carus , William Cote , John Dowd , Kathryn Femina , Alan Frankel , Wensheng Han , Larissa Lapshina , Bernardo Rechea , Ana Santisteban , Amy Uhrbach
发明人: Jill Carrier , Alwin Carus , William Cote , John Dowd , Kathryn Femina , Alan Frankel , Wensheng Han , Larissa Lapshina , Bernardo Rechea , Ana Santisteban , Amy Uhrbach
IPC分类号: G06F17/20
CPC分类号: G06F17/277
摘要: The present invention pertains to a system and method for the tokenization of text. The featurizer may be configured to receive input text and convert the input text into tokens. According to one aspect of the invention, the tokens may include only one type of character, the characters selected from the group consisting of letters, numbers, and punctuation. The tokenizer may also include a classifier. The classifier may be configured to receive the tokens from the featurizer. Furthermore, the classifier may be configured to analyze the tokens received from the featurizer to determine if the tokens may be input into a predetermined classification model using a preclassifier. If one of the tokens passes the preclassifier, then the token is classified using the predetermined classification model. Additionally, according to a first aspect of the invention, the tokenizer may also include a finalizer. The finalizer may be configured to receive the tokens and may be configured to produce a final output.
摘要翻译: 本发明涉及用于文本的标记化的系统和方法。 特征化器可以被配置为接收输入文本并将输入文本转换成令牌。 根据本发明的一个方面,令牌可以仅包括一种类型的字符,从由字母,数字和标点符号组成的组中选择的字符。 标记器还可以包括分类器。 分类器可以被配置为从成色器接收令牌。 此外,分类器可以被配置为分析从特征化器接收的令牌以确定令牌是否可以使用预分类器输入到预定分类模型中。 如果其中一个令牌通过预分类器,则使用预定分类模型对令牌进行分类。 另外,根据本发明的第一方面,标记器还可以包括终结器。 终结器可以被配置为接收令牌,并且可以被配置为产生最终输出。