-
公开(公告)号:US20090249182A1
公开(公告)日:2009-10-01
申请号:US12059247
申请日:2008-03-31
Applicant: Beatrice Symington , Barry Haddow
Inventor: Beatrice Symington , Barry Haddow
IPC: G06F17/21
CPC classification number: G06F17/278
Abstract: There is disclosed a method of recognising named entities in a text-containing document, represented by text document data. The received text document data comprising a plurality of tokens, one or more of the said plurality of tokens being part of a plurality of entities. The text document data is analysed using one or more tagging modules which are operable to determine token label data in respect of at least the tokens which are part of a plurality of entities, wherein the token label data output by the one or more tagging modules comprises data representative of the location of the token within each of a plurality of entities. The token label data representative of the location of the token within each of a plurality of entities is used to determine the beginning and end of the entities which have been identified in the text document data. A plurality of tagging modules may be employed, each of which is adapted to determine token label data representative of the location of a token within a different subset of the entities represented by the text document data, wherein the token label data determined by the plurality of tagging modules together is representative of the location of the said token with a plurality of entities. A single tagging module may be employed which determines a compound tag selected from a group of compound tags, the ground of compound tags including different tags in respect of a plurality of different combinations of the location of a respective token within a plurality of entities.
Abstract translation: 公开了一种识别由文本文档数据表示的包含文本的文档中的命名实体的方法。 所接收的文本文档数据包括多个令牌,所述多个令牌中的一个或多个是多个实体的一部分。 使用一个或多个标签模块来分析文本文档数据,所述标签模块可操作用于至少确定作为多个实体的一部分的令牌的令牌标签数据,其中由一个或多个标签模块输出的令牌标签数据包括 表示令牌在多个实体的每一个内的位置的数据。 用于表示在多个实体的每一个内的令牌的位置的令牌标签数据用于确定在文本文档数据中已被识别的实体的开始和结束。 可以采用多个标签模块,每个标签模块适于确定表示由文本文档数据表示的实体的不同子集内的令牌的位置的令牌标签数据,其中由多个 标记模块一起代表具有多个实体的所述令牌的位置。 可以使用单个标签模块,其确定选自一组化合物标签的复合标签,复合标签的基础包括关于多个实体内相应标记的位置的多个不同组合的不同标签。
-
公开(公告)号:US08495042B2
公开(公告)日:2013-07-23
申请号:US12682648
申请日:2008-10-10
Applicant: Beatrice Symington
Inventor: Beatrice Symington
IPC: G06F17/30
CPC classification number: G06F17/30716
Abstract: Automatic information extraction apparatus for extracting data for review by a human curator from digital representations of documents comprising natural language text, the automatic information extraction apparatus having a plurality of selectable operating modes in which the automatic information extraction apparatus is operable to extract different data for review by a human curator. In the different operating modes, the information extraction apparatus may extract data with a different balance between recall and precision.
Abstract translation: 自动信息提取装置,用于从包括自然语言文本的文件的数字表示中提取数据以供人类策展人审查,所述自动信息提取装置具有多个可选择的操作模式,其中所述自动信息提取装置可操作以提取不同的数据以供审查 由一个人的策展人。 在不同的操作模式中,信息提取装置可以在回调和精度之间以不同的平衡提取数据。
-
公开(公告)号:US20110099184A1
公开(公告)日:2011-04-28
申请号:US12682648
申请日:2008-10-10
Applicant: Beatrice Symington
Inventor: Beatrice Symington
IPC: G06F17/30
CPC classification number: G06F17/30716
Abstract: Automatic information extraction apparatus for extracting data for review by a human curator from digital representations of documents comprising natural language text, the automatic information extraction apparatus having a plurality of selectable operating modes in which the automatic information extraction apparatus is operable to extract different data for review by a human curator. In the different operating modes, the information extraction apparatus may extract data with a different balance between recall and precision.
Abstract translation: 自动信息提取装置,用于从包括自然语言文本的文件的数字表示中提取数据以供人类策展人审查,所述自动信息提取装置具有多个可选择的操作模式,其中所述自动信息提取装置可操作以提取不同的数据以供审查 由一个人的策展人。 在不同的操作模式中,信息提取装置可以在回调和精度之间以不同的平衡提取数据。