Graph based re-composition of document fragments for name entity recognition under exploitation of enterprise databases
    1.
    发明授权
    Graph based re-composition of document fragments for name entity recognition under exploitation of enterprise databases 有权
    基于图表的企业数据库开发下的名称实体识别文档片段的重组

    公开(公告)号:US08229883B2

    公开(公告)日:2012-07-24

    申请号:US12413611

    申请日:2009-03-30

    IPC分类号: G06F17/20 G06F17/30

    CPC分类号: G06F17/30622

    摘要: Methods and systems are described that involve recognizing complex entities from text documents with the help of structured data and Natural Language Processing (NLP) techniques. In one embodiment, the method includes receiving a document as input from a set of documents, wherein the document contains text or unstructured data. The method also includes identifying a plurality of text segments from the document via a set of tagging techniques. Further, the method includes matching the identified plurality of text segments against attributes of a set of predefined entities. Lastly, a best matching predefined entity is selected for each text segment from the plurality of text segments.In one embodiment, the system includes a set of documents, each document containing text or unstructured data. The system also includes a database storage unit that stores a set of predefined entities, wherein each entity contains a set of attributes. Further, the system includes a processor to identify a plurality of text segments from a document via a set of tagging techniques and to match the identified plurality of text segments against the set of attributes.

    摘要翻译: 描述了在结构化数据和自然语言处理(NLP)技术的帮助下,从文本文档中识别复杂实体的方法和系统。 在一个实施例中,该方法包括从一组文档接收文档作为输入,其中文档包含文本或非结构化数据。 该方法还包括经由一组标签技术从文档识别多个文本段。 此外,该方法包括将所识别的多个文本段与一组预定义实体的属性进行匹配。 最后,从多个文本段中为每个文本段选择最佳匹配的预定义实体。 在一个实施例中,系统包括一组文档,每个文档包含文本或非结构化数据。 该系统还包括存储一组预定义实体的数据库存储单元,其中每个实体包含一组属性。 此外,该系统包括处理器,用于经由一组标签技术从文档中识别多个文本段,并且将所识别的多个文本段与该属性集匹配。

    GRAPH BASED RE-COMPOSITION OF DOCUMENT FRAGMENTS FOR NAME ENTITY RECOGNITION UNDER EXPLOITATION OF ENTERPRISE DATABASES
    2.
    发明申请
    GRAPH BASED RE-COMPOSITION OF DOCUMENT FRAGMENTS FOR NAME ENTITY RECOGNITION UNDER EXPLOITATION OF ENTERPRISE DATABASES 有权
    基于图表的企业数据库使用名称实体识别文档片段的重组

    公开(公告)号:US20100250598A1

    公开(公告)日:2010-09-30

    申请号:US12413611

    申请日:2009-03-30

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30622

    摘要: Methods and systems are described that involve recognizing complex entities from text documents with the help of structured data and Natural Language Processing (NLP) techniques. In one embodiment, the method includes receiving a document as input from a set of documents, wherein the document contains text or unstructured data. The method also includes identifying a plurality of text segments from the document via a set of tagging techniques. Further, the method includes matching the identified plurality of text segments against attributes of a set of predefined entities. Lastly, a best matching predefined entity is selected for each text segment from the plurality of text segments.In one embodiment, the system includes a set of documents, each document containing text or unstructured data. The system also includes a database storage unit that stores a set of predefined entities, wherein each entity contains a set of attributes. Further, the system includes a processor to identify a plurality of text segments from a document via a set of tagging techniques and to match the identified plurality of text segments against the set of attributes.

    摘要翻译: 描述了在结构化数据和自然语言处理(NLP)技术的帮助下,从文本文档中识别复杂实体的方法和系统。 在一个实施例中,该方法包括从一组文档接收文档作为输入,其中文档包含文本或非结构化数据。 该方法还包括经由一组标签技术从文档识别多个文本段。 此外,该方法包括将所识别的多个文本段与一组预定义实体的属性进行匹配。 最后,从多个文本段中为每个文本段选择最佳匹配的预定义实体。 在一个实施例中,系统包括一组文档,每个文档包含文本或非结构化数据。 该系统还包括存储一组预定义实体的数据库存储单元,其中每个实体包含一组属性。 此外,该系统包括处理器,用于经由一组标签技术从文档中识别多个文本段,并且将所识别的多个文本段与该属性集匹配。

    Systems and methods for modular information extraction
    3.
    发明授权
    Systems and methods for modular information extraction 有权
    模块化信息提取的系统和方法

    公开(公告)号:US07987416B2

    公开(公告)日:2011-07-26

    申请号:US11939794

    申请日:2007-11-14

    IPC分类号: G06F17/00

    CPC分类号: G06F17/30864 G06F17/241

    摘要: Embodiments of the present invention include a computer-implemented method of extracting information. In one embodiment, the present invention comprises defining a plurality of reusable operators, wherein each operator performs a predefined information extraction task different from the other operators. Composite annotators may be created by specifying a composition of the reusable operators. Each operator may receive a searchable item, such as a web page or an annotation, and may generate one or more output annotations. The output annotations may be further processed by other reusable operators and the annotations may be stored in a repository for use during a search.

    摘要翻译: 本发明的实施例包括提取信息的计算机实现的方法。 在一个实施例中,本发明包括定义多个可重用操作符,其中每个操作者执行与其他操作者不同的预定信息提取任务。 可以通过指定可重用操作符的组合来创建复合注释器。 每个运营商可以接收可搜索的项目,诸如网页或注释,并且可以生成一个或多个输出注释。 输出注释可以由其他可重用操作符进一步处理,并且注释可以存储在存储库中以便在搜索期间使用。

    Systems and Methods for Modular Information Extraction
    4.
    发明申请
    Systems and Methods for Modular Information Extraction 有权
    用于模块化信息提取的系统和方法

    公开(公告)号:US20090125542A1

    公开(公告)日:2009-05-14

    申请号:US11939794

    申请日:2007-11-14

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864 G06F17/241

    摘要: Embodiments of the present invention include a computer-implemented method of extracting information. In one embodiment, the present invention comprises defining a plurality of reusable operators, wherein each operator performs a predefined information extraction task different from the other operators. Composite annotators may be created by specifying a composition of the reusable operators. Each operator may receive a searchable item, such as a web page or an annotation, and may generate one or more output annotations. The output annotations may be further processed by other reusable operators and the annotations may be stored in a repository for use during a search.

    摘要翻译: 本发明的实施例包括提取信息的计算机实现的方法。 在一个实施例中,本发明包括定义多个可重用操作符,其中每个操作者执行与其他操作者不同的预定信息提取任务。 可以通过指定可重用操作符的组合来创建复合注释器。 每个运营商可以接收可搜索的项目,诸如网页或注释,并且可以生成一个或多个输出注释。 输出注释可以由其他可重用操作符进一步处理,并且注释可以存储在存储库中以便在搜索期间使用。