Ingesting documents using multiple ingestion pipelines

    公开(公告)号:US10572547B2

    公开(公告)日:2020-02-25

    申请号:US16263248

    申请日:2019-01-31

    摘要: A primary ingestion pipeline configured for use in natural language processing includes annotators configured for annotating documents. The annotators and documents to be annotated are evaluated. Based on the evaluations, an ingestion risk score is generated for each document. Each ingestion risk score represents a likelihood that an associated document will not successfully be annotated by the annotators. Each ingestion risk score is compared to a set of risk criteria. Based on the comparisons, a determination is made that each document of a first set of documents satisfies the set of risk criteria. A further determination is made, based on the comparisons, that each document of a second set of documents does not satisfy the set of risk criteria. In response to these determinations, the first set of documents is entered into the primary ingestion pipeline and the second set of documents is provided special handling.

    INGESTING DOCUMENTS USING MULTIPLE INGESTION PIPELINES

    公开(公告)号:US20190163706A1

    公开(公告)日:2019-05-30

    申请号:US16263248

    申请日:2019-01-31

    IPC分类号: G06F16/93 G06F17/24 G06F16/33

    摘要: A primary ingestion pipeline configured for use in natural language processing includes annotators configured for annotating documents. The annotators and documents to be annotated are evaluated. Based on the evaluations, an ingestion risk score is generated for each document. Each ingestion risk score represents a likelihood that an associated document will not successfully be annotated by the annotators. Each ingestion risk score is compared to a set of risk criteria. Based on the comparisons, a determination is made that each document of a first set of documents satisfies the set of risk criteria. A further determination is made, based on the comparisons, that each document of a second set of documents does not satisfy the set of risk criteria. In response to these determinations, the first set of documents is entered into the primary ingestion pipeline and the second set of documents is provided special handling.

    GENERATING NATURAL LANGUAGE TEXT SENTENCES AS TEST CASES FOR NLP ANNOTATORS WITH COMBINATORIAL TEST DESIGN
    4.
    发明申请
    GENERATING NATURAL LANGUAGE TEXT SENTENCES AS TEST CASES FOR NLP ANNOTATORS WITH COMBINATORIAL TEST DESIGN 有权
    产生自然语言文本作为具有组合测试设计的NLP神经元的测试案例

    公开(公告)号:US20160170972A1

    公开(公告)日:2016-06-16

    申请号:US14572691

    申请日:2014-12-16

    IPC分类号: G06F17/28 G06F17/27 G06F17/24

    摘要: Test cases for a text annotator are generated by determining types of inputs to the annotator and analyzing language structures in a corpus to identify sentence types and grammar constructs. An input type can correspond to multiple grammar constructs. Test cases are generated by performing grammar tree transformations on selected fragments from the corpus based on the sentence types and the grammar constructs. Additional test cases are generated by replacing starting phrases in selected fragments with substitute phrases from dictionaries associated with the input types (a dictionary can include a false synonym for an input type for purposes of negative testing). The two generating approaches can be combined, i.e., performing one or more successive (different) grammar tree transformations to yield a sentence which is then subjected to phrase substitution.

    摘要翻译: 通过确定注释器的输入类型和分析语料库中的语言结构来识别句子类型和语法结构,来生成文本注释器的测试用例。 输入类型可以对应于多个语法结构。 通过根据句子类型和语法结构对来自语料库的所选片段执行语法树变换来生成测试用例。 通过使用与输入类型相关联的字典中的替换短语替换所选片段中的起始短语来生成另外的测试用例(字典可以包括用于负测试的输入类型的假同义词)。 两个生成方法可以组合,即执行一个或多个连续(不同)语法树转换,以产生一个句子,然后进行短语替换。

    Ingesting documents using multiple ingestion pipelines

    公开(公告)号:US10318591B2

    公开(公告)日:2019-06-11

    申请号:US14728050

    申请日:2015-06-02

    摘要: A primary ingestion pipeline configured for use in natural language processing includes annotators configured for annotating documents. The annotators and documents to be annotated are evaluated. Based on the evaluations, an ingestion risk score is generated for each document. Each ingestion risk score represents a likelihood that an associated document will not successfully be annotated by the annotators. Each ingestion risk score is compared to a set of risk criteria. Based on the comparisons, a determination is made that each document of a first set of documents satisfies the set of risk criteria. A further determination is made, based on the comparisons, that each document of a second set of documents does not satisfy the set of risk criteria. In response to these determinations, the first set of documents is entered into the primary ingestion pipeline and the second set of documents is provided special handling.

    INGESTING DOCUMENTS USING MULTIPLE INGESTION PIPELINES
    6.
    发明申请
    INGESTING DOCUMENTS USING MULTIPLE INGESTION PIPELINES 审中-公开
    使用多个摄影管道创建文件

    公开(公告)号:US20160359894A1

    公开(公告)日:2016-12-08

    申请号:US14728050

    申请日:2015-06-02

    IPC分类号: H04L29/06 G06F17/30 G06F17/24

    摘要: A primary ingestion pipeline configured for use in natural language processing includes annotators configured for annotating documents. The annotators and documents to be annotated are evaluated. Based on the evaluations, an ingestion risk score is generated for each document. Each ingestion risk score represents a likelihood that an associated document will not successfully be annotated by the annotators. Each ingestion risk score is compared to a set of risk criteria. Based on the comparisons, a determination is made that each document of a first set of documents satisfies the set of risk criteria. A further determination is made, based on the comparisons, that each document of a second set of documents does not satisfy the set of risk criteria. In response to these determinations, the first set of documents is entered into the primary ingestion pipeline and the second set of documents is provided special handling.

    摘要翻译: 配置用于自然语言处理的主要摄取流水线包括被配置用于注释文档的注释器。 评估要注释的注释器和文档。 根据评估,为每个文件生成摄取风险评分。 每个摄入风险分数表示相关联的文档不会被注释器成功注释的可能性。 将每个摄入风险评分与一组风险标准进行比较。 根据比较,确定第一组文件的每个文件都满足一组风险标准。 基于比较,进一步确定第二组文档的每个文档不满足一组风险标准。 响应于这些确定,第一组文件被输入到主要摄取管道中,并且第二组文档被提供特殊处理。