摘要:
Test cases for a text annotator are generated by determining types of inputs to the annotator and analyzing language structures in a corpus to identify sentence types and grammar constructs. An input type can correspond to multiple grammar constructs. Test cases are generated by performing grammar tree transformations on selected fragments from the corpus based on the sentence types and the grammar constructs. Additional test cases are generated by replacing starting phrases in selected fragments with substitute phrases from dictionaries associated with the input types (a dictionary can include a false synonym for an input type for purposes of negative testing). The two generating approaches can be combined, i.e., performing one or more successive (different) grammar tree transformations to yield a sentence which is then subjected to phrase substitution.
摘要:
A primary ingestion pipeline configured for use in natural language processing includes annotators configured for annotating documents. The annotators and documents to be annotated are evaluated. Based on the evaluations, an ingestion risk score is generated for each document. Each ingestion risk score represents a likelihood that an associated document will not successfully be annotated by the annotators. Each ingestion risk score is compared to a set of risk criteria. Based on the comparisons, a determination is made that each document of a first set of documents satisfies the set of risk criteria. A further determination is made, based on the comparisons, that each document of a second set of documents does not satisfy the set of risk criteria. In response to these determinations, the first set of documents is entered into the primary ingestion pipeline and the second set of documents is provided special handling.
摘要:
A primary ingestion pipeline configured for use in natural language processing includes annotators configured for annotating documents. The annotators and documents to be annotated are evaluated. Based on the evaluations, an ingestion risk score is generated for each document. Each ingestion risk score represents a likelihood that an associated document will not successfully be annotated by the annotators. Each ingestion risk score is compared to a set of risk criteria. Based on the comparisons, a determination is made that each document of a first set of documents satisfies the set of risk criteria. A further determination is made, based on the comparisons, that each document of a second set of documents does not satisfy the set of risk criteria. In response to these determinations, the first set of documents is entered into the primary ingestion pipeline and the second set of documents is provided special handling.
摘要:
Test cases for a text annotator are generated by determining types of inputs to the annotator and analyzing language structures in a corpus to identify sentence types and grammar constructs. An input type can correspond to multiple grammar constructs. Test cases are generated by performing grammar tree transformations on selected fragments from the corpus based on the sentence types and the grammar constructs. Additional test cases are generated by replacing starting phrases in selected fragments with substitute phrases from dictionaries associated with the input types (a dictionary can include a false synonym for an input type for purposes of negative testing). The two generating approaches can be combined, i.e., performing one or more successive (different) grammar tree transformations to yield a sentence which is then subjected to phrase substitution.
摘要:
A primary ingestion pipeline configured for use in natural language processing includes annotators configured for annotating documents. The annotators and documents to be annotated are evaluated. Based on the evaluations, an ingestion risk score is generated for each document. Each ingestion risk score represents a likelihood that an associated document will not successfully be annotated by the annotators. Each ingestion risk score is compared to a set of risk criteria. Based on the comparisons, a determination is made that each document of a first set of documents satisfies the set of risk criteria. A further determination is made, based on the comparisons, that each document of a second set of documents does not satisfy the set of risk criteria. In response to these determinations, the first set of documents is entered into the primary ingestion pipeline and the second set of documents is provided special handling.
摘要:
A primary ingestion pipeline configured for use in natural language processing includes annotators configured for annotating documents. The annotators and documents to be annotated are evaluated. Based on the evaluations, an ingestion risk score is generated for each document. Each ingestion risk score represents a likelihood that an associated document will not successfully be annotated by the annotators. Each ingestion risk score is compared to a set of risk criteria. Based on the comparisons, a determination is made that each document of a first set of documents satisfies the set of risk criteria. A further determination is made, based on the comparisons, that each document of a second set of documents does not satisfy the set of risk criteria. In response to these determinations, the first set of documents is entered into the primary ingestion pipeline and the second set of documents is provided special handling.