Abstract:
Techniques for organizing knowledge about a dataset storing data from or about multiple sources may be provided. For example, the data can be accessed from the multiple sources and categorized based on the data type. For each data type, a triple extraction technique specific to that data type may be invoked. One set of techniques can allow the extraction of triples from the data based on natural language-based rules. Another set of techniques can allow a similar extraction based on logical or structural-based rules. A triple may store a relationship between elements of the data. The extracted triples can be stored with corresponding identifiers in a list. Further, dictionaries storing associations between elements of the data and the triples can be updated. The list and the dictionaries can be used to return triples in response to a query that specifies one or more elements.
Abstract:
A contextual analysis engine systematically extracts, analyzes and organizes digital content stored in an electronic file such as a webpage. Content can be extracted using a text extraction module which is capable of separating the content which is to be analyzed from less meaningful content such as format specifications and programming scripts. The resulting unstructured corpus of plain text can then be passed to a text analytics module capable of generating a structured categorization of topics included within the content. This structured categorization can be organized based on a content topic ontology which may have been previously defined or which may be developed in real-time. The systems disclosed herein optionally include an input/output interface capable of managing workflows of the text extraction module and the text analytics module, administering a cache of previously generated results, and interfacing with other applications that leverage the disclosed contextual analysis services.
Abstract:
A contextual analysis engine systematically extracts, analyzes and organizes digital content stored in an electronic file such as a webpage. Content can be extracted using a text extraction module which is capable of separating the content which is to be analyzed from less meaningful content such as format specifications and programming scripts. The resulting unstructured corpus of plain text can then be passed to a text analytics module capable of generating a structured categorization of topics included within the content. This structured categorization can be organized based on a content topic ontology which may have been previously defined or which may be developed in real-time. The systems disclosed herein optionally include an input/output interface capable of managing workflows of the text extraction module and the text analytics module, administering a cache of previously generated results, and interfacing with other applications that leverage the disclosed contextual analysis services.
Abstract:
Techniques are disclosed for using natural language processing techniques to define, manipulate, and interact with consumer segmentations. In such embodiments a content consumption analytics engine can be configured to receive and process a natural language segmentation query. The query may comprise, for example, a command that defines a new segmentation, a command that manipulates existing segmentations, or a command that solicits information relating to existing consumer segmentations. The query is parsed to identify individual grammatical tokens which are then correlated with specific segment token types through the use of a token repository. A custom thesaurus is used to identify synonymous terms for grammatical tokens which may not exist in the token repository. User feedback enables the custom thesaurus to learn additional synonyms for future use. Once the grammatical tokens are mapped onto the identified segment token types, a formal segment definition can be constructed based on a segment definition structure.
Abstract:
A contextual analysis engine systematically extracts, analyzes and organizes digital content stored in an electronic file such as a webpage. Content can be extracted using a text extraction module which is capable of separating the content which is to be analyzed from less meaningful content such as format specifications and programming scripts. The resulting unstructured corpus of plain text can then be passed to a text analytics module capable of generating a structured categorization of topics included within the content. This structured categorization can be organized based on a content topic ontology which may have been previously defined or which may be developed in real-time. The systems disclosed herein optionally include an input/output interface capable of managing workflows of the text extraction module and the text analytics module, administering a cache of previously generated results, and interfacing with other applications that leverage the disclosed contextual analysis services.
Abstract:
A computer implemented method and system identifies correct structured reading-order sequence of text segments that are extracted from a file structured in a portable document format. A probabilistic language model is generated from a large text corpus to comprise observed word sequence patterns for a given language. The language model measures whether splicing together a first text segment with another continuation text segment results in a phrase that is more likely than a phrase resulting from splicing together the first text segment with other continuation text segments. Sets of text segments are provided to the probabilistic model, where the sets of text segments comprise a first set including the first text segment and a first continuation text segment. A second set includes the first text segment and a second continuation text segment. A score is obtained for each set of text segments. The score is indicative of a likelihood of the set providing a correct structured reading-order sequence. The probabilistic language model may be generated in accordance with a Recurrent Neural Network or an n-gram model.
Abstract:
Techniques for generating a query statement to query a dataset may be provided. For example, the query statement can be generated from natural language input, such as a natural language utterance. To do so, the input can be analyzed to detect a sentence, identify words in the sentence, and tag the words with the corresponding word types (e.g., nouns, verbs, adjectives, etc.). Expressions using the tags can be generated. Data about the expressions can be inputted to a classifier. Based on a detected pattern associated with the expressions, the classifier can predict a structure of the query statement, such as what expressions correspond to what clauses of the query statement. Based on this prediction, words associated with the expressions can be added to the clauses to generate the query statement and accordingly query the dataset.
Abstract:
Techniques for generating a query statement to query a dataset may be provided. For example, the query statement can be generated from natural language input, such as a natural language utterance. To do so, the input can be analyzed to detect a sentence, identify words in the sentence, and tag the words with the corresponding word types (e.g., nouns, verbs, adjectives, etc.). Expressions using the tags can be generated. Data about the expressions can be inputted to a classifier. Based on a detected pattern associated with the expressions, the classifier can predict a structure of the query statement, such as what expressions correspond to what clauses of the query statement. Based on this prediction, words associated with the expressions can be added to the clauses to generate the query statement and accordingly query the dataset.
Abstract:
A contextual analysis engine systematically extracts, analyzes and organizes digital content stored in an electronic file such as a webpage. Content can be extracted using a text extraction module which is capable of separating the content which is to be analyzed from less meaningful content such as format specifications and programming scripts. The resulting unstructured corpus of plain text can then be passed to a text analytics module capable of generating a structured categorization of topics included within the content. This structured categorization can be organized based on a content topic ontology which may have been previously defined or which may be developed in real-time. The systems disclosed herein optionally include an input/output interface capable of managing workflows of the text extraction module and the text analytics module, administering a cache of previously generated results, and interfacing with other applications that leverage the disclosed contextual analysis services.
Abstract:
Techniques are disclosed for using natural language processing techniques to define, manipulate, and interact with consumer segmentations. In such embodiments a content consumption analytics engine can be configured to receive and process a natural language segmentation query. The query may comprise, for example, a command that defines a new segmentation, a command that manipulates existing segmentations, or a command that solicits information relating to existing consumer segmentations. The query is parsed to identify individual grammatical tokens which are then correlated with specific segment token types through the use of a token repository. A custom thesaurus is used to identify synonymous terms for grammatical tokens which may not exist in the token repository. User feedback enables the custom thesaurus to learn additional synonyms for future use. Once the grammatical tokens are mapped onto the identified segment token types, a formal segment definition can be constructed based on a segment definition structure.