摘要:
Data elements from data sources and having a data value set are linked by using hash functions to determine a dimensionally reduced instance signature for each data element based on all data values associated with that data element to yield a plurality of dimensionally reduced instance signatures of equivalent fixed size such that similarities among the data values in the data value sets across all data elements is maintained among the plurality of instance signatures. Candidate pairs of data elements to link are identified using the plurality of instance signatures in locality sensitive hash functions, and a similarity index is generated for each candidate pair using a pre-determined measure of similarity. Candidate pairs of data elements having a similarity index above a given threshold are linked.
摘要:
Data elements from data sources and having a data value set are linked by using hash functions to determine a dimensionally reduced instance signature for each data element based on all data values associated with that data element to yield a plurality of dimensionally reduced instance signatures of equivalent fixed size such that similarities among the data values in the data value sets across all data elements is maintained among the plurality of instance signatures. Candidate pairs of data elements to link are identified using the plurality of instance signatures in locality sensitive hash functions, and a similarity index is generated for each candidate pair using a pre-determined measure of similarity. Candidate pairs of data elements having a similarity index above a given threshold are linked.
摘要:
Data elements from data sources and having a data value set are linked by using hash functions to determine a dimensionally reduced instance signature for each data element based on all data values associated with that data element to yield a plurality of dimensionally reduced instance signatures of equivalent fixed size such that similarities among the data values in the data value sets across all data elements is maintained among the plurality of instance signatures. Candidate pairs of data elements to link are identified using the plurality of instance signatures in locality sensitive hash functions, and a similarity index is generated for each candidate pair using a pre-determined measure of similarity. Candidate pairs of data elements having a similarity index above a given threshold are linked.
摘要:
A computer-implemented method, system, and article of manufacture for querying and integrating structured and unstructured data. The method includes: receiving entity information that is extracted from a first set of unstructured data using an open domain information extraction system, wherein the entity in-formation comprises relationship information between a first entity and a second entity of the first set of unstructured data; recognizing a pattern based on the relationship information and creating a schema for the first set of unstructured data based on the pattern; and associating an element of the created schema with (i) an entity of a second set of unstructured data or (ii) a schema element of an existing set of structured data if there is sufficient overall similarity between the created schema element and either the second unstructured data entity or the schema element of the existing structured data.
摘要:
A computer-implemented method, system, and article of manufacture for querying and integrating structured and unstructured data. The method includes: receiving entity information that is extracted from a first set of unstructured data using an open domain information extraction system, wherein the entity information comprises relationship information between a first entity and a second entity of the first set of unstructured data; recognizing a pattern based on the relationship information and creating a schema for the first set of unstructured data based on the pattern; and associating an element of the created schema with (i) an entity of a second set of unstructured data or (ii) a schema element of an existing set of structured data if there is sufficient overall similarity between the created schema element and either the second unstructured data entity or the schema element of the existing structured data.
摘要:
Methods and systems for aggregating search query results include receiving search query results and schema information for the query results from multiple heterogeneous sources, determining types for elements of the query results based on the schema information, determining potential aggregations for the query results based on the types, which are based on accumulated information from the plurality of heterogeneous resources, and aggregating the query results according to one or more of the potential aggregations.
摘要:
Methods and systems for determining schema element types are shown that include pooling potential annotations for an element of an unlabeled schema from a plurality of heterogeneous sources, scoring the pool of potential annotations according to relevancy using information using instance information from the plurality of heterogeneous sources to produce a relevancy score, and annotating the element of the unlabeled schema using the most relevant potential annotations.
摘要:
Methods and systems for determining schema element types are shown that include pooling potential annotations for an element of an unlabeled schema from a plurality of heterogeneous sources, scoring the pool of potential annotations according to relevancy using information using instance information from the plurality of heterogeneous sources to produce a relevancy score, and annotating the element of the unlabeled schema using the most relevant potential annotations.
摘要:
Aspects of the present invention provide a tool for extracting schema from a spreadsheet. In an embodiment, a set of data that is stored in an uncataloged tabular format, such as a spreadsheet, is retrieved. The structure of the retrieved set of data is surveyed to determine the dataset schema thereof. Then, data elements within the dataset schema are analyzed to obtain information regarding the data elements. Based on dataset schema and the element information, an interface can be constructed that allows remote access to the set of data.
摘要:
Aspects of the present invention provide a tool for extracting schema from a spreadsheet. In an embodiment, a set of data that is stored in an uncataloged tabular format, such as a spreadsheet, is retrieved. The structure of the retrieved set of data is surveyed to determine the dataset schema thereof. Then, data elements within the dataset schema are analyzed to obtain information regarding the data elements. Based on dataset schema and the element information, an interface can be constructed that allows remote access to the set of data.