Linking Data Elements Based on Similarity Data Values and Semantic Annotations
    3.
    发明申请
    Linking Data Elements Based on Similarity Data Values and Semantic Annotations 审中-公开
    基于相似性数据值和语义​​注释链接数据元素

    公开(公告)号:US20130332466A1

    公开(公告)日:2013-12-12

    申请号:US13491724

    申请日:2012-06-08

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864

    摘要: Data elements from data sources and having a data value set are linked by using hash functions to determine a dimensionally reduced instance signature for each data element based on all data values associated with that data element to yield a plurality of dimensionally reduced instance signatures of equivalent fixed size such that similarities among the data values in the data value sets across all data elements is maintained among the plurality of instance signatures. Candidate pairs of data elements to link are identified using the plurality of instance signatures in locality sensitive hash functions, and a similarity index is generated for each candidate pair using a pre-determined measure of similarity. Candidate pairs of data elements having a similarity index above a given threshold are linked.

    摘要翻译: 来自数据源并且具有数据值集合的数据元素通过使用散列函数来链接,以基于与该数据元素相关联的所有数据值来确定每个数据元素的尺寸上减小的实例签名,以产生多个等距固定的尺寸缩小的实例签名 大小,使得在多个实例签名之间保持跨所有数据元素的数据值中的数据值之间的相似性。 使用位置敏感哈希函数中的多个实例签名来识别要链接的候选数据元素对,并且使用预定的相似度测量为每个候选对生成相似性索引。 具有高于给定阈值的相似性指数的候选对的数据元素被链接。

    Querying and integrating structured and unstructured data
    4.
    发明授权
    Querying and integrating structured and unstructured data 有权
    查询和整合结构化和非结构化数据

    公开(公告)号:US09037615B2

    公开(公告)日:2015-05-19

    申请号:US13493174

    申请日:2012-06-11

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/30946 G06F17/30292

    摘要: A computer-implemented method, system, and article of manufacture for querying and integrating structured and unstructured data. The method includes: receiving entity information that is extracted from a first set of unstructured data using an open domain information extraction system, wherein the entity in-formation comprises relationship information between a first entity and a second entity of the first set of unstructured data; recognizing a pattern based on the relationship information and creating a schema for the first set of unstructured data based on the pattern; and associating an element of the created schema with (i) an entity of a second set of unstructured data or (ii) a schema element of an existing set of structured data if there is sufficient overall similarity between the created schema element and either the second unstructured data entity or the schema element of the existing structured data.

    摘要翻译: 用于查询和整合结构化和非结构化数据的计算机实现的方法,系统和制造。 该方法包括:使用开放域信息提取系统接收从第一组非结构化数据提取的实体信息,其中所述实体信息包括第一组非结构化数据的第一实体与第二实体之间的关系信息; 基于所述关系信息识别模式,并基于所述模式为所述第一组非结构化数据创建模式; 并且将所创建的模式的元素与(i)第二组非结构化数据的实体相关联,或者(ii)现有结构化数据集合的模式元素,如果所创建的模式元素与第二组之间存在足够的总体相似度 非结构化数据实体或现有结构化数据的架构元素。

    QUERYING AND INTEGRATING STRUCTURED AND INSTRUCTURED DATA
    5.
    发明申请
    QUERYING AND INTEGRATING STRUCTURED AND INSTRUCTURED DATA 有权
    查询和整合结构化和结构化数据

    公开(公告)号:US20130332478A1

    公开(公告)日:2013-12-12

    申请号:US13493174

    申请日:2012-06-11

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30946 G06F17/30292

    摘要: A computer-implemented method, system, and article of manufacture for querying and integrating structured and unstructured data. The method includes: receiving entity information that is extracted from a first set of unstructured data using an open domain information extraction system, wherein the entity information comprises relationship information between a first entity and a second entity of the first set of unstructured data; recognizing a pattern based on the relationship information and creating a schema for the first set of unstructured data based on the pattern; and associating an element of the created schema with (i) an entity of a second set of unstructured data or (ii) a schema element of an existing set of structured data if there is sufficient overall similarity between the created schema element and either the second unstructured data entity or the schema element of the existing structured data.

    摘要翻译: 用于查询和整合结构化和非结构化数据的计算机实现的方法,系统和制造。 该方法包括:使用开放域信息提取系统接收从第一组非结构化数据提取的实体信息,其中实体信息包括第一组非结构化数据的第一实体与第二实体之间的关系信息; 基于所述关系信息识别模式,并基于所述模式为所述第一组非结构化数据创建模式; 并且将所创建的模式的元素与(i)第二组非结构化数据的实体相关联,或者(ii)现有结构化数据集合的模式元素,如果所创建的模式元素与第二组之间存在足够的总体相似度 非结构化数据实体或现有结构化数据的架构元素。

    Optimizing sparse schema-less data in relational stores
    6.
    发明授权
    Optimizing sparse schema-less data in relational stores 有权
    优化关系存储中的稀疏无模式数据

    公开(公告)号:US08918434B2

    公开(公告)日:2014-12-23

    申请号:US13454559

    申请日:2012-04-24

    IPC分类号: G06F17/30 G06F7/00

    CPC分类号: G06F17/30292

    摘要: Various embodiments of the invention relate to optimizing storage of schema-less data. A schema-less dataset including a plurality of resources is received. Each resource is associated with at least a plurality of properties. At least one set of co-occurring properties from the plurality of properties is identified. A graph including a plurality of nodes is generated. Each of the nodes represents a unique property in the set of co-occurring properties. The graph further includes an edge connecting each node representing a pair of co-occurring properties. A graph coloring operation is performed on the graph. The graph coloring operation includes assigning each of nodes to a color, where nodes connected by an edge are assigned different colors. A schema is generated that assigns a column identifier from a table to each unique property represented by one of the nodes in the graph based on the color assigned to the node.

    摘要翻译: 本发明的各种实施例涉及优化无模式数据的存储。 接收包括多个资源的无模式数据集。 每个资源与至少多个属性相关联。 识别来自多个属性的至少一组共同属性。 生成包括多个节点的图形。 每个节点表示共同出现属性集中的唯一属性。 该图还包括连接表示一对共同属性的每个节点的边缘。 在图表上执行图形着色操作。 图形着色操作包括将每个节点分配给颜色,其中通过边缘连接的节点被分配不同的颜色。 生成一种模式,该模式根据分配给该节点的颜色,将表中的列标识符从图中的一个节点分配给每个唯一属性。