Discovery of linkage points between data sources

    公开(公告)号:US11531717B2

    公开(公告)日:2022-12-20

    申请号:US16794895

    申请日:2020-02-19

    摘要: Data records are linked across a plurality of datasets. Each dataset contains at least one data record, and each data record is associated with an entity and includes one or more attributes of that entity and a value for each attribute. Values associated with attributes are compared across datasets, and matching attributes having values that satisfy a predetermined similarity threshold are identified. In addition, linkage points between pairs of datasets are identified. Each linkage point links one or more pairs of data records. Each data record in the pair of data records is contained in one of a given pair of datasets, and each pair of data records is associated with a common entity having matching attributes in the given pair of datasets. Data records associated with the common entities are linked across datasets using the identified linkage points.

    Method and apparatus for identifying semantically related records

    公开(公告)号:US11227002B2

    公开(公告)日:2022-01-18

    申请号:US14954664

    申请日:2015-11-30

    IPC分类号: G06F16/35 G06F16/215

    摘要: An apparatus and method of identifying semantically related records, including receiving input data from an input device, splitting the input data into a plurality of clusters according to semantic relationship, each of the clusters including a plurality of source terms and a plurality of target terms, transforming each of the plurality of clusters based on the transformation which includes tokenization of the plurality of clusters, for each of the plurality of clusters that are transformed, finding relatedness scores of a plurality of semantic relatedness measures with the plurality of target terms, building a vector of similarity scores for each of the plurality of target terms, and for each of the plurality of source terms, selecting a predetermined number of the plurality of target terms according to the similarity scores.

    Semantic concept discovery over event databases

    公开(公告)号:US11074266B2

    公开(公告)日:2021-07-27

    申请号:US16157304

    申请日:2018-10-11

    摘要: A concept discovery method, system, and computer program product include preparing a concept index for concepts built over a set of input data having input terms, building a vector representation of the concepts in the input data, receiving a set of query terms as an additional input, mapping the set of query terms to the concepts in the concept index, calculating at least one of a co-occurrence score for each of the concepts in the concept index by measuring their frequency of co-occurrence with the input terms' concepts and a similarity score for each of the concepts in the concept index by measuring the similarity of their vector representations according to a vector similarity measure, and ranking the concepts with respect to their relevance to the input terms by the at least one of the co-occurrence score and the similarity score.

    Data virtualization across heterogeneous formats

    公开(公告)号:US10740304B2

    公开(公告)日:2020-08-11

    申请号:US14467640

    申请日:2014-08-25

    IPC分类号: G06F16/21 G06F16/22

    摘要: Various embodiments virtualize data across heterogeneous formats. In one embodiment, a plurality of heterogeneous data sources is received as input. A local schema graph including a set of attribute nodes and a set of type nodes is generated for each of the plurality of heterogeneous data sources. A global schema graph is generated based on each local schema graph that has been generated. The global schema graph comprises each of the local schema graphs and at least one edge between at least one of two or more attributes nodes and two or more type nodes from different local schema graphs. The edge indicates a relationship between the data sources represented by the different local schema graphs comprising the two or more attributes nodes based on a computed similarity between at least one value associated with each of the two or more attributes nodes.

    NOISE DETECTION IN KNOWLEDGE GRAPHS
    16.
    发明申请

    公开(公告)号:US20200097861A1

    公开(公告)日:2020-03-26

    申请号:US16141303

    申请日:2018-09-25

    IPC分类号: G06N99/00 G06F17/30

    摘要: Techniques regarding autonomous classification and/or identification of various types of noise comprised within a knowledge graph are provided. For example, one or more embodiments described herein can comprise a system, which can comprise a memory that can store computer executable components. The system can also comprise a processor, operably coupled to the memory, and that can execute the computer executable components stored in the memory. The computer executable components can comprise a knowledge extraction component, operatively coupled to the processor, that can classify a type of noise comprised within a knowledge graph. The type of noise can be generated by an information extraction process.

    SYSTEM, METHOD, AND RECORDING MEDIUM FOR KNOWLEDGE GRAPH AUGMENTATION THROUGH SCHEMA EXTENSION

    公开(公告)号:US20190258675A1

    公开(公告)日:2019-08-22

    申请号:US16399535

    申请日:2019-04-30

    IPC分类号: G06F16/901 G06F16/36

    摘要: A method, system, and recording medium for knowledge graph augmentation using data based on a statistical analysis of attributes in the data, including a ranking device configured to rank semantically similar input data elements to create a ranked list of attributes to augment an input of structured data and populate with a data string corresponding to the instances, where the ranking device further combines a set of filters to refine the ranked list of attributes, the set of filters including a first filter according to column ranges of columns, a second filter according to a column uniqueness of the columns, a third filter according to a type of data in a column of the columns, and a fourth filter according to a distribution of values in the columns.

    Knowledge aided feature engineering

    公开(公告)号:US11599826B2

    公开(公告)日:2023-03-07

    申请号:US16741084

    申请日:2020-01-13

    IPC分类号: G06N20/00 G06F11/34

    摘要: Embodiments relate to a system, program product, and method for employing feature engineering to improve classifier performance. A first machine learning (ML) model with a first learning program is selected. The first selected ML model is operatively associated with a first structured dataset. First features in the first dataset directed at performance of the selected ML model are identified. A second structured dataset is assessed with respect to the identified features in the first dataset, and new features in the second dataset are identified, where the new features are semantically related to the identified features in the first dataset. The first dataset is dynamically augmented with the identified new features in the second dataset. The dynamically augmented first dataset is applied to the selected ML model to subject an embedded learning algorithm of the selected ML model to training using the augmented first dataset.

    Expressive temporal predictions over semantically driven time windows

    公开(公告)号:US10795937B2

    公开(公告)日:2020-10-06

    申请号:US15230932

    申请日:2016-08-08

    IPC分类号: G06F16/901 G06N20/00 G06N5/02

    摘要: Methods, systems, and computer program products for expressive temporal predictions over semantically-driven time windows are provided herein. A computer-implemented method includes identifying, within a knowledge graph pertaining to a given prediction, a subset of the knowledge graph related to one or more predicted training examples, wherein the subset comprises (i) a set of nodes and (ii) one or more relationships among the set of nodes; determining, for the identified subset, one or more snapshots of the knowledge graph relevant to the given prediction; quantifying a validity window for the one or more predicted training examples, wherein the validity window comprises a temporal bound for prediction validity; and computing a validity window for the given prediction based on the quantified validity window for the one or more predicted training examples.