Repairing data through domain knowledge

    公开(公告)号:US10127268B2

    公开(公告)日:2018-11-13

    申请号:US15288899

    申请日:2016-10-07

    Abstract: Correcting data in a dataset. A set of data tokens from a tabular data store are grouped into a plurality of different clusters based on similarity of tokens. A reference cluster is selected from among the plurality of different clusters such that the plurality of clusters includes a reference cluster and one or more other clusters, one or more tokens in the one or more other clusters are transformed. Transforming tokens is performed based on a cost of transforming tokens. The effect on the reference cluster of adding the transformed tokens to the reference cluster is determined. Using this information, a correction for a token in the dataset is identified. The data store is updated to correct the token.

    Joining semantically-related data using big table corpora

    公开(公告)号:US10198471B2

    公开(公告)日:2019-02-05

    申请号:US14726547

    申请日:2015-05-31

    Abstract: Examples of the disclosure enable performing semantic joins using a big table corpus. Pairs of values from at least two data sets are identified. The pairs of values include one value from a first one of the data sets and one value from a second one of the data sets. Statistical co-occurrence scores for the identified pairs of values are determined based on historical co-occurrence data. The determined statistical co-occurrence scores are used for predicting a semantic relationship between the at least two data sets. The predicted semantic relationship is used for joining the at least two data sets.

    Repairing data through domain knowledge

    公开(公告)号:US10970271B2

    公开(公告)日:2021-04-06

    申请号:US16161695

    申请日:2018-10-16

    Abstract: Correcting data in a dataset. A set of data tokens from a tabular data store are grouped into a plurality of different clusters based on similarity of tokens. A reference cluster is selected from among the plurality of different clusters such that the plurality of clusters includes a reference cluster and one or more other clusters. One or more tokens in the one or more other clusters are transformed. The effect on the reference cluster of adding the transformed tokens to the reference cluster is determined. Using this information, a correction for a token in the dataset is identified. The data store is updated to correct the token using the identified correction.

    Repairing Data Through Domain Knowledge
    4.
    发明申请

    公开(公告)号:US20180101561A1

    公开(公告)日:2018-04-12

    申请号:US15288899

    申请日:2016-10-07

    Abstract: Correcting data in a dataset. A set of data tokens from a tabular data store are grouped into a plurality of different clusters based on similarity of tokens. A reference cluster is selected from among the plurality of different clusters such that the plurality of clusters includes a reference cluster and one or more other clusters, one or more tokens in the one or more other clusters are transformed. Transforming tokens is performed based on a cost of transforming tokens. The effect on the reference cluster of adding the transformed tokens to the reference cluster is determined. Using this information, a correction for a token in the dataset is identified. The data store is updated to correct the token.

Patent Agency Ranking