-
公开(公告)号:US10127268B2
公开(公告)日:2018-11-13
申请号:US15288899
申请日:2016-10-07
Applicant: Microsoft Technology Licensing, LLC
Inventor: Kris Kuppuswamy Ganjam , Yeye He , Anja Gruenheid
Abstract: Correcting data in a dataset. A set of data tokens from a tabular data store are grouped into a plurality of different clusters based on similarity of tokens. A reference cluster is selected from among the plurality of different clusters such that the plurality of clusters includes a reference cluster and one or more other clusters, one or more tokens in the one or more other clusters are transformed. Transforming tokens is performed based on a cost of transforming tokens. The effect on the reference cluster of adding the transformed tokens to the reference cluster is determined. Using this information, a correction for a token in the dataset is identified. The data store is updated to correct the token.
-
公开(公告)号:US10198471B2
公开(公告)日:2019-02-05
申请号:US14726547
申请日:2015-05-31
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: Yeye He , Kris Kuppuswamy Ganjam , Xu Chu
IPC: G06F17/30
Abstract: Examples of the disclosure enable performing semantic joins using a big table corpus. Pairs of values from at least two data sets are identified. The pairs of values include one value from a first one of the data sets and one value from a second one of the data sets. Statistical co-occurrence scores for the identified pairs of values are determined based on historical co-occurrence data. The determined statistical co-occurrence scores are used for predicting a semantic relationship between the at least two data sets. The predicted semantic relationship is used for joining the at least two data sets.
-
公开(公告)号:US10970271B2
公开(公告)日:2021-04-06
申请号:US16161695
申请日:2018-10-16
Applicant: Microsoft Technology Licensing, LLC
Inventor: Kris Kuppuswamy Ganjam , Yeye He , Anja Gruenheid
IPC: G06F16/23 , G06F16/215 , G06F16/28 , G06F16/35 , G06F16/2457
Abstract: Correcting data in a dataset. A set of data tokens from a tabular data store are grouped into a plurality of different clusters based on similarity of tokens. A reference cluster is selected from among the plurality of different clusters such that the plurality of clusters includes a reference cluster and one or more other clusters. One or more tokens in the one or more other clusters are transformed. The effect on the reference cluster of adding the transformed tokens to the reference cluster is determined. Using this information, a correction for a token in the dataset is identified. The data store is updated to correct the token using the identified correction.
-
公开(公告)号:US20180101561A1
公开(公告)日:2018-04-12
申请号:US15288899
申请日:2016-10-07
Applicant: Microsoft Technology Licensing, LLC
Inventor: Kris Kuppuswamy Ganjam , Yeye He , Anja Gruenheid
IPC: G06F17/30
CPC classification number: G06F17/30371 , G06F17/30303 , G06F17/3053 , G06F17/30598 , G06F17/3071
Abstract: Correcting data in a dataset. A set of data tokens from a tabular data store are grouped into a plurality of different clusters based on similarity of tokens. A reference cluster is selected from among the plurality of different clusters such that the plurality of clusters includes a reference cluster and one or more other clusters, one or more tokens in the one or more other clusters are transformed. Transforming tokens is performed based on a cost of transforming tokens. The effect on the reference cluster of adding the transformed tokens to the reference cluster is determined. Using this information, a correction for a token in the dataset is identified. The data store is updated to correct the token.
-
-
-