MACHINE-LEARNED PREDICTIVE MODELS AND SYSTEMS FOR DATA PREPARATION RECOMMENDATIONS

    公开(公告)号:US20230043015A1

    公开(公告)日:2023-02-09

    申请号:US17969377

    申请日:2022-10-19

    Inventor: Yeye He Cong Yan

    Abstract: Systems are provided for facilitating the building and use of models used to make data preparation recommendations. The systems identify ground truth from a plurality of notebooks and utilizes the ground truth to generate the corresponding data preparation recommendation models. The data preparation recommendation models are used to predict accurate (e.g., useful and relevant) data preparations steps based on user input and user notebook data. The data preparation computing system generates a recommendation prompt based on output from the data preparation recommendation model that can be viewed and/or selected by the user to be applied to the user's notebook data.

    Facilitating data type detection using existing code

    公开(公告)号:US10795667B2

    公开(公告)日:2020-10-06

    申请号:US15850283

    申请日:2017-12-21

    Inventor: Yeye He Cong Yan

    Abstract: Methods, computer systems, computer-storage media, and graphical user interfaces are provided for facilitating data type detection, according to embodiments of the present invention. In one embodiment, existing code is searched to identify a set of functions related to a target data type. Such functions can be executed using positive example values and negative example values. For each executed function, a logical explanation is generated that represents a distinction in execution of the positive example values from the negative example values. The executed functions can then be ranked based on the extent to which the corresponding logical explanations distinguish execution of the positive example values from the negative example values. A function suggestion corresponding with at least a highest ranked function can then be provided, for example to a user, to indicate a function for use in detecting the target data type.

    Determining a hierarchical concept tree using a large corpus of table values

    公开(公告)号:US10789229B2

    公开(公告)日:2020-09-29

    申请号:US15621767

    申请日:2017-06-13

    Abstract: A table corpus processing server identifies concepts within enterprise domain data. The table corpus processing server is configured to iteratively group values in a table corpus based on co-occurrence statistics to produce a candidate hierarchical tree. The candidate hierarchical tree is then summarized by selecting nodes that can best “describe” the original corpus, which leads to a small tree that often corresponds to desired concept hierarchies. The table corpus processing server employs a parallel dynamic programming approach that allows the disclosed embodiments to scale with amount of enterprise domain data being analyzed.

    Concept expansion using tables
    4.
    发明授权

    公开(公告)号:US10769140B2

    公开(公告)日:2020-09-08

    申请号:US14754318

    申请日:2015-06-29

    Abstract: Concept expansion using tables, such as web tables, can return entities belonging to a concept based on an input of the concept and at least one seed entity that belongs to the concept. A concept expansion frontend can receive the concept and seed entity and provide them to a concept expansion framework. The concept expansion framework can expand the coverage of entities for concepts, including tail concepts, using tables by leveraging rich content signals corresponding to concept names. Such content signals can include content matching the concept that appear in captions, early headings, page titles, surrounding text, anchor text, and queries for which the page has been clicked. The concept expansion framework can use the structured entities in tables to infer exclusive tables. Such inference differs from previous label propagation methods and involves modeling a table-entity relationship. The table-entity relationship reduces semantic drift without using a reference ontology.

    Extensible data transformations
    5.
    发明授权

    公开(公告)号:US10706066B2

    公开(公告)日:2020-07-07

    申请号:US15295858

    申请日:2016-10-17

    Abstract: Methods, computer systems, computer-storage media, and graphical user interfaces are provided for facilitating data transformations, according to embodiments of the present invention. In one embodiment, a set of example values are received. A repository of transformation tools is searched to identify a new transformation tool as relevant to a data transformation associated with the received set of example values. The repository includes annotations associated with the new transformation tool. The new transformation tool is used to generate a transformation program that produces transformed output values. Additional annotations are generated for the new transformation tool based on the transformed output values.

    DISCOVERING SCHEMA USING ANCHOR ATTRIBUTES
    6.
    发明申请

    公开(公告)号:US20190325046A1

    公开(公告)日:2019-10-24

    申请号:US15957378

    申请日:2018-04-19

    Abstract: Systems, methods, and computer-executable instructions for partitioning a data set include receiving anchor attributes of a data set. The data set includes records, with each record including attributes. A set of filter attributes that are not mutually exclusive with any of the anchor attributes is determined. A set of candidate attributes that includes each unique attribute from the first data set excluding the anchor attributes and the filter attributes is determined. For each of the anchor attributes and the anchor attributes, an attribute context is determined. For each of the candidate attributes, a context similarity between each of the anchor attributes is determined. A new anchor attribute is selected from the set of candidate attributes based on the context similarity.

    FACILITATING DATA TRANSFORMATIONS
    7.
    发明申请

    公开(公告)号:US20180081954A1

    公开(公告)日:2018-03-22

    申请号:US15271154

    申请日:2016-09-20

    CPC classification number: G06F16/258 G06F16/245

    Abstract: Methods, computer systems, computer-storage media, and graphical user interfaces are provided for facilitating data transformations, according to embodiments of the present invention. In one embodiment, a set of example values including example input values that indicate data values to be transformed and example output values that indicate a desired form in which to transform data. Based on the set of example values, a data transformation function that is relevant to the set of example values is identified. The data transformation function is used to generate a transformation program to transform the example input values to the desired form in which to transform data. A suggestion of the transformation program can be provided to a user device, wherein selection of the transformation program suggestion results in a data transformation.

    Machine-learned predictive models and systems for data preparation recommendations

    公开(公告)号:US11928564B2

    公开(公告)日:2024-03-12

    申请号:US17969377

    申请日:2022-10-19

    Inventor: Yeye He Cong Yan

    CPC classification number: G06N20/00 G06N5/02

    Abstract: Systems are provided for facilitating the building and use of models used to make data preparation recommendations. The systems identify ground truth from a plurality of notebooks and utilizes the ground truth to generate the corresponding data preparation recommendation models. The data preparation recommendation models are used to predict accurate (e.g., useful and relevant) data preparations steps based on user input and user notebook data. The data preparation computing system generates a recommendation prompt based on output from the data preparation recommendation model that can be viewed and/or selected by the user to be applied to the user's notebook data.

    Leveraging a collection of training tables to accurately predict errors within a variety of tables

    公开(公告)号:US11698892B2

    公开(公告)日:2023-07-11

    申请号:US17510327

    申请日:2021-10-25

    Inventor: Yeye He Pei Wang

    CPC classification number: G06F16/2282 G06F16/215 G06F17/18 G06N20/00

    Abstract: The present disclosure relates to systems, methods, and computer-readable media for using a variety of hypothesis tests to identify errors within tables and other structured datasets. For example, systems disclosed herein can generate a modified table from an input table by removing one or more entries from the input table. The systems disclosed herein can further leverage a collection of training tables to determine probabilities associated with whether the input table and modified table are drawn from the collection of training tables. The systems disclosed herein can additionally compare the probabilities to accurately determine whether the one or more entries include errors therein. The systems disclosed herein may apply to a variety of different sizes and types of tables to identify different types of common errors within input tables.

    Repairing data through domain knowledge

    公开(公告)号:US10970271B2

    公开(公告)日:2021-04-06

    申请号:US16161695

    申请日:2018-10-16

    Abstract: Correcting data in a dataset. A set of data tokens from a tabular data store are grouped into a plurality of different clusters based on similarity of tokens. A reference cluster is selected from among the plurality of different clusters such that the plurality of clusters includes a reference cluster and one or more other clusters. One or more tokens in the one or more other clusters are transformed. The effect on the reference cluster of adding the transformed tokens to the reference cluster is determined. Using this information, a correction for a token in the dataset is identified. The data store is updated to correct the token using the identified correction.

Patent Agency Ranking