Adaptive match indexes
    1.
    发明授权

    公开(公告)号:US11372928B2

    公开(公告)日:2022-06-28

    申请号:US16775611

    申请日:2020-01-29

    Abstract: Determine first count of first records storing first value in first field, second count of second records storing second value in second field, third count of third records storing third value in third field. Determine count threshold using first, second and third counts, dispersion measure based on dispersion of values stored in second field by first records and other dispersion measure based on other dispersion of values stored in third field by first records. Train machine-learning model to determine dispersion measure threshold based on dispersion and other dispersion measures. If first count is greater than count threshold, and dispersion measure is greater than dispersion measure threshold, create match index based on first and second fields. Receive prospective record storing first value in first field, second value in second field. Use match index to identify record storing first value in first field, second value in second field as matching prospective record.

    ADAPTIVE FIELD-LEVEL MATCHING
    2.
    发明申请

    公开(公告)号:US20210342353A1

    公开(公告)日:2021-11-04

    申请号:US16862667

    申请日:2020-04-30

    Abstract: Adaptive field-level matching is described. A system identifies first elements in a field of a prospective record for a database, and second elements in the field of a candidate record, in the database, for matching the prospective record. The system identifies features corresponding to any of the first elements that are identical to any of the second elements, any of the first elements that are absent from the second elements, and any of the second elements that are absent from the first elements. A machine-learning model uses the features to determine a field match score for the candidate record's field. Another machine-learning model weighs the field match score and weighs another field match score for another field of the candidate record to determine a record match score for the candidate record. If the record match score satisfies a threshold, the system identifies the candidate record as matching the prospective record.

    Machine-learnt field-specific tokenization

    公开(公告)号:US11163740B2

    公开(公告)日:2021-11-02

    申请号:US16525945

    申请日:2019-07-30

    Abstract: A training set is created via creating adjacent classified substrings by using character classes to replace corresponding characters in adjacent substrings in each training character string, and associating each pair of adjacent classified substrings and each pair of adjacent substrings with corresponding labels indicating whether corresponding pairs include any token boundary. The system splits input character string into beginning and ending parts and creates classified beginning part by replacing beginning part character with corresponding class and classified ending part by replacing ending part character with corresponding class. The machine-learning model determines probability of token identification, based on training set to determine count of instances that classified beginning part is paired with classified ending part and count of corresponding labels that indicate inclusion of any token boundary. If token identification probability satisfies threshold, the system identifies beginning part as token and ending part as remainder of input character string.

    MACHINE-LEARNT FIELD-SPECIFIC STANDARDIZATION

    公开(公告)号:US20210034638A1

    公开(公告)日:2021-02-04

    申请号:US16528175

    申请日:2019-07-31

    Abstract: A system tokenizes raw values and corresponding standardized values into raw token sequences and corresponding standardized token sequences. A machine-learning model learns standardization from token insertions and token substitutions that modify the raw token sequences to match the corresponding standardized token sequences. The system tokenizes an input value into an input token sequence. The machine-learning model determines a probability of inserting an insertion token after an insertion markable token in the input token sequence. If the probability of inserting the insertion token satisfies a threshold, the system inserts the insertion token after the insertion markable token in the input token sequence. The machine-learning model determines a probability of substituting a substitution token for a substitutable token in the input token sequence. If the probability of substituting the substitution token satisfies another threshold, the system substitutes the substitution token for the substitutable token in the input token sequence.

    Recommending data providers' datasets based on database value densities

    公开(公告)号:US10817479B2

    公开(公告)日:2020-10-27

    申请号:US15631306

    申请日:2017-06-23

    Abstract: Recommending data providers' datasets based on database value densities is described. A database system determines a provider dataset density for a value by identifying a frequency of the value in a dataset that is provided by a data provider. The database system determines a user database density for the value by identifying a frequency of the value in a database used by a data user. The database system determines a relative density based on a relationship between the provider dataset density and the user database density. The database system determines an evaluation metric for the value, based on a combination of the relative density and the user database density. The database system causes a recommendation to be outputted, based on a relationship of the evaluation metric relative to other evaluation metrics for other values, which recommends that the data user acquire at least a part of the dataset.

    Match index creation
    6.
    发明授权

    公开(公告)号:US10817465B2

    公开(公告)日:2020-10-27

    申请号:US15496905

    申请日:2017-04-25

    Abstract: A system identifies a first number of distinct values stored in a first field by a dataset of records. The system identifies a second number of distinct values stored in a second field by the dataset of records. The system creates a trie from values stored in a field by multiple records, the field corresponding to the first field or the second field, based on comparing the first number to the second number. The system associates a node in the trie with one of the multiple records, based on a value stored in the field by the record. The system identifies a branch sequence in the trie as a key for a prospective record, based on a prospective value stored in a corresponding field by the prospective record. The system uses the key for the prospective record to identify one of the multiple records that matches the prospective record.

    Rule set induction
    7.
    发明授权

    公开(公告)号:US10552744B2

    公开(公告)日:2020-02-04

    申请号:US15368173

    申请日:2016-12-02

    Abstract: System receives inputs, each input associated with a label and having features, creates a rule for each feature, each rule including a feature and a label, each rule stored in a hierarchy, and distributes each rule into a partition associated with a label or another partition associated with another label. System identifies a number of inputs that include a feature for a rule in the rule partition, and identifies another number of inputs that include both the feature for the rule and another feature for another rule in the rule partition. System deletes the rule from the hierarchy if the ratio of the other number of inputs to the number of inputs satisfies a threshold and an additional number of inputs that includes the other antecedent feature is at least as much as the number. System predicts a label for an input including features by applying each remaining rule to the input.

    SEARCH QUERY RESULT SET COUNT ESTIMATION
    8.
    发明申请

    公开(公告)号:US20190236475A1

    公开(公告)日:2019-08-01

    申请号:US15882800

    申请日:2018-01-29

    Abstract: Search query result set count estimation is described. A system parses data set query that includes first query attribute and second query attribute. The system identifies first hierarchy of connected nodes including a first node representing a first query attribute, and a second hierarchy of other connected nodes including a second node representing a second query attribute. The system identifies a directed arc connecting first correlated node in first hierarchy to second correlated node in second hierarchy. The system identifies cross-hierarchy probabilities of correlations between values of a first attribute represented by the first correlated node and values of a second attribute represented by the second correlated node. The system outputs query result set estimated count generated from cross-hierarchy probabilities, probabilities that values of first attribute are associated with values corresponding to first node, and probabilities that values of second attribute are associated with values corresponding to second node.

    System and method for mapping source columns to target columns
    9.
    发明授权
    System and method for mapping source columns to target columns 有权
    将源列映射到目标列的系统和方法

    公开(公告)号:US08972336B2

    公开(公告)日:2015-03-03

    申请号:US13773286

    申请日:2013-02-21

    Abstract: A system and method for mapping columns from a source file to a target file. The header for each source column is evaluated heuristically to see if the header matches a predefined entity. The contents of a group of cells in the source column are evaluated probabilistically to determine a probability that the cell contents correspond to at least one of the predefined entities. A score is assigned to the likelihood that the column corresponds to one or more predefined entities. If the score meets a threshold, then the correspondence between the source column and one or more predefined entities is mapped. If the score fails to meets the threshold, then the correspondence between the source column and one or more undefined entities is mapped. Finally, each source column is transformed into a target column in accord with the map.

    Abstract translation: 用于将列从源文件映射到目标文件的系统和方法。 对每个源列的标题进行启发性评估,以查看标题是否与预定义的实体匹配。 概率地评估源列中的一组单元的内容以确定单元内容对应于至少一个预定实体的概率。 分数分配给列对应于一个或多个预定义实体的可能性。 如果分数满足阈值,则映射源列与一个或多个预定义实体之间的对应关系。 如果分数不符合阈值,则源列与一个或多个未定义实体之间的对应关系被映射。 最后,根据地图将每个源列转换为目标列。

Patent Agency Ranking