Abstract:
Some embodiments of the present invention include a method for identifying duplicate records from a group of records in a database system. The method includes generating a cluster of records from a group of records based on one or more keys; splitting the cluster of records into multiple subsets of records with each subset of records having fewer number of records than the cluster of records, wherein the splitting the cluster of records into multiple subsets of records is based on a number of records in the cluster of records exceeding a threshold; causing duplicate sets of records in each of the subsets of records to be identified, wherein a duplicate set of records includes one or more records, and wherein when a duplicate set of records includes two or more records, the two or more records are duplicates of one another; merging all of the duplicate sets of records identified from the multiple subsets of records forming a first group of duplicate sets of records; and forming a representative set of records based on selecting a representative record from each of the duplicate sets in the first group of duplicate sets of records.
Abstract:
A system and method for associating a character string with one or more defined entities of a contact record. An input character string is received. The string is first evaluated to see if the structure of the string is recognized. If not, then the string is compared to entries in a look up table. If the string format is not recognized, and the string is not found in the look up table, then a posterior probability is calculated for a set of defined entities over a limited set of string processing features. The result of probabilistic scoring determines which of the defined entities to associate with the character string.
Abstract:
Recommending data providers' datasets based on database value densities is described. A database system determines a provider dataset density for a value by identifying a frequency of the value in a dataset that is provided by a data provider. The database system determines a user database density for the value by identifying a frequency of the value in a database used by a data user. The database system determines a relative density based on a relationship between the provider dataset density and the user database density. The database system determines an evaluation metric for the value, based on a combination of the relative density and the user database density. The database system causes a recommendation to be outputted, based on a relationship of the evaluation metric relative to other evaluation metrics for other values, which recommends that the data user acquire at least a part of the dataset.
Abstract:
An attempt by a user to login to a destination server is identified from a source server. A destination score is determined based on the count of attempts by the user to login to the destination server and the count of attempts by the user to login to all destination servers. A source given destination score is determined based on the count of attempts by the user to login from the source server to the destination server, and the count of attempts by the user to login to the destination server. An outlier score is determined based on values associated with the destination score and the source given destination score. An alert is output if the outlier score satisfies a threshold.
Abstract:
User scores based on bulk record updates is described. A system receives record updates submitted by a user. The system subtracts a penalty debit from a user score, which corresponds to the user, for each record which corresponds to at least one of the record updates and which is removed from purchasing availability. The system adds a full credit to the user score for each record which corresponds to at least one of the record updates and which is purchased. The system adds a partial credit to the user score for each record which corresponds to at least one of the record updates and which is yet to be purchased and which is yet to be removed from purchasing availability, wherein the partial credit is a positive value that is less than the full credit. The system enables the user to access records, based on the user score.
Abstract:
System creates three tries based on values stored in first three fields by records. System associates node in third trie with record, based on value stored in third field by record. System associates node with first dispersion measure, based on values stored in first field by records associated with node, and with second dispersion measure, based on values stored in second field by records associated with node. System identifies branch sequence in third trie as key for prospective record, based on value stored in third field by prospective record. System uses key to identify a subset of records that match prospective record. If a count of the subset exceeds threshold, the system identifies other branch sequence in first trie or second trie as other key for prospective record, based on first dispersion measure and second dispersion measure. System uses the key and the other key to identify at least one record that matches prospective record.
Abstract:
The system tokenizes values stored by records' fields, creates trie from tokenized values, each branch labeled with tokenized value, each node storing count indicating number of records associated with tokenized value sequence beginning from trie root. The system tokenizes value stored by record field, identifies nodes, beginning from trie root, corresponding to token value sequence associated with tokenized value, until node is identified that stores count that is less than node threshold. The system identifies branch sequence comprising each identified node as record's key, and associates key with node storing count less than node threshold, and record with key. The system tokenizes prospective value stored by prospective record's field, identifies nodes, beginning from trie root, corresponding to another token value sequence associated with tokenized prospective value, until another node is identified that stores another count that is less than node threshold. The system identifies other node's key as prospective record's key, identifies existing record that matches prospective record by using prospective record's key.
Abstract:
Transforming columns from source files to target files is described. A system associates a source column in a source file with an entity of multiple entities associated with target columns comprising a target file, based on a first set of features that describes contents of cells of a first source column that is adjacent to the source column, a second set of features that describes contents of cells of a second source column that is adjacent to the source column, and a third set of features that describes contents of cells of the source column. The system creates a mapping of the source column to a target column associated with the entity, and transforms the mapped source column to the target column in accord with the mapping.
Abstract:
Client-server hybrid A.I. scores for customized actions are described. A client generates client scores corresponding to client customized actions by applying a user-specific model to an action received from a user, the user-specific model based on at least one historical action received from the user. The client requests a server to provide server scores corresponding to server customized actions by applying a cross-user model to the action received from the user, the cross-user model based on historical actions associated with server users. The client generates hybrid scores corresponding to hybrid customized actions by combining the client scores with the server scores, in response to receiving the server scores from the server. The client causes the hybrid customized actions to be outputted based on the corresponding hybrid scores.
Abstract:
System receives inputs, each input associated with a label and having features, creates a rule for each feature, each rule including a feature and a label, each rule stored in a hierarchy, and distributes each rule into a partition associated with a label or another partition associated with another label. System identifies a number of inputs that include a feature for a rule in the rule partition, and identifies another number of inputs that include both the feature for the rule and another feature for another rule in the rule partition. System deletes the rule from the hierarchy if the ratio of the other number of inputs to the number of inputs satisfies a threshold and an additional number of inputs that includes the other antecedent feature is at least as much as the number. System predicts a label for an input including features by applying each remaining rule to the input.