摘要:
A computer system and method for determining whether the subject matter described in a received document is substantially similar to the subject matter of other documents in a document corpus, such that the received document can be considered a duplicate document. After receiving a first document, a set of tokens for the first document is generated. A non-fielded relevance search on a token index is executed. The relevance search returns a set of candidate duplicate documents with scores corresponding to each candidate document. For each candidate document with a score above a threshold, filtering is performed on each candidate document to determine whether each candidate document is a true duplicate of the first document. A set of candidate documents with a score above the threshold that were not disqualified as candidate documents is then provided.
摘要:
According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document is provided. A source document is obtained. A list of queries corresponding to a source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.
摘要:
A system and method for determining the likelihood of two documents describing substantially similar subject matter is presented. A set of tokens for each of two documents is obtained, each set representing strings of characters found in the corresponding document. A matrix of token pairs is determined, each token pair comprising a token from each set of tokens. For each token pair in the matrix, a similarity score is determined. Those token pairs in the matrix with a similarity score above a threshold score are selected and added to a set of matched tokens. A similarity score for the two documents is determined according to the scores of the token pairs added to the set of matched tokens. The determined similarity score is provided as the likelihood that the first and second documents describing substantially similar subject matter.
摘要:
Normalized data models are programmatically generated from a combination of product configuration model data, product configuration engine runtime validation, normalized data mappings, and settings files declaring the scope of model content. A master model generation process effectively transforms conventional configuration data into normalized configuration data. The normalized configuration data allows a user to, for example, conduct comparative product configurations. In one embodiment, a normalized model generation process generates normalized data model representing attributes and normalized features of a product. In one embodiment, the normalized configuration data model is then added to in-memory data structures used during runtime contextual configuration analysis, thus reducing the total number of data items preserved as efficiencies result from eliminating duplication and effective use of search structures. In-memory representation of the normalized configuration data model can then be serialized to disk as a file to be loaded for runtime use in a deployment.
摘要:
According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document, is provided. A source document is obtained. A list of queries corresponding to the source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.
摘要:
A method is disclosed that includes identifying an inventory item corresponding to a product configuration. The product configuration is defined using a feature map. The inventory item is also defined using the feature map. Each entry of the feature map corresponds to one of a number of features of a product.
摘要:
A method is disclosed that includes identifying an inventory item corresponding to a product configuration. The product configuration is defined using a feature map. The inventory item is also defined using the feature map. Each entry of the feature map corresponds to one of a number of features of a product.
摘要:
Systems and methods for determining a set of variation-phrases from a collection of documents in a document corpus is presented. Potential variation-phrase pairs among the various documents in the document corpus are identified. The identified potential variation-phrase pairs are then added to a variation-phrase set. The potential variation-phrase pairs in the variation-phrase set are filtered to remove those potential variation-phrase pairs that do not satisfy a predetermined criteria. After filtering the variation-phrase set, the resulting variation-phrase set is stored in a data store.
摘要:
Systems and methods for filtering tokens from a document for determining whether the document describes substantially similar subject matter compared to another document are described. In one embodiment, a first document is obtained. This document is organized into a plurality of fields, and at least some of the fields include tokens representing the subject matter described by the document. A field of this document is selected and a token from within the selected field having the highest inverse document frequency (IDF) is selected. Those tokens that have a higher IDF than the selected token are removed. Using the remaining tokens, a determination is made as to whether the first document describes substantially similar subject matter to the subject matter described by a second document. An indication is provided as to whether the first document describes substantially similar subject matter to that described by a second document according to the determination.
摘要:
Systems and methods for determining whether a first document is a potential duplicate of a second document such that the two documents describe the same or substantially the same subject matter, wherein the first and second documents include attribute data in attribute fields. A set of rules is obtained for determining whether the first document is a potential duplicate of the second document. Moreover, for each rule in the set of rules, a determination is made as to whether data in a first set of attributes of the first document is contained in a second set of attributes of the second document. According to the results of the evaluated rules in the rules set, determining whether the first document is a potential duplicate of the second document. If, according to the evaluated rules in the rules set, the first document is determined to be a potential duplicate of the second document, storing a reference to the first document in a set of potential duplicates of the second document.