Duplicate entry detection system and method
    1.
    发明授权
    Duplicate entry detection system and method 有权
    重复条目检测系统和方法

    公开(公告)号:US08046372B1

    公开(公告)日:2011-10-25

    申请号:US11754237

    申请日:2007-05-25

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/30616

    摘要: A computer system and method for determining whether the subject matter described in a received document is substantially similar to the subject matter of other documents in a document corpus, such that the received document can be considered a duplicate document. After receiving a first document, a set of tokens for the first document is generated. A non-fielded relevance search on a token index is executed. The relevance search returns a set of candidate duplicate documents with scores corresponding to each candidate document. For each candidate document with a score above a threshold, filtering is performed on each candidate document to determine whether each candidate document is a true duplicate of the first document. A set of candidate documents with a score above the threshold that were not disqualified as candidate documents is then provided.

    摘要翻译: 一种计算机系统和方法,用于确定在接收到的文档中描述的主题与文档语料库中的其他文档的主题是否基本相似,使得所接收的文档可以被认为是重复的文档。 在收到第一个文档之后,生成第一个文档的一组令牌。 执行令牌索引上的非字段相关搜索。 相关性搜索返回一组具有与每个候选文档相对应的分数的候选重复文档。 对于分数高于阈值的每个候选文档,对每个候选文档进行过滤以确定每个候选文档是否是第一个文档的真实副本。 然后提供一组具有不超过门槛的分数的候选文件,不被取消作为候选文件的资格。

    Identifying potential duplicates of a document in a document corpus
    2.
    发明授权
    Identifying potential duplicates of a document in a document corpus 有权
    在文档语料库中识别文档的潜在重复项

    公开(公告)号:US07895225B1

    公开(公告)日:2011-02-22

    申请号:US11952020

    申请日:2007-12-06

    IPC分类号: G06F7/00 G06F17/00

    摘要: According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document is provided. A source document is obtained. A list of queries corresponding to a source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.

    摘要翻译: 根据所公开的主题的方面,提供了一种用于从源文档的潜在重复的文档语料库中识别一组文档的方法。 得到一个源文件。 识别与源文档相对应的查询的列表。 在所识别的查询列表中的每个查询在文档语料库上执行,其中每个查询的执行产生标识文档语料库中的有序文档集合的相应结果集。 对于每个结果集中识别的每个文档,根据识别的文档在其结果集中的序数位置,为所识别的文档生成文档分数。 根据满足预定选择标准的所生成的文档分数来选择结果集的识别文档的子集。 识别的文档的所选子集被存储或显示。

    Generating similarity scores for matching non-identical data strings
    3.
    发明授权
    Generating similarity scores for matching non-identical data strings 有权
    生成匹配不相同数据字符串的相似度分数

    公开(公告)号:US07814107B1

    公开(公告)日:2010-10-12

    申请号:US11754241

    申请日:2007-05-25

    IPC分类号: G06F7/00 G06F17/00

    CPC分类号: G06F17/30011

    摘要: A system and method for determining the likelihood of two documents describing substantially similar subject matter is presented. A set of tokens for each of two documents is obtained, each set representing strings of characters found in the corresponding document. A matrix of token pairs is determined, each token pair comprising a token from each set of tokens. For each token pair in the matrix, a similarity score is determined. Those token pairs in the matrix with a similarity score above a threshold score are selected and added to a set of matched tokens. A similarity score for the two documents is determined according to the scores of the token pairs added to the set of matched tokens. The determined similarity score is provided as the likelihood that the first and second documents describing substantially similar subject matter.

    摘要翻译: 提出了一种用于确定描述基本相似主题的两个文档的可能性的系统和方法。 获得两个文档中的每一个的一组令牌,每组代表在相应文档中找到的字符串。 确定令牌对的矩阵,每个令牌对包括来自每组令牌的令牌。 对于矩阵中的每个令牌对,确定相似性得分。 选择具有相似性得分高于阈值分数的矩阵中的那些令牌对并将其添加到一组匹配的令牌中。 根据添加到匹配令牌集中的令牌对的分数来确定两个文档的相似性得分。 确定的相似度得分被提供为第一和第二文档描述基本相似的主题的可能性。

    Method and system for generating a normalized configuration model
    4.
    发明授权
    Method and system for generating a normalized configuration model 有权
    用于生成归一化配置模型的方法和系统

    公开(公告)号:US07567922B1

    公开(公告)日:2009-07-28

    申请号:US10924630

    申请日:2004-08-24

    IPC分类号: G06Q30/00

    CPC分类号: G06Q30/00 G06Q30/0621

    摘要: Normalized data models are programmatically generated from a combination of product configuration model data, product configuration engine runtime validation, normalized data mappings, and settings files declaring the scope of model content. A master model generation process effectively transforms conventional configuration data into normalized configuration data. The normalized configuration data allows a user to, for example, conduct comparative product configurations. In one embodiment, a normalized model generation process generates normalized data model representing attributes and normalized features of a product. In one embodiment, the normalized configuration data model is then added to in-memory data structures used during runtime contextual configuration analysis, thus reducing the total number of data items preserved as efficiencies result from eliminating duplication and effective use of search structures. In-memory representation of the normalized configuration data model can then be serialized to disk as a file to be loaded for runtime use in a deployment.

    摘要翻译: 归一化数据模型通过产品配置模型数据,产品配置引擎运行时验证,规范化数据映射和声明模型内容范围的设置文件的组合以编程方式生成。 主模型生成过程有效地将常规配置数据转换为归一化配置数据。 归一化的配置数据允许用户例如进行比较产品配置。 在一个实施例中,归一化模型生成过程生成表示产品的属性和归一化特征的归一化数据模型。 在一个实施例中,然后将归一化配置数据模型添加到在运行时情境配置分析期间使用的存储器内数据结构,从而减少由于消除重复和有效使用搜索结构而导致的效率的保留的数据项的总数。 然后,归一化配置数据模型的内存中表示可以序列化为磁盘,作为要在部署中运行时使用的要加载的文件。

    Identifying potential duplicates of a document in a document corpus
    5.
    发明授权
    Identifying potential duplicates of a document in a document corpus 有权
    在文档语料库中识别文档的潜在重复项

    公开(公告)号:US09195714B1

    公开(公告)日:2015-11-24

    申请号:US13030114

    申请日:2011-02-17

    IPC分类号: G06F17/30

    摘要: According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document, is provided. A source document is obtained. A list of queries corresponding to the source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.

    摘要翻译: 根据所公开的主题的方面,提供了一种用于从文档语料库中识别源文档的潜在重复的一组文档的方法。 得到一个源文件。 识别与源文档相对应的查询的列表。 在所识别的查询列表中的每个查询在文档语料库上执行,其中每个查询的执行产生标识文档语料库中的有序文档集合的相应结果集。 对于每个结果集中识别的每个文档,根据识别的文档在其结果集中的序数位置,为所识别的文档生成文档分数。 根据满足预定选择标准的所生成的文档分数来选择结果集的识别文档的子集。 识别的文档的所选子集被存储或显示。

    Method and apparatus for inventory searching
    6.
    发明授权
    Method and apparatus for inventory searching 有权
    库存搜索的方法和装置

    公开(公告)号:US08744931B1

    公开(公告)日:2014-06-03

    申请号:US13571602

    申请日:2012-08-10

    IPC分类号: G06Q10/00 G06Q30/00

    CPC分类号: G06Q10/087 G06Q30/0633

    摘要: A method is disclosed that includes identifying an inventory item corresponding to a product configuration. The product configuration is defined using a feature map. The inventory item is also defined using the feature map. Each entry of the feature map corresponds to one of a number of features of a product.

    摘要翻译: 公开了一种包括识别与产品配置相对应的库存物品的方法。 产品配置使用特征图进行定义。 库存项目也使用特征图定义。 特征图的每个条目对应于产品的许多特征之一。

    Method and apparatus for inventory searching
    7.
    发明授权
    Method and apparatus for inventory searching 有权
    库存搜索的方法和装置

    公开(公告)号:US08244604B1

    公开(公告)日:2012-08-14

    申请号:US12749803

    申请日:2010-03-30

    IPC分类号: G06Q10/00 G06Q30/00

    CPC分类号: G06Q10/087 G06Q30/0633

    摘要: A method is disclosed that includes identifying an inventory item corresponding to a product configuration. The product configuration is defined using a feature map. The inventory item is also defined using the feature map. Each entry of the feature map corresponds to one of a number of features of a product.

    摘要翻译: 公开了一种包括识别与产品配置相对应的库存物品的方法。 产品配置使用特征图进行定义。 库存项目也使用特征图定义。 特征图的每个条目对应于产品的许多特征之一。

    Determining variation sets among product descriptions
    8.
    发明授权
    Determining variation sets among product descriptions 有权
    确定产品说明中的变体集

    公开(公告)号:US07970773B1

    公开(公告)日:2011-06-28

    申请号:US11863020

    申请日:2007-09-27

    IPC分类号: G06F7/00

    CPC分类号: G06F17/2211 Y10S707/917

    摘要: Systems and methods for determining a set of variation-phrases from a collection of documents in a document corpus is presented. Potential variation-phrase pairs among the various documents in the document corpus are identified. The identified potential variation-phrase pairs are then added to a variation-phrase set. The potential variation-phrase pairs in the variation-phrase set are filtered to remove those potential variation-phrase pairs that do not satisfy a predetermined criteria. After filtering the variation-phrase set, the resulting variation-phrase set is stored in a data store.

    摘要翻译: 提出了用于从文档语料库中的文档集合确定一组变体词组的系统和方法。 识别文档语料库中的各种文档之间的潜在的变化 - 短语对。 然后将所识别的潜在变异短语对添加到变化短语集合中。 对变化短语组中的潜在的变体 - 短语对进行过滤以去除不满足预定标准的那些潜在的变体 - 短语对。 在对变化短语组进行过滤之后,将所得到的变化短语组存储在数据存储器中。

    Filtering invalid tokens from a document using high IDF token filtering
    9.
    发明授权
    Filtering invalid tokens from a document using high IDF token filtering 有权
    使用高IDF令牌过滤从文档过滤无效令牌

    公开(公告)号:US07908279B1

    公开(公告)日:2011-03-15

    申请号:US11856581

    申请日:2007-09-17

    CPC分类号: G06F17/2211 Y10S707/917

    摘要: Systems and methods for filtering tokens from a document for determining whether the document describes substantially similar subject matter compared to another document are described. In one embodiment, a first document is obtained. This document is organized into a plurality of fields, and at least some of the fields include tokens representing the subject matter described by the document. A field of this document is selected and a token from within the selected field having the highest inverse document frequency (IDF) is selected. Those tokens that have a higher IDF than the selected token are removed. Using the remaining tokens, a determination is made as to whether the first document describes substantially similar subject matter to the subject matter described by a second document. An indication is provided as to whether the first document describes substantially similar subject matter to that described by a second document according to the determination.

    摘要翻译: 描述用于从文档过滤标记以确定文档是否描述与另一文档相比基本相似的主题的系统和方法。 在一个实施例中,获得第一文档。 该文档被组织成多个字段,并且至少一些字段包括表示文档描述的主题的令牌。 选择该文档的字段,并且选择具有最高逆文档频率(IDF)的所选字段内的令牌。 删除IDF高于所选令牌的令牌。 使用剩余的令牌,确定第一文档是否描述与第二文档描述的主题相当的主题。 提供关于第一文档是否根据确定描述与第二文档描述的主题相当的主题的指示。

    Comparison engine for identifying documents describing similar subject matter
    10.
    发明授权
    Comparison engine for identifying documents describing similar subject matter 有权
    用于识别描述相似主题的文档的比较引擎

    公开(公告)号:US07904462B1

    公开(公告)日:2011-03-08

    申请号:US11953726

    申请日:2007-12-10

    IPC分类号: G06F7/00 G06F17/00

    CPC分类号: G06Q30/06

    摘要: Systems and methods for determining whether a first document is a potential duplicate of a second document such that the two documents describe the same or substantially the same subject matter, wherein the first and second documents include attribute data in attribute fields. A set of rules is obtained for determining whether the first document is a potential duplicate of the second document. Moreover, for each rule in the set of rules, a determination is made as to whether data in a first set of attributes of the first document is contained in a second set of attributes of the second document. According to the results of the evaluated rules in the rules set, determining whether the first document is a potential duplicate of the second document. If, according to the evaluated rules in the rules set, the first document is determined to be a potential duplicate of the second document, storing a reference to the first document in a set of potential duplicates of the second document.

    摘要翻译: 用于确定第一文档是否是第二文档的潜在副本的系统和方法,使得两个文档描述相同或基本相同的主题,其中第一和第二文档包括属性字段中的属性数据。 获得一组用于确定第一文档是否是第二文档的潜在副本的规则。 此外,对于该组规则中的每个规则,确定第一文档的第一组属性中的数据是否包含在第二文档的第二组属性中。 根据规则集中评估规则的结果,确定第一个文档是否是第二个文档的潜在副本。 如果根据规则集中的评估规则,确定第一文档是第二文档的潜在副本,则将第一文档的引用存储在第二文档的一组潜在重复项中。