Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems
    5.
    发明授权
    Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems 有权
    在基于关键字的检索系统中找到有意义的词汇或停止词组

    公开(公告)号:US07409383B1

    公开(公告)日:2008-08-05

    申请号:US10813590

    申请日:2004-03-31

    IPC分类号: G06F17/30 G06F7/00 G06F17/21

    摘要: A stopword detection component detects stopwords (also stop-phrases) in search queries input to keyword-based information retrieval systems. Potential stopwords are initially identified by comparing the terms in the search query to a list of known stopwords. Context data is then retrieved based on the search query and the identified stopwords. In one implementation, the context data includes documents retrieved from a document index. In another implementation, the context data includes categories relevant to the search query. Sets of retrieved context data are compared to one another to determine if they are substantially similar. If the sets of context data are substantially similar, this fact may be used to infer that the removal of the potential stopword(s) is not material to the search. If the sets of context data are not substantially similar, the potential stopword can be considered material to the search and should not be removed from the query.

    摘要翻译: 停止词检测组件在输入到基于关键字的信息检索系统的搜索查询中检测到停止词(也称为停止词)。 最初通过将搜索查询中的术语与已知无效词列表进行比较来识别潜在的禁忌词。 然后基于搜索查询和所识别的无效词来检索上下文数据。 在一个实现中,上下文数据包括从文档索引检索的文档。 在另一实现中,上下文数据包括与搜索查询相关的类别。 将检索到的上下文数据的集合彼此进行比较以确定它们是否基本相似。 如果上下文数据集合基本相似,则可以使用该事实来推断潜在的停止词的移除对搜索不重要。 如果上下文数据集基本上不相似,潜在的停用词可以被认为是搜索的重要内容,不应该从查询中移除。

    Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems
    6.
    发明授权
    Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems 有权
    在基于关键字的检索系统中找到有意义的词汇或停止词组

    公开(公告)号:US08214385B1

    公开(公告)日:2012-07-03

    申请号:US13098956

    申请日:2011-05-02

    IPC分类号: G06F17/30 G06F7/00

    摘要: A stopword detection component detects stopwords (also stop-phrases) in search queries input to keyword-based information retrieval systems. Potential stopwords are initially identified by comparing the terms in the search query to a list of known stopwords. Context data is then retrieved based on the search query and the identified stopwords. In one implementation, the context data includes documents retrieved from a document index. In another implementation, the context data includes categories relevant to the search query. Sets of retrieved context data are compared to one another to determine if they are substantially similar. If the sets of context data are substantially similar, this fact may be used to infer that the removal of the potential stopword(s) is not material to the search. If the sets of context data are not substantially similar, the potential stopword can be considered material to the search and should not be removed from the query.

    摘要翻译: 停止词检测组件在输入到基于关键字的信息检索系统的搜索查询中检测到停止词(也称为停止词)。 最初通过将搜索查询中的术语与已知无效词列表进行比较来识别潜在的禁忌词。 然后基于搜索查询和所识别的无效词来检索上下文数据。 在一个实现中,上下文数据包括从文档索引检索的文档。 在另一实现中,上下文数据包括与搜索查询相关的类别。 将检索到的上下文数据的集合彼此进行比较,以确定它们是否基本相似。 如果上下文数据集合基本相似,则可以使用该事实来推断潜在的停止词的移除对搜索不重要。 如果上下文数据集基本上不相似,潜在的停用词可以被认为是搜索的重要内容,不应该从查询中移除。

    Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems
    8.
    发明授权
    Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems 有权
    在基于关键字的检索系统中找到有意义的词汇或停止词组

    公开(公告)号:US07945579B1

    公开(公告)日:2011-05-17

    申请号:US12185651

    申请日:2008-08-04

    IPC分类号: G06F17/30 G06F7/00

    摘要: A stopword detection component detects stopwords (also stop-phrases) in search queries input to keyword-based information retrieval systems. Potential stopwords are initially identified by comparing the terms in the search query to a list of known stopwords. Context data is then retrieved based on the search query and the identified stopwords. In one implementation, the context data includes documents retrieved from a document index. In another implementation, the context data includes categories relevant to the search query. Sets of retrieved context data are compared to one another to determine if they are substantially similar. If the sets of context data are substantially similar, this fact may be used to infer that the removal of the potential stopword(s) is not material to the search. If the sets of context data are not substantially similar, the potential stopword can be considered material to the search and should not be removed from the query.

    摘要翻译: 停止词检测组件在输入到基于关键字的信息检索系统的搜索查询中检测到停止词(也称为停止词)。 最初通过将搜索查询中的术语与已知无效词列表进行比较来识别潜在的禁忌词。 然后基于搜索查询和所识别的无效词来检索上下文数据。 在一个实现中,上下文数据包括从文档索引检索的文档。 在另一实现中,上下文数据包括与搜索查询相关的类别。 将检索到的上下文数据的集合彼此进行比较,以确定它们是否基本相似。 如果上下文数据集合基本相似,则可以使用该事实来推断潜在的停止词的移除对搜索不重要。 如果上下文数据集基本上不相似,潜在的停用词可以被认为是搜索的重要内容,不应该从查询中移除。

    Selectively deleting clusters of conceptually related words from a generative model for text
    9.
    发明授权
    Selectively deleting clusters of conceptually related words from a generative model for text 有权
    从文本的生成模型中选择性地删除与概念相关的词的簇

    公开(公告)号:US07877371B1

    公开(公告)日:2011-01-25

    申请号:US11703582

    申请日:2007-02-07

    IPC分类号: G06F17/30

    CPC分类号: G06F17/3071

    摘要: One embodiment of the present invention provides a system that selectively deletes clusters of conceptually-related words from a probabilistic generative model for textual documents. During operation, the system receives a current model, which contains terminal nodes representing random variables for words and contains one or more cluster nodes representing clusters of conceptually related words. Nodes in the current model are coupled together by weighted links, so that if an incoming link from a node that has fired causes a cluster node to fire with a probability proportionate to a weight of the incoming link, an outgoing link from the cluster node to another node causes the other node to fire with a probability proportionate to the weight of the outgoing link. Next, the system processes a given cluster node in the current model for possible deletion. This involves determining a number of outgoing links from the given cluster node to terminal nodes or cluster nodes in the current model. If the determined number of outgoing links is less than a minimum value, or if the frequency with which the given cluster node fires is less than a minimum frequency, the system deletes the given cluster node from the current model.

    摘要翻译: 本发明的一个实施例提供一种系统,其从文本文档的概率生成模型中选择性地删除与概念相关的词的簇。 在操作期间,系统接收当前模型,其包含代表词的随机变量的终端节点,并且包含表示与概念相关的词的簇的一个或多个簇节点。 当前模型中的节点通过加权链路耦合在一起,使得如果来自已被触发的节点的传入链路使得簇节点以与入局链路权重成比例的概率发射,则从群集节点到 另一个节点导致另一个节点以与输出链路的权重成比例的概率触发。 接下来,系统处理当前模型中的给定集群节点以进行可能的删除。 这涉及确定从给定的集群节点到当前模型中的终端节点或集群节点的输出链路的数量。 如果确定的出站链路数量小于最小值,或者如果给定的集群节点触发的频率小于最小频率,则系统将从当前模型中删除给定的集群节点。