-
公开(公告)号:US20120030206A1
公开(公告)日:2012-02-02
申请号:US12846064
申请日:2010-07-29
申请人: Shuming Shi , Ji-Rong Wen
发明人: Shuming Shi , Ji-Rong Wen
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , G06F17/30707
摘要: A topic modeling architecture is used to discover high-quality semantic classes from a large collection of raw semantic classes (RASCs) for use in generating responses to queries. A specific semantic class is identified from a collection of RASCs, and a preprocessing operation is conducted to remove one or more items with a semantic class frequency less than a predetermined threshold. A topic model is then applied to the specific semantic class for each of the items that remain in the specific semantic class after the preprocessing operation. A postprocessing operation is then conducted on the items of the specific semantic class to merge and sort the results of the topic model and generate final semantic classes for use by a search engine to respond to a query.
摘要翻译: 主题建模架构用于从用于生成对查询的响应的大量原始语义类(RASC)集合中发现高质量语义类。 从RASC的集合中识别特定语义类,并且执行预处理操作以去除具有小于预定阈值的语义类频率的一个或多个项。 然后,在预处理操作之后,将主题模型应用于保留在特定语义类中的每个项目的特定语义类。 然后对特定语义类的项目进行后处理操作,以合并和排序主题模型的结果,并生成最终语义类,供搜索引擎使用以响应查询。
-
公开(公告)号:US20110078131A1
公开(公告)日:2011-03-31
申请号:US12569978
申请日:2009-09-30
申请人: Ji-Rong Wen , Yu Chen , Guomao Xin , Yunxiao Ma , Yi Liu , Zhicheng Dou , Qing Yu , Shuming Shi
发明人: Ji-Rong Wen , Yu Chen , Guomao Xin , Yunxiao Ma , Yi Liu , Zhicheng Dou , Qing Yu , Shuming Shi
CPC分类号: G06F16/951
摘要: Described is the running of search-related experiments on a full (or partial) offline snapshot copy of the search engine documents of an actual production system. A snapshot experimentation subsystem runs experimental code related to web searches on the offline data, including to run experimental index building code to build an experimental index (e.g., to test a new document feature), and/or to run experimental search-related code, such as to rank search results according to experimental ranking code, to implement an experimental search strategy, and/or to generate experimental captions.
摘要翻译: 描述了对实际生产系统的搜索引擎文档的完整(或部分)离线快照副本的搜索相关实验的运行。 快照实验子系统运行与离线数据上的网络搜索相关的实验代码,包括运行实验索引构建代码来构建实验索引(例如,测试新文档特征)和/或运行实验搜索相关代码, 例如根据实验排名代码对搜索结果进行排名,以实现实验搜索策略,和/或生成实验标题。
-
公开(公告)号:US20100145956A1
公开(公告)日:2010-06-10
申请号:US12697056
申请日:2010-01-29
申请人: Shuming Shi , Zaiqing Nie , Ji-Rong Wen , Mingjie Zhu , Fei Xing
发明人: Shuming Shi , Zaiqing Nie , Ji-Rong Wen , Mingjie Zhu , Fei Xing
IPC分类号: G06F17/30
CPC分类号: G06F17/30616 , G06F17/30864 , Y10S707/99932
摘要: A search method uses pseudo-anchor text associated with search objects to improve search performance. The pseudo-anchor text may be extracted in combination with an identifier of the search objects (such as a pseudo-URL) from a digital corpus such as a collection of documents. Pseudo-anchor texts for each object are preferably extracted from candidate anchor blocks using a machine learning based approach. The pseudo-anchor texts are made available for searching and used to help rank the objects in a search result to improve search performance. The method may be used in vertical search of objects such as published articles, products and images that lack explicit URLs and anchor text information.
摘要翻译: 搜索方法使用与搜索对象相关联的伪锚文本来改善搜索性能。 伪锚文本可以与来自诸如文档集合的数字语料库的搜索对象(诸如伪URL)的标识符组合提取。 优选地,使用基于机器学习的方法从候选锚块中提取每个对象的伪锚文本。 伪锚文本可用于搜索,并用于帮助对搜索结果中的对象进行排名以提高搜索性能。 该方法可以用于垂直搜索诸如已发表的文章,产品和缺乏明确的URL和锚文本信息的图像的对象。
-
公开(公告)号:US08874581B2
公开(公告)日:2014-10-28
申请号:US12846064
申请日:2010-07-29
申请人: Shuming Shi , Ji-Rong Wen
发明人: Shuming Shi , Ji-Rong Wen
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , G06F17/30707
摘要: A topic modeling architecture is used to discover high-quality semantic classes from a large collection of raw semantic classes (RASCs) for use in generating responses to queries. A specific semantic class is identified from a collection of RASCs, and a preprocessing operation is conducted to remove one or more items with a semantic class frequency less than a predetermined threshold. A topic model is then applied to the specific semantic class for each of the items that remain in the specific semantic class after the preprocessing operation. A postprocessing operation is then conducted on the items of the specific semantic class to merge and sort the results of the topic model and generate final semantic classes for use by a search engine to respond to a query.
摘要翻译: 主题建模架构用于从用于生成对查询的响应的大量原始语义类(RASC)集合中发现高质量语义类。 从RASC的集合中识别特定语义类,并且执行预处理操作以去除具有小于预定阈值的语义类频率的一个或多个项。 然后,在预处理操作之后,将主题模型应用于保留在特定语义类中的每个项目的特定语义类。 然后对特定语义类的项目进行后处理操作,以合并和排序主题模型的结果,并生成最终语义类,供搜索引擎使用以响应查询。
-
公开(公告)号:US08112421B2
公开(公告)日:2012-02-07
申请号:US11781220
申请日:2007-07-20
申请人: Nan Sun , Qing Yu , Shuming Shi , Ji-Rong Wen
发明人: Nan Sun , Qing Yu , Shuming Shi , Ji-Rong Wen
IPC分类号: G06F17/30
CPC分类号: G06F17/30675
摘要: A learning system for a search ranking function model may include a computer program that iteratively refines the model using new queries and associated documents from an unlabeled training set. The unlabeled training set may include a set of queries for which the associated documents have not been labeled as “relevant” or otherwise labeled. The new queries may be selected based on a similarity to and an accuracy of each neighbor from a labeled training set, such as a labeled validation set. Upon selection, the documents associated with the new queries may be labeled. The new queries and their associated documents may be accumulated into a labeled training set, such as a labeled training set, and a refined model may be learned based on the augmented labeled training set. The model may be iteratively refined until it is determined that the model is adequate.
摘要翻译: 用于搜索排序功能模型的学习系统可以包括使用来自未标记训练集合的新查询和相关联文档迭代地提炼模型的计算机程序。 未标记的训练集可以包括一组查询,其中相关联的文档未被标记为“相关”或以其他方式标记。 可以基于与标记的训练集(例如标记的验证集)的每个邻居的相似性和准确性来选择新的查询。 选择后,与新查询相关联的文档可能被标记。 新查询及其相关联的文档可以被累积到诸如标记的训练集之类的标记训练集中,并且可以基于增强的标记训练集来学习精细模型。 可以迭代地改进该模型,直到确定该模型是足够的。
-
公开(公告)号:US08001130B2
公开(公告)日:2011-08-16
申请号:US11459857
申请日:2006-07-25
申请人: Ji-Rong Wen , Shuming Shi , Wei-Ying Ma , Yunxiao Ma , Zaiqing Nie
发明人: Ji-Rong Wen , Shuming Shi , Wei-Ying Ma , Yunxiao Ma , Zaiqing Nie
IPC分类号: G06F17/30
CPC分类号: G06F17/30687 , G06F17/30864 , G06F17/30896 , Y10S707/99936
摘要: A method and system is provided for determining relevance of an object to a term based on a language model. The relevance system provides records extracted from web pages that relate to the object. To determine the relevance of the object to a term, the relevance system first determines, for each record of the object, a probability of generating that term using a language model of the record of that object. The relevance system then calculates the relevance of the object to the term by combining the probabilities. The relevance system may also weight the probabilities based on the accuracy or reliability of the extracted information for each data source.
摘要翻译: 提供了一种基于语言模型来确定对象与术语的相关性的方法和系统。 相关系统提供从与该对象相关的网页提取的记录。 为了确定对象与术语的相关性,相关系统首先确定对象的每个记录,使用该对象的记录的语言模型生成该术语的概率。 相关系统然后通过组合概率来计算对象与该术语的相关性。 相关系统还可以基于每个数据源提取的信息的准确性或可靠性对概率进行加权。
-
公开(公告)号:US20110078132A1
公开(公告)日:2011-03-31
申请号:US12570004
申请日:2009-09-30
申请人: Guomao Xin , Shuming Shi , Yunxiao Ma , Ji-Rong Wen
发明人: Guomao Xin , Shuming Shi , Yunxiao Ma , Ji-Rong Wen
IPC分类号: G06F17/30
CPC分类号: G06F16/334
摘要: Described is a flexible framework for index building and document retrieval in a search environment that allows different search scenario applications to reuse index building and document retrieval code for non-scenario-specific functionality. Interfaces to various functionality of an index builder and retrieval engine are defined. An application calls the interfaces to specify custom code to perform a search scenario when needed, or use default code when non-scenario-specific functionality may be used.
摘要翻译: 描述了一种灵活的框架,用于搜索环境中的索引构建和文档检索,允许不同的搜索场景应用程序重用索引构建和非特定场景功能的文档检索代码。 定义了与索引构建器和检索引擎的各种功能的接口。 应用程序调用接口来指定自定义代码以在需要时执行搜索方案,或者在可以使用非场景特定功能时使用默认代码。
-
公开(公告)号:US07877384B2
公开(公告)日:2011-01-25
申请号:US11681161
申请日:2007-03-01
申请人: Qing Yu , Shuming Shi , Zhiwei Li , Ji-Rong Wen , Wei-Ying Ma
发明人: Qing Yu , Shuming Shi , Zhiwei Li , Ji-Rong Wen , Wei-Ying Ma
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , G06F17/30265
摘要: A method and system for determining relevance of a document having text and images to a text string is provided. A scoring system identifies image text associated with an image of the document. The scoring system calculates an image score indicating relevance of the image text to the text string. The image score may be used in many applications, such as searching, summary generation, and document classification, image search, and image classification.
摘要翻译: 提供了一种用于确定具有文本和图像的文档与文本串的相关性的方法和系统。 评分系统识别与文档的图像相关联的图像文本。 评分系统计算指示图像文本与文本字符串的相关性的图像分数。 图像分数可以用于许多应用中,例如搜索,汇总生成和文档分类,图像搜索和图像分类。
-
公开(公告)号:US08073838B2
公开(公告)日:2011-12-06
申请号:US12697056
申请日:2010-01-29
申请人: Shuming Shi , Ji-Rong Wen , Mingjie Zhu , Fei Xing , Zaiqing Nie
发明人: Shuming Shi , Ji-Rong Wen , Mingjie Zhu , Fei Xing , Zaiqing Nie
IPC分类号: G06F17/30
CPC分类号: G06F17/30616 , G06F17/30864 , Y10S707/99932
摘要: A search method uses pseudo-anchor text associated with search objects to improve search performance. The pseudo-anchor text may be extracted in combination with an identifier of the search objects (such as a pseudo-URL) from a digital corpus such as a collection of documents. Pseudo-anchor texts for each object are preferably extracted from candidate anchor blocks using a machine learning based approach. The pseudo-anchor texts are made available for searching and used to help rank the objects in a search result to improve search performance. The method may be used in vertical search of objects such as published articles, products and images that lack explicit URLs and anchor text information.
摘要翻译: 搜索方法使用与搜索对象相关联的伪锚文本来改善搜索性能。 伪锚文本可以与来自诸如文档集合的数字语料库的搜索对象(诸如伪URL)的标识符组合提取。 优选地,使用基于机器学习的方法从候选锚块中提取每个对象的伪锚文本。 伪锚文本可用于搜索,并用于帮助对搜索结果中的对象进行排名以提高搜索性能。 该方法可以用于垂直搜索诸如已发表的文章,产品和缺乏明确的URL和锚文本信息的图像的对象。
-
公开(公告)号:US20110264658A1
公开(公告)日:2011-10-27
申请号:US13175796
申请日:2011-07-01
申请人: Ji-Rong Wen , Shuming Shi , Wei-Ying Ma , Yunxiao Ma , Zaiqing Nie
发明人: Ji-Rong Wen , Shuming Shi , Wei-Ying Ma , Yunxiao Ma , Zaiqing Nie
IPC分类号: G06F17/30
CPC分类号: G06F16/3346 , G06F16/951 , G06F16/986 , Y10S707/99936
摘要: A method and system is provided for determining relevance of an object to a term based on a language model. The relevance system provides records extracted from web pages that relate to the object. To determine the relevance of the object to a term, the relevance system first determines, for each record of the object, a probability of generating that term using a language model of the record of that object. The relevance system then calculates the relevance of the object to the term by combining the probabilities. The relevance system may also weight the probabilities based on the accuracy or reliability of the extracted information for each data source.
摘要翻译: 提供了一种基于语言模型来确定对象与术语的相关性的方法和系统。 相关系统提供从与该对象相关的网页提取的记录。 为了确定对象与术语的相关性,相关系统首先确定对象的每个记录,使用该对象的记录的语言模型生成该术语的概率。 相关系统然后通过组合概率来计算对象与该术语的相关性。 相关系统还可以基于每个数据源提取的信息的准确性或可靠性对概率进行加权。
-
-
-
-
-
-
-
-
-