Web-scale entity relationship extraction that extracts pattern(s) based on an extracted tuple
    1.
    发明授权
    Web-scale entity relationship extraction that extracts pattern(s) based on an extracted tuple 有权
    基于提取的元组提取模式的Web规模实体关系提取

    公开(公告)号:US08504490B2

    公开(公告)日:2013-08-06

    申请号:US12757722

    申请日:2010-04-09

    IPC分类号: G06F15/18

    摘要: Methods and systems for Web-scale entity relationship extraction are usable to build large-scale entity relationship graphs from any data corpora stored on a computer-readable medium or accessible through a network. Such entity relationship graphs may be used to navigate previously undiscoverable relationships among entities within data corpora. Additionally, the entity relationship extraction may be configured to utilize discriminative models to jointly model correlated data found within the selected corpora.

    摘要翻译: 用于Web规模实体关系提取的方法和系统可用于从存储在计算机可读介质上或可通过网络访问的任何数据语料库构建大型实体关系图。 这样的实体关系图可以用于导航数据语料库中的实体之间的先前不可发现的关系。 此外,实体关系提取可以被配置为利用歧视模型来共同建模在所选择的语料库内发现的相关数据。

    Query Reformulation Using Post-Execution Results Analysis
    2.
    发明申请
    Query Reformulation Using Post-Execution Results Analysis 审中-公开
    使用执行后结果分析查询重组

    公开(公告)号:US20130086024A1

    公开(公告)日:2013-04-04

    申请号:US13248894

    申请日:2011-09-29

    IPC分类号: G06F17/30

    CPC分类号: G06F16/951 G06F16/3338

    摘要: Systems, methods, devices, and media are described to facilitate the training and employing of a three-class classifier for post-execution search query reformulation. In some embodiments, the classification is trained through a supervised learning process, based on a training set of queries mined from a query log. Query reformulation candidates are determined for each query in the training set, and searches are performed using each reformulation candidate and the un-reformulated training query. The resulting documents lists are analyzed to determine ranking and topic drift features, and to calculate a quality classification. The features and classification for each reformulation candidate are used to train the classifier in an offline mode. In some embodiments, the classifier is employed in an online mode to dynamically perform query reformulation on user-submitted queries.

    摘要翻译: 描述了系统,方法,设备和媒体,以便于训练和采用用于执行后搜索查询重新设计的三类分类器。 在一些实施例中,基于从查询日志挖掘的查询的训练集,通过监督学习过程训练分类。 针对训练集中的每个查询确定查询重写候选,并且使用每个重新配置候选和未重新编排的训练查询执行搜索。 分析结果文件列表以确定排名和主题漂移特征,并计算质量分类。 每个重组候选人的特征和分类用于在离线模式下训练分类器。 在一些实施例中,分类器以在线模式使用以动态地对用户提交的查询进行查询重新配置。

    Pseudo-anchor text extraction
    3.
    发明授权
    Pseudo-anchor text extraction 有权
    伪锚文本提取

    公开(公告)号:US08073838B2

    公开(公告)日:2011-12-06

    申请号:US12697056

    申请日:2010-01-29

    IPC分类号: G06F17/30

    摘要: A search method uses pseudo-anchor text associated with search objects to improve search performance. The pseudo-anchor text may be extracted in combination with an identifier of the search objects (such as a pseudo-URL) from a digital corpus such as a collection of documents. Pseudo-anchor texts for each object are preferably extracted from candidate anchor blocks using a machine learning based approach. The pseudo-anchor texts are made available for searching and used to help rank the objects in a search result to improve search performance. The method may be used in vertical search of objects such as published articles, products and images that lack explicit URLs and anchor text information.

    摘要翻译: 搜索方法使用与搜索对象相关联的伪锚文本来改善搜索性能。 伪锚文本可以与来自诸如文档集合的数字语料库的搜索对象(诸如伪URL)的标识符组合提取。 优选地,使用基于机器学习的方法从候选锚块中提取每个对象的伪锚文本。 伪锚文本可用于搜索,并用于帮助对搜索结果中的对象进行排名以提高搜索性能。 该方法可以用于垂直搜索诸如已发表的文章,产品和缺乏明确的URL和锚文本信息的图像的对象。

    WEB OBJECT RETRIEVAL BASED ON A LANGUAGE MODEL
    4.
    发明申请
    WEB OBJECT RETRIEVAL BASED ON A LANGUAGE MODEL 审中-公开
    基于语言模型的WEB对象检索

    公开(公告)号:US20110264658A1

    公开(公告)日:2011-10-27

    申请号:US13175796

    申请日:2011-07-01

    IPC分类号: G06F17/30

    摘要: A method and system is provided for determining relevance of an object to a term based on a language model. The relevance system provides records extracted from web pages that relate to the object. To determine the relevance of the object to a term, the relevance system first determines, for each record of the object, a probability of generating that term using a language model of the record of that object. The relevance system then calculates the relevance of the object to the term by combining the probabilities. The relevance system may also weight the probabilities based on the accuracy or reliability of the extracted information for each data source.

    摘要翻译: 提供了一种基于语言模型来确定对象与术语的相关性的方法和系统。 相关系统提供从与该对象相关的网页提取的记录。 为了确定对象与术语的相关性,相关系统首先确定对象的每个记录,使用该对象的记录的语言模型生成该术语的概率。 相关系统然后通过组合概率来计算对象与该术语的相关性。 相关系统还可以基于每个数据源提取的信息的准确性或可靠性对概率进行加权。

    Retrieval of structured documents
    5.
    发明授权
    Retrieval of structured documents 有权
    检索结构化文件

    公开(公告)号:US08046370B2

    公开(公告)日:2011-10-25

    申请号:US12211793

    申请日:2008-09-16

    申请人: Ji-Rong Wen Hang Cui

    发明人: Ji-Rong Wen Hang Cui

    IPC分类号: G06F7/00 G06F17/30

    摘要: This disclosure relates to performing a query for a search term of a database containing a plurality of structured documents. Those structured documents that do not include the search term are ferreted or filtered out during an initial search. Matched structured documents which are those structured documents that do contain the search term are evaluated by ranking the individual elements based on how well each individual element matches the search term, and indicating to the user the ranking of the individual elements wherein the individual elements can be accessed by the user.

    摘要翻译: 本公开涉及对包含多个结构化文档的数据库执行关于搜索项的查询。 在初始搜索期间,不包括搜索条件的结构化文档被转移或过滤掉。 通过基于每个单独元素与搜索项匹配的程度对各个元素进行排名来评估包含搜索词的那些结构化文档的匹配结构化文档,并向用户指示各个元素的排名,其中各个元素可以是 由用户访问

    WEB-SCALE ENTITY RELATIONSHIP EXTRACTION
    6.
    发明申请
    WEB-SCALE ENTITY RELATIONSHIP EXTRACTION 有权
    WEB规模实体关系提取

    公开(公告)号:US20110251984A1

    公开(公告)日:2011-10-13

    申请号:US12757722

    申请日:2010-04-09

    IPC分类号: G06F15/18 G06F17/30

    摘要: Methods and systems for Web-scale entity relationship extraction are usable to build large-scale entity relationship graphs from any data corpora stored on a computer-readable medium or accessible through a network. Such entity relationship graphs may be used to navigate previously undiscoverable relationships among entities within data corpora. Additionally, the entity relationship extraction may be configured to utilize discriminative models to jointly model correlated data found within the selected corpora.

    摘要翻译: 用于Web规模实体关系提取的方法和系统可用于从存储在计算机可读介质上或可通过网络访问的任何数据语料库构建大型实体关系图。 这样的实体关系图可以用于导航数据语料库中的实体之间的先前不可发现的关系。 此外,实体关系提取可以被配置为利用歧视模型来共同建模在所选择的语料库内发现的相关数据。

    Using Anchor Text With Hyperlink Structures for Web Searches
    7.
    发明申请
    Using Anchor Text With Hyperlink Structures for Web Searches 有权
    使用超链接结构使用锚文本进行网页搜索

    公开(公告)号:US20110238644A1

    公开(公告)日:2011-09-29

    申请号:US12748903

    申请日:2010-03-29

    IPC分类号: G06F3/14 G06F17/30

    CPC分类号: G06F17/30887

    摘要: This document describes tools for adjusting anchor text weight to provide more relevant search engine results. Specifically, these tools take advantage of a site-relationship model to consider relationships not only between an anchor text source site and a destination page but also relationships between multiple anchor text source sites to improve web searches. Consideration of these relationships aids in determining a new an anchor text weight, which in turn results in more relevant search results.

    摘要翻译: 本文档描述了调整锚文本权重以提供更相关的搜索引擎结果的工具。 具体来说,这些工具利用站点关系模型来考虑不仅锚文本源站点和目标页面之间的关系,还考虑多个锚文本源站点之间的关系,以改进Web搜索。 考虑这些关系有助于确定新的锚文本权重,这又导致更相关的搜索结果。

    Interactive System for Extracting Data from a Website
    9.
    发明申请
    Interactive System for Extracting Data from a Website 审中-公开
    从网站提取数据的互动系统

    公开(公告)号:US20110191381A1

    公开(公告)日:2011-08-04

    申请号:US12696061

    申请日:2010-01-29

    IPC分类号: G06F17/30

    CPC分类号: G06F16/00

    摘要: Described is a technology for efficiently labeling a webpage. A wrapper tool labels records of a webpage at the record level. If an existing wrapper exists that is appropriate for labeling a record, the wrapper tool automatically labels that record. For unlabeled records, the tool provides a user interface to label those records, and updates the set of existing wrappers with a new wrapper that is generated based upon the labeling operation; the new wrapper is then applied to any unlabeled records if appropriate for those records. As a result, a user typically needs only to label a relatively few records, with the wrappers generated for those records automatically used to label the other unlabeled records of the webpage.

    摘要翻译: 描述了一种有效地标记网页的技术。 包装工具在记录级别上标记网页的记录。 如果存在适用于标记记录的现有包装器,则包装工具会自动标记该记录。 对于未标记的记录,该工具提供用户界面来标记这些记录,并使用基于标签操作生成的新包装器来更新现有包装器集合; 如果适用于这些记录,则将新的包装器应用于任何未标记的记录。 因此,用户通常仅需要标记相对较少的记录,为这些记录生成的包装器自动用于标记网页的其他未标记的记录。

    WEBPAGE ENTITY EXTRACTION THROUGH JOINT UNDERSTANDING OF PAGE STRUCTURES AND SENTENCES
    10.
    发明申请
    WEBPAGE ENTITY EXTRACTION THROUGH JOINT UNDERSTANDING OF PAGE STRUCTURES AND SENTENCES 有权
    通过对页面结构和结构的联合理解来提取实体实体

    公开(公告)号:US20110078554A1

    公开(公告)日:2011-03-31

    申请号:US12569912

    申请日:2009-09-30

    IPC分类号: G06F17/21

    CPC分类号: G06F17/278

    摘要: Described is a technology for understanding entities of a webpage, e.g., to label the entities on the webpage. An iterative and bidirectional framework processes a webpage, including a text understanding component (e.g., extended Semi-CRF model) that provides text segmentation features to a structure understanding component (e.g., extended HCRF model). The structure understanding component uses the text segmentation features and visual layout features of the webpage to identify a structure (e.g., labeled block). The text understanding component in turn uses the labeled block to further understand the text. The process continues iteratively until a similarity criterion is met, at which time the entities may be labeled. Also described is the use of multiple mentions of a set of text in the webpage to help in labeling an entity.

    摘要翻译: 描述了一种用于理解网页的实体的技术,例如标记网页上的实体。 迭代和双向框架处理网页,包括向结构理解组件(例如,扩展HCRF模型)提供文本分段特征的文本理解组件(例如,扩展Semi-CRF模型)。 结构理解组件使用网页的文本分割特征和视觉布局特征来识别结构(例如,标记块)。 文本理解组件依次使用标记块来进一步理解文本。 该过程继续迭代直到满足相似性标准,此时实体可以被标记。 还描述了使用多个提及网页中的一组文本来帮助标注一个实体。