System And Method For Generating Training Data For Function Approximation Of An Unknown Process Such As A Search Engine Ranking Algorithm
    11.
    发明申请
    System And Method For Generating Training Data For Function Approximation Of An Unknown Process Such As A Search Engine Ranking Algorithm 审中-公开
    用于生成训练数据的系统和方法用于诸如搜索引擎排名算法的未知进程的功能近似

    公开(公告)号:US20100057719A1

    公开(公告)日:2010-03-04

    申请号:US12367656

    申请日:2009-02-09

    IPC分类号: G06F17/30

    CPC分类号: G06F16/951

    摘要: A system and method for generating training data for a machine learning system. A training data generator server sends at least one keyword to a search engine. The training data generator server receives at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword. The training data generator server assigns a first label to the first page based on the first rank; and assigns a second label to the second page based on the second rank. The first web page, second page, first label and second label are forwarded to a machine learning server.

    摘要翻译: 一种用于生成机器学习系统的训练数据的系统和方法。 训练数据生成器服务器将至少一个关键字发送到搜索引擎。 训练数据生成器服务器响应于关键字从搜索引擎接收至少第一和第二页面,第一页面具有第一等级,第二页面具有第二等级,第一和第二列是基于关键字 。 训练数据生成器服务器基于第一等级向第一页面分配第一标签; 并且基于第二等级向第二页面分配第二标签。 第一个网页,第二页,第一个标签和第二个标签被转发到机器学习服务器。

    Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages
    12.
    发明授权
    Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages 有权
    无监督,自动化的Web主机动态检测,死链接检测和搜索索引网页的先决条件页面发现

    公开(公告)号:US07610267B2

    公开(公告)日:2009-10-27

    申请号:US11203832

    申请日:2005-08-13

    IPC分类号: G06F17/30

    摘要: Automated crawling of page links associated with a site domain that was previously crawled involves computing the dynamicity of a site based on totals of continuous dead links, live links and/or prerequisite pages encountered while crawling page links corresponding to the site. The degree to which links are crawled is optimized based on the dynamicity of the site. Some pages require that another particular page (i.e., a prerequisite page) is retrieved from the host prior to retrieving a given page, e.g., so that the prerequisite page can set a cookie. Prerequisite pages are determined based on stored information about pages that were retrieved, during a previous crawl, prior to retrieving a page. Prerequisite pages are identified to a search system so that when a user clicks on the URL for the page, the request is redirected to the prerequisite page to set the cookie appropriately.

    摘要翻译: 与先前抓取的站点域相关联的页面链接的自动抓取涉及基于在爬行与站点对应的页面链接时遇到的连续死链接,实时链接和/或前提页面的总计来计算站点的动态性。 根据站点的动态性优化链接被爬行的程度。 某些页面要求在检索给定页面之前从主机检索另一个特定页面(即,先决条件页面),例如,使得前提页面可以设置cookie。 先决条件页面是基于存储的信息确定的,该信息是在检索页面之前在之前的爬网中检索到的页面。 先决条件页面被标识到搜索系统,使得当用户点击页面的URL时,请求被重定向到先决条件页面以适当地设置cookie。

    Unsupervised learning tool for feature correction
    13.
    发明申请
    Unsupervised learning tool for feature correction 有权
    无监督学习工具进行功能校正

    公开(公告)号:US20070043707A1

    公开(公告)日:2007-02-22

    申请号:US11253023

    申请日:2005-10-17

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30861

    摘要: Techniques for correcting miscategorized features excerpted from web pages are provided. For each of several categories and several pages on a particular web site, a separate feature may be excerpted from that page and associated with that page in relation to that category. Often, many of the “high confidence” features that have been associated with the same category are found to be associated with similar characteristics regardless of the pages from which those features were excerpted. Thus, a set of category characteristics, which are often found associated with the “high confidence” features in a particular category, may be determined. For each page, a candidate feature that is associated with the set of category characteristics may be identified in that page. If, in relation to the particular category, a feature other than the candidate feature is associated with that page, then that other feature may be replaced by the candidate feature.

    摘要翻译: 提供了从网页上摘录的分类功能的技巧。 对于特定网站上的几个类别和多个页面中的每一个,可以从该页面摘录单独的特征并且与该页面相关联地与该页面相关联。 通常,与相同类别相关联的许多“高信度”特征被发现与相似的特征相关,而不管这些特征被摘录的页面。 因此,可以确定通常发现与特定类别中的“高置信度”特征相关联的一组类别特征。 对于每个页面,可以在该页面中识别与该组类别特征相关联的候选特征。 如果关于特定类别,除了候选特征之外的特征与该页面相关联,则该另一特征可以被候选特征替换。

    System And Method For Generating An Approximation Of A Search Engine Ranking Algorithm
    14.
    发明申请
    System And Method For Generating An Approximation Of A Search Engine Ranking Algorithm 失效
    用于生成搜索引擎排序算法近似的系统和方法

    公开(公告)号:US20100057718A1

    公开(公告)日:2010-03-04

    申请号:US12367646

    申请日:2009-02-09

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864

    摘要: A system and method for determining a ranking function for a search engine. A training data processor receives training data, the training data including at least a first page, a first label, a second page and a second label. A feature extraction processor receives the first page, identifies first features in the first page and calculates first values relating to the first features. The feature extraction processor receives the second page and identifies second features and calculates second values relating to the second features. A machine learning processor receives the first features, the first values, the first label, the second features, the second values, and the second label. The machine learning processor generates a ranking function based on first features, the first values, the first label, the second features, the second values, and the second label.

    摘要翻译: 一种用于确定搜索引擎的排名功能的系统和方法。 训练数据处理器接收训练数据,训练数据至少包括第一页面,第一标签,第二页面和第二标签。 特征提取处理器接收第一页面,识别第一页面中的第一特征并计算与第一特征相关的第一值。 特征提取处理器接收第二页面并识别第二特征并计算与第二特征相关的第二值。 机器学习处理器接收第一特征,第一值,第一标签,第二特征,第二值和第二标签。 机器学习处理器基于第一特征,第一值,第一标签,第二特征,第二值和第二标签来生成排序函数。