Unsupervised learning tool for feature correction
    1.
    发明授权
    Unsupervised learning tool for feature correction 有权
    无监督学习工具进行功能校正

    公开(公告)号:US07483903B2

    公开(公告)日:2009-01-27

    申请号:US11253023

    申请日:2005-10-17

    CPC分类号: G06F17/30861

    摘要: Techniques for correcting miscategorized features excerpted from web pages are provided. For each of several categories and several pages on a particular web site, a separate feature may be excerpted from that page and associated with that page in relation to that category. Often, many of the “high confidence” features that have been associated with the same category are found to be associated with similar characteristics regardless of the pages from which those features were excerpted. Thus, a set of category characteristics, which are often found associated with the “high confidence” features in a particular category, may be determined. For each page, a candidate feature that is associated with the set of category characteristics may be identified in that page. If, in relation to the particular category, a feature other than the candidate feature is associated with that page, then that other feature may be replaced by the candidate feature.

    摘要翻译: 提供了从网页上摘录的分类功能的技巧。 对于特定网站上的几个类别和多个页面中的每一个,可以从该页面摘录单独的特征并且与该页面相关联地与该页面相关联。 通常,与相同类别相关联的许多“高信度”特征被发现与相似的特征相关,而不管这些特征被摘录的页面。 因此,可以确定通常发现与特定类别中的“高置信度”特征相关联的一组类别特征。 对于每个页面,可以在该页面中识别与该组类别特征相关联的候选特征。 如果关于特定类别,除了候选特征之外的特征与该页面相关联,则该另一特征可以被候选特征替换。

    Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages
    2.
    发明授权
    Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages 有权
    无监督,自动化的Web主机动态检测,死链接检测和搜索索引网页的先决条件页面发现

    公开(公告)号:US07610267B2

    公开(公告)日:2009-10-27

    申请号:US11203832

    申请日:2005-08-13

    IPC分类号: G06F17/30

    摘要: Automated crawling of page links associated with a site domain that was previously crawled involves computing the dynamicity of a site based on totals of continuous dead links, live links and/or prerequisite pages encountered while crawling page links corresponding to the site. The degree to which links are crawled is optimized based on the dynamicity of the site. Some pages require that another particular page (i.e., a prerequisite page) is retrieved from the host prior to retrieving a given page, e.g., so that the prerequisite page can set a cookie. Prerequisite pages are determined based on stored information about pages that were retrieved, during a previous crawl, prior to retrieving a page. Prerequisite pages are identified to a search system so that when a user clicks on the URL for the page, the request is redirected to the prerequisite page to set the cookie appropriately.

    摘要翻译: 与先前抓取的站点域相关联的页面链接的自动抓取涉及基于在爬行与站点对应的页面链接时遇到的连续死链接,实时链接和/或前提页面的总计来计算站点的动态性。 根据站点的动态性优化链接被爬行的程度。 某些页面要求在检索给定页面之前从主机检索另一个特定页面(即,先决条件页面),例如,使得前提页面可以设置cookie。 先决条件页面是基于存储的信息确定的,该信息是在检索页面之前在之前的爬网中检索到的页面。 先决条件页面被标识到搜索系统,使得当用户点击页面的URL时,请求被重定向到先决条件页面以适当地设置cookie。

    Unsupervised learning tool for feature correction
    3.
    发明申请
    Unsupervised learning tool for feature correction 有权
    无监督学习工具进行功能校正

    公开(公告)号:US20070043707A1

    公开(公告)日:2007-02-22

    申请号:US11253023

    申请日:2005-10-17

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30861

    摘要: Techniques for correcting miscategorized features excerpted from web pages are provided. For each of several categories and several pages on a particular web site, a separate feature may be excerpted from that page and associated with that page in relation to that category. Often, many of the “high confidence” features that have been associated with the same category are found to be associated with similar characteristics regardless of the pages from which those features were excerpted. Thus, a set of category characteristics, which are often found associated with the “high confidence” features in a particular category, may be determined. For each page, a candidate feature that is associated with the set of category characteristics may be identified in that page. If, in relation to the particular category, a feature other than the candidate feature is associated with that page, then that other feature may be replaced by the candidate feature.

    摘要翻译: 提供了从网页上摘录的分类功能的技巧。 对于特定网站上的几个类别和多个页面中的每一个,可以从该页面摘录单独的特征并且与该页面相关联地与该页面相关联。 通常,与相同类别相关联的许多“高信度”特征被发现与相似的特征相关,而不管这些特征被摘录的页面。 因此,可以确定通常发现与特定类别中的“高置信度”特征相关联的一组类别特征。 对于每个页面,可以在该页面中识别与该组类别特征相关联的候选特征。 如果关于特定类别,除了候选特征之外的特征与该页面相关联,则该另一特征可以被候选特征替换。

    METHODS AND SYSTEMS FOR ASSESSING EXCESSIVE ACCESSORY LISTINGS IN SEARCH RESULTS
    4.
    发明申请
    METHODS AND SYSTEMS FOR ASSESSING EXCESSIVE ACCESSORY LISTINGS IN SEARCH RESULTS 有权
    用于评估搜索结果中超额附件列表的方法和系统

    公开(公告)号:US20120259844A1

    公开(公告)日:2012-10-11

    申请号:US13082226

    申请日:2011-04-07

    IPC分类号: G06F17/30

    摘要: A system and method for assessing excessive accessory listings in search results includes a processor-implemented textual mining module that parses a data field of a document and generates at least one token from the data field. A processor-implemented scoring module calculates a score for the at least one token, with the at least one token score representing a likelihood that the at least one token belongs to one of two binary classifications. The processor-implemented scoring module also calculates a score for the document based on the at least one token score, with the document score representing a probability of the document being in one of the two binary classifications. A processor-implemented decision tree module inputs the document score and document attribute values into a decision tree and generates an output representing a refined score based on the document score and at least one of the document attribute values.

    摘要翻译: 用于评估搜索结果中的过多附件列表的系统和方法包括处理器实现的文本挖掘模块,其解析文档的数据字段并从数据字段生成至少一个令牌。 处理器实现的评分模块计算所述至少一个令牌的分数,所述至少一个令牌分数表示所述至少一个令牌属于两个二进制分类中的一个的可能性。 处理器实现的评分模块还基于至少一个令牌分数来计算文档的分数,文档分数表示文档处于两个二进制分类之一的概率。 处理器实施的决策树模块将文档分数和文档属性值输入到决策树中,并且基于文档分数和文档属性值中的至少一个生成表示精确分数的输出。

    Record boundary identification and extraction through pattern mining
    5.
    发明申请
    Record boundary identification and extraction through pattern mining 有权
    通过模式挖掘记录边界识别和提取

    公开(公告)号:US20070027882A1

    公开(公告)日:2007-02-01

    申请号:US11192620

    申请日:2005-07-28

    IPC分类号: G06F7/00

    摘要: Techniques for identifying discrete records within a multi-record document are provided. According to one technique, a document is encoded based on some combination of visual tag encoding, text category encoding, and text content encoding that produces hash values based on the contents of portions of the document. According to one technique, repeating candidate patterns are identified in a document so encoded. The candidate patterns may be identified in a “fuzzy” manner that allows for some inconsistencies in the individual pattern instances. According to one technique, the identified candidate patterns are validated based on specified factors to determine a “best” pattern. According to one technique, the boundaries of discrete records in a multi-record document are marked based on the portions of the document that correspond to an identified repeating pattern.

    摘要翻译: 提供了用于识别多记录文档内的离散记录的技术。 根据一种技术,基于视觉标签编码,文本类别编码和文本内容编码的组合来编码文档,该文本内容编码基于文档的部分内容产生散列值。 根据一种技术,在如此编码的文档中识别重复的候选模式。 可以以“模糊”的方式识别候选图案,其允许各个图案实例中的一些不一致。 根据一种技术,基于指定因素来确定所识别的候选模式以确定“最佳”模式。 根据一种技术,多记录文档中离散记录的边界基于对应于所识别的重复图案的文档部分进行标记。

    System and method for generating an approximation of a search engine ranking algorithm
    8.
    发明授权
    System and method for generating an approximation of a search engine ranking algorithm 失效
    用于生成搜索引擎排序算法近似的系统和方法

    公开(公告)号:US08255391B2

    公开(公告)日:2012-08-28

    申请号:US12367646

    申请日:2009-02-09

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864

    摘要: A system and method for determining a ranking function for a search engine. A training data processor receives training data, the training data including at least a first page, a first label, a second page and a second label. A feature extraction processor receives the first page, identifies first features in the first page and calculates first values relating to the first features. The feature extraction processor receives the second page and identifies second features and calculates second values relating to the second features. A machine learning processor receives the first features, the first values, the first label, the second features, the second values, and the second label. The machine learning processor generates a ranking function based on first features, the first values, the first label, the second features, the second values, and the second label.

    摘要翻译: 一种用于确定搜索引擎的排名功能的系统和方法。 训练数据处理器接收训练数据,训练数据至少包括第一页面,第一标签,第二页面和第二标签。 特征提取处理器接收第一页面,识别第一页面中的第一特征并计算与第一特征相关的第一值。 特征提取处理器接收第二页面并识别第二特征并计算与第二特征相关的第二值。 机器学习处理器接收第一特征,第一值,第一标签,第二特征,第二值和第二标签。 机器学习处理器基于第一特征,第一值,第一标签,第二特征,第二值和第二标签来生成排序函数。

    Record boundary identification and extraction through pattern mining
    9.
    发明授权
    Record boundary identification and extraction through pattern mining 有权
    通过模式挖掘记录边界识别和提取

    公开(公告)号:US07606816B2

    公开(公告)日:2009-10-20

    申请号:US11192620

    申请日:2005-07-28

    IPC分类号: G06F7/00 G06F17/00

    摘要: Techniques for identifying discrete records within a multi-record document are provided. According to one technique, a document is encoded based on some combination of visual tag encoding, text category encoding, and text content encoding that produces hash values based on the contents of portions of the document. According to one technique, repeating candidate patterns are identified in a document so encoded. The candidate patterns may be identified in a “fuzzy” manner that allows for some inconsistencies in the individual pattern instances. According to one technique, the identified candidate patterns are validated based on specified factors to determine a “best” pattern. According to one technique, the boundaries of discrete records in a multi-record document are marked based on the portions of the document that correspond to an identified repeating pattern.

    摘要翻译: 提供了用于识别多记录文档内的离散记录的技术。 根据一种技术,基于视觉标签编码,文本类别编码和文本内容编码的组合来编码文档,该文本内容编码基于文档的部分内容产生散列值。 根据一种技术,在如此编码的文档中识别重复的候选模式。 可以以“模糊”的方式识别候选图案,其允许各个图案实例中的一些不一致。 根据一种技术,基于指定因素来确定所识别的候选模式以确定“最佳”模式。 根据一种技术,多记录文档中离散记录的边界基于对应于所识别的重复图案的文档部分进行标记。

    Techniques for unsupervised web content discovery and automated query generation for crawling the hidden web
    10.
    发明申请
    Techniques for unsupervised web content discovery and automated query generation for crawling the hidden web 审中-公开
    用于无监督的网页内容发现和自动查询生成的技术,用于爬行隐藏的网页

    公开(公告)号:US20070022085A1

    公开(公告)日:2007-01-25

    申请号:US11224887

    申请日:2005-09-12

    IPC分类号: G06F17/30

    CPC分类号: G06F16/9566

    摘要: Unsupervised crawling of the hidden Web utilizes a query engine, coupled to a crawler system, that automatically and intelligently inserts keywords into text input controls in Web page forms so that the filled form can be submitted to a server to retrieve dynamically generated Web content for indexing. The keywords used to fill form controls are based on the content of corresponding Web pages, which is automatically discovered to generate a set of keywords for filling the controls. The set of keywords can be expanded to include related keywords from other Web pages and Web sites and, therefore, to provide more effective coverage for crawling the Web content. The expanded set of keywords can be continuously expanded by recursively performing similarity analyses based on results from crawling the same and other Web sites.

    摘要翻译: 隐藏Web的无监督抓取使用查询引擎,与爬网系统相结合,自动智能地将关键字插入网页表单中的文本输入控件中,以便将填写的表单提交到服务器以检索动态生成的Web内容进行索引 。 用于填写表单控件的关键字是基于相应网页的内容,自动发现该网页生成一组用于填充控件的关键字。 可以将这组关键字扩展为包括来自其他网页和网站的相关关键字,因此,为抓取Web内容提供更有效的覆盖。 通过根据爬行相同和其他网站的结果递归地执行相似性分析,可以不断扩展扩展的关键字集合。