Handling dynamic URLs in crawl for better coverage of unique content
    1.
    发明授权
    Handling dynamic URLs in crawl for better coverage of unique content 有权
    处理抓取中的动态网址以更好地覆盖唯一内容

    公开(公告)号:US07827166B2

    公开(公告)日:2010-11-02

    申请号:US11580443

    申请日:2006-10-13

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864

    摘要: Techniques for identifying duplicate webpages are provided. In one technique, one or more parameters of a first unique URL are identified where each of the one or more parameters do not substantially affect the content of the corresponding webpage. The first URL and subsequent URLs may be rewritten to drop each of the one or more parameters. Each of the subsequent URLs is compared to the first URL. If a subsequent URL is the same as the first URL, then the corresponding webpage of the subsequent URL is not accessed or crawled. In another technique, the parameters of multiple URLs are sorted, for example, alphabetically. If any URLs are the same, then the webpages of the duplicate URLs are not accessed or crawled.

    摘要翻译: 提供了用于识别重复网页的技术。 在一种技术中,识别第一唯一URL的一个或多个参数,其中一个或多个参数中的每个参数基本上不影响相应网页的内容。 可以重写第一URL和后续URL以丢弃一个或多个参数中的每一个。 每个后续URL都与第一个URL进行比较。 如果随后的URL与第一个URL相同,则后续URL的相应​​网页不被访问或爬网。 在另一种技术中,多个URL的参数按字母顺序进行排序。 如果任何网址相同,则不会访问或抓取重复网址的网页。

    SYSTEM AND METHOD FOR DETECTING DUPLICATE CONTENT ITEMS
    2.
    发明申请
    SYSTEM AND METHOD FOR DETECTING DUPLICATE CONTENT ITEMS 审中-公开
    用于检测双重内容项的系统和方法

    公开(公告)号:US20090125516A1

    公开(公告)日:2009-05-14

    申请号:US11939834

    申请日:2007-11-14

    IPC分类号: G06F17/30 G06F15/18

    CPC分类号: G06F16/958

    摘要: Generally, the present invention provides systems, methods and computer program products for detecting different content items with similar content by examining the anchortext of the link. A method of the present invention comprises selecting one of a plurality of websites, crawling the selected website to identify one or more content items, and downloading one or more content items of the selected website. A determination is then made as to the one or more linking relationships from the one or more content items of the selected website and one or more linking rules are learned based upon association rule mining of the one or more content items. The one or more linking rules are then applied to one or more content items of one or more websites in order to determine storage of the one or more content items based upon the one or more linking rules on a search provider's central server.

    摘要翻译: 通常,本发明提供了通过检查链接的定位信息来检测具有相似内容的不同内容项的系统,方法和计算机程序产品。 本发明的方法包括选择多个网站之一,爬行所选择的网站以识别一个或多个内容项,以及下载所选网站的一个或多个内容项。 然后,根据所选择的网站的一个或多个内容项目确定一个或多个链接关系,并且基于一个或多个内容项目的关联规则挖掘来学习一个或多个链接规则。 然后将一个或多个链接规则应用于一个或多个网站的一个或多个内容项目,以便基于搜索提供商的中央服务器上的一个或多个链接规则来确定一个或多个内容项目的存储。

    Handling dynamic URLs in crawl for better coverage of unique content
    3.
    发明申请
    Handling dynamic URLs in crawl for better coverage of unique content 有权
    处理抓取中的动态网址以更好地覆盖唯一内容

    公开(公告)号:US20080091685A1

    公开(公告)日:2008-04-17

    申请号:US11580443

    申请日:2006-10-13

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864

    摘要: Techniques for identifying duplicate webpages are provided. In one technique, one or more parameters of a first unique URL are identified where each of the one or more parameters do not substantially affect the content of the corresponding webpage. The first URL and subsequent URLs may be rewritten to drop each of the one or more parameters. Each of the subsequent URLs is compared to the first URL. If a subsequent URL is the same as the first URL, then the corresponding webpage of the subsequent URL is not accessed or crawled. In another technique, the parameters of multiple URLs are sorted, for example, alphabetically. If any URLs are the same, then the webpages of the duplicate URLs are not accessed or crawled.

    摘要翻译: 提供了用于识别重复网页的技术。 在一种技术中,识别第一唯一URL的一个或多个参数,其中一个或多个参数中的每个参数基本上不影响相应网页的内容。 可以重写第一URL和后续URL以丢弃一个或多个参数中的每一个。 每个后续URL都与第一个URL进行比较。 如果随后的URL与第一个URL相同,则后续URL的相应​​网页不被访问或爬网。 在另一种技术中,多个URL的参数按字母顺序进行排序。 如果任何网址相同,则不会访问或抓取重复网址的网页。