Invention Grant
US07827166B2 Handling dynamic URLs in crawl for better coverage of unique content
有权
处理抓取中的动态网址以更好地覆盖唯一内容
- Patent Title: Handling dynamic URLs in crawl for better coverage of unique content
- Patent Title (中): 处理抓取中的动态网址以更好地覆盖唯一内容
-
Application No.: US11580443Application Date: 2006-10-13
-
Publication No.: US07827166B2Publication Date: 2010-11-02
- Inventor: Priyank S. Garg , Arnabnil Bhattacharjee
- Applicant: Priyank S. Garg , Arnabnil Bhattacharjee
- Applicant Address: US CA Sunnyvale
- Assignee: Yahoo! Inc.
- Current Assignee: Yahoo! Inc.
- Current Assignee Address: US CA Sunnyvale
- Agency: Hickman Palermo Truong & Becker LLP
- Agent Christian A. Nicholes; Daniel D. Ledesma
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
Techniques for identifying duplicate webpages are provided. In one technique, one or more parameters of a first unique URL are identified where each of the one or more parameters do not substantially affect the content of the corresponding webpage. The first URL and subsequent URLs may be rewritten to drop each of the one or more parameters. Each of the subsequent URLs is compared to the first URL. If a subsequent URL is the same as the first URL, then the corresponding webpage of the subsequent URL is not accessed or crawled. In another technique, the parameters of multiple URLs are sorted, for example, alphabetically. If any URLs are the same, then the webpages of the duplicate URLs are not accessed or crawled.
Public/Granted literature
- US20080091685A1 Handling dynamic URLs in crawl for better coverage of unique content Public/Granted day:2008-04-17
Information query