Invention Grant
- Patent Title: System and method for focused re-crawling of web sites
- Patent Title (中): 网站重点重新抓取的系统和方法
-
Application No.: US12054482Application Date: 2008-03-25
-
Publication No.: US07882099B2Publication Date: 2011-02-01
- Inventor: Neeraj Agrawal , Sreeram Viswanath Balakrishnan , Sachindra Joshi
- Applicant: Neeraj Agrawal , Sreeram Viswanath Balakrishnan , Sachindra Joshi
- Applicant Address: US NY Armonk
- Assignee: International Business Machines Corporation
- Current Assignee: International Business Machines Corporation
- Current Assignee Address: US NY Armonk
- Agency: Gibb I.P. Law Firm, LLC
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.
Public/Granted literature
- US20080168041A1 SYSTEM AND METHOD FOR FOCUSED RE-CRAWLING OF WEB SITES Public/Granted day:2008-07-10
Information query