Invention Grant
- Patent Title: Method and techniques for determining crawling schedule
- Patent Title (中): 确定爬行时间表的方法和技术
-
Application No.: US13348438Application Date: 2012-01-11
-
Publication No.: US08862569B2Publication Date: 2014-10-14
- Inventor: Cheng Xu , Qiying Lin , Xin Li
- Applicant: Cheng Xu , Qiying Lin , Xin Li
- Applicant Address: US CA Mountain View
- Assignee: Google Inc.
- Current Assignee: Google Inc.
- Current Assignee Address: US CA Mountain View
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
Methods, systems and computer-readable storage medium for determining a crawling schedule. In an aspect, a method includes obtaining crawl history data for a Web site having Web pages, determining a status of the Web pages, determining a total quantity of Web pages that have a status of deleted, calculating a probability that another Web page of the Web site will be removed based on the total quantity, and storing data associating the calculated probability with the Web site. The method can further include determining, for a plurality of sets of the previous time periods, a respective crawl penalty as a combination of a penalty for crawling the Web site and a penalty for showing a deleted Web page based on the calculated probability, and determining a re-crawl schedule based on the crawl penalties.
Public/Granted literature
- US20130179424A1 METHOD AND TECHNIQUES FOR DETERMINING CRAWLING SCHEDULE Public/Granted day:2013-07-11
Information query