一种网站信息增量爬取方法

发明公开

请登陆查看更多内容

专利标题： 一种网站信息增量爬取方法
专利标题（英）： Incremental crawling method for website information
申请号： CN201410783643.8

申请日： 2014-12-16
公开(公告)号： CN104516956A

公开(公告)日： 2015-04-15
发明人: 刘学 , 脱立恒 , 董微 , 刘照邻
申请人： 中国科学院声学研究所 , 上海尚恩华科网络科技股份有限公司
申请人地址： 北京市海淀区北四环西路21号
专利权人： 中国科学院声学研究所,上海尚恩华科网络科技股份有限公司
当前专利权人： 中国科学院声学研究所,上海尚恩华科网络科技股份有限公司
当前专利权人地址： 北京市海淀区北四环西路21号
代理机构： 北京亿腾知识产权代理事务所
代理商 陈霁
主分类号： G06F17/30
IPC分类号： G06F17/30

摘要：

本发明公开了一种网站信息增量爬取方法，该方法包括：按照网站数据呈现顺序爬取设定长度的数据，并按照网站数据的呈现顺序放入数据队列，所述数据队列末端设有比较窗口，检查比较窗口内的数据与已爬取数据的重复度，当重复度达到预设值时，停止数据爬取；否则，重复上述过程，直到比较窗口内数据与已爬取数据的重复度达到预设值，停止数据爬取。本发明针对网站信息未严格按照时间排序进行增量爬取时，在可允许的漏爬率情况下，降低了爬取消耗。在工作流程中，可动态调整“数据爬取的设定长度”和“数据队列长度”大小，提高算法工作效率，满足不同的漏爬率及爬取损耗需求。

摘要（英）：

The invention discloses an incremental crawling method for website information. The method comprises the steps of crawling data with set length according to a website information presentation sequence; including into a data array according to the website data presentation sequence, wherein a comparison window is arranged at the tail end of the data array; detecting the repeatability rate of data in the comparison window and the crawled data; stopping crawling data when the repeatability rate reaches the preset value, and otherwise, repeating the processes above until the repeatability rate of the data in the comparison window and the crawled data reaches the preset value, and then stopping crawling data. According to the method, the crawling consumption is decreased under allowable crawling leakage rate if the increment crawling is not performed for the website information strictly according to the time sequence; when in the working process, the set length of data crawling and the length of the data array can be dynamically adjusted, so that the algorithm working efficiency can be increased, and different demands on crawling leakage and crawling loss can be met.

公开/授权文献

CN104516956B 一种网站信息增量爬取方法公开/授权日：2017-12-01

信息查询

中国专利公布公告 Global Dossier Espacenet