-
公开(公告)号:US20090119291A1
公开(公告)日:2009-05-07
申请号:US12348336
申请日:2009-01-05
Applicant: Srinivasan Balasubramanian , Michael Ching , Piyoosh Jalan , Satish C. Penmetsa , Andrew S. Tomkins
Inventor: Srinivasan Balasubramanian , Michael Ching , Piyoosh Jalan , Satish C. Penmetsa , Andrew S. Tomkins
IPC: G06F17/30
CPC classification number: G06F17/30864 , Y10S707/99932 , Y10S707/99937
Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.
Abstract translation: 一种爬行包括至少一个URL的至少一个网站的系统和方法包括维护包括已知在网站上的所有URL的查找结构; 计算要重新抓取的网站的每个网页的中心评分,其中中心评分测量重新获取的网页的可能性包括链接到在网站上发布的新鲜内容; 通过他们的中心分数排序所有要重新抓取的页面; 并从最高中心分数到最低中心分数的顺序爬行重新抓取的页面。 计算包括计算等于要重新获取的页面上的多个新的相对URL的百分比的第一值; 计算等于要重新抓取的页面的先前中心点的百分比的第二值; 以及将所述中心分数计算为所述第一和第二值的总和。
-
公开(公告)号:US08041705B2
公开(公告)日:2011-10-18
申请号:US12348336
申请日:2009-01-05
Applicant: Srinivasan Balasubramanian , Michael Ching , Piyoosh Jalan , Satish C. Penmetsa , Andrew S. Tomkins
Inventor: Srinivasan Balasubramanian , Michael Ching , Piyoosh Jalan , Satish C. Penmetsa , Andrew S. Tomkins
IPC: G06F17/30
CPC classification number: G06F17/30864 , Y10S707/99932 , Y10S707/99937
Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.
Abstract translation: 一种爬行包括至少一个URL的至少一个网站的系统和方法包括维护包括已知在网站上的所有URL的查找结构; 计算要重新抓取的网站的每个网页的中心评分,其中中心评分测量重新获取的网页的可能性包括链接到在网站上发布的新鲜内容; 通过他们的中心分数排序所有要重新抓取的页面; 并从最高中心分数到最低中心分数的顺序爬行重新抓取的页面。 计算包括计算等于要重新获取的页面上的多个新的相对URL的百分比的第一值; 计算等于要重新抓取的页面的先前中心点的百分比的第二值; 以及将所述中心分数计算为所述第一和第二值的总和。
-
公开(公告)号:US07496557B2
公开(公告)日:2009-02-24
申请号:US11241469
申请日:2005-09-30
Applicant: Srinivasan Balasubramanian , Michael Ching , Piyoosh Jalan , Satish C. Penmetsa , Andrew S. Tomkins
Inventor: Srinivasan Balasubramanian , Michael Ching , Piyoosh Jalan , Satish C. Penmetsa , Andrew S. Tomkins
IPC: G06F13/30
CPC classification number: G06F17/30864 , Y10S707/99932 , Y10S707/99937
Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.
Abstract translation: 一种爬行包括至少一个URL的至少一个网站的系统和方法包括维护包括已知在网站上的所有URL的查找结构; 计算要重新抓取的网站的每个网页的中心评分,其中中心评分测量重新获取的网页的可能性包括链接到在网站上发布的新鲜内容; 通过他们的中心分数排序所有要重新抓取的页面; 并从最高中心分数到最低中心分数的顺序爬行重新抓取的页面。 计算包括计算等于要重新获取的页面上的多个新的相对URL的百分比的第一值; 计算等于要重新抓取的页面的先前中心点的百分比的第二值; 以及将所述中心分数计算为所述第一和第二值的总和。
-
-