-
公开(公告)号:US20110093533A1
公开(公告)日:2011-04-21
申请号:US12988078
申请日:2008-04-17
IPC分类号: G06F15/173 , G06F15/16
CPC分类号: G06F16/972 , G06F16/958
摘要: Methods, systems, and apparatus, including computer program products, for generating sitemaps. The method includes scanning network traffic between a server and one or more clients requesting resources from the server, the network traffic including resource request messages from the one or more clients and resources served by the server in response to the resource request messages. The method also includes automatically extracting data from the traffic served by the server to the one or more clients, the extracted data including one or more Uniform Resource Locators that identify the resources served by the server to the one or more clients. The method automatically generates a sitemap from the extracted data, and stores the sitemap in a computer-readable memory.
摘要翻译: 用于生成站点地图的方法,系统和设备,包括计算机程序产品。 该方法包括扫描服务器与从服务器请求资源的一个或多个客户端之间的网络流量,网络流量包括来自一个或多个客户端的资源请求消息和由服务器响应于资源请求消息而服务的资源。 该方法还包括从服务器向一个或多个客户端服务的流量自动提取数据,所提取的数据包括一个或多个统一资源定位符,其将服务器所服务的资源标识给一个或多个客户端。 该方法会自动从提取的数据生成站点地图,并将该站点地图存储在计算机可读存储器中。
-
公开(公告)号:US08655864B1
公开(公告)日:2014-02-18
申请号:US13493872
申请日:2012-06-11
申请人: Alan C. Strohm , Feng Hu , Sascha B. Brawer , Maximilian Ibel , Ralph M. Keller , Narayanan Shivakumar , Elad Gil
发明人: Alan C. Strohm , Feng Hu , Sascha B. Brawer , Maximilian Ibel , Ralph M. Keller , Narayanan Shivakumar , Elad Gil
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , Y10S707/99933
摘要: A method of analyzing documents or relationships between documents includes receiving a notification of an available metadata document containing information about one or more network-accessible documents, obtaining a document format indicator associated with the metadata document, selecting a document crawler using the document format indicator, and crawling at least some of the network-accessible documents using the selected document crawler.
摘要翻译: 分析文档或文档之间的关系的方法包括接收包含关于一个或多个网络可访问文档的信息的可用元数据文档的通知,获得与元数据文档相关联的文档格式指示符,使用文档格式指示符选择文档搜索器, 并使用所选文档抓取工具至少抓取一些网络可访问文档。
-
公开(公告)号:US08037055B2
公开(公告)日:2011-10-11
申请号:US12861663
申请日:2010-08-23
CPC分类号: G06F17/30864
摘要: Methods and systems for a sitemap generating client for web crawlers are described. The client accesses one or more sources of document information about the documents available on a website, such as the file system, access logs, or pre-made URL lists. Document information is extracted from the sources and one or more sitemaps are generated based on the extracted document information. A notification is transmitted to a remote computer, informing that the sitemap(s) are available for access and likely have been updated. If the remote computer is associated with a web crawler, the remote computer may access the sitemap(s) and use the sitemaps to schedule a crawl of documents included or available on the website.
摘要翻译: 描述用于网页抓取工具的网站地图生成客户端的方法和系统。 客户端访问关于网站上可用的文档的文档信息的一个或多个来源,例如文件系统,访问日志或预先制作的URL列表。 从源中提取文档信息,并且基于提取的文档信息生成一个或多个站点地图。 通知被发送到远程计算机,通知站点地图可用于访问并且可能已被更新。 如果远程计算机与网络爬虫相关联,则远程计算机可以访问站点地图,并使用站点地图来安排在网站上包含或可用的文档的爬网。
-
公开(公告)号:US08037054B2
公开(公告)日:2011-10-11
申请号:US12823358
申请日:2010-06-25
CPC分类号: G06F17/30864
摘要: Methods and systems for a web crawler scheduler that utilizes sitemaps from websites are described. A web crawler scheduling system receives a notification from a website or web server. In response to the notification, the system accesses one or more sitemap(s) for documents associated with the website or web server. The system schedules crawls of the documents based on information identified from the sitemaps. The system crawls at least a subset of the documents scheduled for crawling.
摘要翻译: 描述了利用网站中的站点地图的Web爬网调度程序的方法和系统。 网页抓取器调度系统从网站或网络服务器接收通知。 响应于通知,系统访问与网站或Web服务器相关联的文档的一个或多个站点地图。 系统根据从站点地图中识别的信息调度文档的爬取。 系统至少抓取一些计划进行爬网的文档的一部分。
-
公开(公告)号:US20100262592A1
公开(公告)日:2010-10-14
申请号:US12823358
申请日:2010-06-25
IPC分类号: G06F17/30
CPC分类号: G06F17/30864
摘要: Methods and systems for a web crawler scheduler that utilizes sitemaps from websites are described. A web crawler scheduling system receives a notification from a website or web server. In response to the notification, the system accesses one or more sitemap(s) for documents associated with the website or web server. The system schedules crawls of the documents based on information identified from the sitemaps. The system crawls at least a subset of the documents scheduled for crawling.
摘要翻译: 描述了利用网站中的站点地图的Web爬网调度程序的方法和系统。 网页抓取器调度系统从网站或网络服务器接收通知。 响应于通知,系统访问与网站或Web服务器相关联的文档的一个或多个站点地图。 系统根据从站点地图中识别的信息调度文档的爬取。 系统至少抓取一些计划进行爬网的文档的一部分。
-
公开(公告)号:US08234266B2
公开(公告)日:2012-07-31
申请号:US12693310
申请日:2010-01-25
申请人: Alan C. Strohm , Feng Hu , Sascha B. Brawer , Maximilian Ibel , Ralph M. Keller , Narayanan Shivakumar , Elad Gil
发明人: Alan C. Strohm , Feng Hu , Sascha B. Brawer , Maximilian Ibel , Ralph M. Keller , Narayanan Shivakumar , Elad Gil
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , Y10S707/99933
摘要: A method of analyzing documents or relationships between documents includes receiving a notification of an available metadata document containing information about one or more network-accessible documents, obtaining a document format indicator associated with the metadata document, selecting a document crawler using the document format indicator, and crawling at least some of the network-accessible documents using the selected document crawler.
摘要翻译: 分析文档或文档之间的关系的方法包括接收包含关于一个或多个网络可访问文档的信息的可用元数据文档的通知,获得与元数据文档相关联的文档格式指示符,使用文档格式指示符选择文档搜索器, 并使用所选文档抓取工具至少抓取一些网络可访问文档。
-
公开(公告)号:US20120036118A1
公开(公告)日:2012-02-09
申请号:US13271160
申请日:2011-10-11
IPC分类号: G06F17/30
CPC分类号: G06F17/30864
摘要: Methods and systems for a web crawler scheduler that utilizes sitemaps from websites are described. A web crawler scheduling system receives a notification from a website or web server. In response to the notification, the system accesses one or more sitemap(s) for documents associated with the website or web server. The system schedules crawls of the documents based on information identified from the sitemaps. The system crawls at least a subset of the documents scheduled for crawling.
摘要翻译: 描述了利用网站中的站点地图的Web爬网调度程序的方法和系统。 网页抓取器调度系统从网站或网络服务器接收通知。 响应于通知,系统访问与网站或Web服务器相关联的文档的一个或多个站点地图。 系统根据从站点地图中识别的信息调度文档的爬取。 系统至少抓取一些计划进行爬网的文档的一部分。
-
公开(公告)号:US20100318508A1
公开(公告)日:2010-12-16
申请号:US12861663
申请日:2010-08-23
IPC分类号: G06F17/30
CPC分类号: G06F17/30864
摘要: Methods and systems for a sitemap generating client for web crawlers are described. The client accesses one or more sources of document information about the documents available on a website, such as the file system, access logs, or pre-made URL lists. Document information is extracted from the sources and one or more sitemaps are generated based on the extracted document information. A notification is transmitted to a remote computer, informing that the sitemap(s) are available for access and likely have been updated. If the remote computer is associated with a web crawler, the remote computer may access the sitemap(s) and use the sitemaps to schedule a crawl of documents included or available on the website.
摘要翻译: 描述用于网页抓取工具的网站地图生成客户端的方法和系统。 客户端访问关于网站上可用的文档的文档信息的一个或多个来源,例如文件系统,访问日志或预先制作的URL列表。 从源中提取文档信息,并且基于提取的文档信息生成一个或多个站点地图。 通知被发送到远程计算机,通知站点地图可用于访问并且可能已被更新。 如果远程计算机与网络爬虫相关联,则远程计算机可以访问站点地图,并使用站点地图来安排在网站上包含或可用的文档的爬网。
-
公开(公告)号:US20100125564A1
公开(公告)日:2010-05-20
申请号:US12693310
申请日:2010-01-25
申请人: Alan C. Strohm , Feng Hu , Sascha B. Brawer , Maximilian Ibel , Ralph M. Keller , Narayanan Shivakumar , Elad Gil
发明人: Alan C. Strohm , Feng Hu , Sascha B. Brawer , Maximilian Ibel , Ralph M. Keller , Narayanan Shivakumar , Elad Gil
CPC分类号: G06F17/30864 , Y10S707/99933
摘要: A method of analyzing documents or relationships between documents includes receiving a notification of an available metadata document containing information about one or more network-accessible documents, obtaining a document format indicator associated with the metadata document, selecting a document crawler using the document format indicator, and crawling at least some of the network-accessible documents using the selected document crawler.
摘要翻译: 分析文档或文档之间的关系的方法包括接收包含关于一个或多个网络可访问文档的信息的可用元数据文档的通知,获得与元数据文档相关联的文档格式指示符,使用文档格式指示符选择文档搜索器, 并使用所选文档抓取工具至少抓取一些网络可访问文档。
-
公开(公告)号:US07653617B2
公开(公告)日:2010-01-26
申请号:US11415947
申请日:2006-05-01
申请人: Alan C. Strohm , Feng Hu , Sascha B. Brawer , Maximilian Ibel , Ralph M. Keller , Narayanan Shivakumar , Elad Gil
发明人: Alan C. Strohm , Feng Hu , Sascha B. Brawer , Maximilian Ibel , Ralph M. Keller , Narayanan Shivakumar , Elad Gil
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , Y10S707/99933
摘要: A method of analyzing documents or relationships between documents includes receiving a notification of an available metadata document containing information about one or more network-accessible documents, obtaining a document format indicator associated with the metadata document, selecting a document crawler using the document format indicator, and crawling at least some of the network-accessible documents using the selected document crawler.
摘要翻译: 分析文档或文档之间的关系的方法包括接收包含关于一个或多个网络可访问文档的信息的可用元数据文档的通知,获得与元数据文档相关联的文档格式指示符,使用文档格式指示符选择文档搜索器, 并使用所选文档抓取工具至少抓取一些网络可访问文档。
-
-
-
-
-
-
-
-
-