System and method for enhanced browser-based web crawling

发明授权

US07519902B1 System and method for enhanced browser-based web crawling 失效

标题翻译：用于增强基于浏览器的网络爬网的系统和方法

请登陆查看更多内容

专利标题： System and method for enhanced browser-based web crawling
专利标题（中）： 用于增强基于浏览器的网络爬网的系统和方法
申请号： US09607370

申请日： 2000-06-30
公开(公告)号： US07519902B1

公开(公告)日： 2009-04-14
发明人: Reiner Kraft , Jussi P. Myllymaki
申请人： Reiner Kraft , Jussi P. Myllymaki
申请人地址： US NY Armonk
专利权人： International Business Machines Corporation
当前专利权人： International Business Machines Corporation
当前专利权人地址： US NY Armonk
代理机构： Fleit Gibbons Gutman Bongini & Bianco P.L.
代理商 Leonard T. Guzman; Jon A. Gibbons
主分类号： G06N3/00
IPC分类号： G06N3/00

System and method for enhanced browser-based web crawling

摘要：

This invention pioneers an enhanced crawling mechanism and technique called “Enhanced Browser Based Web Crawling”. It permits the fault-tolerant gathering of dynamic data documents on the World Wide Web (WWW). The Enhanced Browser Based Web Crawler technology of this invention is implemented by incorporating the intricate functionality of a web browser into the crawler engine so that documents are properly analyzed. Essentially, the Enhanced Browser Based Crawler acts similarly to a web browser after retrieving the initially requested document. It then loads additional or included documents as needed or required (e.g. inline-frames, frames, images, applets, audio, video, or equivalents.). The Crawler then executes client side script or code and produces the final HTML markup. This final HTML markup is ordinarily used for the rendering for user presentation process. However, unlike a web browser this invention does not render the composed document for viewing purposes. Rather it analyzes or summarizes it, thereby extracting valuable metadata and other important information contained within the document. Also, this invention introduces the integration of optical character recognition (OCR) techniques into the crawler architecture. The reason for this is to enable the web crawler summarization process to properly summarize image content (e.g. GIF, JPEG or an equivalent) without errors.

摘要（中）：

本发明开创了一种增强的爬行机制和技术，称为“基于增强浏览器的网络爬网”。它允许在万维网（WWW）上容错地收集动态数据文档。本发明的基于增强浏览器的网络爬虫技术通过将web浏览器的复杂功能并入到履带引擎中来实现，以便正确分析文档。本质上，基于增强浏览器的抓取器在检索最初请求的文档之后，与Web浏览器类似。然后，根据需要或需要（例如，内联帧，帧，图像，小程序，音频，视频或等同物）加载附加或附带的文档。 Crawler然后执行客户端脚本或代码并生成最终的HTML标记。这个最终的HTML标记通常用于呈现用户呈现过程。然而，与网络浏览器不同，本发明不会使合成文档呈现以供观看。相反，它分析或总结它，从而提取文档中包含的有价值的元数据和其他重要信息。此外，本发明还将光学字符识别（OCR）技术集成到爬虫体系结构中。这样做的原因是使得网页抓取器汇总过程能够正确地总结图像内容（例如，GIF，JPEG或等价物）而没有错误。

信息查询

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06N	基于特定计算模型的计算机系统
G06N3/00	基于生物学模型的计算机系统