发明授权
- 专利标题: System and method for enhanced browser-based web crawling
- 专利标题(中): 用于增强基于浏览器的网络爬网的系统和方法
-
申请号: US09607370申请日: 2000-06-30
-
公开(公告)号: US07519902B1公开(公告)日: 2009-04-14
- 发明人: Reiner Kraft , Jussi P. Myllymaki
- 申请人: Reiner Kraft , Jussi P. Myllymaki
- 申请人地址: US NY Armonk
- 专利权人: International Business Machines Corporation
- 当前专利权人: International Business Machines Corporation
- 当前专利权人地址: US NY Armonk
- 代理机构: Fleit Gibbons Gutman Bongini & Bianco P.L.
- 代理商 Leonard T. Guzman; Jon A. Gibbons
- 主分类号: G06N3/00
- IPC分类号: G06N3/00
摘要:
This invention pioneers an enhanced crawling mechanism and technique called “Enhanced Browser Based Web Crawling”. It permits the fault-tolerant gathering of dynamic data documents on the World Wide Web (WWW). The Enhanced Browser Based Web Crawler technology of this invention is implemented by incorporating the intricate functionality of a web browser into the crawler engine so that documents are properly analyzed. Essentially, the Enhanced Browser Based Crawler acts similarly to a web browser after retrieving the initially requested document. It then loads additional or included documents as needed or required (e.g. inline-frames, frames, images, applets, audio, video, or equivalents.). The Crawler then executes client side script or code and produces the final HTML markup. This final HTML markup is ordinarily used for the rendering for user presentation process. However, unlike a web browser this invention does not render the composed document for viewing purposes. Rather it analyzes or summarizes it, thereby extracting valuable metadata and other important information contained within the document. Also, this invention introduces the integration of optical character recognition (OCR) techniques into the crawler architecture. The reason for this is to enable the web crawler summarization process to properly summarize image content (e.g. GIF, JPEG or an equivalent) without errors.
信息查询
IPC分类:
G | 物理 |
G06 | 计算;推算或计数 |
G06N | 基于特定计算模型的计算机系统 |
G06N3/00 | 基于生物学模型的计算机系统 |