Self-adaptive web crawling and text extraction

Invention Grant

US10922366B2 Self-adaptive web crawling and text extraction 有权

Please log in to see more content

Patent Title: Self-adaptive web crawling and text extraction
Application No.: US15936666

Application Date: 2018-03-27
Publication No.: US10922366B2

Publication Date: 2021-02-16
Inventor: Chen-Yu Huang , Sheng-Wei Lee , June-Ray Lin , Ci-Hao Wu , Hsieh-Lung Yang , Ying-Chen Yu
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
Applicant Address: US NY Armonk
Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
Current Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
Current Assignee Address: US NY Armonk
Agent Erik K. Johnson
Main IPC: G06F16/951
IPC: G06F16/951 ; G06F16/958 ; H04L29/08 ; G06F40/103 ; G06F16/9535 ; G06F16/33 ; G06F40/279

Self-adaptive web crawling and text extraction

Abstract:

A method, computer system, and a computer program product for crawling and extracting main content from a web page is provided. The present invention may include retrieving a HTML document associated with a web page. The present invention may then include identifying at least one entry point located in the retrieved HTML document by utilizing a self-adaptive entry point locator. The present invention may also include extracting a main content article associated with the retrieved HTML document based on the identified at least one entry point. The present invention may further include presenting the extracted main content associated with the retrieved HTML document to the user.

Public/Granted literature

US20190303501A1 SELF-ADAPTIVE WEB CRAWLING AND TEXT EXTRACTION Public/Granted day:2019-10-03

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/90	.•与检索数据类型无关的数据库功能
G06F16/95	..••从网上检索
G06F16/951	...•••索引; 网络抓取技术