Invention Grant
- Patent Title: Self-adaptive web crawling and text extraction
-
Application No.: US15936666Application Date: 2018-03-27
-
Publication No.: US10922366B2Publication Date: 2021-02-16
- Inventor: Chen-Yu Huang , Sheng-Wei Lee , June-Ray Lin , Ci-Hao Wu , Hsieh-Lung Yang , Ying-Chen Yu
- Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
- Applicant Address: US NY Armonk
- Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
- Current Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
- Current Assignee Address: US NY Armonk
- Agent Erik K. Johnson
- Main IPC: G06F16/951
- IPC: G06F16/951 ; G06F16/958 ; H04L29/08 ; G06F40/103 ; G06F16/9535 ; G06F16/33 ; G06F40/279

Abstract:
A method, computer system, and a computer program product for crawling and extracting main content from a web page is provided. The present invention may include retrieving a HTML document associated with a web page. The present invention may then include identifying at least one entry point located in the retrieved HTML document by utilizing a self-adaptive entry point locator. The present invention may also include extracting a main content article associated with the retrieved HTML document based on the identified at least one entry point. The present invention may further include presenting the extracted main content associated with the retrieved HTML document to the user.
Public/Granted literature
- US20190303501A1 SELF-ADAPTIVE WEB CRAWLING AND TEXT EXTRACTION Public/Granted day:2019-10-03
Information query