发明申请
- 专利标题: Extraction of Content from a Web Page
- 专利标题(中): 从网页提取内容
-
申请号: US13817656申请日: 2010-10-26
-
公开(公告)号: US20130283148A1公开(公告)日: 2013-10-24
- 发明人: Suk Hwan Lim , Jian-Ming Jin , Li-Wei Zheng , Jian Fan , Eamonn O'Brien-Strain , Parag Joshi
- 申请人: Suk Hwan Lim , Jian-Ming Jin , Li-Wei Zheng , Jian Fan , Eamonn O'Brien-Strain , Parag Joshi
- 国际申请: PCT/CN2010/001698 WO 20101026
- 主分类号: G06F17/22
- IPC分类号: G06F17/22
摘要:
A system and method are provided for extracting main content from a web page. Web page segmentation is performed on a web page to provide affinity-grouped segments. Descriptive features of at least one of the affinity-grouped segments are computed. At least one of the affinity-grouped segments is classified as a main body segment based on the computed descriptive features. Additional affinity-grouped segments are classified as to a document function based on the computed descriptive features. Classified affinity-grouped segments are assembled according to their classified document functions to provide the main content.