一种基于网页特征的正文信息抽取方法

发明公开

CN107247742A 一种基于网页特征的正文信息抽取方法无效 - 驳回

请登陆查看更多内容

专利标题： 一种基于网页特征的正文信息抽取方法
专利标题（英）： Text information extraction method based on webpage characteristics
申请号： CN201710346591.1

申请日： 2017-05-17
公开(公告)号： CN107247742A

公开(公告)日： 2017-10-13
发明人: 李晓林 , 刘志杰 , 谢婷婷 , 严柯 , 张懿
申请人： 武汉工程大学
申请人地址： 湖北省武汉市东湖新技术开发区光谷一路206号
专利权人： 武汉工程大学
当前专利权人： 武汉工程大学
当前专利权人地址： 湖北省武汉市东湖新技术开发区光谷一路206号
代理机构： 宁波市鄞州甬致专利代理事务所
代理商 潘李亮
主分类号： G06F17/30
IPC分类号： G06F17/30

摘要：

本发明涉及信息抽取技术领域，尤其涉及一种基于网页特征的正文信息抽取方法，它根据页面布局等特征将页面源码预处理行号和文本的集合，然后通过行文本阈值以及行间距阈值来提取页面正文部分，最后根据标点符号来优化提取结果。本方法对于不同类型的页面有较好的效果，具有一定的通用性。

摘要（英）：

The invention relates to the information extraction technology field, particularly relates to a text information extraction method based on webpage characteristics, and the method comprises the steps of preprocessing page source code into a set of row number and text based on characteristics like the page layout; then extracting the page test through a row text threshold and a row pitch threshold; and finally optimizing the extraction result based on punctuation. The method has a good effect to pages of different types and has a certain versatility.

信息查询

中国专利公布公告 Global Dossier Espacenet