发明授权
- 专利标题: Domain constraint path based data record extraction
- 专利标题(中): 基于域约束路径的数据记录提取
-
申请号: US13356241申请日: 2012-01-23
-
公开(公告)号: US09171080B2公开(公告)日: 2015-10-27
- 发明人: Xinying Song , Zhiyuan Chen , Yunbo Cao , Chin-Yew Lin
- 申请人: Xinying Song , Zhiyuan Chen , Yunbo Cao , Chin-Yew Lin
- 申请人地址: US WA Redmond
- 专利权人: Microsoft Technology Licensing LLC
- 当前专利权人: Microsoft Technology Licensing LLC
- 当前专利权人地址: US WA Redmond
- 代理商 Dan Choi; Judy Yee; Micky Minhas
- 主分类号: G06F17/30
- IPC分类号: G06F17/30 ; G06F17/22
摘要:
Described herein are techniques for extracting data records containing user-generated content from documents. The documents may be processed into document trees in which sub-trees represent the data records of the document. Domain constraints may be used to locate structured portions of the document tree. For example, anchor trees may be located as being sets of sibling sub-trees with similar tag paths that contain the domain constraints. The anchor trees may then be used to determine a record boundary (e.g., the start offset and length) of the data records. Finally, the data records may be extracted based on the anchor trees and the record boundaries.
公开/授权文献
- US20120124086A1 Domain Constraint Path Based Data Record Extraction 公开/授权日:2012-05-17
信息查询