发明授权
US09171080B2 Domain constraint path based data record extraction 有权
基于域约束路径的数据记录提取

Domain constraint path based data record extraction
摘要:
Described herein are techniques for extracting data records containing user-generated content from documents. The documents may be processed into document trees in which sub-trees represent the data records of the document. Domain constraints may be used to locate structured portions of the document tree. For example, anchor trees may be located as being sets of sibling sub-trees with similar tag paths that contain the domain constraints. The anchor trees may then be used to determine a record boundary (e.g., the start offset and length) of the data records. Finally, the data records may be extracted based on the anchor trees and the record boundaries.
公开/授权文献
信息查询
0/0