发明申请
- 专利标题: Record boundary identification and extraction through pattern mining
- 专利标题(中): 通过模式挖掘记录边界识别和提取
-
申请号: US11192620申请日: 2005-07-28
-
公开(公告)号: US20070027882A1公开(公告)日: 2007-02-01
- 发明人: Parashuram Kulkarni
- 申请人: Parashuram Kulkarni
- 主分类号: G06F7/00
- IPC分类号: G06F7/00
摘要:
Techniques for identifying discrete records within a multi-record document are provided. According to one technique, a document is encoded based on some combination of visual tag encoding, text category encoding, and text content encoding that produces hash values based on the contents of portions of the document. According to one technique, repeating candidate patterns are identified in a document so encoded. The candidate patterns may be identified in a “fuzzy” manner that allows for some inconsistencies in the individual pattern instances. According to one technique, the identified candidate patterns are validated based on specified factors to determine a “best” pattern. According to one technique, the boundaries of discrete records in a multi-record document are marked based on the portions of the document that correspond to an identified repeating pattern.
公开/授权文献
信息查询