发明申请
US20070027882A1 Record boundary identification and extraction through pattern mining 有权
通过模式挖掘记录边界识别和提取

  • 专利标题: Record boundary identification and extraction through pattern mining
  • 专利标题(中): 通过模式挖掘记录边界识别和提取
  • 申请号: US11192620
    申请日: 2005-07-28
  • 公开(公告)号: US20070027882A1
    公开(公告)日: 2007-02-01
  • 发明人: Parashuram Kulkarni
  • 申请人: Parashuram Kulkarni
  • 主分类号: G06F7/00
  • IPC分类号: G06F7/00
Record boundary identification and extraction through pattern mining
摘要:
Techniques for identifying discrete records within a multi-record document are provided. According to one technique, a document is encoded based on some combination of visual tag encoding, text category encoding, and text content encoding that produces hash values based on the contents of portions of the document. According to one technique, repeating candidate patterns are identified in a document so encoded. The candidate patterns may be identified in a “fuzzy” manner that allows for some inconsistencies in the individual pattern instances. According to one technique, the identified candidate patterns are validated based on specified factors to determine a “best” pattern. According to one technique, the boundaries of discrete records in a multi-record document are marked based on the portions of the document that correspond to an identified repeating pattern.
信息查询
0/0