- 专利标题: System and method for identifying and labeling fields of text associated with scanned business documents
-
申请号: US10970930申请日: 2004-10-22
-
公开(公告)号: US07689037B2公开(公告)日: 2010-03-30
- 发明人: John C. Handley , M. Armon Rahgozar , Dennis L. Venable , Pamela B. Spiteri , Anoop M. Namboodiri , Richard Zanibbi
- 申请人: John C. Handley , M. Armon Rahgozar , Dennis L. Venable , Pamela B. Spiteri , Anoop M. Namboodiri , Richard Zanibbi
- 申请人地址: US CT Norwalk
- 专利权人: Xerox Corporation
- 当前专利权人: Xerox Corporation
- 当前专利权人地址: US CT Norwalk
- 代理机构: Basch & Nickerson LLP
- 代理商 Michael J. Nickerson
- 主分类号: G06K9/34
- IPC分类号: G06K9/34
摘要:
A system for electronically distilling information from a business document uses a network scanner to electronically scan a platen area, having a business document thereon, to create a bitmap. A network server carries out a segmentation process to segment the scan generated bitmap into a bitmap object, the bitmap object corresponding to the scanned business document; a bitmap to text conversion process to convert the bitmap object into a block of text; a semantic recognition process to generate a structured representation of semantic entities corresponding to the scanned business document; and a document generation process to convert the structured representation into a structure text file. The semantic recognition process includes the processes of generating, for each line of text having a keyword therein, a terminal symbol corresponding to the keyword therein; generating, for each line of text not having a keyword therein and absent of numeric characters, an alphabetic terminal symbol; generating, for each line of text not having a keyword therein and having a numeric character therein, an alphanumeric terminal symbol; generating a string of terminal symbols from the generated terminal symbols; determining a probable parsing of the generated string of terminal symbols; labeling each text line, according to a determined function, with non-terminal symbols; and parsing the business document information text into fields of business document information text based upon the non-terminal symbol of each text line and the determined probable parsing of the generated string of terminal symbols.
公开/授权文献
信息查询