发明授权
US08645819B2 Detection and extraction of elements constituting images in unstructured document files
有权
在非结构化文档文件中检测和提取构成图像的元素
- 专利标题: Detection and extraction of elements constituting images in unstructured document files
- 专利标题(中): 在非结构化文档文件中检测和提取构成图像的元素
-
申请号: US13162858申请日: 2011-06-17
-
公开(公告)号: US08645819B2公开(公告)日: 2014-02-04
- 发明人: Hervé Déjean
- 申请人: Hervé Déjean
- 申请人地址: US CT Norwalk
- 专利权人: Xerox Corporation
- 当前专利权人: Xerox Corporation
- 当前专利权人地址: US CT Norwalk
- 代理机构: Fay Sharpe LLP
- 主分类号: G06F17/00
- IPC分类号: G06F17/00
摘要:
A method and a system for detecting and extracting images in an electronic document are disclosed. The method includes receiving an electronic document and identifying elements of a page. The identified elements include a set of graphical elements and a set of text elements. The method may include identifying and excluding elements which serve as graphical page constructs and/or text formatting elements. The page can then be segmented, based on (remaining) graphical elements and identified white spaces, to generate a set of image blocks. Text elements that are associated with a respective image block are identified as captions. Overlapping candidate images are then grouped to form a new image. The new image can thus include candidate images which would, without the identification of their caption(s), each be treated as a respective image.
公开/授权文献
信息查询