System and method for logical structuring of documents based on trailing and leading pages
    1.
    发明授权
    System and method for logical structuring of documents based on trailing and leading pages 有权
    基于尾页和首页的文档逻辑结构化的系统和方法

    公开(公告)号:US09110868B2

    公开(公告)日:2015-08-18

    申请号:US12974843

    申请日:2010-12-21

    申请人: Hervé Déjean

    发明人: Hervé Déjean

    IPC分类号: G06F17/00 G06F17/21

    CPC分类号: G06F17/211 G06F17/212

    摘要: A system, method, and computer program product for determining the structure of a document are provided. The method includes receiving a set of document pages for a document and linking one page frame to each of a plurality of document pages in the set. For each document page linked to a page frame, a content bounding box surrounding the content on the document page is identified, and the document page categorized, based at least in part on the geometrical relationship between the page frame and the content bounding box of the document page. The document page can then be identified as a logical cut based at least in part on the categorization of the document page. Information, such as a table of contents or updated table of contents, can then be output, based on the determined logical unit(s) of the document.

    摘要翻译: 提供了一种用于确定文档结构的系统,方法和计算机程序产品。 该方法包括接收文档的一组文档页面,并将一个页面帧链接到该组中的多个文档页面中的每一个。 对于链接到页面框架的每个文档页面,标识围绕文档页面上的内容的内容边界框,并且至少部分地基于页面框架和内容框架之间的几何关系来分类文档页面 文件页面。 至少部分地基于文档页面的分类,文档页面可以被识别为逻辑切割。 然后可以基于所确定的文档的逻辑单元来输出诸如目录或更新的目录之类的信息。

    System and method for page frame detection
    2.
    发明授权
    System and method for page frame detection 有权
    用于页框检测的系统和方法

    公开(公告)号:US08645821B2

    公开(公告)日:2014-02-04

    申请号:US12892138

    申请日:2010-09-28

    申请人: Hervé Déjean

    发明人: Hervé Déjean

    IPC分类号: G06F17/25

    CPC分类号: G06K9/00463

    摘要: A system and method for page frame detection for pages of a document are disclosed. The method includes receiving a set of document pages for a document, each page having at least one detected object. For each page in the set, the method includes determining dimensions of bounding box which encompasses the detected objects of the page and determining margin dimensions, based on a position of the bounding box on the page. A page frame is computed as a combination of bounding box dimensions and margin dimensions, based on frequencies of the bounding box dimensions and margin dimensions computed for the set of pages. The computed page frame is matched to pages of the document. Information based on the matching, such as content of text objects within the matched page frame, can be output.

    摘要翻译: 公开了一种用于文档页面的页面帧检测的系统和方法。 该方法包括接收文档的一组文档页面,每个页面具有至少一个检测到的对象。 对于集合中的每个页面,该方法包括基于页面上的边界框的位置来确定包围页面的检测对象并确定边缘尺寸的边界框的尺寸。 基于为该组页面计算的边界框尺寸和边距维度的频率,将页框计算为边界框尺寸和边距维度的组合。 计算的页面框架与文档的页面匹配。 可以输出基于匹配的信息,例如匹配页框内的文本对象的内容。

    Versatile page number detector
    3.
    发明授权
    Versatile page number detector 有权
    多功能页码检测器

    公开(公告)号:US07797622B2

    公开(公告)日:2010-09-14

    申请号:US11599947

    申请日:2006-11-15

    IPC分类号: G06F17/27

    CPC分类号: G06F17/30569 G06K9/00469

    摘要: A method for detection of page numbers in a document includes identifying a plurality of text fragments associated with a plurality of pages of a document. From the identified text fragments, at least one sequence is identified. Each identified sequence includes a plurality of terms. Each term of the sequence is derived from a text fragment selected from the plurality text fragments. The terms of an identified sequence comply with at least one predefined numbering scheme which defines a form and an incremental state of the terms in a sequence. A subset of the identified sequences which cover at least some of the pages of the document is computed. Terms of at least some of the subset of the identified sequences are construed as page numbers of pages of the document. Additional page numbers may be identified by considering one or more features of the terms in the subset of identified sequences.

    摘要翻译: 用于检测文档中的页码的方法包括识别与文档的多个页面相关联的多个文本片段。 从识别的文本片段中,至少识别出一个序列。 每个识别的序列包括多个术语。 序列的每个术语从选自多个文本片段的文本片段导出。 所识别序列的术语符合至少一个定义序列中术语的形式和增量状态的预定义编号方案。 计算覆盖文档的至少一些页面的识别序列的子集。 所识别的序列的至少一部分子集的术语被解释为文档的页面页码。 可以通过考虑所识别序列的子集中的术语的一个或多个特征来识别附加页码。

    Matching a page layout for each page of a document to a page template candidate from a list of page layout candidates
    4.
    发明授权
    Matching a page layout for each page of a document to a page template candidate from a list of page layout candidates 有权
    从页面布局候选列表将文档的每个页面的页面布局与页面模板候选项匹配

    公开(公告)号:US08719700B2

    公开(公告)日:2014-05-06

    申请号:US12773125

    申请日:2010-05-04

    申请人: Hervé Déjean

    发明人: Hervé Déjean

    IPC分类号: G06K9/62

    CPC分类号: G06F17/243 G06F17/248

    摘要: A computer-implemented method and system for generation of page templates are provided. The method includes providing a document in computer memory. Using a computer processor, page elements within the document are identified and labeled. For each page of the document, a set of geometric relations between pairs of page elements co-occurring on the page is computed, and the set of geometric relations is associated with the page. The method also includes generating a set of page template candidates based at least in part on the computed geometric relations, selecting page templates from the set of page template candidates, and outputting the selected page templates.

    摘要翻译: 提供了一种用于生成页面模板的计算机实现的方法和系统。 该方法包括在计算机存储器中提供文档。 使用计算机处理器,文档中的页面元素被标识和标记。 对于文档的每个页面,计算页面上共同出现的页面元素对之间的一组几何关系,并且几何关系集合与该页面相关联。 该方法还包括至少部分地基于所计算的几何关系生成一组页面模板候选,从该页面模板候选集中选择页面模板,以及输出所选择的页面模板。

    Detection and extraction of elements constituting images in unstructured document files
    5.
    发明授权
    Detection and extraction of elements constituting images in unstructured document files 有权
    在非结构化文档文件中检测和提取构成图像的元素

    公开(公告)号:US08645819B2

    公开(公告)日:2014-02-04

    申请号:US13162858

    申请日:2011-06-17

    申请人: Hervé Déjean

    发明人: Hervé Déjean

    IPC分类号: G06F17/00

    CPC分类号: G06F17/211

    摘要: A method and a system for detecting and extracting images in an electronic document are disclosed. The method includes receiving an electronic document and identifying elements of a page. The identified elements include a set of graphical elements and a set of text elements. The method may include identifying and excluding elements which serve as graphical page constructs and/or text formatting elements. The page can then be segmented, based on (remaining) graphical elements and identified white spaces, to generate a set of image blocks. Text elements that are associated with a respective image block are identified as captions. Overlapping candidate images are then grouped to form a new image. The new image can thus include candidate images which would, without the identification of their caption(s), each be treated as a respective image.

    摘要翻译: 公开了一种用于在电子文档中检测和提取图像的方法和系统。 该方法包括接收电子文档并识别页面的元素。 所识别的元素包括一组图形元素和一组文本元素。 该方法可以包括识别和排除用作图形页面构造和/或文本格式化元素的元素。 然后可以基于(剩余的)图形元素和标识的空白区分页面,以生成一组图像块。 与相应图像块相关联的文本元素被标识为标题。 然后将重叠的候选图像分组以形成新图像。 因此,新图像可以包括候选图像,其将不会将其标题识别为每个图像。

    Generate-and-test method for column segmentation
    6.
    发明授权
    Generate-and-test method for column segmentation 有权
    用于列分割的生成和测试方法

    公开(公告)号:US08560937B2

    公开(公告)日:2013-10-15

    申请号:US13155011

    申请日:2011-06-07

    申请人: Hervé Déjean

    发明人: Hervé Déjean

    IPC分类号: G06F17/27

    CPC分类号: G06K9/00463

    摘要: A system, method, and computer program product for segmenting a document are disclosed. The method considers a zone of a document, such as a page frame or other zone which is a predetermined ratio thereof, and while there are remaining elements in the zone, iteratively tests different segmentations of the zone into n candidate columns, and computes a width of a gutter for each n-candidate. Assuming that the gutter width computed meets a threshold test, which may be based on the arrangement of the elements in the columns, and the candidate columns for the n-candidate each contain at least a threshold number of elements, elements are assigned to respective ones of n segmented columns within which they are located. For example, line elements are arranged in blocks of text within the columns, enabling a reading order for sequences of text, such as complete sentences and paragraphs, to be computed.

    摘要翻译: 公开了一种用于分割文档的系统,方法和计算机程序产品。 该方法考虑了文档的区域,例如页面框架或其他区域,其为预定比例,并且当区域中存在剩余元素时,将该区域的不同分段迭代地测试为n个候选列,并且计算宽度 每个n候选人的沟槽。 假设所计算的沟槽宽度满足可能基于列中的元素的布置的阈值测试,并且用于n候选的候选列各包含至少一个阈值数量的元素,元素被分配给相应的元素 的n个分段列。 例如,行元素以列内的文本块排列,使得能够计算文本序列的读取顺序,例如完整的句子和段落。

    Captions detector
    7.
    发明授权
    Captions detector 有权
    字幕检测器

    公开(公告)号:US07852499B2

    公开(公告)日:2010-12-14

    申请号:US11528261

    申请日:2006-09-27

    申请人: Hervé Déjean

    发明人: Hervé Déjean

    IPC分类号: G06F15/00 G06F3/00

    CPC分类号: G06F17/2745

    摘要: To detect captions in a document that includes text fragments and objects of interest, a signature is assigned to each text fragment. The signature is the value for that text fragment of a text fragment representation comprising at least one text fragment attribute. A caption signature is identified as a signature assigned to a substantial number of text fragments that are near at least one object of interest in the document. One or more captions are detected as one or more text fragments each assigned a caption signature.

    摘要翻译: 要检测包含文本片段和感兴趣对象的文档中的标题,将为每个文本片段分配一个签名。 签名是包含至少一个文本片段属性的文本片段表示的文本片段的值。 字幕签名被识别为分配给文档中至少一个感兴趣对象附近的大量文本片段的签名。 一个或多个标题被检测为一个或多个文本片段,每个文本片段分配了字幕签名。

    Table of contents extraction with improved robustness
    8.
    发明授权
    Table of contents extraction with improved robustness 失效
    目录提取具有改进的鲁棒性

    公开(公告)号:US07743327B2

    公开(公告)日:2010-06-22

    申请号:US11360963

    申请日:2006-02-23

    IPC分类号: G06F17/21

    CPC分类号: G06F17/2745

    摘要: In a method for identifying a table of contents in a document (10), text fragments are extracted (12) from the document. There are identified (20, 30, 34, 38): (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries. During the identifying, a number of text fragments that are candidates for identification as linked text fragments is reduced based on at least one reduction criterion (130). The identified table of contents entries and linked text fragments (110) are validated based on at least one validation criterion (162) related to distribution of the linked text fragments.

    摘要翻译: 在用于识别文档(10)中的目录的方法中,从文档中提取文本片段(12)。 确定(20,30,34,38):(i)作为内容条目表的基本连续的文本片段组,以及(ii)与相应的内容条目链接的链接的文本片段的不同的文本片段组。 在识别期间,基于至少一个简化标准(130),减少作为链接文本片段的识别的候选者的多个文本片段。 基于与链接的文本片段的分布相关的至少一个验证标准(162),验证所识别的目录条目和链接的文本片段(110)。

    System and method for identifying regular geometric structures in document pages
    9.
    发明授权
    System and method for identifying regular geometric structures in document pages 有权
    用于识别文档页面中的常规几何结构的系统和方法

    公开(公告)号:US09008443B2

    公开(公告)日:2015-04-14

    申请号:US13530141

    申请日:2012-06-22

    申请人: Hervé Déjean

    发明人: Hervé Déjean

    IPC分类号: G06K9/46 G06K9/00

    CPC分类号: G06K9/00463

    摘要: A system and method for identifying regular geometric structures in a document page are disclosed. In the method, for a document page for which a set of page elements have been identified, the method includes identifying, where present, geometric relations among a subset of the page elements, from a predefined set of geometric relations, and a geometric structure comprising regular rows and regular columns, based on the identified geometric relations. Constraints of a definition of a regular geometric structure are applied to the identified geometric structure and, where the subset of page elements includes regular rows and regular columns forming a geometric structure which meets the constraints of the definition of a regular geometric structure, the subset of the page elements is identified as forming a regular geometric structure and may be labeled or tested to determine if it can be expanded by adding one or more rows or columns.

    摘要翻译: 公开了一种用于识别文档页面中的常规几何结构的系统和方法。 在该方法中,对于已经识别了一组页面元素的文档页面,该方法包括从预定义的几何关系集合中识别页面元素的子集之间的几何关系,以及包括 基于所确定的几何关系的常规行和常规列。 规则几何结构的定义的约束应用于所识别的几何结构,并且其中页面元素的子集包括规则行和规则列,其形成满足规则几何结构的定义的约束的几何结构, 页面元素被识别为形成规则几何结构,并且可以被标记或测试以确定是否可以通过添加一个或多个行或列来扩展。

    SYSTEM AND METHOD FOR UNSUPERVISED GENERATION OF PAGE TEMPLATES
    10.
    发明申请
    SYSTEM AND METHOD FOR UNSUPERVISED GENERATION OF PAGE TEMPLATES 有权
    不稳定生成页面模板的系统和方法

    公开(公告)号:US20110276874A1

    公开(公告)日:2011-11-10

    申请号:US12773125

    申请日:2010-05-04

    申请人: Hervé Déjean

    发明人: Hervé Déjean

    IPC分类号: G06F3/14 G06F11/07

    CPC分类号: G06F17/243 G06F17/248

    摘要: A computer-implemented method and system for generation of page templates are provided. The method includes providing a document in computer memory. Using a computer processor, page elements within the document are identified and labeled. For each page of the document, a set of geometric relations between pairs of page elements co-occurring on the page is computed, and the set of geometric relations is associated with the page. The method also includes generating a set of page template candidates based at least in part on the computed geometric relations, selecting page templates from the set of page template candidates, and outputting the selected page templates.

    摘要翻译: 提供了一种用于生成页面模板的计算机实现的方法和系统。 该方法包括在计算机存储器中提供文档。 使用计算机处理器,文档中的页面元素被标识和标记。 对于文档的每个页面,计算页面上共同出现的页面元素对之间的一组几何关系,并且几何关系集合与该页面相关联。 该方法还包括至少部分地基于所计算的几何关系生成一组页面模板候选,从该页面模板候选集中选择页面模板,以及输出所选择的页面模板。