Search and retrieval of documents indexed by optical character recognition
    1.
    发明授权
    Search and retrieval of documents indexed by optical character recognition 有权
    搜索和检索通过光学字符识别索引的文档

    公开(公告)号:US08208765B2

    公开(公告)日:2012-06-26

    申请号:US11972446

    申请日:2008-01-10

    IPC分类号: G06K9/00

    摘要: An image of a character string composed of M pieces of characters is clipped from a document image, and the image is divided into separate characters. Image features of each character image are extracted. Based on the image features, N (N>1, integer) pieces of character images in descending order of degree of similarity are selected as candidate characters, from a character image feature dictionary which stores the image features of character image in units of character, and a first index matrix of M×N cells is prepared. A candidate character string composed of a plurality of candidate characters constituting a first column of the first index matrix, is subjected to a lexical analysis according to a language model, and whereby a second index matrix having a character string which makes sense is prepared. In the language model, statistics are taken and then, the lexical analysis is performed.

    摘要翻译: 从文件图像剪切由M个字符组成的字符串的图像,并且将图像划分为单独的字符。 提取每个字符图像的图像特征。 基于图像特征,从以字符为单位存储字符图像的图像特征的字符图像特征词典中,选择相似度降序的N(N> 1,整数)个字符图像作为候选字符, 并准备M×N个单元的第一个索引矩阵。 由构成第一索引矩阵的第一列的多个候选字符组成的候选字符串根据语言模型进行词法分析,由此准备具有有意义的字符串的第二索引矩阵。 在语言模型中,进行统计,然后进行词法分析。

    CHARACTER IMAGE EXTRACTING APPARATUS AND CHARACTER IMAGE EXTRACTING METHOD
    2.
    发明申请
    CHARACTER IMAGE EXTRACTING APPARATUS AND CHARACTER IMAGE EXTRACTING METHOD 有权
    字符提取设备和字符提取方法

    公开(公告)号:US20090028435A1

    公开(公告)日:2009-01-29

    申请号:US11963613

    申请日:2007-12-21

    IPC分类号: G06K9/46

    摘要: In an extracting step, the extracting portion obtains a linked component composed of a plurality of mutually linking pixels from a character string region composed of a plurality of characters, and extracts section elements from the character string region, the section elements each being surrounded by a circumscribing figure circumscribing to the linked component. In the first altering step, the first altering portion combines section elements at least having a mutually overlapping part among the extracted section elements so as to prepare a new section element. In the first selecting step, the first selecting portion determines a reference size in advance and selects section elements having a size greater than the reference size, from among the section elements altered in the first altering step.

    摘要翻译: 在提取步骤中,提取部分从由多个字符组成的字符串区域中获得由多个相互关联的像素组成的链接成分,并从字符串区域中提取出部分元素, 限定连接组件的外观图。 在第一改变步骤中,第一改变部分组合至少在提取的部分元素中具有相互重叠的部分的部分元素,以便准备新的部分元素。 在第一选择步骤中,第一选择部分从第一改变步骤中改变的部分元素中预先确定参考尺寸并且选择具有大于参考尺寸的尺寸的部分元素。

    Document image processing apparatus
    3.
    发明授权
    Document image processing apparatus 有权
    文件图像处理装置

    公开(公告)号:US08160402B2

    公开(公告)日:2012-04-17

    申请号:US11972477

    申请日:2008-01-10

    IPC分类号: G06K9/03 G06K9/18

    摘要: An image of a character string composed of M pieces of characters is clipped from a document image, and the image is divided character by character, and image features of each character image are extracted. On the basis of the image features, N (N>1, integer) pieces of character images in descending order of degree of similarity are selected as candidate characters from a character image feature dictionary which stores the image features of character image in units of character, and the first index matrix of M×N cells is prepared. A candidate character string composed of a plurality of candidate characters constituting the first column of the first index matrix, is subjected to a lexical analysis according to a predetermined language model, whereby a second index matrix adjusted into a character string which makes sense is prepared to be utilized for searching.

    摘要翻译: 从文件图像中剪辑由M个字符组成的字符串的图像,并且逐个地分割图像,并且提取每个字符图像的图像特征。 基于图像特征,从以字符为单位存储字符图像的图像特征的字符图像特征词典中选择作为相似度降序的N(N> 1,整数)个字符图像的候选字符 ,并准备M×N个单元的第一个索引矩阵。 由构成第一索引矩阵的第一列的多个候选字符构成的候选字符串根据预定语言模型进行词法分析,由此将调整为有意义的字符串的第二索引矩阵准备为 用于搜索。

    Image document processing device, image document processing method, program, and storage medium
    4.
    发明授权
    Image document processing device, image document processing method, program, and storage medium 有权
    图像文件处理装置,图像文件处理方法,程序和存储介质

    公开(公告)号:US08290269B2

    公开(公告)日:2012-10-16

    申请号:US11953695

    申请日:2007-12-10

    CPC分类号: G06K9/6828 G06F17/30253

    摘要: A headline-region initial processing section clips a headline-region image in an image document, divides the image into individual character images, and extracts features of the individual character images. Based on the features, a candidate-character-sequence generating section selects N (N is an integer more than 1) character images as candidate characters in the order of degree of matching from a font-feature dictionary for storing features of individual character images, and generates M×N index matrix where M is the number of characters in an extracted character sequence. Based on the index matrix, a document-name generating section generates a meaningful document name according to the image document. An image-document-DB management section manages accumulated image documents using the document name. This provides an image document processing device and an image document processing method each allowing automatically generating and managing the meaningful document name that represents the contents of the image document, without user's operation.

    摘要翻译: 标题区域初始处理部分剪切图像文档中的标题区域图像,将图像分割成单独的字符图像,并且提取单个字符图像的特征。 基于特征,候选字符序列生成部从用于存储各个字符图像的特征的字体特征词典中选择N(N为1以上的整数)的字符图像作为匹配度的顺序的候选字符, 并生成M×N索引矩阵,其中M是提取的字符序列中的字符数。 基于索引矩阵,文档名称生成部根据图像文档生成有意义的文档名称。 图像文档DB管理部分使用文档名称来管理累积的图像文档。 这提供了一种图像文档处理设备和图像文档处理方法,每种图像文档处理方法都允许在不需要用户操作的情况下自动地生成和管理表示图像文档的内容的有意义的文档名称。

    DOCUMENT IMAGE PROCESSING APPARATUS, DOCUMENT IMAGE PROCESSING METHOD, DOCUMENT IMAGE PROCESSING PROGRAM, AND RECORDING MEDIUM ON WHICH DOCUMENT IMAGE PROCESSING PROGRAM IS RECORDED
    5.
    发明申请
    DOCUMENT IMAGE PROCESSING APPARATUS, DOCUMENT IMAGE PROCESSING METHOD, DOCUMENT IMAGE PROCESSING PROGRAM, AND RECORDING MEDIUM ON WHICH DOCUMENT IMAGE PROCESSING PROGRAM IS RECORDED 有权
    文件图像处理装置,文件图像处理方法,文件图像处理程序和记录文件图像处理程序的记录介质

    公开(公告)号:US20090028446A1

    公开(公告)日:2009-01-29

    申请号:US11972446

    申请日:2008-01-10

    IPC分类号: G06K9/72

    摘要: An image of a character string composed of M pieces of characters is clipped from a document image, and the image is divided into separate characters. Image features of each character image are extracted. Based on the image features, N (N>1, integer) pieces of character images in descending order of degree of similarity are selected as candidate characters, from a character image feature dictionary which stores the image features of character image in units of character, and a first index matrix of M×N cells is prepared. A candidate character string composed of a plurality of candidate characters constituting a first column of the first index matrix, is subjected to a lexical analysis according to a language model, and whereby a second index matrix having a character string which makes sense is prepared. In the language model, statistics are taken and then, the lexical analysis is performed.

    摘要翻译: 从文件图像剪切由M个字符组成的字符串的图像,并且将图像划分为单独的字符。 提取每个字符图像的图像特征。 基于图像特征,从以字符为单位存储字符图像的图像特征的字符图像特征词典中,选择相似度降序的N(N> 1,整数)个字符图像作为候选字符, 并准备MxN单元的第一指标矩阵。 由构成第一索引矩阵的第一列的多个候选字符组成的候选字符串根据语言模型进行词法分析,由此准备具有有意义的字符串的第二索引矩阵。 在语言模型中,进行统计,然后进行词法分析。

    Image document processing device, image document processing method, program, and storage medium
    6.
    发明申请
    Image document processing device, image document processing method, program, and storage medium 有权
    图像文件处理装置,图像文件处理方法,程序和存储介质

    公开(公告)号:US20080181505A1

    公开(公告)日:2008-07-31

    申请号:US11953695

    申请日:2007-12-10

    IPC分类号: G06K9/46

    CPC分类号: G06K9/6828 G06F17/30253

    摘要: A headline-region initial processing section clips a headline-region image in an image document, divides the image into individual character images, and extracts features of the individual character images. Based on the features, a candidate-character-sequence generating section selects N (N is an integer more than 1) character images as candidate characters in the order of degree of matching from a font-feature dictionary for storing features of individual character images, and generates M×N index matrix where M is the number of characters in an extracted character sequence. Based on the index matrix, a document-name generating section generates a meaningful document name according to the image document. An image-document-DB management section manages accumulated image documents using the document name. This provides an image document processing device and an image document processing method each allowing automatically generating and managing the meaningful document name that represents the contents of the image document, without user's operation.

    摘要翻译: 标题区域初始处理部分剪切图像文档中的标题区域图像,将图像分割成单独的字符图像,并且提取单个字符图像的特征。 基于特征,候选字符序列生成部从用于存储各个字符图像的特征的字体特征词典中选择N(N为1以上的整数)的字符图像作为匹配度的顺序的候选字符, 并生成MxN索引矩阵,其中M是提取的字符序列中的字符数。 基于索引矩阵,文档名称生成部根据图像文档生成有意义的文档名称。 图像文档DB管理部分使用文档名称来管理累积的图像文档。 这提供了一种图像文档处理设备和图像文档处理方法,每种图像文档处理方法都允许在不需要用户操作的情况下自动地生成和管理表示图像文档的内容的有意义的文档名称。

    Image document processing device, image document processing method, program, and storage medium
    7.
    发明授权
    Image document processing device, image document processing method, program, and storage medium 有权
    图像文件处理装置,图像文件处理方法,程序和存储介质

    公开(公告)号:US08295600B2

    公开(公告)日:2012-10-23

    申请号:US11952823

    申请日:2007-12-07

    摘要: An image document processing device extracts a character sequence image having M number of characters in an image document, divides the image into individual character images, extracts features of the individual character images, and based on the features, selects N (N is an integer more than 1) character images in the order of degree of matching from a font-feature dictionary for storing features of all character images according to fonts, and generates an M×N index matrix for the extracted character sequence. In searching, the device searches an index-information storage section with respect to each search character included in a search keyword in an input search expression, and extracts an image document including an index matrix including the search keyword. This provides an image document processing device and an image document processing method each allowing indexing not requiring user's operation and each allowing highly precise searching without OCR recognition.

    摘要翻译: 图像文档处理装置提取图像文档中具有M个字符的字符序列图像,将图像分割成单个字符图像,提取各个字符图像的特征,并且基于特征,选择N(N是更整数 比1字符图像按照根据字体存储所有字符图像的特征的字体特征字典的匹配程度的顺序,并且生成用于提取的字符序列的M×N索引矩阵。 在搜索中,设备针对输入搜索表达式中搜索关键字中包括的每个搜索字符搜索索引信息存储部分,并且提取包括包括搜索关键字的索引矩阵的图像文档。 这提供了一种图像文档处理设备和图像文档处理方法,每个图像文档处理方法允许不需要用户操作的索引,并且每个允许在没有OCR识别的情况下进行高

    DOCUMENT IMAGE PROCESSING APPARATUS AND DOCUMENT IMAGE PROCESSING METHOD
    8.
    发明申请
    DOCUMENT IMAGE PROCESSING APPARATUS AND DOCUMENT IMAGE PROCESSING METHOD 审中-公开
    文件图像处理装置和文件图像处理方法

    公开(公告)号:US20090030882A1

    公开(公告)日:2009-01-29

    申请号:US11972476

    申请日:2008-01-10

    IPC分类号: G06F7/06 G06F17/30

    CPC分类号: G06F16/40 G06F16/5846

    摘要: There is provided a document image processing apparatus which can reduce troubles to find a desired heading from a document image. A heading region extracting portion searches an index information DB and extracts a heading region containing a search keyword. An order setting portion automatically sets in line with a predetermined rule an order of the heading regions extracted by the heading region extracting portion. On a displaying portion is displayed a document image on which the heading regions extracted by the heading region extracting portion are highlighted in accordance with the order set by the order setting portion. A display order of search results may be set by determining importance of the extracted heading regions based on the number of the search keyword and features of character images in the heading regions.

    摘要翻译: 提供了一种文件图像处理装置,其可以减少从文档图像中找到所需标题的麻烦。 标题区域提取部分搜索索引信息DB并提取包含搜索关键字的标题区域。 订单设置部分根据标题区域提取部分提取的标题区域的顺序自动设置与预定规则相一致的顺序。 在显示部分上显示根据由订单设置部分设置的顺序突出显示由标题区域提取部分提取的标题区域的文档图像。 可以通过基于搜索关键词的数量和标题区域中的字符图像的特征来确定提取的标题区域的重要性来设置搜索结果的显示顺序。

    Character image extracting apparatus and character image extracting method
    9.
    发明授权
    Character image extracting apparatus and character image extracting method 有权
    字符图像提取装置和字符图像提取方法

    公开(公告)号:US08750616B2

    公开(公告)日:2014-06-10

    申请号:US11963613

    申请日:2007-12-21

    IPC分类号: G06K9/34

    摘要: In an extracting step, the extracting portion obtains a linked component composed of a plurality of mutually linking pixels from a character string region composed of a plurality of characters, and extracts section elements from the character string region, the section elements each being surrounded by a circumscribing figure circumscribing to the linked component. In the first altering step, the first altering portion combines section elements at least having a mutually overlapping part among the extracted section elements so as to prepare a new section element. In the first selecting step, the first selecting portion determines a reference size in advance and selects section elements having a size greater than the reference size, from among the section elements altered in the first altering step.

    摘要翻译: 在提取步骤中,提取部分从由多个字符组成的字符串区域中获得由多个相互关联的像素组成的链接成分,并从字符串区域中提取出部分元素, 限定连接组件的外观图。 在第一改变步骤中,第一改变部分组合至少在提取的部分元素中具有相互重叠的部分的部分元素,以便准备新的部分元素。 在第一选择步骤中,第一选择部分从第一改变步骤中改变的部分元素中预先确定参考尺寸并且选择具有大于参考尺寸的尺寸的部分元素。