摘要:
An image of a character string composed of M pieces of characters is clipped from a document image, and the image is divided into separate characters. Image features of each character image are extracted. Based on the image features, N (N>1, integer) pieces of character images in descending order of degree of similarity are selected as candidate characters, from a character image feature dictionary which stores the image features of character image in units of character, and a first index matrix of M×N cells is prepared. A candidate character string composed of a plurality of candidate characters constituting a first column of the first index matrix, is subjected to a lexical analysis according to a language model, and whereby a second index matrix having a character string which makes sense is prepared. In the language model, statistics are taken and then, the lexical analysis is performed.
摘要:
In an extracting step, the extracting portion obtains a linked component composed of a plurality of mutually linking pixels from a character string region composed of a plurality of characters, and extracts section elements from the character string region, the section elements each being surrounded by a circumscribing figure circumscribing to the linked component. In the first altering step, the first altering portion combines section elements at least having a mutually overlapping part among the extracted section elements so as to prepare a new section element. In the first selecting step, the first selecting portion determines a reference size in advance and selects section elements having a size greater than the reference size, from among the section elements altered in the first altering step.
摘要:
An image of a character string composed of M pieces of characters is clipped from a document image, and the image is divided character by character, and image features of each character image are extracted. On the basis of the image features, N (N>1, integer) pieces of character images in descending order of degree of similarity are selected as candidate characters from a character image feature dictionary which stores the image features of character image in units of character, and the first index matrix of M×N cells is prepared. A candidate character string composed of a plurality of candidate characters constituting the first column of the first index matrix, is subjected to a lexical analysis according to a predetermined language model, whereby a second index matrix adjusted into a character string which makes sense is prepared to be utilized for searching.
摘要:
A headline-region initial processing section clips a headline-region image in an image document, divides the image into individual character images, and extracts features of the individual character images. Based on the features, a candidate-character-sequence generating section selects N (N is an integer more than 1) character images as candidate characters in the order of degree of matching from a font-feature dictionary for storing features of individual character images, and generates M×N index matrix where M is the number of characters in an extracted character sequence. Based on the index matrix, a document-name generating section generates a meaningful document name according to the image document. An image-document-DB management section manages accumulated image documents using the document name. This provides an image document processing device and an image document processing method each allowing automatically generating and managing the meaningful document name that represents the contents of the image document, without user's operation.
摘要:
An image of a character string composed of M pieces of characters is clipped from a document image, and the image is divided into separate characters. Image features of each character image are extracted. Based on the image features, N (N>1, integer) pieces of character images in descending order of degree of similarity are selected as candidate characters, from a character image feature dictionary which stores the image features of character image in units of character, and a first index matrix of M×N cells is prepared. A candidate character string composed of a plurality of candidate characters constituting a first column of the first index matrix, is subjected to a lexical analysis according to a language model, and whereby a second index matrix having a character string which makes sense is prepared. In the language model, statistics are taken and then, the lexical analysis is performed.
摘要:
A headline-region initial processing section clips a headline-region image in an image document, divides the image into individual character images, and extracts features of the individual character images. Based on the features, a candidate-character-sequence generating section selects N (N is an integer more than 1) character images as candidate characters in the order of degree of matching from a font-feature dictionary for storing features of individual character images, and generates M×N index matrix where M is the number of characters in an extracted character sequence. Based on the index matrix, a document-name generating section generates a meaningful document name according to the image document. An image-document-DB management section manages accumulated image documents using the document name. This provides an image document processing device and an image document processing method each allowing automatically generating and managing the meaningful document name that represents the contents of the image document, without user's operation.
摘要:
An image document processing device extracts a character sequence image having M number of characters in an image document, divides the image into individual character images, extracts features of the individual character images, and based on the features, selects N (N is an integer more than 1) character images in the order of degree of matching from a font-feature dictionary for storing features of all character images according to fonts, and generates an M×N index matrix for the extracted character sequence. In searching, the device searches an index-information storage section with respect to each search character included in a search keyword in an input search expression, and extracts an image document including an index matrix including the search keyword. This provides an image document processing device and an image document processing method each allowing indexing not requiring user's operation and each allowing highly precise searching without OCR recognition.
摘要:
There is provided a document image processing apparatus which can reduce troubles to find a desired heading from a document image. A heading region extracting portion searches an index information DB and extracts a heading region containing a search keyword. An order setting portion automatically sets in line with a predetermined rule an order of the heading regions extracted by the heading region extracting portion. On a displaying portion is displayed a document image on which the heading regions extracted by the heading region extracting portion are highlighted in accordance with the order set by the order setting portion. A display order of search results may be set by determining importance of the extracted heading regions based on the number of the search keyword and features of character images in the heading regions.
摘要:
In an extracting step, the extracting portion obtains a linked component composed of a plurality of mutually linking pixels from a character string region composed of a plurality of characters, and extracts section elements from the character string region, the section elements each being surrounded by a circumscribing figure circumscribing to the linked component. In the first altering step, the first altering portion combines section elements at least having a mutually overlapping part among the extracted section elements so as to prepare a new section element. In the first selecting step, the first selecting portion determines a reference size in advance and selects section elements having a size greater than the reference size, from among the section elements altered in the first altering step.
摘要:
An image of a character string composed of M pieces of characters is clipped from a document image, and the image is divided character by character, and image features of each character image are extracted. On the basis of the image features, N (N>1, integer) pieces of character images in descending order of degree of similarity are selected as candidate characters from a character image feature dictionary which stores the image features of character image in units of character, and the first index matrix of M×N cells is prepared. A candidate character string composed of a plurality of candidate characters constituting the first column of the first index matrix, is subjected to a lexical analysis according to a predetermined language model, whereby a second index matrix adjusted into a character string which makes sense is prepared to he utilized for searching.