摘要:
A character area extracting device includes a reflective and non-reflective area separation unit separating image data into reflective and non-reflective areas, and binarizing the image data by changing a first threshold value when it is inappropriate; a reflective area binarizing unit separating the reflective area into character and background areas, and binarizing it by changing a second threshold value when it is inappropriate; a non-reflective area binarizing unit separating the non-reflective area into the character and background areas, and binarizing it by changing a third threshold value when it is inappropriate; a reflective and non-reflective area separation evaluation unit; and a line extracting unit connecting the character areas of the reflective and non-reflective areas and extracting positional information of the connected character areas in the image data.
摘要:
According to an aspect of an embodiment, a method of detecting boundary line information contained in image information comprising a plurality of pixels in either one of first and second states, comprising: detecting a first group of pixels in the first state disposed continuously in said image information to determine first line information and detecting a second group of pixels in the first state disposed adjacently with each other and surrounded by pixels in the second state to determine edge information based on the contour of the second group of pixels; and determining the boundary line information on the basis of the information of the relation of relative position of the line information and the edge information and the size of the first and second group of pixels.
摘要:
According to an aspect of an embodiment, a method of detecting boundary line information contained in image information comprising a plurality of pixels in either one of first and second states, comprising: detecting a first group of pixels in the first state disposed continuously in said image information to determine first line information and detecting a second group of pixels in the first state disposed adjacently with each other and surrounded by pixels in the second state to determine edge information based on the contour of the second group of pixels; and determining the boundary line information on the basis of the information of the relation of relative position of the line information and the edge information and the size of the first and second group of pixels.
摘要:
A form processing apparatus extracts layout information and character information from a form document. A candidate extracting unit extracts word candidates from the character information. A frequency digitizing unit calculates emission probability of a word candidate from each element. A relation digitizing unit calculates transition probability that relationship between word candidates is established. An evaluating unit calculates an evaluation value indicative of a probability of appearance of word candidates in respective logical elements. A determining unit determines the element and a word candidate thereof as the element and a character string thereof in the form document, based on the evaluation value.
摘要:
A document type identifying apparatus includes in advance a database storing therein keywords used as keys that identify document types in association with each document type. The document type identifying apparatus aligns word strings written on a document and generates partial keyword strings for each keyword by using the keywords stored in the database. The partial keyword strings are to be checked for matching with the word strings written on the document. Then, the document type identifying apparatus checks matching of the grouped and aligned word strings with the partial keyword strings and obtains, for each keyword, each number of matched words with the highest matching rates between the grouped word strings that are successfully matched and the partial keyword strings. Then, each number of matched words is used to calculate each evaluation value to determine the document type.
摘要:
A program causes a computer to function as a document recognition apparatus, having an extraction unit for extracting connected components of pixels from an input image, a generation unit for generating a reference element that is connected components of pixels extracted by the extraction unit and combined elements obtained by combining the reference element and connected components of pixels adjacent to the reference element as an element to be estimated, a calculation unit for calculating a degree of certainty that indicates how much the element to be estimated generated by the generation unit seems to be a character, and a determination unit for identifying elements that seem to be characters among the elements to be estimated based on the degree of certainty calculated by the calculation unit.
摘要:
A form processing apparatus extracts layout information and character information from a form document. A candidate extracting unit extracts word candidates from the character information. A frequency digitizing unit calculates emission probability of a word candidate from each element. A relation digitizing unit calculates transition probability that relationship between word candidates is established. An evaluating unit calculates an evaluation value indicative of a probability of appearance of word candidates in respective logical elements. A determining unit determines the element and a word candidate thereof as the element and a character string thereof in the form document, based on the evaluation value.
摘要:
In an apparatus for analyzing a layout of a document, a character candidate element generator generates character candidate elements from black pixel linkage components of a document image. A horizontally oriented line rectangle generator sets a plurality of character candidate elements as a line candidate rectangle, among character candidate elements aligned in horizontal line orientation, when each amount of displacement of the set character candidate elements in a vertical orientation with respect to the horizontal line orientation, is smaller than or equal to a threshold value. A horizontally oriented paragraph-box generator sets a plurality of line candidate elements having approximately the same length as each other in the vertical orientation, as a paragraph candidate element.
摘要:
A program causes a computer to function as a document recognition apparatus, having an extraction unit for extracting connected components of pixels from an input image, a generation unit for generating a reference element that is connected components of pixels extracted by the extraction unit and combined elements obtained by combining the reference element and connected components of pixels adjacent to the reference element as an element to be estimated, a calculation unit for calculating a degree of certainty that indicates how much the element to be estimated generated by the generation unit seems to be a character, and a determination unit for identifying elements that seem to be characters among the elements to be estimated based on the degree of certainty calculated by the calculation unit.
摘要:
A form processing program which is capable of automatically extracting keywords. When the image of a scanned form is entered, a layout recognizer extracts a readout region of the form image, a character recognizer recognizes characters within the readout region. A form logical definition database stores form logical definitions defining strings as keywords according to logical structures which are common to forms of same type. A possible string extractor extracts as possible strings combinations of recognized characters each of which satisfies defined relationships of a string. A linking unit links the possible strings according to positional relationships, and determines a combination of possible strings as keywords.