摘要:
An image processing apparatus for detecting paragraphs in a textual image includes an input component for receiving an input image in which textual lines and words have been identified and a page classification component for classifying the input image as a first or second page type. The apparatus also includes a paragraph detection component for classifying all textual lines on the input image as a beginning paragraph line or a continuation paragraph line. The apparatus is also provided with a paragraph creation component for creating paragraphs that include textual lines between two successive beginning paragraph lines, including a first of the two successive beginning paragraph lines. The paragraphs that have been identified may be classified by the type of alignment they exhibit. For instance, paragraphs may be classified according to whether they are left aligned, right aligned, center aligned or justified.
摘要:
Line segmentation in an OCR process is performed to detect the positions of words within an input textual line image by extracting features from the input to locate breaks and then classifying the breaks into one of two break classes which include inter-word breaks and inter-character breaks. An output including the bounding boxes of the detected words and a probability that a given break belongs to the identified class can then be provided to downstream OCR or other components for post-processing. Advantageously, by reducing line segmentation to the extraction of features, including the position of each break and the number of break features, and break classification, the task of line segmentation is made less complex but with no loss of generality.
摘要:
Line segmentation in an OCR process is performed to detect the positions of words within an input textual line image by extracting features from the input to locate breaks and then classifying the breaks into one of two break classes which include inter-word breaks and inter-character breaks. An output including the bounding boxes of the detected words and a probability that a given break belongs to the identified class can then be provided to downstream OCR or other components for post-processing. Advantageously, by reducing line segmentation to the extraction of features, including the position of each break and the number of break features, and break classification, the task of line segmentation is made less complex but with no loss of generality.
摘要:
An image processing apparatus for detecting paragraphs in a textual image includes an input component for receiving an input image in which textual lines and words have been identified and a page classification component for classifying the input image as a first or second page type. The apparatus also includes a paragraph detection component for classifying all textual lines on the input image as a beginning paragraph line or a continuation paragraph line. The apparatus is also provided with a paragraph creation component for creating paragraphs that include textual lines between two successive beginning paragraph lines, including a first of the two successive beginning paragraph lines. The paragraphs that have been identified may be classified by the type of alignment they exhibit. For instance, paragraphs may be classified according to whether they are left aligned, right aligned, center aligned or justified.
摘要:
Page segmentation in an optical character recognition process is performed to detect textual objects and/or image objects. Textual objects in an input gray scale image are detected by selecting candidates for native lines which are sets of horizontally neighboring connected components (i.e., subsets of image pixels where each pixel from the set is connected with all remaining pixels from the set) having similar vertical statistics defined by values of baseline (the line upon which most text characters “sit”) and mean line (the line under which most of the characters “hang”). Binary classification is performed on the native line candidates to classify them as textual or non-textual through examination of any embedded regularity. Image objects are indirectly detected by detecting the image's background using the detected text to define the background. Once the background is detected, what remains (i.e., the non-background) is an image object.
摘要:
Page segmentation in an optical character recognition process is performed to detect textual objects and/or image objects. Textual objects in an input gray scale image are detected by selecting candidates for native lines which are sets of horizontally neighboring connected components (i.e., subsets of image pixels where each pixel from the set is connected with all remaining pixels from the set) having similar vertical statistics defined by values of baseline (the line upon which most text characters “sit”) and mean line (the line under which most of the characters “hang”). Binary classification is performed on the native line candidates to classify them as textual or non-textual through examination of any embedded regularity. Image objects are indirectly detected by detecting the image's background using the detected text to define the background. Once the background is detected, what remains (i.e., the non-background) is an image object.
摘要:
An electronic model of the image document is created by undergoing an OCR process. The electronic model includes elements (e.g., words, text lines, paragraphs, images) of the image document that have been determined by each of a plurality of sequentially executed stages in the OCR process. The electronic model serves as input information which is supplied to each of the stages by a previous stage that processed the image document. A graphical user interface is presented to the user so that the user can provide user input data correcting a mischaracterized item appearing in the document. Based on the user input data, the processing stage which produced the initial error that gave rise to the mischaracterized item corrects the initial error. Stages of the OCR process subsequent to this stage then correct any consequential errors arising in their respective stages as a result of the initial error.
摘要:
Embodiments of the present invention relate to classifying pages of an electronic document, such as a scanned book page. OCR software is applied to the contents of the electronic document, revealing semantic information about the content of the electronic document. Software-based features are applied to the semantic information to determine the type of page the electronic document is. Page types may include table of contents (TOC), table of figures (TOF), bibliography, index, or other types of pages commonly found in a book, magazine, or other publication. Once determined, the determined page type is stored and used by other software engines.
摘要:
A processing device may parse a group of strokes representing a mathematical expression. The group of strokes may be examined to determine whether the group of strokes satisfies any of a finite set of rules. When the group of strokes, included in a region, satisfies any of the finite set of rules, the region may be partitioned according to a satisfied one of the finite set of rules. The group of strokes included in the region may be further examined to determine whether the group of strokes may be further partitioned according to any of the finite set of rules. After all regions have been examined and no further partitioning of regions may be performed, all mathematical symbols of the mathematical expression may be isolated in at least some of the regions and may be recognized.
摘要:
A processing device may parse a group of strokes representing a mathematical expression. The group of strokes may be examined to determine whether the group of strokes satisfies any of a finite set of rules. When the group of strokes, included in a region, satisfies any of the finite set of rules, the region may be partitioned according to a satisfied one of the finite set of rules. The group of strokes included in the region may be further examined to determine whether the group of strokes may be further partitioned according to any of the finite set of rules. After all regions have been examined and no further partitioning of regions may be performed, all mathematical symbols of the mathematical expression may be isolated in at least some of the regions and may be recognized.