摘要:
A method of separating finely textured and solid regions in a binary image from other regions such as those containing text and line graphics. The image is subjected to a first set of operations (10) that eliminates OFF pixels that are near ON pixels, which tends to thicken text and lines and solidify textured regions. The image is then subjected to a second set of operations (12) that eliminates ON pixels that are near OFF pixels. This thins out and eliminates the previously thickened text and lines, but leaves the previously solidified textured regions substantially intact.
摘要:
A signature for a page of text is generated. The signature serves as an identifier of the text page. Positions of words in a text page are determined. Positions of multiple second words in the text page are determined relative to the position of a first word in the text page. A signature value is generated that describes the second word positions relative to the first word position. The signature value is stored. Additional signatures for the text page can be generated, each signature describing positions of other words in the text page relative to a word in the text page for which the signature is being generated. The signatures can be used to compare the text page to another text page and generate a measure of similarity that describes the result of the comparison.
摘要:
A system for printing glyph frames around known obstructions. All frames in an area are determined to be obstructed or unobstructed, based on their location with respect to other printed areas. The unobstructed locations can be numbered and glyph data printed within. In the alternative, the good locations can be numbered modulo some number much smaller that the number of available locations to provide redundancy. The unobstructed locations can be stored in either the sync lines or in the data area of other locations known to be unobstructed. Also, the frame itself can be identified as obstructed or unobstructed to provide more redundancy.
摘要:
This invention provides self-clocking glyph shape codes for encoding digital data in the shapes of glyphs that are suitable for printing on hardcopy recording media. Advantageously, the glyphs are selected so that they tend not to degrade into each other when they are degraded and/or distorted as a result, for example, of being photocopied, transmitted via facsimile, and/or scanned-in to an electronic document processing system. Moreover, for at least some applications, the glyphs desirably are composed of printed pixel patterns containing nearly the same number of ON pixels and nearly the same number of OFF pixels, such that the code that is rendered by printing such glyphs on substantially uniformly spaced centers appears to have a generally uniform texture. In the case of codes printed at higher spatial densities, this texture is likely to be perceived as a generally uniform gray tone. Binary image processing and convolution filtering techniques for decoding such codes also are diclosed, but this application focuses on the codes.
摘要:
A method of automatically identifying sentence boundaries in a document image without performing character recognition to generate an ASCII representation of the document text. The identification process begins by selecting a connected component from the multiplicity of connected components of a text line. Next, it is determined whether the selected connected component might represent a period based upon its shape. If the selected connected component is dot shaped, then it is determined whether the selected connected component might represent a colon. Finally, if the selected connected component is dot shaped and not part of a colon, the selected connected component is labeled as a sentence boundary.
摘要:
A method of automatically generating a thematic summary from a document image without performing character recognition to generate an ASCII representation of the document text. The method begins with decomposition of the document image into text blocks, and text lines. Using the median x-height of text blocks the main body of text is identified. Afterward, word image equivalence classes and sentence boundaries within the blocks of the main body of text are determined. The word image equivalence classes are used to identify thematic words. These, in turn are used to score the sentences within the main body of text, and the highest scoring sentences are selected for extraction.
摘要:
An efficient image processing technique automatically analyzes an image scanned at 300 or greater dpi and measures an image characteristic of the input image from which it is possible to determine whether the image has ever been previously scanned or printed at low resolution at some time in its history. The technique is effective in classifying an image that was at one time embodied in paper form and scanned at a vertical resolution of 100 dpi or less, such as a facsimile document scanned in standard mode, or at 200 pixels/inch (referred to as "fine fax mode".) The technique performs measurements on the pixels included in the vertical or horizontal edges of symbols contained in the input image, and produces a distribution of the measurements. A numerical interpretation of the measurement distribution data is used to classify the image. The invention is computationally efficient because it may be applied to only a small percentage (e.g., 7%) of a document image as long as the subimage selected contains symbols such as characters. The invention may be incorporated into a document image management system where identification of documents that contain the artifacts of low resolution document images could be used to improve subsequent processing of the image, such as, for example, in an OCR system.
摘要:
A robust technique for determining whether a field (43, 45, 47a-d) on a form (40'), which has been converted to a binary input image, contains a mark utilizes an approach of making an initial determination of the approximate location of the field, and then refining such determination. The form is assumed to have registration marks (fiducials) with the field at a known location relative to the fiducials. The fiducials are identified (50), and the approximate location of the field is determined (55) from the fiducial positions and the known relation between the fiducials and the field. At this point, a portion of the image (referred to as the subimage) is extracted (57). The subimage is typically somewhat larger than the field so that it can be assumed that the field is within the subimage. The field has machine-printed lines along at least part of the field perimeter. In order to distinguish these lines from hand-printed marks in the field, a copy of the subimage is subjected to a set of operations (60) on the actual pixels in the subimage that provides nominal information on the location of these lines. The boundaries of the subimage are then changed (62) to nominally exclude the lines.
摘要:
The present invention provides a robust technique for quickly determining whether a binary input image originated as a blank page. The technique provides reliable sensing in the presence of various image and scanner noise in the input image. In broad terms, the invention contemplates reducing the input image with a low threshold, labeling (by size) connected components (8-connected or 4-connected), and performing a threshold analysis. The threshold analysis typically entails size and numerical thresholds, taking into account the characteristic dimensions of expected types of noise. In specific embodiments, the reduction is performed as a textured reduction wherein the image is divided into tiles, and a single row of pixels in each tile is checked to see whether there are any ON pixels. If there are, the corresponding pixel in the reduced image is ON, otherwise it is OFF. Optional morphological operations are performed to remove expected sources of noise (e.g., pepper noise and thin horizontal lines). The invention further recognizes that a faxed page may contain vertical streaks that are not part of the original paper document. Thus, the threshold analysis typically allows a certain number of such streaks to be present without concluding that the page is not blank. A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings.
摘要:
Font-independent spotting of user-defined keywords in a scanned image. Word identification is based on features of the entire word without the need for segmentation or OCR, and without the need to recognize non-keywords. Font-independent character models are created using hidden Markov models (HMMs) and arbitrary keyword models are built from the character HMM components. Word or text line bounding boxes are extracted from the image, a set of features based on the word shape, (and preferably also the word internal structure) within each bounding box is extracted, this set of features is applied to a network that includes one or more keyword HMMs, and a determination is made. The identification of word bounding boxes for potential keywords includes the steps of reducing the image (say by 2.times.) and subjecting the reduced image to vertical and horizontal morphological closing operations. The bounding boxes of connected components in the resulting image are then used to hypothesize word or text line bounding boxes, and the original bitmaps within the boxes are used to hypothesize words. In a particular embodiment, a range of structuring elements is used for the closing operations to accommodate the variation of inter- and intra-character spacing with font and font size.