摘要:
An automatic character cell determining apparatus automatically determines the character cells within the text image of a document. A connected component generator means generates connected components from the pixels comprising the text image. An aligning device aligns skewed and warped lines to the proper image axes. A bounding box generator generates a bounding box surrounding each connected component. A character cell determining device for locating character cells including one or more connected components has a vertical splaying device and a horizontal splaying device for ensuring white spaces between lines and connected components, a vertical profile device for determining the vertical positions of a line, a splitting device for splitting ligatures of two or more connected components and a character cell generator for generating character cells grouping together one or more connected components.
摘要:
A first method for exact and inexact matching of documents stored in a document database includes the step of converting the documents in the database to a compacted tokenized form. A search string or search document is then converted to the compact tokenized form and compared to determine if the test string occurs in the documents of the database or whether the documents in the database correspond to the test document. A second method for inexact matching of a test document to the documents in the database includes generating sets of one or more floating point values for each document in the database and for the test document. The sets of floating point numbers for the database are then compared to the set for the test document to determine a degree of matching. A threshold value is established and each document in the database which generates a matching value closer to the test document that the threshold is considered to be an inexact match of the test document.