摘要:
The presence of a non-text object is sensed in a mixed object document to be archived in an information retrieval system. In addition to text objects, a mixed object document can contain non-text objects such as image objects, graphics objects, formatted objects, font objects, voice objects, video objects and animation objects. This enables the creation of key words which characterize the non-text object, for incorporation in the inverted file index of the data base, thereby enabling the later retrieval of either the entire document or the independent retrieval of the non-text object through the use of such key words.
摘要:
The invention is characterized as a data processing architecture and method for multi-stage processing of mail, using knowledge based techniques. The system includes OCR-scanning a multipart address field of a mail piece at a sending location, the address field including at least two portions, a first stage routing portion (destination city, state, country, zip code) and a second stage routing portion (destination street address, building floor, corporate addressee internal routing). At the sending location, the image of the entire address field is captured by an OCR head and stored in memory. A serial number is printed on the mail piece. The first routing portion is then converted into sorting signals to sort the mail piece to a truck at the sending location which is to be dispatched to the city, state and country indicated in the first stage routing portion. Then, while the mail piece is in transit by truck to the destination city, the image of the second stage routing portion is analyzed by a knowledge base processor to resolve street addresses, building floor, corporate addressee internal routing information and addressee name. The deferred execution of the analysis by the knowledge base processor is available because of the sporadic volume of mail pieces submitted to the sytem. While the mail piece is in transit on the truck, the knowledge processor completes its analysis and is able to transmit by electronic communications link to the destination location, the information that the mail piece is on its way and the second stage routing information needed to automatically sort and deliver the mail piece to its corporate addressee.
摘要:
A system for reducing the computation required to match a misspelled word against various candidates from a dictionary to find one or more words that represent the best match to the misspelled word. The major facility offered is the ability to computationally discern the degree of apparent match that exists between words that do not perfectly match a given target word without requiring the computationally tedious procedure of character by character positional matching which necessitates shifting and realignment to accommodate for differences between the candidate and target words due to character differences or added and dropped syllables. The system includes a method for storing and retrieving words from the dictionary based on their likelihood of being the correct version of a misspelled word and then reviewing those words further using the Prescan Alpha Content Match to reduce the number of candidates that must then be examined in a high resolution positional match to find the candidate(s) which matches the mis-spelled word with the greatest character affinity. The Prescan Alpha Content Match reduces the number of candidates in contention so as to make a high resolution match computationally feasible on a real-time basis.
摘要:
The combination of dictionary driven hyphenation, specialized algorithmic hyphenation and intelligent blank insertion provides improved right margin justification capability in a text processing system. When hyphenation is required for right margin justification, the system compares the word to be hyphenated to a prestored dictionary of words containing hyphenation points. When the word to be hyphenated matches one of the dictionary words the hyphenation points are retrieved and the word is split at the right margin. If the word to be hyphenated does not match one of the dictionary words, then a specialized list of prestored hyphenated suffixes and prestored statistical character digrams are compared to the word to determine the appropriate hyphenation points. Once the word has been split, the system searches the line for sets of predetermined words which may be separated from other words in the sentence by adding space to the line with a minimum of aesthetic distortion. Space is then added to the line until the line ending equals the right margin. The text is then printed.
摘要:
An improved system for identifying and compacting text data to be transmitted over communications lines and thereby reducing the data volume and transmission time. Transmitting and receiving text processing systems are provided identical library memories containing words commonly used in correspondence. Each word in a document to be communicated is compared to the transmitting system's word library and, if found in the library, only the library address is transmitted. If the word is not found in the library, then it is added to the transmitting system's library, sent, and added to the receiving system's library. The receiving system reconstructs the document by using the received addresses to access the appropriate words from its library and place them in the document. The system combines this word match encoding with character match encoding and facsimile run length encoding for communicating words not found in the system library. The character match process requires a template match and non-linear difference code summation combined with N-dimensional weighting using prestored feature vectors for statistically determining the match between an input character and characters stored in the system library.
摘要:
A system that intelligently abstracts and archives a document for storage and interprets a free form user retrieval query to recall the document from the storage file. The system includes a method for automatically selecting keywords from the document using a parts of a speech directory. A method is given for weighing the importance or centrality of each keyword with respect to the document of its origin. Using the same logic paths, a free form query that describes the document in the same manner that it would have to be described to a secretary to "find" it in a filing cabinet, the system automatically determines the key matching terms and finds the archived document(s) with the greatest affinity.
摘要:
A system for reducing the computation required to match a misspelled word against various candidates from a dictionary to find one or more words that represent the best match to the misspelled word. The major facility offered is the ability to computationally discern the degree of apparent match that exists between words that do not perfectly match a given target word without requiring the computationally tedious procedure of character by character positional matching which necessitates shifting and realignment to accommodate for differences between the candidate and target words due to character differences or added and dropped syllables. The system includes a method for storing and retrieving words from the dictionary based on their likelihood of being the correct version of a misspelled word and then reviewing those words further to reduce the number of candidates that must then be examined in a high resolution positional match to find the candidate(s) which matches the misspelled word with the greatest character affinity. This technique reduces the number of candidates in contention so as to make a high resolution match computationally feasible on a real-time basis. The discriminant potential and the real-time computational burden associated with the technique are balanced in an optimal manner.
摘要:
A system for reducing storage requirements and accessing times in a text processing machine for automatic spelling verification and hyphenation functions. The system includes a method for storing a word list file and accessing the word list file such that legal prefixes and suffixes are truncated and only the unique root element, or "stem", of a word is stored. A set of unique rules is provided for prefix/suffix removal during compilation of the word list file and subsequent accessing of the word list file. Spelling verification is accomplished by applying the rules to the words whose spelling is to be verified and application of the said rules provides, under most circumstances, a natural hyphenation break point at the prefix-stem and stem-suffix junctions.
摘要:
A multi-channel multi-genre character recognition discriminator is disclosed which performs the decision making process between strings of characters coming from a multi-channel (i.e., three or more channels) alpha-numeric output optical character reader (OCR) system for use in such applications as, for example, text processing and mail processing. The multi-channel output OCR uses separate recognition processes for each genre or character set indicative of a distinct group with respect to style (i.e., font) or form, and attempts to recognize each character independently as belonging to each respective genre. For example, in a three channel output OCR for reading mixed numeric, English and Russian Cyrillic character sets, the English alphabetic interpretation of a scanned word is outputted as an English alphabetic subfield on a first OCR output line, the Cyrillic interpretation of the scanned word is outputted as a Cyrillic subfield on a second OCR output line, and numeric interpretation of the scanned word is outputted as a numeric subfield on a third OCR output line. A multi-channel multi-genre character recognition discriminator analyzes these three subfield character streams by calculating a first conditional probability that given the OCR has scanned and recognized an English alphabetic character E.sub.i, the probability that numeric N.sub.K and Cyrillic C.sub.J characters were respectively misrecognized by their recognition channels; a second conditional probability that given the OCR has scanned and recogized a Cyrillic character C.sub.J the probability that numeric N.sub.K and English E.sub.i characters were respectively misrecognized by their recognition channels; and a third conditional probability that given the OCR scanned and recognized a numeric character N.sub.K, the probability that English E.sub.i and Cyrillic C.sub.J characters were respectively misrecognized by their recognition channels. These conditional probabilities are developed character by character for each character within a string thereof or a word. A first product of all the first type conditional probabilities is calculated for all of the characters in a word (which may, of course, contain only a single character); similarly second and third products are calculated for the second and third conditional probabilities, respectively. The magnitudes of the products of these conditional probabilities are then compared in an N-channel comparator, and the highest probability subfield is selected as the most probable interpretation of the word scanned by the OCR.