摘要:
To compute a signature for an object comprising or represented by a set of vectors in a vector space of dimensionality D, statistics are computed that are indicative of distribution of the vectors of the set of vectors amongst a set of regions Ri, i=1, . . . , N of the vector space, at least some statistics associated with each region are binarized to generate sets of binary values ai, i=1, . . . , N indicative of statistics of the vectors of the set of vectors belonging to the respective regions Ri, i=1, . . . , N; and a vector set signature is defined for the set of vectors including the sets of binary values ai, i=1, . . . , N. The computing, binarizing, and defining operations may be repeated for two sets of vectors, and a quantitative comparison of the two sets of vectors determined based on the corresponding vector set signatures.
摘要:
A method is provided for selecting fields of an electronic form for automatic population with candidate text segments. The candidate text segments can be obtained by capturing an image of a document, applying optical character recognition to the captured image to identify textual content, and tagging candidate text segments in the textual content for fields of the form. The method includes, for each of a plurality of fields of the form, computing a field exclusion function based on at least one parameter selected from a text length parameter, an optical character recognition error rate, a tagging error rate, and a field relevance parameter; and determining whether to select the field for automatic population based on the computed field exclusion function.
摘要:
To compute a signature for an object comprising or represented by a set of vectors in a vector space of dimensionality D, statistics are computed that are indicative of distribution of the vectors of the set of vectors amongst a set of regions Ri, i=1, . . . , N of the vector space, at least some statistics associated with each region are binarized to generate sets of binary values ai, i=1, . . . , N indicative of statistics of the vectors of the set of vectors belonging to the respective regions Ri, i=1, . . . , N; and a vector set signature is defined for the set of vectors including the sets of binary values ai, i=1, . . . , N. The computing, binarizing, and defining operations may be repeated for two sets of vectors, and a quantitative comparison of the two sets of vectors determined based on the corresponding vector set signatures.
摘要:
Words of an input string are morphologically analyzed to identify their alternative base forms and parts of speech. The analyzed words of the input string are used to compile the input string into a first finite-state network. The first finite-state network is matched with a second finite-state network of multiword expressions to identify all subpaths of the first finite-state network that match one or more complete paths in the second finite-state network. Each matching subpath of the first finite-state network and path of the second finite-state network identify a multiword expression in the input string. The morphological analysis is performed without disambiguating words and without segmenting the input string into sentences in the input string to compile the first finite-state network with at least one path that identifies alternative base forms or parts of speech of a word in the input string.
摘要:
Multiword expressions are mapped to identifiers using finite-state networks. Each of a plurality of multiword expressions is encoded into a regular expression. Each regular expression encodes a base form common to a plurality of derivative forms defined by ones of the multiword expressions. Each of the plurality of regular expressions is compiled with factorization into a set of finite-state networks. A union of the finite-state networks in the set of finite-state networks is performed to define a multiword finite-state network and a set of subnets. The multiword finite-state network and the set of subnets are traversed to identify a path corresponding to one of the plurality of multiword expressions, wherein only transitions originating from the multiword finite-state network are accounted for to ascertain a path number identifying a base form of the one of the plurality of multiword expressions.
摘要:
An executable for a new linguistic service is produced using preexisting source code for an ancestor service that is a less specified ancestor of the new linguistic service in a hierarchy. The preexisting source code is modified, such as by further specifying it, to produce modified source code for responding to requests for the new linguistic service, where each request identifies the new linguistic service and indicates linguistic data on which it is to be performed. The modified source code is then used to produce the executable for the new linguistic service. The preexisting source code can, for example, define a top-level class in an object-oriented programming language, with common parameters including input parameters with information for obtaining the linguistic data and result parameters with information for returning results of the new linguistic service.
摘要:
The invention relates to a method and a computer system for enhanced part-of-speech (POS-) tagging as well as grammatically disambiguating a phrase. A phrase is usually a short multiword expression that may be ambiguous. By introducing grammatical constraints the invention supports POS-tagging as well as grammatically disambiguating the phrase. According to an identifier for the phrase, the phrase is supplemented with artificial context information. The supplemented phrase is then POS-tagged or grammatically disambiguated. Important applications are POS-tagging, Automatic Term Encoding, Headword Detection and Information Retrieval.