摘要:
The invention is a method, system and computer program for automatically discovering concepts from a corpus of documents and automatically generating a labeled concept hierarchy. The method involves extraction of signatures from the corpus of documents. The similarity between signatures is computed using a statistical measure. The frequency distribution of signatures is refined to alleviate any inaccuracy in the similarity measure. The signatures are also disambiguated to address the polysemy problem. The similarity measure is recomputed based on the refined frequency distribution and disambiguated signatures. The recomputed similarity measure reflects actual similarity between signatures. The recomputed similarity measure is then used for clustering related signatures. The signatures are clustered to generate concepts and concepts are arranged in a concept hierarchy. The concept hierarchy automatically generates query for a particular concept and retrieves relevant documents associated with the concept.
摘要:
Computer systems and methods incorporate user annotations (metadata) regarding various pages or sites, including annotations by a querying user and by members of a trust network defined for the querying user into search and browsing of a corpus such as the World Wide Web. A trust network is defined for each user, and annotations by any member of a first user's trust network are made visible to the first user during search and/or browsing of the corpus. Users can also limit searches to content annotated by members of their trust networks or by members of a community selected by the user.
摘要:
Fixed-pitch, fixed-font characters embedded in a noisy gray-scale image of picture elements (pels) within a complex background can be extracted prior to execution of any recognition operations by first deriving a normalized Boolean-coded image from the gray-scale image. Then, a subset of at least three uncontaminated character triples is formed by filtering the Boolean-coded image. Next, an affine transform is approximated from locations in the Boolean-coded image of at least three noncollinear ones of the uncontaminated character triples. Lastly, the locations in a logical matrix array of all possible character triples are estimated according to the affine transform.