摘要:
Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. A search query includes a search operator containing of a plurality of search sub-expressions each having an associated weight value. The search engine returns a document or documents having a weight value sum that exceeds a threshold weight value sum. The search operator is implemented as a Boolean predicate that functions as a Weighted AND (WAND).
摘要:
Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. Also disclosed is system, method and computer program product to process document data. The method includes inputting a document and operating at least one text analysis engine that comprises a plurality of coupled annotators for tokenizing document data for identifying and annotating a particular type of semantic content. Operating the at least one text analysis engine generates a plurality of views of a document, where each of the plurality of views are derived from a different tokenization of the document. The method further includes storing the plurality of views in a common data structure associated with the document.
摘要:
A method for searching a document collection includes providing an index of terms indicating the documents in which the terms appear. A first statistical distribution of each of at least some of the terms in the index and a second statistical distribution of each of at least some of the categories are estimated a over the documents in the collection. A query including one or more of the terms and a category restriction referring to at least one of the categories is accepted. A modified term distribution is produced by operating on the first statistical distribution of at least one of the terms in the query using the second statistical distribution, responsively to the category restriction. The query is applied to the index to return a response, in which occurrences of the at least one of the terms are scored responsively to the modified term distribution.
摘要:
A method for searching a document collection includes providing an index of terms indicating the documents in which the terms appear. A first statistical distribution of each of at least some of the terms in the index and a second statistical distribution of each of at least some of the categories are estimated a over the documents in the collection. A query including one or more of the terms and a category restriction referring to at least one of the categories is accepted. A modified term distribution is produced by operating on the first estimated statistical distribution of at least one of the terms in the query using the second estimated statistical distribution of the at least one of the categories, responsively to the category restriction. The query is applied to the index so as to return a response, in which occurrences of the at least one of the terms are scored responsively to the modified term distribution.
摘要:
Method, system, and computer program product are provided for scoring of crowd-computing inputs. A group of data is provided to crowd-computing participants and the participants are requested to provide candidate members or the group of data. The computer-implemented method performed includes: receiving an input by a participant, wherein the input is a candidate member; counting multiple inputs of the same candidate member by participants; validating a candidate member; rewarding the participants inputting the candidate member, with a higher reward for participants who input the candidate member earlier than other participants; and supplying the rewards to participants once the candidate member has been validated.
摘要:
Method, system, and computer program product for indexing and searching entity-relationship data are provided. The method includes: defining a logical document model for entity-relationship data including: representing an entity as a document containing the entity's searchable content and metadata; dually representing the entity as a document and as a category; and representing each relationship instance for the entity as a category set that contains categories of all participating entities in the relationship. The method also includes: translating entity-relationship data into the logical document model; and indexing the entity-relationship data of the populated logical document model as an inverted index. The method may include searching indexed entity-relationship data using a faceted search, wherein the categories are all categories required for supporting faceted navigation.
摘要:
A method and system for using social bookmarks wherein a social bookmark is a triplet of the entities of user, document, and tag. The method including: collecting multiple bookmarks; representing the bookmarks as a three-dimensional space or matrix of the number of times a user u, used tag t to bookmark document d; measuring the similarity of two entities of the same type; and using the similarity to weight bookmarks or entities. The weightings may be used to provide a measure of a usefulness of a bookmark for describing a document for retrieval purposes. Two-dimensions of the bookmark space may also be used to predict the third-dimension.
摘要:
A method and system are provided for maintaining profiles of information channels available on the Web, wherein the information channels are accessed via pull-only protocols. The method includes monitoring one or more channels by a channel pull action at a monitoring rate, wherein the monitoring rate is determined for the one or more channels based on the number of update events in a previous time period. The method may optimally include filtering the update events in the time period by a novelty measure, wherein the filtering disregards events that do not include significant novel information. The monitoring rate is adapted based on reinforcement learning applying iterative learning rules over time.
摘要:
A method and system for improved query expansion in faceted search are provided. The method includes: receiving a search query; expanding the search query to obtain query expansion terms; and receiving a facet selection for the search query. A facet profile is retrieved in the form of collected important terms for the facet; and the query expansion terms are weighted by comparing them to the facet profile. The query expansion terms are re-ranked and the method includes executing the re-weighted query expansion terms whilst filtering for the facet.
摘要:
A method and system for prioritising operations on network objects are provided. The method includes gathering Web 2.0 available relationship data on the relationships between network entities, wherein network entities are network users and network objects. The relationship data for a network entity is analysed and a first relative score is determined based on the relationship data. For a network object, a second relative score is determined which is a dynamic score based on user interactions with the network object and formed using the first relative scores of network entities interacting with the object. The method then prioritizes an operation on a network object using the second relative score.