摘要:
A dangling web page processing system ranks dangling web pages on the web. The system ranks dangling web pages of high quality that cannot be crawled by a crawler. In addition, the system adjusts ranks to penalize dangling web pages that return errors when links on the dangling web pages are crawled. By providing a rank for dangling web pages, the present system allows the concentration of crawling resources on those dangling web pages that have the highest rank in the uncrawled region. The system operates locally to the dangling web pages, providing efficient determination of ranks for the dangling web pages. The system explicitly discriminates against web pages on the basis of whether they point to penalty pages, i.e., pages that return an error when a link is followed. By incorporating more fine-grained information such as this into ranking, the system can improve the quality of individual search results and better manage resources for crawling.
摘要:
A method is carried out by storing information describing configurations of discussion threads formed of respective series of EMTs that are exchanged among at least two individuals. The discussion threads have a root EMT, zero or more reply EMTs, and a last offspring EMT. The method is further carried out by compacting the EMT discussion threads, and indexing the compacted EMT discussion threads.
摘要:
A method for querying multifaceted information. An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents. An indexed token is either a facet token included in a document as an annotation or a path prefix of the facet token. The annotation indicates a path within a tree structure representing a facet that includes the document. The tree structure includes nodes representing categories of documents. Constructing the inverted index includes generating a full path token and an associated full path token posting list. A query is received that includes constraints on documents. The constraints are associated with indexed tokens and corresponding posting lists. An execution of the query includes identifying the corresponding posting lists by utilizing the constraints and the inverted index and intersecting the posting lists to obtain a query result.
摘要:
A method and system for querying multifaceted information. An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents. An indexed token is either a facet token included in a document as an annotation or a path prefix of the facet token. The annotation indicates a path within a tree structure representing a facet that includes the document. The tree structure includes nodes representing categories of documents. A query is received that includes constraints on documents. The constraints are associated with indexed tokens and corresponding posting lists. An execution of the query includes identifying the corresponding posting lists by utilizing the constraints and the inverted index and intersecting the posting lists to obtain a query result.
摘要:
A dangling web page processing system ranks dangling web pages on the web. The system ranks dangling web pages of high quality that cannot be crawled by a crawler. In addition, the system adjusts ranks to penalize dangling web pages that return errors when links on the dangling web pages are crawled. By providing a rank for dangling web pages, the present system allows the concentration of crawling resources on those dangling web pages that have the highest rank in the uncrawled region. The system operates locally to the dangling web pages, providing efficient determination of ranks for the dangling web pages. The system explicitly discriminates against web pages on the basis of whether they point to penalty pages, i.e., pages that return an error when a link is followed. By incorporating more fine-grained information such as this into ranking, the system can improve the quality of individual search results and better manage resources for crawling.
摘要:
A system and method of indexing a plurality of entities located in a taxonomy, the entities comprising sets of terms, comprises receiving terms in an index structure; building a posting list for an entity with respect to the locations of the set of terms defining the entity and data associated with the respective terms; and indexing a name of a group comprising the entities within this group at the location of the entities with the data of the group comprising the name of the respective entity at each location. The building of the posting list comprises storing the location of the term and data associated with the term in an entry in the posting list for the term. The method comprises indexing aliases of the name of the group comprising the term, and using an inverted list index to associate data with each occurrence of an index term.
摘要:
A method is carried out by storing information describing configurations of discussion threads formed of respective series of EMTs that are exchanged among at least two individuals. The discussion threads have a root EMT, zero or more reply EMTs, and a last offspring EMT. The method is further carried out by compacting the EMT discussion threads, and indexing the compacted EMT discussion threads.
摘要:
A method includes describing the thread configurations of a volume of well-ordered electronic message transmissions (EMT) and utilizing the thread configuration data to conduct selective searches of the EMT volume. An apparatus includes a thread processor and a query manager. The thread processor analyzes the EMT threads and records the thread configuration data. The query manager utilizes the thread configuration data to conduct selective searches of the EMT volume.
摘要:
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
摘要:
A producer node may be included in a hierarchical, tree-shaped processing architecture, the architecture including at least one distributor node configured to distribute queries within the architecture, including distribution to the producer node and at least one other producer node within a predefined subset of producer nodes. The distributor node may be further configured to receive results from the producer node and results from the at least one other producer node and to output compiled results therefrom. The producer node may include a query pre-processor configured to process a query received from the distributor node to obtain a query representation using query features compatible with searching a producer index associated with the producer node to thereby obtain the results from the producer node, and a query classifier configured to input the query representation and output a prediction, based thereon, as to whether processing of the query by the at least one other producer node within the predefined subset of producer nodes will cause results of the at least one other producer node to be included within the compiled results.