摘要:
A method for indexing a plurality of documents, that includes a plurality of duplicate documents, first identifies one or more duplicate groups of documents from among the plurality of documents. Then, one index of content for the duplicate group is created instead of indexing the content from every document within the duplicate group. However, in contrast to the content index, an index of metadata for each of the documents in the duplicate group is created. Thus the content of each duplicate group is indexed only once, while a search engine using such indexing techniques retains the capability to answer queries as if the duplicated content was indexed for each document of the group.
摘要:
A method and system for querying multifaceted information. An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents. An indexed token is either a facet token included in a document as an annotation or a path prefix of the facet token. The annotation indicates a path within a tree structure representing a facet that includes the document. The tree structure includes nodes representing categories of documents. A query is received that includes constraints on documents. The constraints are associated with indexed tokens and corresponding posting lists. An execution of the query includes identifying the corresponding posting lists by utilizing the constraints and the inverted index and intersecting the posting lists to obtain a query result.
摘要:
A method for querying multifaceted information. An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents. An indexed token is either a facet token included in a document as an annotation or a path prefix of the facet token. The annotation indicates a path within a tree structure representing a facet that includes the document. The tree structure includes nodes representing categories of documents. Constructing the inverted index includes generating a full path token and an associated full path token posting list. A query is received that includes constraints on documents. The constraints are associated with indexed tokens and corresponding posting lists. An execution of the query includes identifying the corresponding posting lists by utilizing the constraints and the inverted index and intersecting the posting lists to obtain a query result.
摘要:
A method for querying multifaceted information. An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents. An indexed token is either a facet token included in a document as an annotation or a path prefix of the facet token. The annotation indicates a path within a tree structure representing a facet that includes the document. The tree structure includes nodes representing categories of documents. Constructing the inverted index includes generating a full path token and an associated full path token posting list. A query is received that includes constraints on documents. The constraints are associated with indexed tokens and corresponding posting lists. An execution of the query includes identifying the corresponding posting lists by utilizing the constraints and the inverted index and intersecting the posting lists to obtain a query result.
摘要:
A method and system for querying multifaceted information. An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents. An indexed token is either a facet token included in a document as an annotation or a path prefix of the facet token. The annotation indicates a path within a tree structure representing a facet that includes the document. The tree structure includes nodes representing categories of documents. A query is received that includes constraints on documents. The constraints are associated with indexed tokens and corresponding posting lists. An execution of the query includes identifying the corresponding posting lists by utilizing the constraints and the inverted index and intersecting the posting lists to obtain a query result.
摘要:
A method for indexing a plurality of documents, that includes a plurality of duplicate documents, first identifies one or more duplicate groups of documents from among the plurality of documents. Then, one index of content for the duplicate group is created instead of indexing the content from every document within the duplicate group. However, in contrast to the content index, an index of metadata for each of the documents in the duplicate group is created. Thus the content of each duplicate group is indexed only once, while a search engine using such indexing techniques retains the capability to answer queries as if the duplicated content was indexed for each document of the group.
摘要:
Provided are a system and article of manufacture for searching documents for ranges of numeric values. Document identifiers for documents include at least one value that is a member of a set of values. A number of posting lists is generated, wherein each posting list is associated with a range of consecutive values within the set of values and includes document identifiers for documents including at least one value within the range of consecutive values associated with the posting list, and wherein each document identifier is associated with one value in the set of values included in the document identified by the document identifier. The generated posting lists are stored, wherein the posting lists are used to process a query on a range of values within the set of values. A query on a query range of values within the set of values is received and a determination is made of a minimum number of posting lists associated with consecutive values that together include the query range of values. The determined posting lists are merged to form a merged posting list including document identifiers of documents including values within the query range. The document identifiers in the merged posting list are returned.
摘要:
Provided are a method, system, and article of manufacture for searching documents for ranges of numeric values. Document identifiers for documents are accessed, wherein the documents include at least one value that is a member of a set of values. A number of posting lists are generated. Each posting list is associated with a range of consecutive values within the set of values and includes document identifiers for documents including at least one value within the range of consecutive values associated with the posting list, and wherein each document identifier is associated with one value in the set of values included in the document identified by the document identifier. The generated posting lists are stored, wherein the posting lists are used to process a query on a range of values within the set of values.
摘要:
Provided are a method, system, and article of manufacture for searching documents for ranges of numeric values. Document identifiers for documents are accessed, wherein the documents include at least one value that is a member of a set of values. A number of posting lists are generated. Each posting list is associated with a range of consecutive values within the set of values and includes document identifiers for documents including at least one value within the range of consecutive values associated with the posting list, and wherein each document identifier is associated with one value in the set of values included in the document identified by the document identifier. The generated posting lists are stored, wherein the posting lists are used to process a query on a range of values within the set of values.
摘要:
Provided are a method, system, and program for searching documents for ranges of numeric values. Document identifiers for documents are accessed, wherein the documents include at least one value that is a member of a set of values. A number of posting lists are generated. Each posting list is associated with a range of consecutive values within the set of values and includes document identifiers for documents having values within the range of consecutive values associated with the posting list. Each document identifier is associated with one value in the set of values included in the document identified by the document identifier. The generated posting lists are stored.