摘要:
A method, information processing system, and computer program storage product to compress sorted values. At least a first prefix and a second prefix in a plurality of prefixes are compared. Each prefix comprises at least a portion of a plurality of sorted values. A respective prefix comprises a set of consecutive characters including at least a first character of a respective sorted value. The respective sorted value further comprising a respective suffix comprising consecutive characters of the respective sorted value that are after the respective prefix. At least a respective first character of the first prefix and a respective first character of the second prefix are determined to be substantially identical. The first prefix is merged with the second prefix into a single prefix comprising the first character. A set of suffixes associated with the first prefix is updated to reflect an association with the second prefix.
摘要:
A method, information processing system, and computer program storage product for compressing sorted values is disclosed. At least a first prefix and a second prefix in a plurality of prefixes are compared. Each prefix comprises at least a portion of a plurality of sorted values. A respective prefix comprises a set of consecutive characters including at least a first character of a respective sorted value. The respective sorted value further comprising a respective suffix comprising consecutive characters of the respective sorted value that are after the respective prefix. At least a respective first character of the first prefix and a respective first character of the second prefix are determined to be substantially identical. The first prefix is merged with the second prefix into a single prefix comprising the first character. A set of suffixes associated with the first prefix is updated to reflect an association with the second prefix.
摘要:
A method, information processing system, and computer readable storage product estimate a compression factor. A set of key values within an index are analyzed. Each key value is associated with a record identifier (“RID”) list comprising a set of RIDs. The index is in an uncompressed format and includes a total byte length. A number of RIDs associated with each key value is estimated for each key value in the set of key values. A total byte length for all RID deltas between each at least two consecutive RIDs within a RID list is estimated for each RID list based on the number of RIDs that have been determined. The total byte length estimated for each RID list is accumulated. A compression factor associated with the index is determined by dividing the total byte length that has been accumulated by the byte length of the index.
摘要:
A method, information processing system, and computer readable storage product estimate a compression factor. A set of key values within an index are analyzed. Each key value is associated with a record identifier (“RID”) list comprising a set of RIDs. The index is in an uncompressed format and includes a total byte length. A number of RIDs associated with each key value is estimated for each key value in the set of key values. A total byte length for all RID deltas between each at least two consecutive RIDs within a RID list is estimated for each RID list based on the number of RIDs that have been determined. The total byte length estimated for each RID list is accumulated. A compression factor associated with the index is determined by dividing the total byte length that has been accumulated by the byte length of the index.
摘要:
Techniques are disclosed for selecting a delete-safe compression method for a plurality of delta encoded data values (e.g., delta encoded integers or deltas). For example, a computer-implemented method for selecting an optimal delete-safe compression algorithm from among two or more compression algorithms for use on a plurality of delta encoded data values includes the following steps. The maximum number of data values eliminated by each of the two or more compression algorithms is computed. For the plurality of delta encoded data values to be compressed, the minimum size of the plurality of delta encoded data values before compression thereof is computed. A delete-safe threshold value is computed based on the minimum size of the plurality of delta encoded data values. Then, the compression algorithm is selected from the two or more compression algorithms that achieves the delete-safe threshold value.
摘要:
Techniques are disclosed for encoding a variable length structure such that it facilitates forward and reverse scans of a list of such structures as needed. While the techniques are applicable to a wide variety of applications, they are particularly well-suited for use with structures such as those found in compressed database indexes. For example, a computer-implemented method for processing one or more variable length data structures includes the following steps. Each variable length data structure is obtained. Each variable length structure comprises one or more data block. A variable length encoding process is applied to the one or more blocks of each variable length data structure which comprises setting a continuation data value in each block to a first value or a second value, wherein the setting of the continuation data values enables bi-directional scanning of each variable length structure.
摘要:
Techniques are disclosed for encoding a variable length structure such that it facilitates forward and reverse scans of a list of such structures as needed. While the techniques are applicable to a wide variety of applications, they are particularly well-suited for use with structures such as those found in compressed database indexes. For example, a computer-implemented method for processing one or more variable length data structures includes the following steps. Each variable length data structure is obtained. Each variable length structure comprises one or more data block. A variable length encoding process is applied to the one or more blocks of each variable length data structure which comprises setting a continuation data value in each block to a first value or a second value, wherein the setting of the continuation data values enables bi-directional scanning of each variable length structure.
摘要:
Techniques are disclosed for selecting a delete-safe compression method for a plurality of delta encoded data values (e.g., delta encoded integers or deltas). For example, a computer-implemented method for selecting an optimal delete-safe compression algorithm from among two or more compression algorithms for use on a plurality of delta encoded data values includes the following steps. The maximum number of data values eliminated by each of the two or more compression algorithms is computed. For the plurality of delta encoded data values to be compressed, the minimum size of the plurality of delta encoded data values before compression thereof is computed. A delete-safe threshold value is computed based on the minimum size of the plurality of delta encoded data values. Then, the compression algorithm is selected from the two or more compression algorithms that achieves the delete-safe threshold value.
摘要:
A method for organizing deep Web services is provided. In one aspect, the method obtains a collection of sources and their associated attributes and/or input modes, for instance, using a crawling algorithm. The method uses this information to organize the sources into communities. A mining algorithm such as the hyperclique mining algorithm is used to obtain cliques of highly correlated attributes. A clustering algorithm such as the hierarchical agglomerative clustering algorithm is used to further cluster the cliques of attributes into larger cliques, which in the present disclosure is referred to as signatures. The sources that are associated with each signature form a community and a graph representation of the communities is constructed, where the vertices are communities and the edges are the shared attributes.
摘要:
A system and method for searching deep web services are provided. The system and method in one aspect allow organizing communities, sources and schema attributes in a multi-tier containment relationship; searching representative schema attributes in one or more communities; searching representative services in one or more communities; searching for related schema attributes; and searching for related communities.