Abstract:
A method includes generating, a plurality of sets of pairs of records from a set of records, for each attribute-position pair in the set of records. Each attribute-position pair being indicative of a position of an attribute in a record. Further, the method includes forming, electronically, a plurality of groups, each group comprising two attribute-position pairs having different attributes. Further, the method also includes determining, electronically for each group, number of pairs of records that are common in the two attribute-position pairs of that group. Furthermore, the method includes extracting results based on a first group of the plurality of groups if the number of pairs of records that are common in the two attribute-position pairs of the first group is greater than a second threshold, is highest among the plurality of groups, and no group having three or more attribute-position pairs with different attributes is possible.
Abstract:
A method includes generating, electronically, one or more matching patterns for one or more pairs of attribute values. Each pair includes two attribute values. The two attribute values include a first attribute value from a first record and a second attribute value from a second record. The first attribute value and the second attribute value satisfy a first criterion. Further, the method includes identifying, electronically, matching segment between the first attribute value and the second attribute value of a first pair. The method also includes repeating identifying for each pair. Moreover, the method includes computing a similarity score for the first pair using one of the first pair and the matching segment based on the one or more matching patterns and matching segments of the one or more pairs satisfying a second criterion. The method also includes repeating computing for each pair.
Abstract:
A method of grouping nodes within a distributed network is provided. The example method includes performing a leader node self determination operation by which each node within the distributed network determines whether to become a leader node or a non-leader node, each leader node being the leader of a group including at least one node. Next, requests are sent, from each leader node, requesting at least one non-leader node to join the group associated with the leader node. First received requests are accepted, at each non-leader node, such that accepting non-leader nodes transition from a non-leader node to a dependent node dependent upon the requesting leader node. A next set of requests are sent, from each remaining non-leader node, requesting to join the group associated with at least one leader node. A determination is made, at each requested leader node, as to whether to accept the non-leader node into the group associated with the requested leader node. Based on the determination, at each requested leader node, the non-leader node is either accepted into the group associated with the requested leader node, or is alternatively rejected from the group.
Abstract:
A distinct-count estimate is obtained in a guaranteed small footprint using a two level hash, distinct count sketch. A first hash fills the first-level hash buckets with an exponentially decreasing number of data-elements. These are then uniformly hashed to an array of second-level-hash tables, and have an associated total-element counter and bit-location counters. These counters are used to identify singletons and so provide a distinct-sample and a distinct-count. An estimate of the total distinct-count is obtained by dividing by the distinct-count by the probability of mapping a data-element to that bucket. An estimate of the total distinct-source frequencies of destination address can be found in a similar fashion. By further associating the distinct-count sketch with a list of singletons, a total singleton count and a heap containing the destination addresses ordered by their distinct-source frequencies, a tracking distinct-count sketch may be formed that has considerably improved query time.
Abstract:
A method of estimating set-expression cardinalities over data streams with guaranteed small maintenance time per data-element update. The method only examines each data element once and uses a limited amount of memory. The time-efficient stream synopsis extends 2-level hash-sketches by randomly, but uniformly, pre-hashing data-elements prior to logarithmically hashing them to a first-level hash-table. This generates a set of independent 2-level hash-sketches. The set-union cardinality can be estimated by determining the smallest hash-bucket index j at which only a predetermined fraction of the b hash-buckets has a non-empty union |A∪B|. Once a set-union cardinality is estimated, general set-expression cardinalities may be estimated by counting witness elements for the set-expression, i.e., those first-level hash-buckets that are both a singleton for the set-expression and a set-union singleton. The set-expression cardinality is the set-union cardinality times the number of witness elements divided by the number of hash-buckets.
Abstract:
Improved techniques are disclosed for processing data stream queries wherein a data stream is obtained, a set of aggregate queries to be executed on the data stream is obtained, and a query plan for executing the set of aggregate queries on the data stream is generated. In a first method, the generated query plan includes generating at least one intermediate aggregate query, wherein the intermediate aggregate query combines a subset of aggregate queries from the set of aggregate queries so as to pre-aggregate data from the data stream prior to execution of the subset of aggregate queries such that the generated query plan is optimized for computational expense based on a given cost model. In a second method, the generated query plan includes identifying similar filters in two or more aggregate queries of the set of aggregate queries and combining the similar filters into a single filter such that the single filter is usable to pre-filter data input to the two or more aggregate queries.
Abstract:
A method of grouping nodes within a distributed network is provided. The example method includes performing a leader node self determination operation by which each node within the distributed network determines whether to become a leader node or a non-leader node, each leader node being the leader of a group including at least one node. Next, requests are sent, from each leader node, requesting at least one non-leader node to join the group associated with the leader node. First received requests are accepted, at each non-leader node, such that accepting non-leader nodes transition from a non-leader node to a dependent node dependent upon the requesting leader node. A next set of requests are sent, from each remaining non-leader node, requesting to join the group associated with at least one leader node. A determination is made, at each requested leader node, as to whether to accept the non-leader node into the group associated with the requested leader node. Based on the determination, at each requested leader node, the non-leader node is either accepted into the group associated with the requested leader node, or is alternatively rejected from the group.
Abstract:
A system for, and method of, configuring border gateway selection for transit traffic flows in a computer network. In one embodiment, the system includes: (1) a border gateway modeler that builds a model of cooperating border gateways, the model including capacities of the border gateways and (2) a traffic flow optimizer, associated with the border gateway modeler, that initially assigns traffic to the border gateways in accordance with a generalized assignment problem and subsequently reassigns the traffic to the border gateways based on cost until the capacities are respected.
Abstract:
The present invention discloses a document descriptor extraction method and system. The document descriptor extraction method and system creates a document descriptor by generalizing input sequences within a document; factoring the input sequences and generalized input sequences; and selecting a document descriptor from the input sequences, generalized sequences, and factored sequences, preferably using minimum descriptor length (MDL) principles. Novel algorithms are employed to perform the generalizing, factoring, and selecting.
Abstract:
A new method for identifying a predetermined number of data points of interest in a large data set. The data points of interest are ranked in relation to the distance to their neighboring points. The method employs partition-based detection algorithms to partition the data points and then compute upper and lower bounds for each partition. These bounds are then used to eliminate those partitions that do contain the predetermined number of data points of interest. The data points of interest are then computed from the remaining partitions that were not eliminated. The present method eliminates a significant number of data points from consideration as the points of interest, thereby resulting in substantial savings in computational expense compared to conventional methods employed to identify such points.