摘要:
The present invention relates to a computer method, apparatus and programmed medium for clustering databases containing data with categorical attributes. The present invention assigns a pair of points to be neighbors if their similarity exceeds a certain threshold. The similarity value for pairs of points can be based on non-metric information. The present invention determines a total number of links between each cluster and every other cluster bases upon the neighbors of the clusters. A goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and the total number of points within each cluster and every other cluster is then calculated. The present invention merges the two clusters with the best goodness measure. Thus, clustering is performed accurately and efficiently by merging data based on the amount of links between the data to be clustered.
摘要:
The present invention relates to a computer method, apparatus and programmed medium for clustering large databases. The present invention represents each cluster to be merged by a constant number of well scattered points that capture the shape and extent of the cluster. The chosen scattered points are shrunk towards the mean of the cluster by a shrinking fraction to form a representative set of data points that efficiently represent the cluster. The clusters with the closest pair of representative points are merged to form a new cluster. The use of an efficient representation of the clusters allows the present invention to obtain improved clustering while efficiently eliminating outliers.
摘要:
A technique that uses a weighted divide and conquer approach for clustering a set S of n data points to find k final centers. The technique comprises 1) partitioning the set S into P disjoint pieces S1, . . . , Sp; 2) for each piece Si, determining a set Di of k intermediate centers; 3) assigning each data point in each piece Si to the nearest one of the k intermediate centers; 4) weighting each of the k intermediate centers in each set Di by the number of points in the corresponding piece Si assigned to that center; and 5) clustering the weighted intermediate centers together to find said k final centers, the clustering performed using a specific error metric and a clustering method A.
摘要翻译:一种使用加权分割和征服方法来聚集n个数据点的集合S以找到k个最终中心的技术。 该技术包括:1)将集合S划分成P个不相交的部分S 1。 。 。 ,S u> 2)对于每个块S i确定k个中间中心的集合D i i i i, 3)将每个片段S i中的每个数据点分配给k个中间中心中最接近的一个; 4)通过分配给该中心的相应片段S i i中的点的数量对每个集合D i i i中的每个k个中间中心进行加权; 和5)将加权中间体聚类在一起以找到所述k个最终中心,使用特定的误差度量和聚类方法A进行聚类。
摘要:
A system and method are provided for summarizing dynamic data from distributed sources through the use of histograms. In particular, the method comprises receiving a first data signal at a first location, determining a first array sketch of the first data signal, and constructing a first output histogram from the first array sketch and a first robust histogram via a first hybrid histogram. Array sketches of a number of data signals may be calculated, and added to yield a single vector sum. The histogram is constructed from the vector sum. In that way, the vector sum may be analyzed without revealing the individual data signals that form the basis of the sum.
摘要:
The present invention relates to a method and apparatus for optimizing queries. The present invention discloses an efficient method for providing answers to queries under parametric aggregation constraints.
摘要:
Certain exemplary embodiments provide a method comprising: automatically: receiving a plurality of elements for each of a plurality of continuous data streams; treating the plurality of elements as a first data stream matrix that defines a first dimensionality; reducing the first dimensionality of the first data stream matrix to obtain a second data stream matrix; computing a singular value decomposition of the second data stream matrix; and based on the singular value decomposition of the second data stream matrix, quantifying approximate linear correlations between the plurality of elements.
摘要:
A system and method are provided for monitoring dynamic data from distributed sources through the use of histograms. In the method, an array sketch of the digital signal is determined, a robust histogram is constructed from the array sketch, and an output histogram is constructed from the array sketch and the robust histogram via a hybrid histogram. Dyadic intervals of a representation of the array sketch are used in constructing the robust histogram.
摘要:
Certain exemplary embodiments provide a method comprising: automatically: receiving a plurality of elements for each of a plurality of continuous data streams; treating the plurality of elements as a first data stream matrix that defines a first dimensionality; reducing the first dimensionality of the first data stream matrix to obtain a second data stream matrix; computing a singular value decomposition of the second data stream matrix; and based on the singular value decomposition of the second data stream matrix, quantifying approximate linear correlations between the plurality of elements.
摘要:
A device and a method are provided. Approximate match operations are performed for each of a group of attributes for each of a group of tuples with respect to a query to create a respective ranking for each of the group of attributes. The rankings of the group of attributes are combined to provide a ranking score for each of the group of tuples. Data representing a ranking score of each of the group of tuples is generated according to a position of a respective ranking of each one of the group of tuples for a first k positions of the ranking. K of top ranked ones of the group of tuples are identified based at least in part on the generated data, wherein a number of the group of tuples is n and k
摘要:
The present invention relates to a method and apparatus for optimizing queries. The present invention discloses an efficient method for providing answers to queries under parametric aggregation constraints.