Method, apparatus and programmed medium for clustering databases with
categorical attributes
    1.
    发明授权
    Method, apparatus and programmed medium for clustering databases with categorical attributes 失效
    用于对具有分类属性的数据库进行聚类的方法,装置和程序化介质

    公开(公告)号:US6049797A

    公开(公告)日:2000-04-11

    申请号:US55940

    申请日:1998-04-07

    IPC分类号: G06F17/30 G06K9/62

    摘要: The present invention relates to a computer method, apparatus and programmed medium for clustering databases containing data with categorical attributes. The present invention assigns a pair of points to be neighbors if their similarity exceeds a certain threshold. The similarity value for pairs of points can be based on non-metric information. The present invention determines a total number of links between each cluster and every other cluster bases upon the neighbors of the clusters. A goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and the total number of points within each cluster and every other cluster is then calculated. The present invention merges the two clusters with the best goodness measure. Thus, clustering is performed accurately and efficiently by merging data based on the amount of links between the data to be clustered.

    摘要翻译: 本发明涉及一种计算机方法,装置和用于对包含具有分类属性的数据进行聚类的数据库的编程介质。 如果它们的相似度超过特定阈值,则本发明将一对点分配为邻居。 点对的相似度值可以基于非度量信息。 本发明确定每个群集与每个其他群集之间的链路的总数量,基于群集的邻居。 基于每个集群和每个其他集群之间的链路总数和每个集群和每个其他集群中的总点数,然后计算每个集群和每个其他集群之间的良好度量。 本发明以最佳的品质度量合并了两个群。 因此,通过基于待聚集的数据之间的链接量合并数据,准确而有效地执行聚类。

    Programmed medium for clustering large databases
    2.
    发明授权
    Programmed medium for clustering large databases 失效
    用于集群大数据库的程序化介质

    公开(公告)号:US6092072A

    公开(公告)日:2000-07-18

    申请号:US55941

    申请日:1998-04-07

    IPC分类号: G06F17/30

    摘要: The present invention relates to a computer method, apparatus and programmed medium for clustering large databases. The present invention represents each cluster to be merged by a constant number of well scattered points that capture the shape and extent of the cluster. The chosen scattered points are shrunk towards the mean of the cluster by a shrinking fraction to form a representative set of data points that efficiently represent the cluster. The clusters with the closest pair of representative points are merged to form a new cluster. The use of an efficient representation of the clusters allows the present invention to obtain improved clustering while efficiently eliminating outliers.

    摘要翻译: 本发明涉及用于聚类大数据库的计算机方法,装置和编程介质。 本发明表示通过捕获簇的形状和范围的恒定数量的良好散射点来合并的每个簇。 所选择的散点按照缩小的分数缩小到群集的平均值,以形成有效代表群集的一组代表性的数据点。 具有最接近的代表点对的集合被合并以形成新的集群。 使用集群的有效表示允许本发明获得改进的聚类,同时有效地消除异常值。

    Computer implemented scalable, incremental and parallel clustering based on weighted divide and conquer
    3.
    发明授权
    Computer implemented scalable, incremental and parallel clustering based on weighted divide and conquer 有权
    基于加权分割和征服的计算机实现可扩展,增量和并行聚类

    公开(公告)号:US06907380B2

    公开(公告)日:2005-06-14

    申请号:US10726254

    申请日:2003-12-01

    摘要: A technique that uses a weighted divide and conquer approach for clustering a set S of n data points to find k final centers. The technique comprises 1) partitioning the set S into P disjoint pieces S1, . . . , Sp; 2) for each piece Si, determining a set Di of k intermediate centers; 3) assigning each data point in each piece Si to the nearest one of the k intermediate centers; 4) weighting each of the k intermediate centers in each set Di by the number of points in the corresponding piece Si assigned to that center; and 5) clustering the weighted intermediate centers together to find said k final centers, the clustering performed using a specific error metric and a clustering method A.

    摘要翻译: 一种使用加权分割和征服方法来聚集n个数据点的集合S以找到k个最终中心的技术。 该技术包括:1)将集合S划分成P个不相交的部分S 1。 。 。 ,S 2)对于每个块S i确定k个中间中心的集合D i i i i, 3)将每个片段S i中的每个数据点分配给k个中间中心中最接近的一个; 4)通过分配给该中心的相应片段S i i中的点的数量对每个集合D i i i中的每个k个中间中心进行加权; 和5)将加权中间体聚类在一起以找到所述k个最终中心,使用特定的误差度量和聚类方法A进行聚类。

    Method and apparatus for using histograms to produce data summaries
    4.
    发明授权
    Method and apparatus for using histograms to produce data summaries 有权
    使用直方图产生数据摘要的方法和装置

    公开(公告)号:US07965643B1

    公开(公告)日:2011-06-21

    申请号:US12217958

    申请日:2008-07-10

    IPC分类号: H04J1/16

    CPC分类号: H04L43/045 H04L63/1408

    摘要: A system and method are provided for summarizing dynamic data from distributed sources through the use of histograms. In particular, the method comprises receiving a first data signal at a first location, determining a first array sketch of the first data signal, and constructing a first output histogram from the first array sketch and a first robust histogram via a first hybrid histogram. Array sketches of a number of data signals may be calculated, and added to yield a single vector sum. The histogram is constructed from the vector sum. In that way, the vector sum may be analyzed without revealing the individual data signals that form the basis of the sum.

    摘要翻译: 提供了一种通过使用直方图从分布式源汇总动态数据的系统和方法。 特别地,该方法包括在第一位置处接收第一数据信号,确定第一数据信号的第一阵列草图,以及经由第一混合直方图从第一阵列草图和第一稳健直方图构造第一输出直方图。 可以计算多个数据信号的阵列草图,并将其加到以产生单个向量和。 直方图由向量和构成。 以这种方式,可以分析矢量和,而不会泄露构成和的基础的各个数据信号。

    Apparatus and method for correlating synchronous and asynchronous data streams
    6.
    发明授权
    Apparatus and method for correlating synchronous and asynchronous data streams 有权
    用于关联同步和异步数据流的装置和方法

    公开(公告)号:US08131792B1

    公开(公告)日:2012-03-06

    申请号:US12125973

    申请日:2008-05-23

    CPC分类号: G06K9/00536

    摘要: Certain exemplary embodiments provide a method comprising: automatically: receiving a plurality of elements for each of a plurality of continuous data streams; treating the plurality of elements as a first data stream matrix that defines a first dimensionality; reducing the first dimensionality of the first data stream matrix to obtain a second data stream matrix; computing a singular value decomposition of the second data stream matrix; and based on the singular value decomposition of the second data stream matrix, quantifying approximate linear correlations between the plurality of elements.

    摘要翻译: 某些示例性实施例提供了一种方法,包括:自动地:接收多个连续数据流中的每一个的多个元素; 将所述多个元素作为限定第一维度的第一数据流矩阵; 减少第一数据流矩阵的第一维度以获得第二数据流矩阵; 计算第二数据流矩阵的奇异值分解; 并且基于第二数据流矩阵的奇异值分解,量化多个元素之间的近似线性相关性。

    Apparatus and method for correlating synchronous and asynchronous data streams
    8.
    发明授权
    Apparatus and method for correlating synchronous and asynchronous data streams 有权
    用于关联同步和异步数据流的装置和方法

    公开(公告)号:US07437397B1

    公开(公告)日:2008-10-14

    申请号:US10822316

    申请日:2004-04-12

    IPC分类号: G06F17/15

    CPC分类号: G06K9/00536

    摘要: Certain exemplary embodiments provide a method comprising: automatically: receiving a plurality of elements for each of a plurality of continuous data streams; treating the plurality of elements as a first data stream matrix that defines a first dimensionality; reducing the first dimensionality of the first data stream matrix to obtain a second data stream matrix; computing a singular value decomposition of the second data stream matrix; and based on the singular value decomposition of the second data stream matrix, quantifying approximate linear correlations between the plurality of elements.

    摘要翻译: 某些示例性实施例提供了一种方法,包括:自动地:接收多个连续数据流中的每一个的多个元素; 将所述多个元素作为限定第一维度的第一数据流矩阵; 减少第一数据流矩阵的第一维度以获得第二数据流矩阵; 计算第二数据流矩阵的奇异值分解; 并且基于第二数据流矩阵的奇异值分解,量化多个元素之间的近似线性相关性。

    Apparatus and method for merging results of approximate matching operations
    9.
    发明授权
    Apparatus and method for merging results of approximate matching operations 有权
    用于合并近似匹配操作结果的装置和方法

    公开(公告)号:US07415461B1

    公开(公告)日:2008-08-19

    申请号:US11195888

    申请日:2005-08-03

    IPC分类号: G06F7/00

    摘要: A device and a method are provided. Approximate match operations are performed for each of a group of attributes for each of a group of tuples with respect to a query to create a respective ranking for each of the group of attributes. The rankings of the group of attributes are combined to provide a ranking score for each of the group of tuples. Data representing a ranking score of each of the group of tuples is generated according to a position of a respective ranking of each one of the group of tuples for a first k positions of the ranking. K of top ranked ones of the group of tuples are identified based at least in part on the generated data, wherein a number of the group of tuples is n and k

    摘要翻译: 提供了一种设备和方法。 对于关于查询的一组元组中的每一个的一组属性中的每一个执行近似匹配操作,以为该属性组中的每一个创建相应的排名。 组合属性的排名被组合以提供每组元组的排名得分。 根据排序的第一k个位置的组元组中的每一个的相应排名的位置来生成表示每组元组的排名得分的数据。 至少部分地基于所生成的数据来识别组元组中的顶级排名的K,其中该组元组的数目为n且k