Apparatus and accompanying methods for visualizing clusters of data and hierarchical cluster classifications
    1.
    发明授权
    Apparatus and accompanying methods for visualizing clusters of data and hierarchical cluster classifications 有权
    用于可视化数据集群和分级集群分类的装置和相关方法

    公开(公告)号:US07333998B2

    公开(公告)日:2008-02-19

    申请号:US10808064

    申请日:2004-03-24

    IPC分类号: G06F17/30

    摘要: A system that incorporates an interactive graphical user interface for visualizing clusters (categories) and segments (summarized clusters) of data. Specifically, the system automatically categorizes incoming case data into clusters, summarizes those clusters into segments, determines similarity measures for the segments, scores the selected segments through the similarity measures, and then forms and visually depicts hierarchical organizations of those selected clusters. The system also automatically and dynamically reduces, as necessary, a depth of the hierarchical organization, through elimination of unnecessary hierarchical levels and inter-nodal links, based on similarity measures of segments or segment groups. Attribute/value data that tends to meaningfully characterize each segment is also scored, rank ordered based on normalized scores, and then graphically displayed. The system permits a user to browse through the hierarchy, and, to readily comprehend segment inter-relationships, selectively expand and contract the displayed hierarchy, as desired, as well as to compare two selected segments or segment groups together and graphically display the results of that comparison. An alternative discriminant-based cluster scoring technique is also presented.

    摘要翻译: 一个包含交互式图形用户界面的系统,用于可视化数据的集群(类别)和分段(聚合集群)。 具体来说,系统将传入的案例数据自动分类为群集,将这些群集归纳为段,确定段的相似性度量,通过相似性度量对所选段进行分类,然后形成并可视地描绘这些群集的层次结构。 基于片段或段组的相似性度量,系统还可以根据需要自动和动态地减少层次组织的深度,通过消除不必要的层级和节点间链接。 倾向于对每个段进行有意义表征的属性/值数据也被划分,基于归一化分数进行排序,然后以图形方式显示。 该系统允许用户浏览层次结构,并且为了容易地理解分段相互关系,根据需要选择性地扩展和收缩所显示的分层结构,并且将两个选定的分段或分段组进行比较,并以图形方式显示 比较。 还提出了一种替代的基于判别式的聚类评分技术。

    Apparatus and accompanying methods for visualizing clusters of data and hierarchical cluster classifications
    2.
    发明授权
    Apparatus and accompanying methods for visualizing clusters of data and hierarchical cluster classifications 有权
    用于可视化数据集群和分级集群分类的装置和相关方法

    公开(公告)号:US06742003B2

    公开(公告)日:2004-05-25

    申请号:US09845151

    申请日:2001-04-30

    IPC分类号: G06F1730

    摘要: A system that incorporates an interactive graphical user interface for visualizing clusters (categories) and segments (summarized clusters) of data. Specifically, the system automatically categorizes incoming case data into clusters, summarizes those clusters into segments, determines similarity measures for the segments, scores the selected segments through the similarity measures, and then forms and visually depicts hierarchical organizations of those selected clusters. The system also automatically and dynamically reduces, as necessary, a depth of the hierarchical organization, through elimination of unnecessary hierarchical levels and inter-nodal links, based on similarity measures of segments or segment groups. Attribute/value data that tends to meaningfully characterize each segment is also scored, rank ordered based on normalized scores, and then graphically displayed. The system permits a user to browse through the hierarchy, and, to readily comprehend segment inter-relationships, selectively expand and contract the displayed hierarchy, as desired, as well as to compare two selected segments or segment groups together and graphically display the results of that comparison. An alternative discriminant-based cluster scoring technique is also presented.

    摘要翻译: 一个包含交互式图形用户界面的系统,用于可视化数据的集群(类别)和分段(聚合集群)。 具体来说,系统将传入的病例数据自动分类为群集,将这些群集合成段,确定段的相似性度量,通过相似性度量对所选段进行分类,然后形成并可视地描绘这些群集的层次结构。 基于片段或段组的相似性度量,系统还可以根据需要自动和动态地减少层次组织的深度,通过消除不必要的层级和节点间链接。 倾向于对每个段进行有意义表征的属性/值数据也被划分,基于归一化分数进行排序,然后以图形方式显示。 该系统允许用户浏览层次结构,并且为了容易地理解分段相互关系,根据需要选择性地扩展和收缩所显示的层次结构,以及将两个选定的分段或分段组进行比较,并以图形方式显示 那个比较。 还提出了一种替代的基于判别式的聚类评分技术。

    Varying cluster number in a scalable clustering system for use with large databases
    3.
    发明授权
    Varying cluster number in a scalable clustering system for use with large databases 有权
    可扩展集群系统中的更改集群号,用于大型数据库

    公开(公告)号:US06449612B1

    公开(公告)日:2002-09-10

    申请号:US09607365

    申请日:2000-06-30

    IPC分类号: G06F704

    摘要: In one exemplary embodiment the invention provides a data mining system for use in finding cluster of data items in a database or any other data storage medium. A portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original model data distributions in each of the K clusters in a clustering model. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. Each time the data is read from the database, a holdout set of data is used to evaluate the model then current as well as other possible cluster models chosen from a candidate set of cluster models. The evaluation of the holdout data set allows a cluster model with a different cluster number K′ to be chosen if that model more accurately models the data based upon the evaluation of the holdout set.

    摘要翻译: 在一个示例性实施例中,本发明提供了一种用于在数据库或任何其他数据存储介质中查找数据项的集群的数据挖掘系统。 从存储介质读取数据库中的一部分数据,并将其带入快速访问存储器缓冲器,其大小取决于可用的存储器资源由用户或操作系统确定。 包含在数据缓冲器中的数据用于更新聚类模型中每个K个簇中的原始模型数据分布。 属于集群的一些数据被汇总或压缩并存储为表示数据的足够统计数据的数据的简化形式。 从数据库访问更多数据,更新模型。 从汇总的数据(足够的统计数据)和新获取的数据确定集群的更新的一组参数。 评估停止条件以确定是否应从数据库访问进一步的数据。 每次从数据库中读取数据时,将使用一组数据来评估模型,然后评估当前以及从候选集群模型中选择的其他可能的集群模型。 保持数据集的评估允许选择具有不同簇号K'的群集模型,如果该模型基于保持集合的评估更准确地建模数据。

    Iterative validation and sampling-based clustering using error-tolerant frequent item sets
    4.
    发明授权
    Iterative validation and sampling-based clustering using error-tolerant frequent item sets 失效
    使用容错频繁项目集的迭代验证和基于抽样的聚类

    公开(公告)号:US06490582B1

    公开(公告)日:2002-12-03

    申请号:US09500172

    申请日:2000-02-08

    IPC分类号: G06F1730

    摘要: Iterative validation for efficiently determining error-tolerant frequent itemsets is disclosed. A description of the application of error-tolerant frequent itemsets to efficiently determining clusters as well as initializing clustering algorithms are also given. In one embodiment, a method determines a sample set of error-tolerant frequent itemsets (ETF's) within a uniform random sample of data within a database. This sample set of ETF's is independently validated, so that, for example, spurious ETF's and spurious dimensions within the ETF's can be removed. The validated sample set of ETF's, is added to the set of ETF's for the database. This process is repeated with additional uniform samples that are mutually exclusive from prior uniform samples, to continue building the database's set of ETF's, until no new sample sets can be found. The method is significantly more efficient than disk-based methods in the prior art, and the data clusters found are often not discovered by traditional clustering algorithm in the prior art.

    摘要翻译: 公开了用于有效地确定容错频繁项目集的迭代验证。 还给出了应用容错的频繁项集以有效地确定簇以及初始化聚类算法的描述。 在一个实施例中,一种方法确定在数据库内的统一的随机数据样本内的容错频繁项目集(ETF)的样本集合。 ETF的这个样本集是独立验证的,因此,例如,ETF中的虚假ETF和杂散维数可以被去除。 经过验证的ETF样本集添加到数据库的ETF集合中。 使用与先前统一样本相互排斥的附加均匀样本重复此过程,以继续构建数据库的一组ETF,直到找不到新的样本集。 该方法比现有技术中基于磁盘的方法显着更有效率,而现有技术中传统的聚类算法通常不会发现发现的数据集群。

    Scalable system for clustering of large databases having mixed data attributes
    5.
    发明授权
    Scalable system for clustering of large databases having mixed data attributes 有权
    具有混合数据属性的大型数据库的可扩展系统

    公开(公告)号:US06581058B1

    公开(公告)日:2003-06-17

    申请号:US09700606

    申请日:2001-01-31

    IPC分类号: G06F1730

    摘要: One exemplary embodiment of a scalable clustering algorithm accesses a database of records having attributes or data fields of both enumerated discrete and ordered values and brings a portion of the data records into a rapid access memory. A cluster model for the data includes a table of probabilities for the enumerated, discrete data fields of the data records. The cluster model for data fields that are ordered comprises a mean and spread of the cluster. The cluster model is updated from the database records brought into the rapid access memory. At least some of the database records in the rapid access memory are summarized and stored within the rapid access memory. A criteria is then evaluated to determine if further data should be accessed from the database to further cluster data records in the database. Based on the evaluating step, additional database records in the database are accessed and brought into the rapid access memory for further updating of the cluster model.

    摘要翻译: 可扩展聚类算法的一个示例性实施例访问具有列举的离散值和有序值的属性或数据字段的记录数据库,并将一部分数据记录带入快速存取存储器。 数据的集群模型包括数据记录的枚举离散数据字段的概率表。 排序的数据字段的集群模型包括集群的均值和扩展。 集群模型从数据库记录更新到快速访问存储器中。 快速访问存储器中的至少一些数据库记录被汇总并存储在快速存取存储器中。 然后评估一个标准,以确定是否应该从数据库访问进一步的数据,以进一步对数据库中的数据记录进行聚类。 基于评估步骤,访问数​​据库中的附加数据库记录并将其引入快速访问存储器中,以进一步更新集群模型。

    Scalable system for clustering of large databases
    6.
    发明授权
    Scalable system for clustering of large databases 失效
    用于大型数据库聚类的可扩展系统

    公开(公告)号:US06374251B1

    公开(公告)日:2002-04-16

    申请号:US09040219

    申请日:1998-03-17

    IPC分类号: G06F1700

    摘要: A data mining system for use in finding clusters of data items in a database or any other data storage medium. The clusters are used in categorizing the data in the database into K different clusters within each of M models. An initial set of estimates (or guesses) of the parameters of each model to be explored (e.g. centriods in K-means), of each cluster are provided from some source. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original guesses at the parameters of the model in each of the K clusters over all M models. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. If further data is needed to characterize the clusters, more data is gathered from the database and used in combination with already compressed data until the stopping criteria has been met.

    摘要翻译: 一种数据挖掘系统,用于在数据库或任何其他数据存储介质中查找数据项的集群。 这些集群用于将数据库中的数据分类为每个M模型中的K个不同的集群。 从一些来源提供了每个集群的每个集群的每个模型的参数的初始估计值(或猜测)(例如,K均值的中心点)。 然后,从存储介质读取数据库中的一部分数据,并将其导入快速访问存储器缓冲器,其大小取决于可用的存储器资源由用户或操作系统确定。 包含在数据缓冲器中的数据用于在所有M个模型中的每个K个集群中更新模型参数的原始猜测。 属于集群的一些数据被汇总或压缩并存储为表示数据的足够统计数据的数据的简化形式。 从数据库访问更多数据,更新模型。 从汇总的数据(足够的统计数据)和新获取的数据确定集群的更新的一组参数。 评估停止条件以确定是否应从数据库访问进一步的数据。 如果需要进一步的数据来表征集群,则从数据库收集更多的数据,并与已压缩的数据组合使用,直到达到停止条件为止。

    Data clustering using error-tolerant frequent item sets
    7.
    发明授权
    Data clustering using error-tolerant frequent item sets 有权
    使用容错频繁项目集的数据聚类

    公开(公告)号:US06567936B1

    公开(公告)日:2003-05-20

    申请号:US09500173

    申请日:2000-02-08

    IPC分类号: G06F1100

    摘要: A generalization of frequent item sets to error-tolerant frequent item sets (ETF) is disclosed, together with its application in data clustering using error-tolerant frequent item sets to either build clusters or as an initialization technique for standard clustering algorithms. Efficient feasible computational algorithms for computing ETF's from very large databases is presented. In one embodiment, a method determines a plurality of weak ETF's, which are strongly tolerant of errors, and determines a plurality of strong ETF's therefrom, which are less tolerant of errors. The resulting clusters can be used as an initial model for a standard clustering approach, or may themselves be used as the end clusters. In one embodiment, the data covered by the strong clusters is removed from the data, and the process is repeated, until no more weak clusters can be found. Te invention includes methods for constructing ETF's from more general data types: data sets that include categorical discrete, continuous, and binary attributes.

    摘要翻译: 公开了频繁项目集到容错频繁项目集(ETF)的泛化,以及其在数据聚类中的应用,其使用容错频繁项集来构建集群或作为用于标准聚类算法的初始化技术。 提出了从非常大的数据库计算ETF的有效的可行计算算法。 在一个实施例中,一种方法确定多个弱ETF,其强烈地容忍错误,并确定其中的多个强ETF,其较不容忍错误。 所得到的集群可以用作标准聚类方法的初始模型,或者本身可以用作端点集群。 在一个实施例中,从数据中移除强簇所覆盖的数据,并重复该过程,直到找不到更多的弱簇为止。 Te发明包括从更一般的数据类型构建ETF的方法:包括分类离散,连续和二进制属性的数据集。

    Scalable system for expectation maximization clustering of large databases
    8.
    发明授权
    Scalable system for expectation maximization clustering of large databases 失效
    大型数据库的期望最大化聚类的可扩展系统

    公开(公告)号:US06263337B1

    公开(公告)日:2001-07-17

    申请号:US09083906

    申请日:1998-05-22

    IPC分类号: G06F1700

    摘要: In one exemplary embodiment the invention provides a data mining system for use in finding clusters of data items in a database or any other data storage medium. Before the data evaluation begins a choice is made of the number M of models to be explored, and the number of clusters (K) of clusters within each of the M models. The clusters are used in categorizing the data in the database into K different clusters within each model. An initial set of estimates for a data distribution of each model to be explored is provided. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original model data distributions in each of the K clusters over all M models. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database.

    摘要翻译: 在一个示例性实施例中,本发明提供了一种用于在数据库或任何其他数据存储介质中查找数据项的集群的数据挖掘系统。 在数据评估开始之前,选择要探索的模型的数量M以及每个M模型中的簇的簇数(K)。 集群用于将数据库中的数据分类为每个模型中的K个不同的集群。 提供了要探索的每个模型的数据分布的初始估计集合。 然后,从存储介质读取数据库中的一部分数据,并将其导入快速访问存储器缓冲器,其大小取决于可用的存储器资源由用户或操作系统确定。 包含在数据缓冲区中的数据用于更新所有M个模型中每个K个集群中的原始模型数据分布。 属于集群的一些数据被汇总或压缩并存储为表示数据的足够统计数据的数据的简化形式。 从数据库访问更多数据,更新模型。 从汇总的数据(足够的统计数据)和新获取的数据确定集群的更新的一组参数。 评估停止条件以确定是否应从数据库访问进一步的数据。

    Method for refining the initial conditions for clustering with
applications to small and large database clustering
    9.
    发明授权
    Method for refining the initial conditions for clustering with applications to small and large database clustering 失效
    用于优化与小型和大型数据库聚类应用程序集群的初始条件的方法

    公开(公告)号:US6115708A

    公开(公告)日:2000-09-05

    申请号:US34834

    申请日:1998-03-04

    IPC分类号: G06F17/30

    摘要: As an optimization problem, clustering data (unsupervised learning) is known to be a difficult problem. Most practical approaches use a heuristic, typically gradient-descent, algorithm to search for a solution in the huge space of possible solutions. Such methods are by definition sensitive to starting points. It has been well-known that clustering algorithms are extremely sensitive to initial conditions. Most methods for guessing an initial solution simply make random guesses. In this paper we present a method that takes an initial condition and efficiently produces a refined starting condition. The method is applicable to a wide class of clustering algorithms for discrete and continuous data. In this paper we demonstrate how this method is applied to the popular K-means clustering algorithm and show that refined initial starting points indeed lead to improved solutions. The technique can be used as an initializer for other clustering solutions. The method is based on an efficient technique for estimating the modes of a distribution and runs in time guaranteed to be less than overall clustering time for large data sets. The method is also scalable and hence can be efficiently used on huge databases to refine starting points for scalable clustering algorithms in data mining applications.

    摘要翻译: 作为优化问题,已知聚类数据(无监督学习)是一个难题。 大多数实用的方法使用启发式,通常是梯度下降算法来在可能的解决方案的巨大空间中搜索解决方案。 这些方法根据定义对起点敏感。 众所周知,聚类算法对初始条件非常敏感。 大多数猜测初始解决方案的方法都是随机猜测。 在本文中,我们提出一种采用初始条件并有效地产生精细起始条件的方法。 该方法适用于离散和连续数据的广泛类聚类算法。 在本文中,我们演示了该方法如何应用于流行的K均值聚类算法,并表明精确的初始起点确实导致改进的解决方案。 该技术可以用作其他集群解决方案的初始化器。 该方法基于用于估计分布的模式并且在时间上运行的有效技术,其保证小于大数据集的总体聚类时间。 该方法也是可扩展的,因此可以在巨大的数据库上有效地使用数据挖掘来优化数据挖掘应用程序中的可扩展聚类算法的起点。

    Clustering of databases having mixed data attributes
    10.
    发明授权
    Clustering of databases having mixed data attributes 有权
    具有混合数据属性的数据库的聚类

    公开(公告)号:US07246125B2

    公开(公告)日:2007-07-17

    申请号:US09886771

    申请日:2001-06-21

    IPC分类号: G06F17/30 G06F15/16

    摘要: A computer data processing system. A method for clustering data in a database comprising providing a database having a number of data records having both discrete and continuous attributes. Grouping together data records from the database which have specified discrete attribute configurations. Clustering data records having the same or similar specified discrete attribute configuration based on the continuous attributes to produce an intermediate set of data clusters. And, merging together clusters from the intermediate set of data clusters to produce a clustering model.

    摘要翻译: 计算机数据处理系统。 一种用于在数据库中聚类数据的方法,包括提供具有多个具有离散和连续属性的数据记录的数据库。 将数据库中的数据记录分组在一起,这些记录指定了离散的属性配置。 基于连续属性聚集具有相同或相似的指定的离散属性配置的数据记录,以产生中间数据集群。 并且将来自中间数据集群的聚类合并在一起以产生聚类模型。