Method to reduce I/O for hierarchical data partitioning methods

发明授权

US6055539A Method to reduce I/O for hierarchical data partitioning methods 失效

标题翻译：降低分层数据分区方法的I / O的方法

请登陆查看更多内容

专利标题： Method to reduce I/O for hierarchical data partitioning methods
专利标题（中）： 降低分层数据分区方法的I / O的方法
申请号： US884080

申请日： 1997-06-27
公开(公告)号： US6055539A

公开(公告)日： 2000-04-25
发明人: Vineet Singh , Anurag Srivastava
申请人： Vineet Singh , Anurag Srivastava
申请人地址： NY Armonk
专利权人： International Business Machines Corporation
当前专利权人： International Business Machines Corporation
当前专利权人地址： NY Armonk
主分类号： G06F17/30
IPC分类号： G06F17/30

Method to reduce I/O for hierarchical data partitioning methods

摘要：

A method and system for generating a decision-tree classifier from a training set of records, independent of the system memory size. The method includes the steps of: generating an attribute list for each attribute of the records, sorting the attribute lists for numeric attributes, and generating a decision tree by repeatedly partitioning the records using the attribute lists. For each node, split points are evaluated to determine the best split test for partitioning the records at the node. Preferably, a gini index and class histograms are used in determining the best splits. The gini index indicates how well a split point separates the records while the class histograms reflect the class distribution of the records at the node. Also, a hash table is built as the attribute list of the split attribute is divided among the child nodes, which is then used for splitting the remaining attribute lists of the node. The method reduces I/O read time by combining the read for partitioning the records at a node with the read required for determining the best split test for the child nodes. Further, it requires writes of the records only at one out of n levels of the decision tree where n.gtoreq.2. Finally, a novel data layout on disk minimizes disk seek time. The I/O optimizations work in a general environment for hierarchical data partitioning. They also work in a multi-processor environment. After the generation of the decision tree, any prior art pruning methods may be used for pruning the tree.

摘要（中）：

一种用于从训练集记录中生成决策树分类器的方法和系统，与系统存储器大小无关。该方法包括以下步骤：为记录的每个属性生成属性列表，对数字属性的属性列表进行排序，以及通过使用属性列表重复分割记录来生成决策树。对于每个节点，分析点进行评估，以确定分区节点上的记录的最佳分割测试。优选地，使用基尼系数索引和类别直方图来确定最佳分割。 gini指数表示分割点将记录分离成多少，而类直方图反映了节点上记录的类分布。此外，由于分割属性的属性列表在子节点之间划分，因此构建了哈希表，然后用于分割节点的剩余属性列表。该方法通过将用于分割节点上的记录的读取与为确定子节点的最佳分割测试所需的读取相结合来减少I / O读取时间。此外，它需要在n> / = 2的决策树的n个级别中的一个层次上写入记录。最后，磁盘上的一个新颖的数据布局最大限度地减少了磁盘查找时间。 I / O优化适用于分层数据分区的通用环境。它们还可以在多处理器环境中工作。在生成决策树之后，可以使用任何现有技术的修剪方法来修剪树。

公开/授权文献

USD345464S Condom wallet 公开/授权日：1994-03-29

信息查询

Global Dossier Espacenet