Method to reduce I/O for hierarchical data partitioning methods
    3.
    发明授权
    Method to reduce I/O for hierarchical data partitioning methods 失效
    降低分层数据分区方法的I / O的方法

    公开(公告)号:US6055539A

    公开(公告)日:2000-04-25

    申请号:US884080

    申请日:1997-06-27

    IPC分类号: G06F17/30

    摘要: A method and system for generating a decision-tree classifier from a training set of records, independent of the system memory size. The method includes the steps of: generating an attribute list for each attribute of the records, sorting the attribute lists for numeric attributes, and generating a decision tree by repeatedly partitioning the records using the attribute lists. For each node, split points are evaluated to determine the best split test for partitioning the records at the node. Preferably, a gini index and class histograms are used in determining the best splits. The gini index indicates how well a split point separates the records while the class histograms reflect the class distribution of the records at the node. Also, a hash table is built as the attribute list of the split attribute is divided among the child nodes, which is then used for splitting the remaining attribute lists of the node. The method reduces I/O read time by combining the read for partitioning the records at a node with the read required for determining the best split test for the child nodes. Further, it requires writes of the records only at one out of n levels of the decision tree where n.gtoreq.2. Finally, a novel data layout on disk minimizes disk seek time. The I/O optimizations work in a general environment for hierarchical data partitioning. They also work in a multi-processor environment. After the generation of the decision tree, any prior art pruning methods may be used for pruning the tree.

    摘要翻译: 一种用于从训练集记录中生成决策树分类器的方法和系统,与系统存储器大小无关。 该方法包括以下步骤:为记录的每个属性生成属性列表,对数字属性的属性列表进行排序,以及通过使用属性列表重复分割记录来生成决策树。 对于每个节点,分析点进行评估,以确定分区节点上的记录的最佳分割测试。 优选地,使用基尼系数索引和类别直方图来确定最佳分割。 gini指数表示分割点将记录分离成多少,而类直方图反映了节点上记录的类分布。 此外,由于分割属性的属性列表在子节点之间划分,因此构建了哈希表,然后用于分割节点的剩余属性列表。 该方法通过将用于分割节点上的记录的读取与为确定子节点的最佳分割测试所需的读取相结合来减少I / O读取时间。 此外,它需要在n> / = 2的决策树的n个级别中的一个层次上写入记录。 最后,磁盘上的一个新颖的数据布局最大限度地减少了磁盘查找时间。 I / O优化适用于分层数据分区的通用环境。 它们还可以在多处理器环境中工作。 在生成决策树之后,可以使用任何现有技术的修剪方法来修剪树。

    Structure and method for efficient parallel high-dimensional similarity
join
    4.
    发明授权
    Structure and method for efficient parallel high-dimensional similarity join 失效
    高效平行高维相似性联合的结构与方法

    公开(公告)号:US5987468A

    公开(公告)日:1999-11-16

    申请号:US989847

    申请日:1997-12-12

    摘要: Multidimensional similarity join finds pairs of multi-dimensional points that are within some small distance of each other. Databases in domains such as multimedia and time-series can require a high number of dimensions. The .epsilon.-k-d-B tree has been proposed as a data structure that scales better as number of dimensions increases compared to previous data structures such as the R-tree (and variations), grid-file, and k-d-B tree. We present a cost model of the .epsilon.-k-d-B tree and use it to optimize the leaf size. This new leaf size is shown to be better in most situations compared to previous work that used a constant leaf size. We present novel parallel procedures for the .epsilon.-k-d-B tree. A load-balancing strategy based on equi-depth histograms is shown to work well for uniform or low-skew situations, whereas another based on weighted, equi-depth histograms works far better for high-skew datasets. The latter strategy is only slightly slower than the former strategy for low skew datasets. The weights for the latter strategy are based on the same cost model that is used to determine optimal leaf sizes.

    摘要翻译: 多维相似联合找到彼此在一些小距离内的多维点对。 域中的数据库(如多媒体和时间序列)可能需要大量的维度。 已经提出了epsilon -k-d-B树作为数据结构,与先前的数据结构(如R-tree(和变体),网格文件和k-d-B树)相比,维度数量的增加更好。 我们提出了一个eps-k-d-B树的成本模型,并用它来优化叶尺寸。 与使用恒定叶尺寸的以前的工作相比,在大多数情况下,新叶尺寸显示出更好。 我们提出了epsilon -k-d-B树的新颖并行程序。 基于等深度直方图的负载平衡策略被显示为均匀或低偏移情况下工作良好,而基于加权的等深度直方图的另一个对于高偏斜数据集的工作更好。 后一种策略仅比低偏差数据集的前一种策略略慢。 后一种策略的权重基于用于确定最佳叶尺寸的相同成本模型。

    Patient rule induction method on large disk resident data sets and parallelization thereof
    5.
    发明授权
    Patient rule induction method on large disk resident data sets and parallelization thereof 有权
    大盘驻留数据集的病人规则感应方法及其并行化

    公开(公告)号:US07269586B1

    公开(公告)日:2007-09-11

    申请号:US09470444

    申请日:1999-12-22

    IPC分类号: G06F17/30 G06F7/00 G06F17/60

    摘要: The present invention relates to analysis of large, disk resident data sets using a Patient Rule Induction Method (PRIM) in a computer system wherein a relational data table is initially received. The relational data table includes continuous attributes, discrete attributes, a matter parameter and a cost attribute. The cost attribute represents cost output values based on continuous attribute values and discrete attribute values as inputs. A hyper-rectangle is then formed which encloses a multi-dimensional space defined by the continuous attribute values and the discrete attribute values. The continuous attribute values and the discrete attribute values are represented as points within the multi-dimensional space. A plurality of points along edges of the hyper-rectangle are then removed based on an average of the cost output value from the plurality of points until a count of the points enclosed within the hyper-rectangle equals the meta parameter. Discrete attribute values and continuous attribute values which were removed from the hyper-rectangle are next added along edges of the hyper-rectangle until a sum of the cost output value over the multi-dimensional space enclosed by the hyper-rectangle changes. In a further embodiment a parallel architecture computer system calculates the cost attribute average values over the plurality of points enclosed by the hyper-rectangle in parallel. The invention analyzes large disk resident data sets without having to load the data set into main memory and can be practiced on a parallel computer architecture or a symmetric multi-processor architecture to improve performance.

    摘要翻译: 本发明涉及在计算机系统中使用患者规则诱导方法(PRIM)分析大的盘驻留数据集,其中最初接收关系数据表。 关系数据表包括连续属性,离散属性,事物参数和成本属性。 成本属性表示基于连续属性值和离散属性值作为输入的成本输出值。 然后形成超矩形,其包围由连续属性值和离散属性值定义的多维空间。 连续属性值和离散属性值表示为多维空间内的点。 然后根据多个点的成本输出值的平均值去除超矩形边沿的多个点,直到包含在超矩形内的点的计数等于元参数。 从超矩形移除的离散属性值和连续属性值接下来沿着超矩形的边缘添加,直到由超矩形包围的多维空间的成本输出值的总和发生变化。 在另一实施例中,并行架构计算机系统并行计算由超矩形包围的多个点上的成本属性平均值。 本发明分析大盘驻留数据集,而不必将数据集加载到主存储器中,并且可以在并行计算机体系结构或对称多处理器架构上实践以提高性能。

    Real time electronic service interaction management system and method

    公开(公告)号:US07016936B2

    公开(公告)日:2006-03-21

    申请号:US09858704

    申请日:2001-05-15

    IPC分类号: G06F15/16

    CPC分类号: G06Q10/10 Y10S707/99931

    摘要: The invention real time electronic service interaction management system and method facilitates presentation of information that increases the probability of desirable target interaction. Desirable target interaction includes metrics associated with campaign objectives (e.g., maximize profits) and constraints (e.g., budget constraints). The system and method automatically develops interaction motivation plans that determine a stimulation action (e.g., information presented to a target). A motivation interaction plan is a procedure utilized to determine a stimulation action to present to a target with specific attributes under certain system attributes. The present invention adaptively optimizes and tests interaction motivation plans to permit automated learning about target individual interaction activities and accordingly modify interaction motivation plans in both real time and over the lifetime of a campaign. It also facilitates the development of behavioral models that provide predictions associated with the probability of target behavior based upon a set of target characteristics and system attributes.

    Method and apparatus to sense and multicast window events to a plurality
of existing applications for concurrent execution
    7.
    发明授权
    Method and apparatus to sense and multicast window events to a plurality of existing applications for concurrent execution 失效
    用于感测并将窗口事件多播到多个现有应用并且执行的方法和装置

    公开(公告)号:US5742778A

    公开(公告)日:1998-04-21

    申请号:US602386

    申请日:1996-02-16

    摘要: A multicasting system for multicasting window events to various application programs running on a computer system, each such program having an application window. A global control program runs on the computer system and has a global control window. Through the global control program, a user selects one or more of the application programs to receive incoming window events. Later, when the global control window is active, any incoming window event is received in that window. The global control program automatically multicasts each such event to every application program that the user has selected to receive incoming window events. Events may be multicast directly to child windows of the various application windows. The global control window may have a global child window that receives incoming window events; such events are multicast directly to selected child windows of the application programs. The application programs may be resident locally or on a remote computer system. If window events are received out of sequence, the global control program may either ignore them or resequence them for proper operation.

    摘要翻译: 一种用于将窗口事件组播到在计算机系统上运行的各种应用程序的多播系统,每个这样的程序具有应用窗口。 全局控制程序在计算机系统上运行,并具有全局控制窗口。 通过全局控制程序,用户选择一个或多个应用程序来接收传入的窗口事件。 之后,当全局控制窗口处于活动状态时,在该窗口中接收到任何传入的窗口事件。 全局控制程序自动将每个这样的事件组播到用户选择接收传入窗口事件的每个应用程序。 事件可以直接组播到各种应用程序窗口的子窗口。 全局控制窗口可以具有接收传入窗口事件的全局子窗口; 这样的事件被直接组播到应用程序的选定子窗口。 应用程序可以驻留在本地或远程计算机系统上。 如果不按顺序接收到窗口事件,则全局控制程序可以忽略它们,或者对它们进行排序以进行正确的操作。

    Method and apparatus for classification of high dimensional data
    8.
    发明授权
    Method and apparatus for classification of high dimensional data 失效
    高维数据分类方法和装置

    公开(公告)号:US06563952B1

    公开(公告)日:2003-05-13

    申请号:US09420252

    申请日:1999-10-18

    IPC分类号: G06K962

    CPC分类号: G06K9/6276

    摘要: The present invention is an apparatus and method for classifying high-dimensional sparse datasets. A raw data training set is flattened by converting it from categorical representation to a boolean representation. The flattened data is then used to build a class model on which new data not in the training set may be classified. In one embodiment, the class model takes the form of a decision tree, and large itemsets and cluster information are used as attributes for classification. In another embodiment, the class model is based on the nearest neighbors of the data to be classified. An advantage of the invention is that, by flattening the data, classification accuracy is increased by eliminating artificial ordering induced on the attributes. Another advantage is that the use of large itemsets and clustering increases classification accuracy.

    摘要翻译: 本发明是用于对高维稀疏数据集进行分类的装置和方法。 原始数据训练集通过将其从分类表示转换为布尔表示而被平坦化。 然后,使用平坦化的数据来构建一个类别模型,在该类模型中,不在训练集中的新数据可以被分类。 在一个实施例中,类模型采用决策树的形式,并且使用大的项目集和集群信息作为分类的属性。 在另一个实施例中,类模型基于要分类的数据的最近邻。 本发明的优点在于,通过平坦化数据,通过消除对属性引起的人为排序来增加分类精度。 另一个优点是使用大项集和聚类提高了分类精度。

    Method and apparatus for reducing the computational requirements of
K-means data clustering
    9.
    发明授权
    Method and apparatus for reducing the computational requirements of K-means data clustering 失效
    减少K-means数据聚类的计算要求的方法和装置

    公开(公告)号:US5983224A

    公开(公告)日:1999-11-09

    申请号:US962470

    申请日:1997-10-31

    IPC分类号: G06F17/30 G06K9/62

    摘要: The present invention is directed to an improved data clustering method and apparatus for use in data mining operations. The present invention determines the pattern vectors of a k-d tree structure which are closest to a given prototype cluster by pruning prototypes through geometrical constraints, before a k-means process is applied to the prototypes. For each sub-branch in the k-d tree, a candidate set of prototypes is formed from the parent of a child node. The minimum and maximum distances from any point in the child node to any prototype in the candidate set is determined. The smallest of the maximum distances found is compared to the minimum distances of each prototype in the candidate set. Those prototypes with a minimum distance greater than the smallest of the maximum distances are pruned or eliminated. Pruning the number of remote prototypes reduces the number of distance calculations for the k-means process, significantly reducing the overall computation time.

    摘要翻译: 本发明涉及用于数据挖掘操​​作的改进的数据聚类方法和装置。 本发明通过在将k-means过程应用于原型之前通过几何约束修剪原型来确定最靠近给定原型群的k-d树结构的模式向量。 对于k-d树中的每个子分支,从子节点的父节点形成候选的原型集合。 确定子节点中任何点到候选集中任何原型的最小和最大距离。 找到的最大距离中的最小距离与候选集中每个原型的最小距离进行比较。 最小距离大于最大距离最小距离的原型被修剪或消除。 修剪远程原型的数量减少了k-means过程的距离计算次数,从而大大减少了整个计算时间。

    Real-time user behavior prediction
    10.
    发明授权
    Real-time user behavior prediction 有权
    实时用户行为预测

    公开(公告)号:US08468110B1

    公开(公告)日:2013-06-18

    申请号:US12841831

    申请日:2010-07-22

    IPC分类号: G06F17/00

    CPC分类号: G06F11/3438 G06F9/453

    摘要: The disclosed embodiments provide a system that facilitates use of an application. During operation, the system obtains an activity history of interaction between the user and the application during use of the application by the user. Next, the system applies a predictive model to the activity history to predict a probability of a user action in the application. Finally, the system facilitates subsequent real-time use of the application by the user based on the probability of the user action.

    摘要翻译: 所公开的实施例提供了便于使用应用的系统。 在操作期间,系统在用户使用应用程序期间获得用户与应用程序之间的交互活动历史。 接下来,系统将预测模型应用于活动历史以预测用户在应用中的动作概率。 最后,系统基于用户动作的概率便于用户随后实时使用应用程序。