Systems and methods for sequential modeling in less than one sequential scan
    1.
    发明申请
    Systems and methods for sequential modeling in less than one sequential scan 失效
    在不到一次顺序扫描中进行顺序建模的系统和方法

    公开(公告)号:US20060026110A1

    公开(公告)日:2006-02-02

    申请号:US10903336

    申请日:2004-07-30

    IPC分类号: G06F15/18

    CPC分类号: G06N99/005 Y10S707/99931

    摘要: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.

    摘要翻译: 对最大流式数据集的可伸缩归纳学习的最新研究着重于消除记忆限制并减少顺序数据扫描的次数。 然而,最先进的算法仍然需要对数据集进行多次扫描,并使用复杂的控制机制和数据结构。 这里讨论了一般的归纳学习框架,该框架一次扫描数据集。 然后,提出了一种基于Hoeffding不等式的扩展,可以扫描数据集不止一次。 提出的框架适用于广泛的归纳学习者。

    System and method for continuous diagnosis of data streams
    2.
    发明申请
    System and method for continuous diagnosis of data streams 失效
    用于连续诊断数据流的系统和方法

    公开(公告)号:US20060010093A1

    公开(公告)日:2006-01-12

    申请号:US10880913

    申请日:2004-06-30

    IPC分类号: G06F17/30

    摘要: In connection with the mining of time-evolving data streams, a general framework that mines changes and reconstructs models from a data stream with unlabeled instances or a limited number of labeled instances. In particular, there are defined herein statistical profiling methods that extend a classification tree in order to guess the percentage of drifts in the data stream without any labelled data. Exact error can be estimated by actively sampling a small number of true labels. If the estimated error is significantly higher than empirical expectations, there preferably re-sampled a small number of true labels to reconstruct the decision tree from the leaf node level.

    摘要翻译: 与挖掘时间不断变化的数据流有关的一般框架,即从具有未标记实例的数据流或有限数量的标记实例中挖掘变更和重建模型。 特别地,这里定义了扩展分类树的统计分析方法,以便在没有任何标记数据的情况下猜测数据流中漂移的百分比。 可以通过主动抽取少量真实标签来估计精确误差。 如果估计的误差明显高于经验期望值,则最好重新采样少量的真实标签,以从叶节点级别重建决策树。

    Index Structure for Supporting Structural XML Queries
    3.
    发明申请
    Index Structure for Supporting Structural XML Queries 失效
    支持结构XML查询的索引结构

    公开(公告)号:US20070271243A1

    公开(公告)日:2007-11-22

    申请号:US11780095

    申请日:2007-07-19

    IPC分类号: G06F17/30

    摘要: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure. ViST supports dynamic index update, and it relies solely on B+Trees without using any specialized data structures that are not well supported by common database management systems (hereinafter referred to as “DBMSs”).

    摘要翻译: 本发明提供了一种ViST(或“虚拟后缀树”),其是用于搜索XML文档的新型索引结构。 通过在结构编码序列中同时表示XML文档和XML查询,显示查询XML数据等同于查找(非连续)子序列匹配。 各种XML查询(包括具有分支的查询)或通配符('*'和'//')可以由结构编码的序列表示。 不同于将查询反汇编成多个子查询的索引方法,然后加入这些子查询的结果以提供最终答案,ViST使用树结构作为查询的基本单位,以避免昂贵的连接操作。 此外,ViST为XML文档的内容和结构提供了一个统一的索引,因此与仅通过内容或结构索引方法相比,它具有性能优势。 ViST支持动态索引更新,它仅仅依赖于B< +>树,而不使用通用数据库管理系统(以下简称“DBMS”)不能很好支持的任何专门的数据结构。

    System and method for sequencing XML documents for tree structure indexing
    4.
    发明申请
    System and method for sequencing XML documents for tree structure indexing 失效
    用于对树结构索引的XML文档进行排序的系统和方法

    公开(公告)号:US20060161575A1

    公开(公告)日:2006-07-20

    申请号:US11035889

    申请日:2005-01-14

    IPC分类号: G06F7/00

    摘要: Sequence-based XML indexing aims at avoiding expensive join operations in query processing. It transforms structured XML data into sequences so that a structured query can be answered holistically through subsequence matching. Herein, there is addresed the problem of query equivalence with respect to this transformation, and thereis introduced a performance-oriented principle for sequencing tree structures. With query equivalence, XML queries can be performed through subsequence matching without join operations, post-processing, or other special handling for problems such as false alarms. There is identified a class of sequencing methods for this purpose, and there is presented a novel subsequence matching algorithm that observe query equivalence. Also introduced is a performance-oriented principle to guide the sequencing of tree structures. For any given XML dataset, the principle finds an optimal sequencing strategy according to its schema and its data distribution; there is thus presented herein a novel method that realizes this principle.

    摘要翻译: 基于序列的XML索引旨在避免查询处理中的昂贵的联接操作。 它将结构化XML数据转换为序列,以便可以通过子序列匹配整体回答结构化查询。 这里提出了相对于这种转换的查询等价性的问题,并且引入了用于排序树结构的性能导向原理。 通过查询等价,可以通过子序列匹配执行XML查询,无需连接操作,后处理或其他特殊处理,例如虚假警报等问题。 确定了一类用于此目的的测序方法,并提出了一种观察查询等价性的新颖的子序列匹配算法。 还引入了一种以性能为导向的原则来指导树结构的排序。 对于任何给定的XML数据集,该原理根据其模式及其数据分布找到最佳排序策略; 因此在此呈现了实现这一原理的新颖方法。

    System and method for scalable cost-sensitive learning
    5.
    发明申请
    System and method for scalable cost-sensitive learning 审中-公开
    可扩展成本敏感学习的系统和方法

    公开(公告)号:US20050125434A1

    公开(公告)日:2005-06-09

    申请号:US10725378

    申请日:2003-12-03

    IPC分类号: G06F7/00

    CPC分类号: G06N20/00

    摘要: A method (and structure) for processing an inductive learning model for a dataset of examples, includes dividing the dataset into N subsets of data and developing an estimated learning model for the dataset by developing a learning model for a first subset of the N subsets.

    摘要翻译: 一种用于处理实例数据集的归纳学习模型的方法(和结构),包括将数据集划分成N个数据子集,并通过开发用于N个子集的第一子集的学习模型来开发数据集的估计学习模型。

    SYSTEMS AND METHODS FOR SEQUENTIAL MODELING IN LESS THAN ONE SEQUENTIAL SCAN
    6.
    发明申请
    SYSTEMS AND METHODS FOR SEQUENTIAL MODELING IN LESS THAN ONE SEQUENTIAL SCAN 失效
    用于顺序建模的系统和方法不超过一次连续扫描

    公开(公告)号:US20080052255A1

    公开(公告)日:2008-02-28

    申请号:US11931129

    申请日:2007-10-31

    IPC分类号: G06F15/18 G06N7/00

    CPC分类号: G06N99/005 Y10S707/99931

    摘要: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.

    摘要翻译: 对最大流式数据集的可伸缩归纳学习的最新研究着重于消除记忆限制并减少顺序数据扫描的次数。 然而,最先进的算法仍然需要对数据集进行多次扫描,并使用复杂的控制机制和数据结构。 这里讨论了一般的归纳学习框架,该框架一次扫描数据集。 然后,提出了一种基于Hoeffding不等式的扩展,可以扫描数据集不止一次。 提出的框架适用于广泛的归纳学习者。

    Systems and methods for subspace clustering
    7.
    发明申请
    Systems and methods for subspace clustering 失效
    用于子空间聚类的系统和方法

    公开(公告)号:US20050278324A1

    公开(公告)日:2005-12-15

    申请号:US10858541

    申请日:2004-05-31

    IPC分类号: G06F7/00 G06K9/62

    CPC分类号: G06K9/6215 Y10S707/99936

    摘要: Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including e-Commerce target marketing, bioinformatics (large scale scientific data analysis), and automatic computing (web usage analysis), etc. However, state-of-the-art pattern-based clustering methods (e.g., the pCluster algorithm) can only handle datasets of thousands of records, which makes them inappropriate for many real-life applications. Furthermore, besides the huge data volume, many data sets are also characterized by their sequentiality, for instance, customer purchase records and network event logs are usually modeled as data sequences. Hence, it becomes important to enable pattern-based clustering methods i) to handle large datasets, and ii) to discover pattern similarity embedded in data sequences. There is presented herein a novel method that offers this capability.

    摘要翻译: 与传统的集群方法不同,传统的集群方法集中在对一组维度上具有类似值的对象进行分组,通过模式相似性进行聚类可以找到在子空间中呈现一致的上升和下降模式的对象。 基于模式的群集扩展了传统群集的概念,受益于广泛的应用,包括电子商务目标营销,生物信息学(大规模科学数据分析)和自动计算(Web使用分析)等。然而,状态 基于图案的聚类方法(例如,pCluster算法)只能处理数千条记录的数据集,这使得它们不适合许多现实生活中的应用。 此外,除了巨大的数据量之外,许多数据集的特征还在于它们的顺序性,例如,客户购买记录和网络事件日志通常被建模为数据序列。 因此,重要的是启用基于图案的聚类方法i)处理大数据集,以及ii)发现嵌入在数据序列中的模式相似性。 这里提供了一种提供这种能力的新颖方法。

    System and method for mining time-changing data streams
    8.
    发明申请
    System and method for mining time-changing data streams 有权
    挖掘时变数据流的系统和方法

    公开(公告)号:US20050278322A1

    公开(公告)日:2005-12-15

    申请号:US10857030

    申请日:2004-05-28

    IPC分类号: G06F7/00

    摘要: A general framework for mining concept-drifting data streams using weighted ensemble classifiers. An ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., is trained from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. An empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

    摘要翻译: 采用加权综合分类器挖掘概念漂移数据流的一般框架。 分类模型的集合,例如C4.5,RIPPER,朴素贝叶斯等,是从数据流的连续块中训练出来的。 根据其在时间不断变化的环境下的测试数据的预期分类精度,合理地加权集合中的分类器。 因此,综合方法提高了学习模型的效率和执行分类的准确性。 实证研究表明,所提出的方法在预测精度方面具有优于单分类器方法的优势,并且整体框架对于各种分类模型是有效的。

    System and method for adaptive pruning
    9.
    发明申请
    System and method for adaptive pruning 失效
    自适应修剪的系统和方法

    公开(公告)号:US20050131873A1

    公开(公告)日:2005-06-16

    申请号:US10737123

    申请日:2003-12-16

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30539 G06F17/30598

    摘要: Disclosed in a method and structure for searching data in databases using an ensemble of models. First the invention performs training. This training orders models within the ensemble in order of prediction accuracy and joins different numbers of models together to form sub-ensembles. The models are joined together in the sub-ensemble in the order of prediction accuracy. Next in the training process, the invention calculates confidence values of each of the sub-ensembles. The confidence is a measure of how closely results form the sub-ensemble will match results from the ensemble. The size of each of the sub-ensembles is variable depending upon the level of confidence, while, to the contrary, the size of the ensemble is fixed. After the training, the invention can make a prediction. First, the invention selects a sub-ensemble that meets a given level of confidence. As the level of confidence is raised, a sub-ensemble that has more models will be selected and as the level of confidence is lowered, a sub-ensemble that has fewer models will be selected. Finally, the invention applies the selected sub-ensemble, in place of the ensemble, to an example to make a prediction.

    摘要翻译: 公开了一种使用模型集合在数据库中搜索数据的方法和结构。 首先,发明执行训练。 这种训练按照预测精度的顺序对集合内的模型进行排序,并将不同数量的模型结合在一起形成子集合。 这些模型以预测精度的顺序连接在子集合中。 接下来在训练过程中,本发明计算每个子集合的置信度值。 信心是衡量子系统的结果与合奏结果相符的结果。 每个子集合的大小根据置信水平而变化,而相反,整体的大小是固定的。 训练后,本发明可以进行预测。 首先,本发明选择满足给定的置信水平的子集合。 随着信心的提高,将选择具有更多模型的子集合,并且随着置信度的降低,将选择具有较少模型的子集合。 最后,本发明将选择的子集合代替集合应用于一个例子进行预测。

    Index structure for supporting structural XML queries
    10.
    发明申请
    Index structure for supporting structural XML queries 失效
    用于支持结构XML查询的索引结构

    公开(公告)号:US20050114314A1

    公开(公告)日:2005-05-26

    申请号:US10723206

    申请日:2003-11-26

    IPC分类号: G06F17/30

    摘要: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure. ViST supports dynamic index update, and it relies solely on B+Trees without using any specialized data structures that are not well supported by common database management systems (hereinafter referred to as “DBMSs”).

    摘要翻译: 本发明提供了一种ViST(或“虚拟后缀树”),其是用于搜索XML文档的新型索引结构。 通过在结构编码序列中同时表示XML文档和XML查询,显示查询XML数据等同于查找(非连续)子序列匹配。 各种XML查询(包括具有分支的查询)或通配符('*'和'//')可以由结构编码的序列表示。 不同于将查询反汇编成多个子查询的索引方法,然后加入这些子查询的结果以提供最终答案,ViST使用树结构作为查询的基本单位,以避免昂贵的连接操作。 此外,ViST为XML文档的内容和结构提供了一个统一的索引,因此与仅通过内容或结构索引方法相比,它具有性能优势。 ViST支持动态索引更新,它仅仅依赖于B< +>树,而不使用通用数据库管理系统(以下简称“DBMS”)不能很好支持的任何专门的数据结构。