SYSTEM AND METHOD OF MINING TIME-CHANGING DATA STREAMS USING A DYNAMIC RULE CLASSIFIER HAVING LOW GRANULARITY
    11.
    发明申请
    SYSTEM AND METHOD OF MINING TIME-CHANGING DATA STREAMS USING A DYNAMIC RULE CLASSIFIER HAVING LOW GRANULARITY 失效
    使用具有低精度的动态规则分类器来采集时变数据流的系统和方法

    公开(公告)号:US20080222060A1

    公开(公告)日:2008-09-11

    申请号:US12121942

    申请日:2008-05-16

    IPC分类号: G06F15/18

    CPC分类号: G06N5/025

    摘要: A dynamic rule classifier for mining a data stream includes at least one window for viewing data contained in the data stream and a set of rules for mining the data. Rules are added and the set of rules are updated by algorithms when an drift in a concept within the data occurs, causing unacceptable drops in classification accuracy. The dynamic rule classifier is also implemented as a method and a computer program product.

    摘要翻译: 用于挖掘数据流的动态规则分类器包括用于查看数据流中包含的数据的至少一个窗口和用于挖掘数据的一组规则。 添加规则,并且当数据中的概念中的漂移发生时,通过算法更新规则集合,导致分类准确性的不可接受的下降。 动态规则分类器也被实现为一种方法和一种计算机程序产品。

    Methods and apparatus for mining attribute associations
    12.
    发明申请
    Methods and apparatus for mining attribute associations 失效
    挖掘属性关联的方法和装置

    公开(公告)号:US20050027710A1

    公开(公告)日:2005-02-03

    申请号:US10630992

    申请日:2003-07-30

    IPC分类号: G06F7/00 G06F17/18 G06F17/30

    摘要: Attribute association discovery techniques that support relational-based data mining are disclosed. In one aspect of the invention, a technique for mining attribute associations in a relational data set comprises the following steps/operations. Multiple items are obtained from the relational data set. Then, attribute associations are discovered using: (i) multi-attribute mining templates formed from at least a portion of the multiple items; and (ii) one or more mining preferences specified by a user. The invention provides a novel architecture for the mining search space so as to exploit the inter-relationships among patterns of different templates. The framework is relational-sensitive and supports interactive and online mining.

    摘要翻译: 公开了支持基于关系的数据挖掘的属性关联发现技术。 在本发明的一个方面,用于挖掘关系数据集中的属性关联的技术包括以下步骤/操作。 从关系数据集获得多个项目。 然后,使用以下方式发现属性关联:(i)由多个项目的至少一部分形成的多属性挖掘模板; 和(ii)用户指定的一个或多个挖掘偏好。 本发明提供了一种用于挖掘搜索空间的新型架构,以便利用不同模板的模式之间的相互关系。 该框架是关系敏感的,支持交互式和在线挖掘。

    Method for fast relevance discovery in time series
    13.
    发明授权
    Method for fast relevance discovery in time series 有权
    时间序列快速相关性发现的方法

    公开(公告)号:US07447723B2

    公开(公告)日:2008-11-04

    申请号:US11563900

    申请日:2006-11-28

    IPC分类号: G06F17/15

    CPC分类号: G06K9/00536

    摘要: A method for measuring time series relevance using state transition points, including inputting time series data and relevance threshold data. Then convert all time series values to ranks within [0,1] interval. Calculate the valid range of the transition point in [0,1]. Afterwards, a verification occurs that a time series Z exists for each pair of time series Z and Y, such that the relevances between X and Z, and between Y and Z are known. Then deduce the relevance of X and Y. The relevance of X and Y must be at least one of, (i) higher, and (ii) lower than, the given threshold. Provided Z is found terminate all remaining calculations for X and Y. Otherwise, segment the time series if no Z time series exists, use the segmented time series to estimate the relevance. Apply a hill climbing algorithm in the valid range to find the true relevance.

    摘要翻译: 一种使用状态转换点来测量时间序列相关性的方法,包括输入时间序列数据和相关阈值数据。 然后将所有时间序列值转换为[0,1]间隔内的等级。 计算[0,1]中转换点的有效范围。 之后,对于每对时间序列Z和Y存在时间序列Z的验证,使得X和Z之间,以及Y和Z之间的相关性是已知的。 然后推导X和Y的相关性.X和Y的相关性必须至少为(i)较高和(ii)低于给定阈值中的一个。 如果Z被找到终止X和Y的所有剩余计算。否则,如果没有Z时间序列,则分段时间序列,使用分段时间序列来估计相关性。 在有效范围内应用爬山算法来找到真正的相关性。

    System and method for indexing weighted-sequences in large databases
    14.
    发明授权
    System and method for indexing weighted-sequences in large databases 有权
    用于索引大数据库中加权序列的系统和方法

    公开(公告)号:US07418455B2

    公开(公告)日:2008-08-26

    申请号:US10723229

    申请日:2003-11-26

    IPC分类号: G06F7/00 G06F17/00

    摘要: The present invention provides an index structure for managing weighted-sequences in large databases. A weighted-sequence is defined as a two-dimensional structure in which each element in the sequence is associated with a weight. A series of network events, for instance, is a weighted-sequence because each event is associated with a timestamp. Querying a large sequence database by events' occurrence patterns is a first step towards understanding the temporal causal relationships among the events. The index structure proposed herein enables the efficient retrieval from the database of all subsequences (contiguous and non-contiguous) that match a given query sequence both by events and by weights. The index structure also takes into consideration the nonuniform frequency distribution of events in the sequence data.

    摘要翻译: 本发明提供了一种用于在大数据库中管理加权序列的索引结构。 加权序列被定义为二维结构,其中序列中的每个元素与权重相关联。 例如,一系列网络事件是加权序列,因为每个事件都与时间戳相关联。 通过事件发生模式查询大序列数据库是了解事件之间的时间因果关系的第一步。 这里提出的索引结构使得能够通过事件和权重从数据库有效地检索与给定查询序列匹配的所有子序列(连续的和不连续的)。 索引结构还考虑了序列数据中事件的不均匀频率分布。

    System and method for indexing weighted-sequences in large databases
    15.
    发明申请
    System and method for indexing weighted-sequences in large databases 有权
    用于索引大数据库中加权序列的系统和方法

    公开(公告)号:US20050114298A1

    公开(公告)日:2005-05-26

    申请号:US10723229

    申请日:2003-11-26

    IPC分类号: G06F17/30

    摘要: The present invention provides an index structure for managing weighted-sequences in large databases. A weighted-sequence is defined as a two-dimensional structure in which each element in the sequence is associated with a weight. A series of network events, for instance, is a weighted-sequence because each event is associated with a timestamp. Querying a large sequence database by events' occurrence patterns is a first step towards understanding the temporal causal relationships among the events. The index structure proposed herein enables the efficient retrieval from the database of all subsequences (contiguous and non-contiguous) that match a given query sequence both by events and by weights. The index structure also takes into consideration the nonuniform frequency distribution of events in the sequence data.

    摘要翻译: 本发明提供了一种用于在大数据库中管理加权序列的索引结构。 加权序列被定义为二维结构,其中序列中的每个元素与权重相关联。 例如,一系列网络事件是加权序列,因为每个事件都与时间戳相关联。 通过事件发生模式查询大序列数据库是了解事件之间的时间因果关系的第一步。 这里提出的索引结构使得能够通过事件和权重从数据库有效地检索与给定查询序列匹配的所有子序列(连续的和不连续的)。 索引结构还考虑了序列数据中事件的不均匀频率分布。

    System and method for load shedding in data mining and knowledge discovery from stream data
    16.
    发明授权
    System and method for load shedding in data mining and knowledge discovery from stream data 有权
    数据挖掘中的负载脱落和流数据的知识发现的系统和方法

    公开(公告)号:US08060461B2

    公开(公告)日:2011-11-15

    申请号:US12372568

    申请日:2009-02-17

    IPC分类号: G06F7/00 G06F17/00

    CPC分类号: G06K9/6297 H04L43/028

    摘要: Load shedding schemes for mining data streams. A scoring function is used to rank the importance of stream elements, and those elements with high importance are investigated. In the context of not knowing the exact feature values of a data stream, the use of a Markov model is proposed herein for predicting the feature distribution of a data stream. Based on the predicted feature distribution, one can make classification decisions to maximize the expected benefits. In addition, there is proposed herein the employment of a quality of decision (QoD) metric to measure the level of uncertainty in decisions and to guide load shedding. A load shedding scheme such as presented herein assigns available resources to multiple data streams to maximize the quality of classification decisions. Furthermore, such a load shedding scheme is able to learn and adapt to changing data characteristics in the data streams.

    摘要翻译: 挖掘数据流的加载脱落方案。 使用评分函数对流元素的重要性进行排序,并调查那些具有重要意义的元素。 在不知道数据流的精确特征值的上下文中,本文提出了使用马尔可夫模型来预测数据流的特征分布。 基于预测的特征分布,可以进行分类决定,以最大限度地提高预期效益。 此外,在此提出采用质量决策(QoD)度量来衡量决策中的不确定性水平并指导负荷脱落。 诸如此处呈现的负载脱落方案将可用资源分配给多个数据流以最大化分类决定的质量。 此外,这种负载脱落方案能够学习和适应数据流中不断变化的数据特性。

    System and method for scalable cost-sensitive learning
    17.
    发明授权
    System and method for scalable cost-sensitive learning 有权
    可扩展成本敏感学习的系统和方法

    公开(公告)号:US07904397B2

    公开(公告)日:2011-03-08

    申请号:US12690502

    申请日:2010-01-20

    IPC分类号: G06F15/18 G06N3/00 G06N3/12

    CPC分类号: G06N99/005

    摘要: A method (and structure) for processing an inductive learning model for a dataset of examples, includes dividing the dataset of examples into a plurality of subsets of data and generating, using a processor on a computer, a learning model using examples of a first subset of data of the plurality of subsets of data. The learning model being generated for the first subset comprises an initial stage of an evolving aggregate learning model (ensemble model) for an entirety of the dataset, the ensemble model thereby providing an evolving estimated learning model for the entirety of the dataset if all the subsets were to be processed. The generating of the learning model using data from a subset includes calculating a value for at least one parameter that provides an objective indication of an adequacy of a current stage of the ensemble model.

    摘要翻译: 一种用于处理实例的数据集的感应学习模型的方法(和结构),包括将示例的数据集划分成多个数据子集,并使用计算机上的处理器生成使用第一子集的示例的学习模型 的多个数据子集的数据。 为第一子集生成的学习模型包括用于整个数据集的演进聚合学习模型(集合模型)的初始阶段,从而为整个数据集提供演进的估计学习模型,如果所有子集 被处理。 使用来自子集的数据生成学习模型包括计算至少一个参数的值,所述参数提供对所述集合模型的当前阶段的充分性的客观指示。

    System and method for ranked keyword search on graphs
    18.
    发明授权
    System and method for ranked keyword search on graphs 有权
    在图表上排名关键词搜索的系统和方法

    公开(公告)号:US07702620B2

    公开(公告)日:2010-04-20

    申请号:US11693471

    申请日:2007-03-29

    IPC分类号: G06F17/30

    摘要: Arrangements and methods for providing for the efficient implementation of ranked keyword searches on graph-structured data. Since it is difficult to directly build indexes for general schemaless graphs, conventional techniques highly rely on graph traversal in running time. The previous lack of more knowledge about graphs also resulted in great difficulties in applying pruning techniques. To address these problems, there is introduced herein a new scoring function while the block is used as an intermediate access level; the result is an opportunity to create sophisticated indexes for keyword search. Also proposed herein is a cost-balanced expansion algorithm to conduct a backward search, which provides a good theoretical guarantee in terms of the search cost.

    摘要翻译: 用于提供在图形结构化数据上有效执行排名关键词搜索的安排和方法。 由于难以直接构建一般无法图的索引,常规技术高度依赖于运行时间的图遍历。 以前缺乏对图形的更多了解也导致了应用修剪技术的巨大困难。 为了解决这些问题,这里引入了一个新的评分功能,而块被用作中间访问级别; 结果是为关键字搜索创建复杂索引的机会。 这里还提出了一种用于进行后向搜索的成本平衡的扩展算法,这在搜索成本方面提供了良好的理论保证。

    System and method for sequence-based subspace pattern clustering
    19.
    发明授权
    System and method for sequence-based subspace pattern clustering 失效
    基于序列的子空间模式聚类的系统和方法

    公开(公告)号:US07565346B2

    公开(公告)日:2009-07-21

    申请号:US10858541

    申请日:2004-05-31

    IPC分类号: G06F17/30

    CPC分类号: G06K9/6215 Y10S707/99936

    摘要: Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including e-Commerce target marketing, bioinformatics (large scale scientific data analysis), and automatic computing (web usage analysis), etc. However, state-of-the-art pattern-based clustering methods (e.g., the pCluster algorithm) can only handle datasets of thousands of records, which makes them inappropriate for many real-life applications. Furthermore, besides the huge data volume, many data sets are also characterized by their sequentiality, for instance, customer purchase records and network event logs are usually modeled as data sequences. Hence, it becomes important to enable pattern-based clustering methods i) to handle large datasets, and ii) to discover pattern similarity embedded in data sequences. There is presented herein a novel method that offers this capability.

    摘要翻译: 与传统的集群方法不同,传统的集群方法集中在对一组维度上具有类似值的对象进行分组,通过模式相似性进行聚类可以找到在子空间中呈现一致的上升和下降模式的对象。 基于模式的群集扩展了传统群集的概念,受益于广泛的应用,包括电子商务目标营销,生物信息学(大规模科学数据分析)和自动计算(Web使用分析)等。然而,状态 基于图案的聚类方法(例如,pCluster算法)只能处理数千条记录的数据集,这使得它们不适合许多现实生活中的应用。 此外,除了巨大的数据量之外,许多数据集的特征还在于它们的顺序性,例如,客户购买记录和网络事件日志通常被建模为数据序列。 因此,重要的是启用基于图案的聚类方法i)处理大数据集,以及ii)发现嵌入在数据序列中的模式相似性。 这里提供了一种提供这种能力的新颖方法。

    SYSTEM AND METHOD FOR RANKED KEYWORD SEARCH ON GRAPHS
    20.
    发明申请
    SYSTEM AND METHOD FOR RANKED KEYWORD SEARCH ON GRAPHS 有权
    排序关键字搜索的系统和方法

    公开(公告)号:US20080243811A1

    公开(公告)日:2008-10-02

    申请号:US11693471

    申请日:2007-03-29

    IPC分类号: G06F17/30

    摘要: Arrangements and methods for providing for the efficient implementation of ranked keyword searches on graph-structured data. Since it is difficult to directly build indexes for general schemaless graphs, conventional techniques highly rely on graph traversal in running time. The previous lack of more knowledge about graphs also resulted in great difficulties in applying pruning techniques. To address these problems, there is introduced herein a new scoring function while the block is used as an intermediate access level; the result is an opportunity to create sophisticated indexes for keyword search. Also proposed herein is a cost-balanced expansion algorithm to conduct a backward search, which provides a good theoretical guarantee in terms of the search cost.

    摘要翻译: 用于提供在图形结构化数据上有效执行排名关键词搜索的安排和方法。 由于难以直接构建一般无法图的索引,常规技术高度依赖于运行时间的图遍历。 以前缺乏对图形的更多了解也导致了应用修剪技术的巨大困难。 为了解决这些问题,这里引入了一个新的评分功能,而块被用作中间访问级别; 结果是为关键字搜索创建复杂索引的机会。 这里还提出了一种用于进行后向搜索的成本平衡的扩展算法,这在搜索成本方面提供了良好的理论保证。