摘要:
Disclosed in a method and structure for searching data in databases using an ensemble of models. First the invention performs training. This training orders models within the ensemble in order of prediction accuracy and joins different numbers of models together to form sub-ensembles. The models are joined together in the sub-ensemble in the order of prediction accuracy. Next in the training process, the invention calculates confidence values of each of the sub-ensembles. The confidence is a measure of how closely results form the sub-ensemble will match results from the ensemble. The size of each of the sub-ensembles is variable depending upon the level of confidence, while, to the contrary, the size of the ensemble is fixed. After the training, the invention can make a prediction. First, the invention selects a sub-ensemble that meets a given level of confidence. As the level of confidence is raised, a sub-ensemble that has more models will be selected and as the level of confidence is lowered, a sub-ensemble that has fewer models will be selected. Finally, the invention applies the selected sub-ensemble, in place of the ensemble, to an example to make a prediction.
摘要:
Uncertain data is classified by constructing an error adjusted probability density estimate for the data, and applying a subspace exploration process to the probability density estimate to classify the data.
摘要:
Arrangements are provided for performing structural clustering between different time series. Time series data relating to a plurality of time series is accepted, structural features relating to the time series data are ascertained, and at least one distance between different time series via employing the structural features is determined. The different time series may be partitioned into clusters based on the at least one distance, and/or the k closest matches to a given time series query based on the at least one distance may be returned.
摘要:
Disclosed is a method, information processing system, and computer readable medium for preserving privacy of nonstationary data streams. The method includes receiving at least one nonstationary data stream with time dependent data. Calculating, for a given instant of sub-space of time, A set of first-moment statistical values is calculated, for a given instant of sub-space of time, for the data. The first moment statistical values include a principal component for the sub-space of time. The data is perturbed with noise along the principal component in proportion to the first-moment of statistical values so that at least part of a set of second-moment statistical values for the data is perturbed by the noise only within a predetermined variance.
摘要:
A method (and structure) for processing an inductive learning model for a dataset of examples, includes dividing the dataset of examples into a plurality of subsets of data and generating, using a processor on a computer, a learning model using examples of a first subset of data of the plurality of subsets of data. The learning model being generated for the first subset comprises an initial stage of an evolving aggregate learning model (ensemble model) for an entirety of the dataset, the ensemble model thereby providing an evolving estimated learning model for the entirety of the dataset if all the subsets were to be processed. The generating of the learning model using data from a subset includes calculating a value for at least one parameter that provides an objective indication of an adequacy of a current stage of the ensemble model.
摘要:
Disclosed are a method, information processing system, and computer readable medium for managing data collection in a distributed processing system. The method includes dynamically collecting at least one statistical query pattern associated with a selected group of information processing nodes. The statistical query pattern is dynamically collected from a plurality of information processing nodes in a distributed processing system. At least one operating attribute distribution associated with an operating attribute that has been queried for the selected group is dynamically monitored. The selected group is dynamically configured, based on the query pattern and the operating attribute distribution, to periodically push a set of attributes associated with the each information processing node in the selected group.
摘要:
Streaming environments typically dictate incomplete or approximate algorithm execution, in order to cope with sudden surges in the data rate. Such limitations are even more accentuated in mobile environments (such as sensor networks) where computational and memory resources are typically limited. Introduced herein is a novel “resource adaptive” algorithm for spectrum and periodicity estimation on a continuous stream of data. The formulation is based on the derivation of a closed-form incremental computation of the spectrum, augmented by an intelligent load-shedding scheme that can adapt to available CPU resources. Experimentation indicates that the proposed technique can be a viable and resource efficient solution for real-time spectrum estimation.
摘要:
Uncertain data is classified by constructing an error adjusted probability density estimate for the data, and applying a subspace exploration process to the probability density estimate to classify the data.
摘要:
Sequence-based XML indexing aims at avoiding expensive join operations in query processing. It transforms structured XML data into sequences so that a structured query can be answered holistically through subsequence matching. Herein, there is addressed the problem of query equivalence with respect to this transformation, and thereis introduced a performance-oriented principle for sequencing tree structures. With query equivalence, XML queries can be performed through subsequence matching without join operations, post-processing, or other special handling for problems such as false alarms. There is identified a class of sequencing methods for this purpose, and there is presented a novel subsequence matching algorithm that observe query equivalence. Also introduced is a performance-oriented principle to guide the sequencing of tree structures. For any given XML dataset, the principle finds an optimal sequencing strategy according to its schema and its data distribution; there is thus presented herein a novel method that realizes this principle.
摘要:
In connection with the mining of time-evolving data streams, a general framework that mines changes and reconstructs models from a data stream with unlabeled instances or a limited number of labeled instances. In particular, there are defined herein statistical profiling methods that extend a classification tree in order to guess the percentage of drifts in the data stream without any labelled data. Exact error can be estimated by actively sampling a small number of true labels. If the estimated error is significantly higher than empirical expectations, there preferably re-sampled a small number of true labels to reconstruct the decision tree from the leaf node level.