Abstract:
Methods and apparatus are provided for outlier detection in databases by determining sparse low dimensional projections. These sparse projections are used for the purpose of determining which points are outliers. The methodologies of the invention are very relevant in providing a novel definition of exceptions or outliers for the high dimensional domain of data.
Abstract:
Techniques are disclosed for predicting the future behavior of data streams through the use of current trends of the data stream. By way of example, a technique for predicting the future behavior of a data stream comprises the following steps/operations. Statistics are obtained from the data stream. Estimated statistics for a future time interval are generated by using at least a portion of the obtained statistics. A portion of the estimated statistics are utilized to generate one or more representative pseudo-data records within the future time interval. Pseudo-data records are utilized for forecasting of at least one characteristic of the data stream.
Abstract:
Methods and apparatus for generating at least one output data set from at least one input data set for use in association with a data mining process are provided. First, data statistics are constructed from the at least one input data set. Then, an output data set is generated from the data statistics. The output data set differs from the input data set but maintains one or more correlations from within the input data set. The correlations may be the inherent correlations between different dimensions of a multidimensional input data set. A significant amount of information from the input data set may be hidden so that the privacy level of the data mining process may be increased.
Abstract:
A technique for effective classification of time series data using a rule-based wavelet decomposition approach. This method is effective in classification of a wide variety of time series data sets. The process uses a combination of wavelet decomposition, discretization and rule generation of training time series data to classify various instances of test time series data. The wavelet decomposition can effectively explore the data at varying levels of granularity to classify instances of the test time series data.
Abstract:
Techniques are provided for finding query responses from database queries using an interactive process between a user (e.g., a person entering a query to a database) and a computer system (e.g., a computing system upon which the database resides or which has access to the database). The interactive process comprises providing the user with one or more visual perspectives as feedback on the distribution of points in the database. These visual perspectives may be considered by the user in order for the user to provide feedback to the computer system. The computer system may then use the user-provided feedback to determine the best response to the query.
Abstract:
In one aspect of the invention, a method of performing a conceptual similarity search comprises the steps of: generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search; building a conceptual index of documents with the one or more word-chains; and evaluating a similarity query using the conceptual index. The evaluating step preferably returns one or more of the closest documents resulting from the search; one or more matching word-chains in the one or more documents; and one or more matching topical words of the one or more documents.
Abstract:
A method for automatically generating associations of items included in a database. A user first specifies a support criteria indicating a strength of desired associations of items contained in the said database. Then, a recursive program is executed for generating a hierarchical tree structure comprising one or more levels of database itemsets, with each itemset representing item associations determined to have satisfied the specified support criteria. The recursive program includes steps of: characterizing nodes of the tree structure as being either active and enabling generation of new nodes at a new level of the tree, or inactive, at any given time; enabling traversal of the tree structure in a predetermined manner and projecting each of the transactions included in the database onto currently active nodes of the tree structure to generate projected transaction results; and, counting the projected transaction results of the projected transactions at the active nodes to determine whether the further itemsets satisfy the specified support criteria. All itemsets meeting the specified support criteria are added to the tree structure at a new level.
Abstract:
A method (and system) for supervised network clustering includes receiving and reading node labels from a plurality of nodes on a network, as executed by a processor on a computer having access to the network, the network defined as a group of entities interconnected by links. The node labels are used to define densities associated with the nodes. Node components are extracted from the network, based on using thresholds on densities. Smaller components having a size below a user-defined threshold are merged.
Abstract:
A system and method for resource adaptive classification of data streams. Embodiments of systems and methods provide classifying data received in a computer, including discretizing the received data, constructing an intermediate data structure from said received data as training instances, performing subspace sampling on said received data as test instances and adaptively classifying said received data based on statistics of said subspace sampling.
Abstract:
An object and attributes that describe that object are identified. The attributes are grouped into attribute patterns, and classification classes are identified. For each identified class a sketch table containing a plurality of parallel hash tables is created. For the object to be classified, each attribute pattern is processed using the all of the hash functions for each sketch table, resulting in a plurality of values under each sketch table for a single attribute pattern. The lowest value is selected for each sketch table. The distribution of values across all sketch tables is evaluated for each attribute pattern, producing a discriminatory power for each attribute pattern. Attribute patterns having a discriminatory power above a given threshold are selected and added to associated sketch table values. The sketch table with the largest overall sum is identified, and the associated class is assigned to the object belonging to the attribute patterns.