摘要:
An unreliable training set is modified to provide for a reliable training set to be used in supervised classification. The training set is modified by determining which data of the set are incorrect and reconstructing those incorrect data. The reconstruction includes modifying the labels associated with the data to provide for correct labels. The modification can be performed iteratively.
摘要:
A method and apparatus for minimizing the time required to obtain results for a content based query in a data base. More specifically, with this invention, the data base is partitioned into a plurality of groups. Then, a schedule or sequence of groups is assigned to each of the operations of the query, where the schedule represents the order in which an operation of the query will be applied to the groups in the schedule. Each schedule is arranged so that each application of the operation operates on the group which will yield intermediate results that are closest to final results.
摘要:
Linear optimization queries, which usually arise in various decision support and resource planning applications, are queries that retrieve top N data records (where N is an integer greater than zero) which satisfy a specific optimization criterion. The optimization criterion is to either maximize or minimize a linear equation. The coefficients of the linear equation are given at query time. Methods and apparatus are disclosed for constructing, maintaining and utilizing a multidimensional indexing structure of database records to improve the execution speed of linear optimization queries. Database records with numerical attributes are organized into a number of layers and each layer represents a geometric structure called convex hull. Such linear optimization queries are processed by searching from the outer-most layer of this multi-layer indexing structure inwards. At least one record per layer will satisfy the query criterion and the number of layers needed to be searched depends on the spatial distribution of records, the query-issued linear coefficients, and N, the number of records to be returned. When N is small compared to the total size of the database, answering the query typically requires searching only a small fraction of all relevant records, resulting in a tremendous speedup as compared to linearly scanning the entire dataset.
摘要:
An apparatus and method for approximating the data stored in a databases by generating multiple projections and representations from the database such that the OLAP queries for the original database (such as aggregation and histogram operations) may be applied to the approximated version of the database, which can be much smaller than the original databases. Other aspects optimize a mapping, via a mapping (or dimension) table, of non-numeric or numeric attributes to other numeric attributes such that the error incurred on applying queries to the approximated version of the database is minimized. Still further aspects define boundaries of approximations so that the boundaries are preserved when approximated versions of the databases are generated.
摘要:
An object tracking technique is provided which, given: (i) a potentially large data set; (ii) a set of dimensions along which the data has been ordered; and (iii) a set of functions for measuring the similarity between data elements, a set of objects are produced. Each of these objects is defined by a list of data elements. Each of the data elements on this list contains the probability that the data element is part of the object. The method produces these lists via an adaptive, knowledge-based search function which directs the search for high-probability data elements. This serves to reduce the number of data element combinations evaluated while preserving the most flexibility in defining the associations of data elements which comprise an object.
摘要:
A computer-based technique is provided for retrieving one or more items from a database in response to a query specified by a user via one or more example sets. Preferably the example sets include multiple positive and negative example sets. The method comprises the following steps. First, a scoring function is constructed from the one or more example sets. The scoring function gives higher scores to database items that are more closely related to the query than to database items that are not as closely related to the query. The scoring function is operable for use with a multidimensional indexing structure associated with the database. Then, the one or more database items that have the highest score as computed using the scoring function are retrieved via the multidimensional indexing structure.
摘要:
A computer system and method for performing similarity searches which is phase and scale insensitive and which allows similarity searches to be performed at a semantic level. Each sequence in a database is preferably segmented at multiple projections and/or resolution levels. The sequences may represent object having multi-dimensional features such as temporal and/or spatial-temporal data. Preferably, the segmenting logic starts with the finest resolution, and each sequence is parsed into a number of disjointed segments, wherein each segment has uniform features. The uniform features could be segments having a constant slope, or waveform segments representable by a single function. The segments may then be re-sampled into a fixed length vector with appropriate normalization. A label may also be assigned to each segment via conventional clustering/classification methods. The above steps are iterated at successive projections and/or resolution levels until each sequence in the database has been independently segmented and clustered. Thus, the labels are preferably extracted in a pseudo-hierarchical manner in which the label of the lowest resolution representation of the sequence is extracted first. The representation of each time series at various resolutions and/or projections captures different characteristics of the same time series (or 2D/3D objects). Recall that each segment represents a region having uniform features. The segmentation at each individual resolution and/or projection thus enables recognition or emphasis of different characteristics within segments having uniform features.
摘要:
An object tracking technique is provided which, given: (i) a potentially large data set; (ii) a set of dimensions along which the data has been ordered; and (iii) a set of functions for measuring the similarity between data elements, a set of objects are produced. Each of these objects is defined by a list of data elements. Each of the data elements on this list contains the probability that the data element is part of the object. The method produces these lists via an adaptive, knowledge-based search function which directs the search for high-probability data elements. This serves to reduce the number of data element combinations evaluated while preserving the most flexibility in defining the associations of data elements which comprise an object.
摘要:
Linear optimization queries, which usually arise in various decision support and resource planning applications, are queries that retrieve top N data records (where N is an integer greater than zero) which satisfy a specific optimization criterion. The optimization criterion is to either maximize or minimize a linear equation. The coefficients of the linear equation are given at query time. Methods and apparatus are disclosed for constructing, maintaining and utilizing a multidimensional indexing structure of database records to improve the execution speed of linear optimization queries. Database records with numerical attributes are organized into a number of layers and each layer represents a geometric structure called convex hull. Such linear optimization queries are processed by searching from the outer-most layer of this multi-layer indexing structure inwards. At least one record per layer will satisfy the query criterion and the number of layers needed to be searched depends on the spatial distribution of records, the query-issued linear coefficients, and N, the number of records to be returned. When N is small compared to the total size of the database, answering the query typically requires searching only a small fraction of all relevant records, resulting in a tremendous speedup as compared to linearly scanning the entire dataset.
摘要:
Similarity measure has been one of the critical issues for successful content-based retrieval. Simple quadratic forms of distance is inadequate as it does not necessary correspond to perceived similarity nor is it adaptive to different applications. This patent application describes a new sequential query processing algorith for evaluating content-based composite object queries. The composite objects consist of spatial and temporal arrangements of simple objects. The simple objects are defined in terms of spatial, temporal, feature and semantic attributes. The query method defines a process for executing a best-first search for the matches to the query, while providing a flexible framework for broadening the search space as required. The query method guarantees that there are no false dismissals of the candidate composite objects.