Abstract:
A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.
Abstract:
A system that facilitates estimating functional relationships associated with one or more columns in a database comprises a sampling component that receives a random sample of records within the database. An estimate generator component calculates an estimate of strength of functional relationships based at least in part upon the received samples. For example, the estimate generator component can calculate an estimate of strength of a column as a key column based at least in part upon the received samples.
Abstract:
Techniques for estimating the progress of database queries are described herein. In a first implementation, a respective lower-bound parameter is associated with each node in an operator tree that representing a given database query, and the progress of the database query at a given point is estimated based upon the lower-bound parameters. In a second implementation, the progress of the query is estimated by associating respective lower-bound and upper-bound parameters with each node in the operator tree. The progress of the query at the given point is then estimated based on the lower-bound and upper-bound parameters. The progress estimate is computed by dividing the work done so far by the sums of the above averages for each node in the tree.
Abstract:
A method of estimating selectivity of a given string predicate in a database query. In the method selectivities of substrings of various substring lengths are estimated. For example, the selectivity of substrings between length l (or some constant q) to the length of the given string predicate may be estimated. The method then selects a candidate sub string for each sub string length based on estimated selectivities of the substrings. The estimated selectivities of the candidate substrings are combined. The combined estimated selectivity of the candidate substrings is returned as the estimated selectivity of the given string predicate.
Abstract:
A system that facilitates automatic selection of a physical configuration of a database comprises an optimizer component that determines simulated physical structures and creates a hypothetical configuration based thereon. A reduction component progressively reduces size of the configuration until the hypothetical configuration is associated with a size below a threshold. For example, the simulated physical structures can be based at least in part upon a workload.
Abstract:
An automated physical database design tool may provide an integrated physical design recommendation for horizontal partitioning, indexes and indexed views, all three features being tuned together (in concert). Manageability requirements may be specified when optimizing for performance. User-specified configuration may enable the specification of a partial physical design without materialization of the physical design. The tuning process may be performed for a production server but may be conducted substantially on a test server. Secondary indexes may be suggested for XML columns. Tuning of a database may be invoked by any owner of a database. Usage of objects may be evaluated and a recommendation for dropping unused objects may be issued. Reports may be provided concerning the count and percentage of queries in the workload that reference a particular database, and/or the count and percentage of queries in the workload that reference a particular table or column. A feature may be provided whereby a weight may be associated with each statement in the workload, enabling relative importance of particular statements to be specified. An in-row length for a column may be specified. If a value for the column exceeds the specified in-row length for that column, the portion of the value not exceeding the specified in-row length may be stored in the row while the portion of the value exceeding the specified in-row length may be stored in an overflow area. Rebuild and reorganization recommendations may be generated.
Abstract:
A method of estimating the Results of a database query are estimated by performing a sampling of weighted tuples in a database based on a probability of usage of tuples required in executing a workload. A probability is associated with each tuple sampled. And, can aggregate is computed over values in each sampled tuple while multiplying by the inverses of the probabilities associated with each tuple sampled.
Abstract:
An XML transformation tool that constructs a relational database with associated physical structures that can be populated with shredded XML data. A mapping transformation enumerator examines queries in the workload and enumerates mapping transformations that use XSD specific constraints and statistics on XML data and can be used to generate mappings from XSD to relational database schema that may lead to better performance in presence of physical design. A design tuner that searches mappings generated from a default mapping using enumerated transformations together with physical design structures associated with those mappings and selects a preferred mapping and the physical design structures. Cost estimates for performing queries in the workload are determined for the relational database implementing the mapping and associated physical design structures.
Abstract:
A method for estimating the result of a query on a database having data records arranged in tables. The database has an expected workload that includes a set of queries that can be executed on the database. An expected workload is derived comprising a set of queries that can be executed on the database. A sample is constructed by selecting data records for inclusion in the sample in a manner that minimizes an estimation error when the data records are acted upon by a query in the expected workload to provide an expected workload to provide an expected result. The query accesses the sample and is executed on the sample, returning an estimated query result. The expected workload can be constructed by specifying a degree of overlap between records selected by queries in the given workload and records selected by queries in the expected workload.
Abstract:
An index and materialized view selection wizard produces a fast and reasonable recommendation for a configuration of indexes, materialized views, and indexes on materialized views which are beneficial given a specified workload for a given database and database server. Candidate materialized views and indexes are obtained, and a joint enumeration of the combined materialized views and indexes is performed to obtain a recommended configuration. The configuration includes indexes, materialized views and indexes on materialized views. Candidate materialized views are obtained by first determining subsets of tables are referenced in queries in the workload and then finding interesting table subsets. Next, interesting subsets are considered on a per query basis to determine which are syntactically relevant for a query. Materialized views which are likely to be used for the workload are then generated along with a set of merged materialized views. Clustered indexes and non-clustered indexes on materialized views are then generated. The indexes, materialized views and indexes on materialized views are then enumerated together to form the recommended configuration.