-
公开(公告)号:US11544630B2
公开(公告)日:2023-01-03
申请号:US16417145
申请日:2019-05-20
Applicant: ORACLE INTERNATIONAL CORPORATION
Inventor: Tomas Karnagel , Sam Idicula , Nipun Agarwal
Abstract: The present invention relates to dimensionality reduction for machine learning (ML) models. Herein are techniques that individually rank features and combine features based on their rank to achieve an optimal combination of features that may accelerate training and/or inferencing, prevent overfitting, and/or provide insights into somewhat mysterious datasets. In an embodiment, a computer calculates, for each feature of a training dataset, a relevance score based on: a relevance scoring function, and statistics of values, of the feature, that occur in the training dataset. A rank based on relevance scores of the features is calculated for each feature. A sequence of distinct subsets of the features, based on the ranks of the features, is generated. For each distinct subset of the sequence of distinct feature subsets, a fitness score is generated based on training a machine learning (ML) model that is configured for the distinct subset.
-
公开(公告)号:US11782926B2
公开(公告)日:2023-10-10
申请号:US17573897
申请日:2022-01-12
Applicant: Oracle International Corporation
Inventor: Sam Idicula , Tomas Karnagel , Jian Wen , Seema Sundara , Nipun Agarwal , Mayur Bency
IPC: G06F16/2453 , G06N20/00 , G06F16/21 , G06N20/20
CPC classification number: G06F16/24545 , G06F16/217 , G06N20/00 , G06N20/20
Abstract: Embodiments utilize trained query performance machine learning (QP-ML) models to predict an optimal compute node cluster size for a given in-memory workload. The QP-ML models include models that predict query task runtimes at various compute node cardinalities, and models that predict network communication time between nodes of the cluster. Embodiments also utilize an analytical model to predict overlap between predicted task runtimes and predicted network communication times. Based on this data, an optimal cluster size is selected for the workload. Embodiments further utilize trained data capacity machine learning (DC-ML) models to predict a minimum number of compute nodes needed to run a workload. The DC-ML models include models that predict the size of the workload dataset in a target data encoding, models that predict the amount of memory needed to run the queries in the workload, and models that predict the memory needed to accommodate changes to the dataset.
-
公开(公告)号:US11520834B1
公开(公告)日:2022-12-06
申请号:US17387841
申请日:2021-07-28
Applicant: Oracle International Corporation
Inventor: Tomas Karnagel , Suratna Budalakoti , Onur Kocberber , Nipun Agarwal , Alan Wood
IPC: G06F16/00 , G06F16/9035
Abstract: Techniques are described for generating an approximate frequency histogram using a series of Bloom filters (BF). For example, to estimate the f1 and f2 cardinalities in a dataset, an ordered chain of three BFs is established (“BF1”, “BF2”, and “BF3”). An insertion operation is performed for each datum in the dataset, whereby the BFs are tested in order (starting at BF1) for the datum. If the datum is represented in a currently-tested BF, the subsequent BF in the chain is tested for the datum. If the datum is not represented in the currently-tested BF, the datum is added to the BF, a counter for the BF is incremented, and the insertion operation for the current datum ends. To estimate the cardinality of f1-values in the dataset, the BF2-counter is subtracted from the BF1-counter. Similarly, to estimate the cardinality of f2-values in the dataset, the BF3-counter is subtracted from the BF2-counter.
-
公开(公告)号:US20220138199A1
公开(公告)日:2022-05-05
申请号:US17573897
申请日:2022-01-12
Applicant: Oracle International Corporation
Inventor: Sam Idicula , Tomas Karnagel , Jian Wen , Seema Sundara , Nipun Agarwal , Mayur Bency
IPC: G06F16/2453 , G06N20/00 , G06F16/21 , G06N20/20
Abstract: Embodiments utilize trained query performance machine learning (QP-ML) models to predict an optimal compute node cluster size for a given in-memory workload. The QP-ML models include models that predict query task runtimes at various compute node cardinalities, and models that predict network communication time between nodes of the cluster. Embodiments also utilize an analytical model to predict overlap between predicted task runtimes and predicted network communication times. Based on this data, an optimal cluster size is selected for the workload. Embodiments further utilize trained data capacity machine learning (DC-ML) models to predict a minimum number of compute nodes needed to run a workload. The DC-ML models include models that predict the size of the workload dataset in a target data encoding, models that predict the amount of memory needed to run the queries in the workload, and models that predict the memory needed to accommodate changes to the dataset.
-
公开(公告)号:US20210365805A1
公开(公告)日:2021-11-25
申请号:US16877882
申请日:2020-05-19
Applicant: Oracle International Corporation
Inventor: Tomas Karnagel , Onur Kocberber , Farhan Tauheed , Nipun Agarwal
Abstract: Techniques for estimating the number of distinct values in a data set using machine learning are provided. In one technique, a sample of a data set is retrieved where the sample is a strict subset of the data set. The sample is analyzed to identify feature values of multiple features of the sample. The feature values are inserted into a machine-learned model that computes a prediction regarding a number of distinct values in the data set. An estimated number of distinct values that is based on the prediction is stored in association with the data set.
-
公开(公告)号:US12014286B2
公开(公告)日:2024-06-18
申请号:US16914816
申请日:2020-06-29
Applicant: Oracle International Corporation
Inventor: Farhan Tauheed , Onur Kocberber , Tomas Karnagel , Nipun Agarwal
CPC classification number: G06N5/04 , G06F16/2282 , G06N20/00
Abstract: Herein are approaches for self-optimization of a database management system (DBMS) such as in real time. Adaptive just-in-time sampling techniques herein estimate database content statistics that a machine learning (ML) model may use to predict configuration settings that conserve computer resources such as execution time and storage space. In an embodiment, a computer repeatedly samples database content until a dynamic convergence criterion is satisfied. In each iteration of a series of sampling iterations, a subset of rows of a database table are sampled, and estimates of content statistics of the database table are adjusted based on the sampled subset of rows. Immediately or eventually after detecting dynamic convergence, a machine learning (ML) model predicts, based on the content statistic estimates, an optimal value for a configuration setting of the DBMS.
-
公开(公告)号:US11615265B2
公开(公告)日:2023-03-28
申请号:US16547312
申请日:2019-08-21
Applicant: Oracle International Corporation
Inventor: Tomas Karnagel , Sam Idicula , Hesam Fathi Moghadam , Nipun Agarwal
Abstract: The present invention relates to dimensionality reduction for machine learning (ML) models. Herein are techniques that individually rank features and combine features based on their rank to achieve an optimal combination of features that may accelerate training and/or inferencing, prevent overfitting, and/or provide insights into somewhat mysterious datasets. In an embodiment, a computer ranks features of datasets of a training corpus. For each dataset and for each landmark percentage, a target ML model is configured to receive only a highest ranking landmark percentage of features, and a landmark accuracy achieved by training the ML model with the dataset is measured. Based on the landmark accuracies and meta-features values of the dataset, a respective training tuple is generated for each dataset. Based on all of the training tuples, a regressor is trained to predict an optimal amount of features for training the target ML model.
-
公开(公告)号:US11567937B2
公开(公告)日:2023-01-31
申请号:US17318972
申请日:2021-05-12
Applicant: Oracle International Corporation
Inventor: Sam Idicula , Tomas Karnagel , Jian Wen , Seema Sundara , Nipun Agarwal , Mayur Bency
IPC: G06F16/2453 , G06N20/00 , G06F16/21 , G06N20/20
Abstract: Embodiments implement a prediction-driven, rather than a trial-driven, approach to automate database configuration parameter tuning for a database workload. This approach uses machine learning (ML) models to test performance metrics resulting from application of particular database parameters to a database workload, and does not require live trials on the DBMS managing the workload. Specifically, automatic configuration (AC) ML models are trained, using a training corpus that includes information from workloads being run by DBMSs, to predict performance metrics based on workload features and configuration parameter values. The trained AC-ML models predict performance metrics resulting from applying particular configuration parameter values to a given database workload being automatically tuned. Based on correlating changes to configuration parameter values with changes in predicted performance metrics, an optimization algorithm is used to converge to an optimal set of configuration parameters. The optimal set of configuration parameter values is automatically applied for the given workload.
-
公开(公告)号:US11061902B2
公开(公告)日:2021-07-13
申请号:US16298837
申请日:2019-03-11
Applicant: Oracle International Corporation
Inventor: Sam Idicula , Tomas Karnagel , Jian Wen , Seema Sundara , Nipun Agarwal , Mayur Bency
IPC: G06F16/2453 , G06N20/00 , G06F16/21 , G06N20/20
Abstract: Embodiments implement a prediction-driven, rather than a trial-driven, approach to automate database configuration parameter tuning for a database workload. This approach uses machine learning (ML) models to test performance metrics resulting from application of particular database parameters to a database workload, and does not require live trials on the DBMS managing the workload. Specifically, automatic configuration (AC) ML models are trained, using a training corpus that includes information from workloads being run by DBMSs, to predict performance metrics based on workload features and configuration parameter values. The trained AC-ML models predict performance metrics resulting from applying particular configuration parameter values to a given database workload being automatically tuned. Based on correlating changes to configuration parameter values with changes in predicted performance metrics, an optimization algorithm is used to converge to an optimal set of configuration parameters. The optimal set of configuration parameter values is automatically applied for the given workload.
-
公开(公告)号:US11620547B2
公开(公告)日:2023-04-04
申请号:US16877882
申请日:2020-05-19
Applicant: Oracle International Corporation
Inventor: Tomas Karnagel , Onur Kocberber , Farhan Tauheed , Nipun Agarwal
Abstract: Techniques for estimating the number of distinct values in a data set using machine learning are provided. In one technique, a sample of a data set is retrieved where the sample is a strict subset of the data set. The sample is analyzed to identify feature values of multiple features of the sample. The feature values are inserted into a machine-learned model that computes a prediction regarding a number of distinct values in the data set. An estimated number of distinct values that is based on the prediction is stored in association with the data set.
-
-
-
-
-
-
-
-
-