ESTIMATING OPTIMAL TRAINING DATA SET SIZES FOR MACHINE LEARNING MODEL SYSTEMS AND APPLICATIONS

    公开(公告)号:US20230376849A1

    公开(公告)日:2023-11-23

    申请号:US18318212

    申请日:2023-05-16

    CPC classification number: G06N20/00

    Abstract: In various examples, estimating optimal training data set sizes for machine learning model systems and applications. Systems and methods are disclosed that estimate an amount of data to include in a training data set, where the training data set is then used to train one or more machine learning models to reach a target validation performance. To estimate the amount of training data, subsets of an initial training data set may be used to train the machine learning model(s) in order to determine estimates for the minimum amount of training data needed to train the machine learning model(s) to reach the target validation performance. The estimates may then be used to generate one or more functions, such as a cumulative density function and/or a probability density function, wherein the function(s) is then used to estimate the amount of training data needed to train the machine learning model(s).

    OPTIMIZED ACTIVE LEARNING USING INTEGER PROGRAMMING

    公开(公告)号:US20230244985A1

    公开(公告)日:2023-08-03

    申请号:US17591039

    申请日:2022-02-02

    CPC classification number: G06N20/00

    Abstract: In various examples, a representative subset of data points are queried or selected using integer programming to minimize the Wasserstein distance between the selected data points and the data set from which they were selected. A Generalized Benders Decomposition (GBD) may be used to decompose and iteratively solve the minimization problem, providing a globally optimal solution (an identified subset of data points that match the distribution of their data set) within a threshold tolerance. Data selection may be accelerated by applying one or more constraints while iterating, such as optimality cuts that leverage properties of the Wasserstein distance and/or pruning constraints that reduce the search space of candidate data points. In an active learning implementation, a representative subset of unlabeled data points may be selected using GBD, labeled, and used to train machine learning model(s) over one or more cycles of active learning.

Patent Agency Ranking