USING MULTIPLE TRAINED MODELS TO REDUCE DATA LABELING EFFORTS

    公开(公告)号:US20240281431A1

    公开(公告)日:2024-08-22

    申请号:US18647425

    申请日:2024-04-26

    CPC classification number: G06F16/2379 G06N20/00

    Abstract: A method of labeling training data includes inputting a plurality of unlabeled input data samples into each of a plurality of pre-trained neural networks and extracting a set of feature embeddings from multiple layer depths of each of the plurality of pre-trained neural networks. The method also includes generating a plurality of clusterings from the set of feature embeddings. The method also includes analyzing, by a processing device, the plurality of clusterings to identify a subset of the plurality of unlabeled input data samples that belong to a same unknown class. The method also includes assigning pseudo-labels to the subset of the plurality of unlabeled input data samples.

    Using multiple trained models to reduce data labeling efforts

    公开(公告)号:US11983171B2

    公开(公告)日:2024-05-14

    申请号:US18219333

    申请日:2023-07-07

    CPC classification number: G06F16/2379 G06N20/00

    Abstract: A method of labeling a dataset includes inputting a testing set comprising a plurality of input data samples into a plurality of pre-trained machine learning models to generate a set of embeddings output by the plurality of pre-trained machine learning models. The method further includes performing an iterative cluster labeling algorithm that includes generating a plurality of clusterings from the set of embeddings, analyzing the plurality of clusterings to identify a target embedding with a highest duster quality, analyzing the target embedding to determine a compactness for each of the plurality of clusterings of the target embedding, and identifying a target cluster among the plurality of clusterings of the target embedding based on the compactness. The method further includes assigning pseudo-labels to the subset of the plurality of input data samples that are members of the target duster.

    USING MULTIPLE TRAINED MODELS TO REDUCE DATA LABELING EFFORTS

    公开(公告)号:US20230350880A1

    公开(公告)日:2023-11-02

    申请号:US18219333

    申请日:2023-07-07

    CPC classification number: G06F16/2379 G06N20/00

    Abstract: A method of labeling a dataset includes inputting a testing set comprising a plurality of input data samples into a plurality of pre-trained machine learning models to generate a set of embeddings output by the plurality of pre-trained machine learning models. The method further includes performing an iterative cluster labeling algorithm that includes generating a plurality of clusterings from the set of embeddings, analyzing the plurality of clusterings to identify a target embedding with a highest duster quality, analyzing the target embedding to determine a compactness for each of the plurality of clusterings of the target embedding, and identifying a target cluster among the plurality of clusterings of the target embedding based on the compactness. The method further includes assigning pseudo-labels to the subset of the plurality of input data samples that are members of the target duster.

Patent Agency Ranking