Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set
    1.
    发明申请
    Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set 审中-公开
    从非标准数据集中生成平衡和类别独立训练数据的技术

    公开(公告)号:US20130097103A1

    公开(公告)日:2013-04-18

    申请号:US13274002

    申请日:2011-10-14

    IPC分类号: G06F15/18 G06F17/30

    CPC分类号: G06N20/00

    摘要: Techniques for creating training sets for predictive modeling are provided. In one aspect, a method for generating training data from an unlabeled data set is provided which includes the following steps. A small initial set of data is selected from the unlabeled data set. Labels are acquired for the initial set of data selected from the unlabeled data set resulting in labeled data. The data in the unlabeled data set is clustered using a semi-supervised clustering process along with the labeled data to produce data clusters. Data samples are chosen from each of the clusters to use as the training data. The selecting, presenting, clustering and choosing steps are repeated with one or more additional sets of data selected from the unlabeled data set until a desired amount of training data has been obtained, wherein at each iteration an amount of the labeled data is increased.

    摘要翻译: 提供了用于创建预测建模训练集的技术。 一方面,提供了一种用于从未标记的数据集生成训练数据的方法,包括以下步骤。 从未标记的数据集中选择一小段初始数据。 从未标记的数据集中选择的初始数据集中获取标签,从而产生标记数据。 未标记数据集中的数据使用半监督聚类过程与标记数据一起聚类以产生数据集群。 从每个群集中选择数据样本以用作训练数据。 使用从未标记的数据集中选择的一个或多个附加数据集重复选择,呈现,聚类和选择步骤,直到获得了所需量的训练数据,其中在每次迭代时,标记数据的量增加。