CREATING A TRAINING DATA SET BASED ON UNLABELED TEXTUAL DATA
    1.
    发明申请
    CREATING A TRAINING DATA SET BASED ON UNLABELED TEXTUAL DATA 审中-公开
    根据未经批准的文本数据创建培训数据集

    公开(公告)号:WO2017040663A1

    公开(公告)日:2017-03-09

    申请号:PCT/US2016/049700

    申请日:2016-08-31

    Applicant: SKYTREE, INC.

    CPC classification number: G06F17/30675 G06F17/30705 G06N99/005

    Abstract: A system and method are disclosed for obtaining a plurality of unlabeled text documents; obtaining an initial concept; obtaining keywords from a knowledge source based on the initial concept; scoring the plurality of unlabeled documents based at least in part on the initial keywords; determining a categorization of the documents based on the scores; performing a first feature selection and creating a first vector space representation of each document in a first category and a second category, the first and second categories based on the scores, the first vector space representation serving as one or more labels for an associated unlabeled textual document; and generating the training set including a subset of the obtained unlabeled textual documents, the subset of the obtained unlabeled documents including a documents belonging to the first category and documents belonging to the second category.

    Abstract translation: 公开了一种用于获得多个未标记的文本文档的系统和方法; 获得初始概念; 基于初始概念从知识源获取关键字; 至少部分地基于初始关键词对多个未标记的文档进行评分; 根据分数确定文件的分类; 执行第一特征选择并且在第一类别和第二类别中创建每个文档的第一向量空间表示,所述第一和第二类别基于所述分数,所述第一向量空间表示用作相关联的未标记文本的一个或多个标签 文件; 以及生成包括所获得的未标记文本文档的子集的训练集合,所获得的未标记文档的子集包括属于第一类别的文档和属于第二类别的文档。

Patent Agency Ranking