SYSTEM AND METHOD FOR AUTOMATICALLY CLASSIFYING DOCUMENTS
    1.
    发明申请
    SYSTEM AND METHOD FOR AUTOMATICALLY CLASSIFYING DOCUMENTS 审中-公开
    用于自动分类文件的系统和方法

    公开(公告)号:US20140214835A1

    公开(公告)日:2014-07-31

    申请号:US13840285

    申请日:2013-03-15

    IPC分类号: G06F17/30

    CPC分类号: G06F16/35

    摘要: A system and method for automatically classifying documents using an annotated topic tree is provided. A set of topics may be extracted from a document corpus such that each document in the document corpus is associated with a topic model. A sample set of documents may be selected from the document corpus during a current sampling round. The topic models associated with the sample set of documents may be annotated by human reviewers with coding information. Each coded document may be coded as ‘responsive’, ‘non-responsive’, ‘arguably responsive’, ‘null’, and/or for other codes or issues, which are related to the topic model associated with that document. An annotated topic tree may be formed based on the annotated topic model. One or more machine learning algorithms may be used to project the information in the annotated topic tree to the rest of the document corpus. A voting algorithm which may comprise a plurality of machine learning algorithms may also be used to project the sampling judgments to the rest of the document corpus. To continuously enhance the performance of automatic classification of documents, the projection results may be analyzed after each sampling round.

    摘要翻译: 提供了一种使用注释主题树自动分类文档的系统和方法。 可以从文档语料库中提取一组主题,使得文档语料库中的每个文档与主题模型相关联。 在当前采样周期期间,可以从文档语料库中选择一组文档样本。 与样本文件集相关联的主题模型可以由具有编码信息的人类评论者注释。 每个编码文档可以被编码为与该文档相关联的主题模型相关的“响应”,“无响应”,“可以说是响应”,“空”和/或用于其他代码或问题。 可以基于注释主题模型形成注释主题树。 可以使用一种或多种机器学习算法将注释主题树中的信息投影到文档语料库的其余部分。 可以使用可以包括多个机器学习算法的投票算法来将采样判断投影到文档语料库的其余部分。 为了不断提高文件自动分类的性能,可以在每次抽样后对投影结果进行分析。