一种基于聚类的文本查重方法

Invention Grant

Please log in to see more content

Patent Title: 一种基于聚类的文本查重方法
Application No.: CN201610839650.4

Application Date: 2016-09-21
Publication No.: CN106446148B

Publication Date: 2019-08-09
Inventor: 贾倩 , 王立伟 , 王彦静 , 杜俊鹏 , 姜悦 , 杨玉堃 , 张冶 , 郭大庆 , 池元成 , 张丽晔 , 许怡婷 , 康磊晶
Applicant: 中国运载火箭技术研究院
Applicant Address: 北京市丰台区北京9200信箱38分箱
Assignee: 中国运载火箭技术研究院
Current Assignee: 中国运载火箭技术研究院
Current Assignee Address: 北京市丰台区北京9200信箱38分箱
Agency: 中国航天科技专利中心
Agent 范晓毅
Main IPC: G06F16/33
IPC: G06F16/33 ; G06F16/35 ; G06F16/34

Abstract:

本发明公开了一种基于聚类的文本查重方法，方法步骤包括：1、数据采集处理将文本数据存储在数据库和文件服务器中，2、预处理对文本数据进行分词和特征向量提取；3、对数据库中已完成预处理的文本数据进行聚类，并计算出各类簇的中心特征向量；4、一次查重处理提取文本数据的特征向量，并与数据库中各类簇的中心向量进行比对，对于距离小于设定阈值的中心特征向量，对其类簇进行记录；5、二次查重处理对文本数据的特征向量与对应类簇中各文本数据的特征向量进行比对，对于距离小于一定阈值的特征向量，将其对应的文本数据记为重复文本数据，从而实现文本数据的查重处理。本发明可以减少不必要的重复性比对工作，提升文本查重效率。

Abstract(English):

The invention discloses a cluster-based text duplicate checking method. The method includes the steps: 1, for data acquisition and processing, storing text data in a database and a file server; 2, for preprocessing, subjecting the text data to word segmentation and feature vector extraction; 3, clustering the text data preprocessed in the database, and calculating center feature vectors of all class clusters; 4, for primary duplicate checking processing, extracting feature vectors of the text data, comparing the feature vectors with the center vectors of the class clusters in the database, and recording the class clusters of the center feature vectors with the distance smaller than a set threshold; 5, for secondary duplicate checking processing, comparing the feature vectors of the text data with the feature vectors of the text data in the corresponding class clusters, and recording the corresponding text data of the feature vectors with the distance smaller than a certain threshold as duplicated text data, so as to realize text data duplicate checking. By the method, unnecessary duplicated comparative work can be reduced, and text duplicate checking efficiency is improved.

Public/Granted literature

CN106446148A 一种基于聚类的文本查重方法 Public/Granted day:2017-02-22

Information query

Chinese Patent Announcement Global Dossier Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/30	.•非结构文本数据（文档管理系统入G06F 16/93）
G06F16/33	..••查询