一种基于领域适应性的网络文本的分词方法

发明授权

CN107291837B 一种基于领域适应性的网络文本的分词方法失效 - 权利终止

请登陆查看更多内容

专利标题： 一种基于领域适应性的网络文本的分词方法
申请号： CN201710397541.6

申请日： 2017-05-31
公开(公告)号： CN107291837B

公开(公告)日： 2020-04-03
发明人: 孙栩 , 许晶晶 , 马树铭
申请人： 北京大学
申请人地址： 北京市海淀区颐和园路5号
专利权人： 北京大学
当前专利权人： 北京大学
当前专利权人地址： 北京市海淀区颐和园路5号
代理机构： 北京万象新悦知识产权代理有限公司
代理商 黄凤茹
主分类号： G06F16/35
IPC分类号： G06F16/35 ; G06F40/289 ; G06N3/08

摘要：

本发明公布了一种基于领域适应性的社交网络文本的分词方法，通过建立集成式神经网络和采用自训练的学习方法，利用跨领域的新闻语料、社交网络中的标注数据和无标注数据对集成式神经网络模型进行训练；具体将社交网络文本分为标注和未标注数据集合作为输入；将新闻领域语料作为源语料，在新闻源语料上预训练源分类器；通过对源分类器赋予权重的方式进行源分类器的集成；使用社交网络语料对集成式神经网络模型进行训练；利用训练好的集成式神经网络模型进行预测，由此提升社交网络分词的效果。本发明可用于解决社交网络中因为数据过少导致的效果差的问题，能够有效地提升社交网络文本分词的效果。

摘要（英）：

The invention discloses a domain adaptation-based word division method of social network text. Through building an integrated neural network and using a self-training learning method, cross-domain news corpus and labeled data and unlabeled data in a social network are utilized to train an integrated neural network model. The method specifically comprises: dividing the social network text into labeled and unlabeled datasets, and using the datasets as input; using the news domain corpus as source corpus, and pre-training source classifiers on the news source corpus; integrating the source classifiers through a manner of assigning weights to the source classifiers; using the social network corpus to train the integrated neural network model; and utilizing the well-trained integrated neural network model to carry out prediction, and thus improving an effect of word division of the social network. The method can be used to solve the problem of a poor effect caused by very insufficient data in the social network, and can effectively improve the effect of word division of the social network text.

公开/授权文献

CN107291837A 一种基于领域适应性的网络文本的分词方法公开/授权日：2017-10-24

信息查询

中国专利公布公告

审查信息

Global Dossier

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/30	.•非结构文本数据（文档管理系统入G06F 16/93）
G06F16/35	..••聚类；分类