一种基于嵌入式表示的自适应中文分词方法

发明公开

请登陆查看更多内容

专利标题： 一种基于嵌入式表示的自适应中文分词方法
专利标题（英）： Self-adaptive Chinese word segmentation method based on embedded representation
申请号： CN201710269840.1

申请日： 2017-04-24
公开(公告)号： CN107145483A

公开(公告)日： 2017-09-08
发明人: 李思 , 包祖贻 , 徐蔚然 , 高升
申请人： 北京邮电大学
申请人地址： 北京市海淀区西土城路10号
专利权人： 北京邮电大学
当前专利权人： 北京邮电大学
当前专利权人地址： 北京市海淀区西土城路10号
主分类号： G06F17/27
IPC分类号： G06F17/27 ; G06N3/04

摘要：

本发明实施例公开了一种基于嵌入式表示的自适应中文分词方法。属于信息处理领域。该方法的特征包括：分词网络和字符语言模型共享一个字符的嵌入式表示层。字符的嵌入式表示，一方面通过基于卷积神经网络的分词网络，得到待分词文本的隐多粒度局部特征；再经过一个前向网络层，得到字符的标签概率；最后应用标签推断，得到句子级别上的最优分词结果。另一方面，我们随机抽取未标注的文本，通过一个基于长短期记忆单元(LSTM)循环神经网络的字符语言模型，预测该字符下一个位置的字符，对分词网络进行约束；本发明通过字符语言模型建模中文不同领域文本中的字符共现关系，并通过嵌入式表示将信息传递给分词网络，使得分词的领域迁移能力得到提升，具有很大的实用价值。

摘要（英）：

The embodiment of the invention discloses a self-adaptive Chinese word segmentation method based on embedded representation and belongs to the field of information processing. The method is characterized in that an embedded representation layer of a character is shared by a word segmentation network and a character language model. As for embedded representation of the character, on the one hand, hidden multi-granularity local features of a to-be-segmented text is obtained by means of the word segmentation network based on convolutional neural network; then label probability of the character is obtained through a forward network layer; finally, label inference is used to obtain the optimum segmentation result in the sentence level; on the other hand, an unlabelled text is randomly extracted, a character next to the character is predicted by means of a character language model based on a long- and short-term memory unit (LSTM) recurrent neural network and the word segmentation network is constrained. By modeling a character co-representing relationship in texts in different fields by means of the character language model and transferring information to the word segmentation network by means of embedded representation, the field transfer ability of word segmentation is enhanced, and the method has very huge practical value.

公开/授权文献

CN107145483B 一种基于嵌入式表示的自适应中文分词方法公开/授权日：2018-09-04

信息查询

中国专利公布公告 Global Dossier Espacenet