一种基于编码和机器学习的多语种识别方法

发明公开

请登陆查看更多内容

专利标题： 一种基于编码和机器学习的多语种识别方法
专利标题（英）： Multi-language identification method based on coding and machine learning
申请号： CN201611001398.6

申请日： 2016-11-14
公开(公告)号： CN106528535A

公开(公告)日： 2017-03-22
发明人: 王宇 , 徐晓燕 , 周渊 , 刘庆良 , 郑彩娟 , 王海平 , 黄成 , 周游 , 陈婷婷
申请人： 北京赛思信安技术股份有限公司 , 国家计算机网络与信息安全管理中心
申请人地址： 北京市朝阳区霞光里8号承冀诚大厦二层
专利权人： 北京赛思信安技术股份有限公司,国家计算机网络与信息安全管理中心
当前专利权人： 北京赛思信安技术股份有限公司,国家计算机网络与信息安全管理中心
当前专利权人地址： 北京市朝阳区霞光里8号承冀诚大厦二层
代理机构： 北京永创新实专利事务所
代理商 祗志洁
主分类号： G06F17/27
IPC分类号： G06F17/27

摘要：

本发明提供了一种基于编码和机器学习的多语种识别方法，是计算机对自然语言的处理技术。本方法分别通过机器学习单元和编码识别单元对文本进行语种识别，编码识别时还统计各语种的单词量，当机器学习单元的识别结果在编码识别单元的判定区间内，且二者识别的语言一致时，输出单一识别语言，当编码识别单元识别到多种语言时，进行混合语言规则判断，若第二语言在文本中的单词量比例达到设定比例，则判定文本为混合语言。本发明对长文本可先作随机采样再判定，以提高识别效率。本发明能够准确、高效地实现中文简繁体、日、法、英等99种语言的语种识别，同时支持混合语种文本识别，在海量数据分析以及舆情监控中具有广泛的应用前景。

摘要（英）：

The invention provides a multi-language identification method based on coding and machine learning, and belongs to processing technology of computers to natural languages. The method comprises the following steps: performing language identification on a text through a machine learning unit and a coding identification unit, counting the vocabulary of each language during the coding identification, outputting a single identified language when the identification result of the machine learning unit is within a judgment interval of the coding identification unit and when the languages identified by the two units are consistent, when the coding identification unit identifies a plurality of languages, judging a mixed language rule, and if the vocabulary of a second language in the text reaches a set proportion, judging the text as a mixed language. According to the multi-language identification method provided by the invention, a long text can be judged after random sampling to improve the identification efficiency. By adoption of the multi-language identification method provided by the invention, 99 languages can be identified accurately and efficiently, such as Chinese familiar style and complex font, Japanese, French, English and the like, and the mixed language text recognition is supported, so the multi-language identification method has a wide application prospect in mass data analysis and public opinion monitoring.

公开/授权文献

CN106528535B 一种基于编码和机器学习的多语种识别方法公开/授权日：2019-04-26

信息查询

中国专利公布公告 Global Dossier Espacenet