基于机器学习的大数据处理方法

发明公开

请登陆查看更多内容

专利标题： 基于机器学习的大数据处理方法
专利标题（英）： Big data processing method based on machine learning
申请号： CN201811039771.6

申请日： 2018-09-06
公开(公告)号： CN109214004A

公开(公告)日： 2019-01-15
发明人: 不公告发明人
申请人： 广州知弘科技有限公司
申请人地址： 广东省广州市黄埔区南翔一路68号第(1)栋A30房
专利权人： 广州知弘科技有限公司
当前专利权人： 贵州航天云网科技有限公司
当前专利权人地址： 广东省广州市黄埔区南翔一路68号第(1)栋A30房
主分类号： G06F17/27
IPC分类号： G06F17/27 ; G06F16/33 ; G06N3/04

摘要：

本发明提供了一种基于机器学习的大数据处理方法，包括：给定一个检索语句，使用通用的停用词表对初始检索中的词进行过滤，保留有意义的检索词；使用语义块模型对词汇进行语义向量表示；在语义向量的基础上针对每个初始检索词采用余弦相似度从其它词汇中找出与之相似度最接近的多个词，作为扩展检索词；使用初始检索中对应的扩展检索词在初始检索语句中进行替换，将新生成的检索词序列作为扩展检索语句；根据扩展检索词的排列组合得到不同表达形式的扩展检索语句。本发明改进了MAPRUDUCE的并行框架，更好地适应文本数据挖掘的需要；并且针对社交文本的不规范特点，利用语义向量对文本数据进行有效表示和分析，适用于各种规模的社交文本挖掘分析和计算。

摘要（英）：

The invention provides a big data processing method based on machine learning, which comprises the following steps: given a retrieval sentence, filtering the words in the initial retrieval by using auniversal stop word list, and reserving meaningful retrieval words; the semantic block model is used to represent the lexical semantic vector. On the basis of semantic vector, cosine similarity is used to find out the closest words from other words for each initial search term, which can be used as extended search terms. The corresponding extended search terms in the initial search are used to replace the original search terms, and the newly generated search term sequence is used as the extended search terms. According to the permutation and combination of the extended search terms, the extended search sentences with different expressions are obtained. The invention improves the parallel frame of the MAPRUDUCE and is better adapted to the needs of text data mining. Aiming at the irregularity of social text, semantic vectors are used to represent and analyze the text data effectively, which is suitable for the analysis and calculation of social text mining of various scales.

公开/授权文献

CN109214004B 基于机器学习的大数据处理方法公开/授权日：2019-11-05

信息查询

中国专利公布公告 Global Dossier Espacenet