基于Spark平台的分布式大数据函数依赖发现方法

发明公开

CN109918410A 基于Spark平台的分布式大数据函数依赖发现方法有权

请登陆查看更多内容

专利标题： 基于Spark平台的分布式大数据函数依赖发现方法
专利标题（英）： A distributed big data function dependency discovery method based on a Spark platform
申请号： CN201811285204.9

申请日： 2018-10-31
公开(公告)号： CN109918410A

公开(公告)日： 2019-06-21
发明人: 张海粟 , 王龙 , 左青云 , 李韬伟 , 张胜 , 吴照林 , 刘鹏飞 , 朱明东 , 戴剑伟 , 徐飞 , 刘培磊 , 文峰 , 刘一博 , 张岩
申请人： 中国人民解放军国防科技大学
申请人地址： 湖北省武汉市江岸区解放公园路45号
专利权人： 中国人民解放军国防科技大学
当前专利权人： 中国人民解放军国防科技大学
当前专利权人地址： 湖北省武汉市江岸区解放公园路45号
代理机构： 武汉科皓知识产权代理事务所
代理商 严彦
主分类号： G06F16/2458
IPC分类号： G06F16/2458

摘要：

本发明提供一种基于Spark平台的分布式大数据函数依赖发现方法，包括数据分区，包括根据Spark集群各节点分配的CPU内核数对数据进行分区；生成属性集合的所有非空子集，包括通过数据库中的所有属性集合，生成含有所有非空子集的集合，为求解所有属性集合的等价类个数作准备；累加各节点属性集合的等价类数量，通过等价类计算得到全局数据库的(属性集合，等价类数)集合；迭代各属性集合生成函数依赖集合，包括由各属性集合的子集构建候选函数依赖关系，判断函数依赖关系是否成立。该方法解决了分布式环境下函数依赖发现算法的负载不平衡和低效问题，大幅度提高了函数依赖发现的执行效率。

摘要（英）：

The invention provides a distributed big data function dependency discovery method based on a Spark platform, and the method comprises the steps: carrying out the partitioning of data according to thenumber of CPU kernels distributed by each node of a Spark cluster; generating all the non-empty subsets of the attribute set, including generating a set containing all the non-empty subsets through all the attribute sets in the database, preparing for solving the number of equivalent classes of all the attribute sets; accumulating the equivalent class number of each node attribute set, obtaininga (attribute set, equivalent class number) set of the global database through equivalent class calculation; Iterating each attribute set to generate a function dependency set, namely constructing a candidate function dependency relationship by the subset of each attribute set, and judging whether the function dependency relationship is established or not. According to the method, the problems of unbalanced load and low efficiency of a function dependency discovery algorithm in a distributed environment are solved, and the execution efficiency of function dependency discovery is greatly improved.

公开/授权文献

CN109918410B 基于Spark平台的分布式大数据函数依赖发现方法公开/授权日：2020-12-04

信息查询

中国专利公布公告 Global Dossier Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/20	.•结构化数据，例如关系型数据
G06F16/24	..••查询
G06F16/245	...•••查询处理
G06F16/2458	....••••特殊类型的查询，例如统计查询、模糊查询或分布式查询