首页 | 本学科首页   官方微博 | 高级检索  
     

跳跃滤波:一种面向大数据治理的动态数据摘要设计
引用本文:符鹏涛,罗来龙,郭得科,赵翔,李尚森,王怀民. 跳跃滤波:一种面向大数据治理的动态数据摘要设计[J]. 软件学报, 2023, 34(3): 1193-1212
作者姓名:符鹏涛  罗来龙  郭得科  赵翔  李尚森  王怀民
作者单位:国防科技大学 系统工程学院, 湖南 长沙 410073;国防科技大学 系统工程学院, 湖南 长沙 410073;国防科技大学 计算机学院, 湖南 长沙 410073;国防科技大学 计算机学院, 湖南 长沙 410073
基金项目:国家自然科学基金(U19B2024,62002378,61772544);国防科技大学科研基金(ZK20-30)
摘    要:随着信息技术的迅速发展,数据体量维持指数增长,数据价值挖掘困难,这为数据采集、清洗、存储、共享等数据生命周期中各环节的高效管控带来极大的挑战.数据摘要技术利用哈希表/矩阵/位向量对数据的频数、基数、成员关系等核心基础特性进行追踪,使得数据摘要自身成为元数据,并在共享、传输、更新等场景得到广泛应用.大数据的快速流转特性更是催生了动态数据摘要技术.现有的动态数据摘要技术通过动态维护链状或树状结构的概率数据结构列表,具有其容量随数据流大小而扩增或缩减的优势,然而也存在空间开销过大以及时间开销随数据基数增加而增长的缺陷.基于先进的跳跃一致性哈希理论,设计了一种面向大数据治理的动态数据摘要技术.该方法可以同时实现随数据基数线性增长的空间开销以及数据处理分析常数级别的时间开销,能够有效地支撑要求苛刻的多种大数据处理分析任务.在多种合成和真实数据集上,通过与传统方法实验对比,验证了所提方法的有效性和高效性.

关 键 词:大数据  大数据治理  元数据  动态数据摘要  概率数据结构
收稿时间:2022-05-14
修稿时间:2022-07-29

Jump Filter: Dynamic Sketch Design for Big Data Governance
FU Peng-Tao,LUO Lai-Long,GUO De-Ke,ZHAO Xiang,LI Shang-Sen,WANG Huai-Min. Jump Filter: Dynamic Sketch Design for Big Data Governance[J]. Journal of Software, 2023, 34(3): 1193-1212
Authors:FU Peng-Tao  LUO Lai-Long  GUO De-Ke  ZHAO Xiang  LI Shang-Sen  WANG Huai-Min
Affiliation:College of Systems Engineering, National University of Defense Technology, Changsha 410008, China;College of Systems Engineering, National University of Defense Technology, Changsha 410008, China;College of Computer Science, National University of Defense Technology, Changsha 410008, China
Abstract:With the rapid development of information technology,the volume of data maintains an exponential growth,and the value of data is hard to min.It brings significant challenges to the efficient management and control of each link in the data life cycle,such as data collection,cleaning,storage,and sharing.Sketch uses a hash table/matrix/bit vector to track the core characteristics of data,such as frequency,cardinality,membership,etc.This mechanism makes sketch itself metadata which has been widely used in the sharing,transmission,update and other scenarios.The rapid flow characteristics of big data has spawned the dynamic sketches.The existing dynamic sketches have the advantage of expanding or shrinking in capacity with the size of the data stream by dynamically maintaining a list of probabilistic data structures in a chain or tree structure.However,there are defects of excessive space overhead and time overhead increasing with the increase of the dataset cardinality.This paper designs a dynamic sketch for big data governance based on the advanced Jump Consistent Hash.This method can simultaneously realize the space overhead that grows linearly with the dataset cardinality and the constant time overhead of data processing and analysis,effectively supporting the demanding big data processing and analysis tasks for big data governance.The validity and efficiency of the proposed method are verified by comparing it with traditional methods on various datasets,including synthetic and natural datasets.
Keywords:big data  big data governance  metadata  dynamic sketch  probabilistic data structure
本文献已被 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号