首页 | 本学科首页   官方微博 | 高级检索  
     

基于MapReduce框架的海量数据相似性连接研究进展
引用本文:庞俊,于戈,许嘉,谷峪.基于MapReduce框架的海量数据相似性连接研究进展[J].计算机科学,2015,42(1):1-5,27.
作者姓名:庞俊  于戈  许嘉  谷峪
作者单位:1. 东北大学信息科学与工程学院 沈阳110819
2. 国防科学技术大学信息系统与管理学院 长沙410073
基金项目:本文受国家重点基础研究发展计划(973)项目(2012CB316201),国家自然科学基金(61272179,8),教育部博士点基金(20120042110028),教育部-英特尔信息技术专项科研基金(MOE-INTEL-2012-06)资助
摘    要:海量数据相似性连接作为海量数据处理的基本操作,在文本聚类、剽窃检测、实体解析等研究领域具有重要作用.另一方面,MapReduce编程模型因为具有良好的可扩放性、容错性和易用性,被广泛地应用于海量数据处理.因此,基于MapReduce框架的海量数据相似性连接查询技术成为海量数据处理领域的热点问题之一.首先,概括了海量数据固有特点和MapReduce编程框架的缺陷给现有相似性连接查询技术带来的巨大挑战;其次,提出了海量数据相似性连接的定义,按3种不同的分类标准对其进行了分类;接着,重点分析了集合、字符串和向量数据类型的海量相似性连接查询最新技术,并从效率和适用范围等方面分别对这些技术进行了比较;最后,讨论了海量数据相似性连接查询技术亟待解决的关键问题,并提出了一些有前景的解决方案.

关 键 词:海量数据  相似性连接  MapReduce  Top-k

Similarity Joins on Massive Data Based on MapReduce Framework
PANG Jun,YU Ge,XU Jia and GU Yu.Similarity Joins on Massive Data Based on MapReduce Framework[J].Computer Science,2015,42(1):1-5,27.
Authors:PANG Jun  YU Ge  XU Jia and GU Yu
Affiliation:School of Information Science and Engineering,Northeastern University,Shenyang 110819,China,School of Information Science and Engineering,Northeastern University,Shenyang 110819,China,School of Information System and Management,National University of Defense Technology,Changsha 410073,China and School of Information Science and Engineering,Northeastern University,Shenyang 110819,China
Abstract:As a basic operation of large-scale data processing,similarity joins on large-scale data play an important role in document clustering,plagiarism detection,entity resolution and many other fields.On the other hand,the MapReduce programming model is widely applied to massive data processing because of its good scalability,fault tolerance and usability.Therefore,similarity joins on massive data based on MapReduce become one of the hot topics in the field of massive data processing.Firstly,big challenges of similarity join query introduced by rapid growth of data volume were generalized.Then,the definition of similarity joins on massive data was presented and similarity joins on massive data were classified according to three different standards.In addition,the current status of set,string and vector similarity joins on massive data were emphatically analyzed and compared from the aspects of efficiency,applicability and so on.Finally,the research focus and trend of this area were investigated and the promising solutions were suggested.
Keywords:Massive data  Similarity join  MapReduce  Top-k
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号