首页 | 本学科首页   官方微博 | 高级检索  
     

基于Spark的压缩近邻算法
引用本文:张素芳,翟俊海,王婷婷,郝璞,王聪,赵春玲. 基于Spark的压缩近邻算法[J]. 计算机科学, 2018, 45(Z6): 406-410
作者姓名:张素芳  翟俊海  王婷婷  郝璞  王聪  赵春玲
作者单位:中国气象局气象干部培训学院河北分院 河北 保定071000,河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室 河北 保定071002,河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室 河北 保定071002,河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室 河北 保定071002,河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室 河北 保定071002,河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室 河北 保定071002
基金项目:本文受国家自然科学基金项目(71371063),河北省自然科学基金项目(F2017201026),河北大学自然科学研究计划项目(799207217071),河北大学大学生创新训练项目(2017071)资助
摘    要:K-近邻(K-Nearest Neighbors,K-NN)是一种懒惰学习算法,用K-NN对数据分类时,不需要训练分类模型。K-NN算法的优点是思想简单、易于实现;缺点是计算量大,原因是在对测试样例进行分类时,其需要计算测试样例与训练集中每一个训练样例之间的距离。压缩近邻算法(Condensed Nearest Neighbors,CNN)可以克服K-NN算法的不足。但是,在面对大数据集时,由于自身的迭代计算特性,CNN的运算效率会变得非常低。针对这一问题,提出一种名为Spark CNN的压缩近邻算法。在大数据环境下,与基于MapReduce的CNN算法相比,Spark CNN的效率大幅提高,在5个大数据集上的实验证明了这一结论。

关 键 词:压缩近邻  大数据  样例选择  迭代计算  懒惰学习

Spark Based Condensed Nearest Neighbor Algorithm
ZHANG Su-fang,ZHAI Jun-hai,WANG Ting-ting,HAO Pu,WANG Cong and ZHAO Chun-ling. Spark Based Condensed Nearest Neighbor Algorithm[J]. Computer Science, 2018, 45(Z6): 406-410
Authors:ZHANG Su-fang  ZHAI Jun-hai  WANG Ting-ting  HAO Pu  WANG Cong  ZHAO Chun-ling
Affiliation:Hebei Branch of China Meteorological Administration Training Centre,China Meteorological Administration,Baoding,Hebei 071000,China,Key Lab.of Machine Learning and Computational Intelligence,College of Mathematics and Information Science, Hebei University,Baoding,Hebei 071002,China,Key Lab.of Machine Learning and Computational Intelligence,College of Mathematics and Information Science, Hebei University,Baoding,Hebei 071002,China,Key Lab.of Machine Learning and Computational Intelligence,College of Mathematics and Information Science, Hebei University,Baoding,Hebei 071002,China,Key Lab.of Machine Learning and Computational Intelligence,College of Mathematics and Information Science, Hebei University,Baoding,Hebei 071002,China and Key Lab.of Machine Learning and Computational Intelligence,College of Mathematics and Information Science, Hebei University,Baoding,Hebei 071002,China
Abstract:K-nearest neighbors (K-NN) is a lazy learning algorithm.It is unnecessary to train classification models,when one uses K-NN for data classification.K-NN algorithm is simple and easy to implement.The disadvantages of K-NN is that it requires large number of computations,which is introduced by calculating distances between testing instance and every training instance.Condensed nearest neighbors (CNN) can overcome the drawback of K-NN mentioned above.However,CNN is an iterative algorithm,when it is applied in big data scenario,its efficiency becomes very low.In order to deal with this problem,this paper proposed an algorithm named Spark CNN.In big data circumstances,Spark CNN can significantly improve the efficiency of CNN.This paper experimentally compared the Spark CNN with MapReduce CNN on 5 big data sets,the experimental results show that the Spark CNN is very effective.
Keywords:Condensed nearest neighbors  Big data  Instance selection  Iterative calculation  Lazy learning
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号