基于Spark的压缩近邻算法 Spark Based Condensed Nearest Neighbor Algorithm期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Spark的压缩近邻算法

引用本文：	张素芳,翟俊海,王婷婷,郝璞,王聪,赵春玲. 基于Spark的压缩近邻算法[J]. 计算机科学, 2018, 45(Z6): 406-410

作者姓名：	张素芳翟俊海王婷婷郝璞王聪赵春玲

作者单位：	中国气象局气象干部培训学院河北分院河北保定071000,河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室河北保定071002,河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室河北保定071002,河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室河北保定071002,河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室河北保定071002,河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室河北保定071002

基金项目：	本文受国家自然科学基金项目(71371063),河北省自然科学基金项目(F2017201026),河北大学自然科学研究计划项目(799207217071),河北大学大学生创新训练项目(2017071)资助

摘要：	K-近邻(K-Nearest Neighbors,K-NN)是一种懒惰学习算法,用K-NN对数据分类时,不需要训练分类模型。K-NN算法的优点是思想简单、易于实现；缺点是计算量大,原因是在对测试样例进行分类时,其需要计算测试样例与训练集中每一个训练样例之间的距离。压缩近邻算法(Condensed Nearest Neighbors,CNN)可以克服K-NN算法的不足。但是,在面对大数据集时,由于自身的迭代计算特性,CNN的运算效率会变得非常低。针对这一问题,提出一种名为Spark CNN的压缩近邻算法。在大数据环境下,与基于MapReduce的CNN算法相比,Spark CNN的效率大幅提高,在5个大数据集上的实验证明了这一结论。
关键词：	压缩近邻大数据样例选择迭代计算懒惰学习
Spark Based Condensed Nearest Neighbor Algorithm

ZHANG Su-fang,ZHAI Jun-hai,WANG Ting-ting,HAO Pu,WANG Cong and ZHAO Chun-ling. Spark Based Condensed Nearest Neighbor Algorithm[J]. Computer Science, 2018, 45(Z6): 406-410

Authors:	ZHANG Su-fang ZHAI Jun-hai WANG Ting-ting HAO Pu WANG Cong ZHAO Chun-ling

Affiliation:	Hebei Branch of China Meteorological Administration Training Centre,China Meteorological Administration,Baoding,Hebei 071000,China,Key Lab.of Machine Learning and Computational Intelligence,College of Mathematics and Information Science, Hebei University,Baoding,Hebei 071002,China,Key Lab.of Machine Learning and Computational Intelligence,College of Mathematics and Information Science, Hebei University,Baoding,Hebei 071002,China,Key Lab.of Machine Learning and Computational Intelligence,College of Mathematics and Information Science, Hebei University,Baoding,Hebei 071002,China,Key Lab.of Machine Learning and Computational Intelligence,College of Mathematics and Information Science, Hebei University,Baoding,Hebei 071002,China and Key Lab.of Machine Learning and Computational Intelligence,College of Mathematics and Information Science, Hebei University,Baoding,Hebei 071002,China

Abstract:	K-nearest neighbors (K-NN) is a lazy learning algorithm.It is unnecessary to train classification models,when one uses K-NN for data classification.K-NN algorithm is simple and easy to implement.The disadvantages of K-NN is that it requires large number of computations,which is introduced by calculating distances between testing instance and every training instance.Condensed nearest neighbors (CNN) can overcome the drawback of K-NN mentioned above.However,CNN is an iterative algorithm,when it is applied in big data scenario,its efficiency becomes very low.In order to deal with this problem,this paper proposed an algorithm named Spark CNN.In big data circumstances,Spark CNN can significantly improve the efficiency of CNN.This paper experimentally compared the Spark CNN with MapReduce CNN on 5 big data sets,the experimental results show that the Spark CNN is very effective.

Keywords:	Condensed nearest neighbors Big data Instance selection Iterative calculation Lazy learning

	点击此处可从《计算机科学》浏览原始摘要信息
	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏