Spark框架结合分布式KNN分类器的网络大数据分类处理方法 Network big data classification processing method based on Spark and distributed KNN classifier期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Spark框架结合分布式KNN分类器的网络大数据分类处理方法

引用本文：	曹瑜,王楠,徐志超.Spark框架结合分布式KNN分类器的网络大数据分类处理方法[J].计算机应用研究,2019,36(11).

作者姓名：	曹瑜王楠徐志超

作者单位：	哈尔滨金融学院计算机系,哈尔滨,150030;吉林财经大学管信学院,长春130117;吉林大学计算机学院,长春130012;吉林财经大学管信学院,长春,130117

基金项目：	国家自然科学基金资助项目(61702213);吉林省教育厅“十三五”科学技术研究(JJKH20180463KJ)

摘要：	针对现有大数据分类方法难以满足大数据应用中时间和储存空间的限制，提出了一种基于Apache Spark框架的大数据并行多标签K最近邻分类器设计方法。为了通过使用其他内存操作来减轻现有MapReduce方案的成本消耗，首先，结合Apache Spark框架的并行机制将训练集划分成若干分区；然后在Map阶段找到待预测样本每个分区的K近邻，进一步在reduce阶段根据map阶段的结果确定最终的K近邻；最后并行地对近邻的标签集合进行聚合，通过最大化后验概率输出待预测样本的目标标签集合。在PokerHand等四个大数据分类数据集上进行实验，提出方法取得了较低的汉明损失，证明了其有效性。
关键词：	分类处理 Apache Spark 并行机制数据挖掘汉明损失 K最近邻
收稿时间：	2018/5/7 0:00:00
修稿时间：	2018/7/3 0:00:00
Network big data classification processing method based on Spark and distributed KNN classifier

Cao Yu,Wang Nan and Xu Zhichao.Network big data classification processing method based on Spark and distributed KNN classifier[J].Application Research of Computers,2019,36(11).

Authors:	Cao Yu Wang Nan and Xu Zhichao

Affiliation:	Computer Department,Harbin Finance University,Harbin,Heilongjiang,,

Abstract:	Aiming at the limitation that the existing big data classification methods can not meet the time and storage space in big data applications, this paper proposed a design method of big data parallel multi-label K-nearest neighbor classifier based on Apache Spark framework. In order to reduce the cost of the existing MapReduce scheme by using other memory operations, firstly, it divided the training set into several partitions in conjunction with the parallel mechanism of the Apache Spark framework. Then in the Map stage, it found the K nearest neighbors of each partition of the sample to be predicted, and in the Reduce phase, it determined the final K nearest neighbors according to the results of the Map phase. Finally, it aggregated the neighboring tag sets in parallel, and output the target tag set of the sample to be predicted by maximizing the posterior probability. Experiments were carried out on four big data classification datasets such as PokerHand. The proposed method achieved a lower Hamming loss and proved its effectiveness.

Keywords:	classification processing Apache Spark parallelism data mining hamming loss K-nearest neighbor
本文献已被万方数据等数据库收录！
	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏