首页 | 本学科首页   官方微博 | 高级检索  
     

一种大数据环境中分布式辅助关联分类算法
引用本文:张明卫,朱志良,刘莹,张斌.一种大数据环境中分布式辅助关联分类算法[J].软件学报,2015,26(11):2795-2810.
作者姓名:张明卫  朱志良  刘莹  张斌
作者单位:东北大学 软件学院, 辽宁 沈阳 110004,东北大学 软件学院, 辽宁 沈阳 110004,东北大学 软件学院, 辽宁 沈阳 110004,东北大学 信息科学与工程学院, 辽宁 沈阳 110004
基金项目:国家自然科学基金(61100027, 61374178, 61202085, 61572117, 61572116); 中央高校基本科研业务费专项资金(N13 0417003); 高等学校博士学科点专项科研基金(20120042120010)
摘    要:在很多现实的分类应用中,新数据的类标需要由领域专家最终确定,而分类器的分类结果仅起辅助作用.另外,随着大数据所隐含价值越发被人们重视,分类器的训练会从面向单一数据集逐渐过渡到面向分布式空间数据集,大数据环境下辅助分类也将成为未来分类应用的重要分支.然而,现有的分类研究缺乏对此类应用的关注.大数据环境中的辅助分类面临以下3个问题:1) 训练集是分布式大数据集;2) 在空间上,训练集所包含的各局部数据源的类别分布不尽相同;3) 在时间上,训练集是动态变化的,会发生类别迁移现象.在考虑以上问题的基础上,提出一种大数据环境中分布式辅助关联分类方法.该方法首先给出一种大数据环境中分布式关联分类器构建算法,在该算法中,通过横向加权考虑分类数据集在空间上的类别分布差异,并给出"前件空间支持度-相关系数"的度量框架,改进关联分类算法面对不平衡数据的性能缺陷;然后,给出一种基于适应因子的辅助关联分类器动态调整方法,能够在分类器应用过程中充分利用领域专家实时反馈的结果对分类器进行动态调整,以提升其面向动态数据集的分类性能,减缓分类器的退化和重新训练的频率.实验结果表明,该方法能够面向分布式数据集较快地训练出有较高分类准确率的关联分类器,并在数据集不断扩充变化时提升分类性能,是一种有效的大数据环境中辅助分类应用方法.

关 键 词:大数据  分布式  辅助分类  关联分类  动态分类器
收稿时间:2015/5/27 0:00:00
修稿时间:2015/8/26 0:00:00

Distributed Assistant Associative Classification Algorithm in Big Data Environment
ZHANG Ming-Wei,ZHU Zhi-Liang,Liu Ying and ZHANG Bin.Distributed Assistant Associative Classification Algorithm in Big Data Environment[J].Journal of Software,2015,26(11):2795-2810.
Authors:ZHANG Ming-Wei  ZHU Zhi-Liang  Liu Ying and ZHANG Bin
Affiliation:Software College, Northeastern University, Shenyang 110004, China,Software College, Northeastern University, Shenyang 110004, China,Software College, Northeastern University, Shenyang 110004, China and College of Information Science and Engineering, Northeastern University, Shenyang 110004, China
Abstract:For many practical classification applications, the class label of new data needs to be confirmed eventually by domain expert, and the result of classifier only plays an assistant role. In addition, with the implicit values of big data calling more people's attention, classifier training is going through a transition from single dataset to distributed space dataset, and assistant classification in big data environment will also become an important branch of future classification applications. However, existing classification research lacks attention to this kind of application. Assistant classification in big data environment faces with the following three problems: 1) the training set is distributed big dataset, 2) in space, the class distributions of local datasets contained in the training set are commonly different, and 3) in time, the training set is dynamic and its class distribution may change. To address the above problems, this paper proposes a distributed assistant associative classification approach in big data environment. Firstly, a distributed associative classifier constructing algorithm in big data environment is constructed. With the new algorithm, the class distribution difference in space of the classification dataset is considered by horizontal weighting, and the performance deficiency of associative classification algorithms to imbalanced class distribution datasets is improved by giving a measure framework of "antecedent space support-correlation coefficient". Next, an adaptive factor based dynamic adjustment method for assistant associative classifier is proposed. This method can make full use of domain experts' real-time feedback to adjust classifier dynamically in the applying process of the used classifier, to improve its performance facing dynamic datasets, and to slow down its retraining frequency. Experimental results demonstrate that the presented approach can relative quickly train associative classifiers with higher classification accuracy for distributed datasets, and can improve their performance when datasets are continually expanding and changing. Thus it's an effective approach for assistant classification applications in big data environment.
Keywords:data  distribution  assistant classification  association classification  dynamic classifier
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号