一种基于综合不放回抽样的随机森林算法改进 An improvement of random forests algorithm based on comprehensive sampling without replacement期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于综合不放回抽样的随机森林算法改进

引用本文：	李慧,李正,佘堃. 一种基于综合不放回抽样的随机森林算法改进[J]. 计算机工程与科学, 2015, 37(7): 1233-1238

作者姓名：	李慧李正佘堃

作者单位：	电子科技大学计算机科学与工程学院,四川成都,611731

基金项目：	四川省科技支撑计划资助项目

摘要：	数据挖掘是大数据服务计算的一个重要方法,对于优化服务计算有重要意义。作为一种典型的数据挖掘方法,随机森林有着较高的正确率,因而得到广泛的应用。为了更加准确高效地处理服务计算中的大数据问题,进一步提升随机森林的正确率和效率,成为一项极其重要的研究。通过改变训练集的样本量和样本抽样方法,对平衡样本集和不平衡样本集进行分析,发现通过上述两个改进后,在优化区间内,平衡样本集泛化误差会减小12%~20%;单项改变抽样方法,可以使算法时间缩短,提升效率达10%~40%;对不平衡数据,也能够明显提升效率。理论和实验均证明,基于综合不放回抽样的随机森林算法改进能够提升平衡样本的正确率,使得该数据挖掘方法更适用于服务计算中的大数据分析和处理。
关键词：	随机森林平衡数据不平衡数据不重复抽样
收稿时间：	2014-08-05
修稿时间：	2015-07-25
An improvement of random forests algorithm based on comprehensive sampling without replacement

LI Hui,LI Zheng,SHE Kun. An improvement of random forests algorithm based on comprehensive sampling without replacement[J]. Computer Engineering & Science, 2015, 37(7): 1233-1238

Authors:	LI Hui LI Zheng SHE Kun

Affiliation:	（School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731,China）

Abstract:	Data mining is an important method in big data and service computing. As a typical method in data mining, random forest is widely used due to its low error rate. In order to dealing with big data more accurately and efficiently, we make a further improvement in the accuracy and efficiency of the random forest. It demonstrates both theoretically and practically that our method can decrease the generalization error by about 12%~20% when the number we choose for replacement is beyond the number of the samples. Moreover, we replace the method of repeated sampling with a simple method, which proves equal to the method of repeated sampling. By this way, we can decrease the time of building the forest, thus promoting the efficiency by about 10%~40% when it is used alone. And this method can just make up for the efficiency loss of the first improvement. Combing the two aforementioned methods, we promote the efficiency of the unbalanced data by 10%, and improve the accuracy of the balanced data over 12% without any impact on the efficiency. Therefore, the proposed method is more suitable for big data analysis and processing in service computing than the original method.

Keywords:	random forest balanced data unbalanced data sampling without replacement

	点击此处可从《计算机工程与科学》浏览原始摘要信息
	点击此处可从《计算机工程与科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏