首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于综合不放回抽样的随机森林算法改进
引用本文:李慧,李正,佘堃. 一种基于综合不放回抽样的随机森林算法改进[J]. 计算机工程与科学, 2015, 37(7): 1233-1238
作者姓名:李慧  李正  佘堃
作者单位:电子科技大学计算机科学与工程学院,四川成都,611731
基金项目:四川省科技支撑计划资助项目
摘    要:数据挖掘是大数据服务计算的一个重要方法,对于优化服务计算有重要意义。作为一种典型的数据挖掘方法,随机森林有着较高的正确率,因而得到广泛的应用。为了更加准确高效地处理服务计算中的大数据问题,进一步提升随机森林的正确率和效率,成为一项极其重要的研究。通过改变训练集的样本量和样本抽样方法,对平衡样本集和不平衡样本集进行分析,发现通过上述两个改进后,在优化区间内,平衡样本集泛化误差会减小12%~20%;单项改变抽样方法,可以使算法时间缩短,提升效率达10%~40%;对不平衡数据,也能够明显提升效率。理论和实验均证明,基于综合不放回抽样的随机森林算法改进能够提升平衡样本的正确率,使得该数据挖掘方法更适用于服务计算中的大数据分析和处理。

关 键 词:随机森林  平衡数据  不平衡数据  不重复抽样
收稿时间:2014-08-05
修稿时间:2015-07-25

An improvement of random forests algorithm based on comprehensive sampling without replacement
LI Hui,LI Zheng,SHE Kun. An improvement of random forests algorithm based on comprehensive sampling without replacement[J]. Computer Engineering & Science, 2015, 37(7): 1233-1238
Authors:LI Hui  LI Zheng  SHE Kun
Affiliation:(School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu 611731,China)
Abstract:Data mining is an important method in big data and service computing. As a typical method in data mining, random forest is widely used due to its low error rate. In order to dealing with big data more accurately and efficiently, we make a further improvement in the accuracy and efficiency of the random forest. It demonstrates both theoretically and practically that our method can decrease the generalization error by about 12%~20% when the number we choose for replacement is beyond the number of the samples. Moreover, we replace the method of repeated sampling with a simple method, which proves equal to the method of repeated sampling. By this way, we can decrease the time of building the forest, thus promoting the efficiency by about 10%~40% when it is used alone. And this method can just make up for the efficiency loss of the first improvement. Combing the two aforementioned methods, we promote the efficiency of the unbalanced data by 10%, and improve the accuracy of the balanced data over 12% without any impact on the efficiency. Therefore, the proposed method is more suitable for big data analysis and processing in service computing than the original method.
Keywords:random forest  balanced data  unbalanced data  sampling without replacement
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号