首页 | 本学科首页   官方微博 | 高级检索  
     

基于双决策树的数据采样方法
引用本文:陈力,费洪晓,丁海伦,成琳,翟纪宇.基于双决策树的数据采样方法[J].计算机工程与科学,2019,41(1):130-135.
作者姓名:陈力  费洪晓  丁海伦  成琳  翟纪宇
作者单位:中南大学地球科学与信息物理学院,湖南 长沙,410075;中南大学软件学院,湖南 长沙,410075
基金项目:国家自然科学基金(61602525);中南大学2017年本科生自由探索项目(201710533267,ZY20170769)
摘    要:在数据挖掘问题中,一个基本假设是训练集样本与测试集样本的数据分布一致,但随着数据量逐渐增加,如何在海量数据中找出具有代表意义的数据也变得尤为困难。对现有的数据选择方法研究发现,传统的简单随机抽样和渐进抽样等数据选择方法,由于没有和数据挖掘工具进行结合,采样结果具有偶然性和不确定性,抽样数据很难保证数据挖掘的基本假设,这也使得最终模型的泛化误差较大。为了解决数据采样过程中类间的不平衡问题,提出一种基于双决策树的结构化数据采样方法。首先通过C4.5算法生成一棵决策树,借助决策树在数据源中选择适合的数据和数据采集点,同时通过使用另一棵决策树对选择出的数据集的质量进行评估来达到高效率和高质量的数据采样。实验表明,与简单随机抽样相比,新采样数据下训练的模型准确率有明显提高。

关 键 词:决策树  数据采样  机器学习
收稿时间:2017-10-17
修稿时间:2019-01-25

A data sampling method based on double decision tree
CHEN Li,FEI Hong xiao,DING Hai lun,CHENG Lin,ZHAI Ji yu.A data sampling method based on double decision tree[J].Computer Engineering & Science,2019,41(1):130-135.
Authors:CHEN Li  FEI Hong xiao  DING Hai lun  CHENG Lin  ZHAI Ji yu
Affiliation:(1.School of Geosciences and Info Physics,Central South University,Changsha 410075; 2.School of Software,Central South University,Changsha 410075,China)
Abstract:In data mining, a basic assumption is that the data distribution of training set samples are consistent with that of test set samples. But as data volumes increase, how to find out representative data in huge amounts of data becomes particularly difficult. By studying existing data selection methods, we find that it is difficult to evaluate their sampling effect because they are not integrated with the data mining tool, such as simple random sampling and progressive sampling. Due to contingency factors and uncertainty, it is difficult to guarantee the basic assumptions of data mining, which also makes the generalization error of the model larger. In order to solve these problems, we put forward a structured data sampling method based on double decision tree. Firstly, we generate a decision tree with the C4.5 algorithm, which is used to select appropriate data and data collection points in the data source. Then, we generate another decision tree to evaluate the quality of the selected data set and achieve data sampling of high efficiency and high quality. Experiments show that compared with random sampling, the accuracy of the model based on our sampling is improved obviously.
Keywords:decision tree  data sampling  machine learning  
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号