基于不平衡数据集的软件缺陷预测 |
| |
引用本文: | 张德平.基于不平衡数据集的软件缺陷预测[J].计算机应用研究,2017,34(7). |
| |
作者姓名: | 张德平 |
| |
作者单位: | 南京航空航天大学 计算机科学与技术学院 |
| |
基金项目: | 国防重点项目资金资助, No.JCKY2016206B001和国防一般项目资助,No.JCKY2014206C002 |
| |
摘 要: | 数据的不平衡性是软件缺陷预测研究中一个严峻且无法规避的问题,为了解决这一问题,本文提出一种利用分布函数合成新样本的过抽样和随机向下抽样相结合的算法。该算法首先对降维后的主成分进行分布函数拟合,然后利用分布函数生成随机数,并对生成的随机数进行筛选,最后与随机向下抽样相结合。实验所用数据取自NASA MDP数据集,并与经典的SMOTE 向下抽样方法进行对比,从G-mean和F-measure值可以看出前者的预测结果明显优于后者,预测精度更高。
|
关 键 词: | 软件失效预测 不平衡数据 主成分分析 分类回归树 |
收稿时间: | 2016/9/20 0:00:00 |
修稿时间: | 2017/5/15 0:00:00 |
Software defect prediction based on imbalanced datasets |
| |
Affiliation: | College of Computer Science and Technology, Nanjing University of Aeronautics & Astronautics, Nanjing 210016 |
| |
Abstract: | The imbalance of data is one of the serious problems which can''t be avoided in the research of software defect prediction. In order to solve this problem, this paper proposes a new sampling method based on the combination of over-sampling which uses the distribution function to get the new sample and the random under-sampling. In this paper, it first reduces the dimension of the original dataset; then, it can get the random values by fitting the distribution function of principal components. It filters some random values by truncating and Removal of noise samples; this over-sampling method will combines with random under-sampling to get the training sets and testing sets. In this paper, the datasets are from NASA MDP datasets and the results will be compared with SMOTE random under-sampling. It can draw the conclusion that the method using distribution function and random under-sampling is better than SMOTE random under-sampling by comparing the G-means and F-measure value. |
| |
Keywords: | Software Failure Prediction Imbalanced Datasets Principal Component Analysis Classification Regression Tree |
|
| 点击此处可从《计算机应用研究》浏览原始摘要信息 |
|
点击此处可从《计算机应用研究》下载全文 |