首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Ji  Haijin  Huang  Song  Wu  Yaning  Hui  Zhanwei  Zheng  Changyou 《Software Quality Journal》2019,27(3):923-968

Software defect prediction (SDP) plays a significant part in identifying the most defect-prone modules before software testing and allocating limited testing resources. One of the most commonly used classifiers in SDP is naive Bayes (NB). Despite the simplicity of the NB classifier, it can often perform better than more complicated classification models. In NB, the features are assumed to be equally important, and the numeric features are assumed to have a normal distribution. However, the features often do not contribute equivalently to the classification, and they usually do not have a normal distribution after performing a Kolmogorov-Smirnov test; this may harm the performance of the NB classifier. Therefore, this paper proposes a new weighted naive Bayes method based on information diffusion (WNB-ID) for SDP. More specifically, for the equal importance assumption, we investigate six weight assignment methods for setting the feature weights and then choose the most suitable one based on the F-measure. For the normal distribution assumption, we apply the information diffusion model (IDM) to compute the probability density of each feature instead of the acquiescent probability density function of the normal distribution. We carry out experiments on 10 software defect data sets of three types of projects in three different programming languages provided by the PROMISE repository. Several well-known classifiers and ensemble methods are included for comparison. The final experimental results demonstrate the effectiveness and practicability of the proposed method.

  相似文献   

2.
Various methods for ensembles selection and classifier combination have been designed to optimize the performance of ensembles of classifiers. However, use of large number of features in training data can affect the classification performance of machine learning algorithms. The objective of this paper is to represent a novel feature elimination (FE) based ensembles learning method which is an extension to an existing machine learning environment. Here the standard 12 lead ECG signal recordings data have been used in order to diagnose arrhythmia by classifying it into normal and abnormal subjects. The advantage of the proposed approach is that it reduces the size of feature space by way of using various feature elimination methods. The decisions obtained from these methods have been coalesced to form a fused data. Thus the idea behind this work is to discover a reduced feature space so that a classifier built using this tiny data set would perform no worse than a classifier built from the original data set. Random subspace based ensembles classifier is used with PART tree as base classifier. The proposed approach has been implemented and evaluated on the UCI ECG signal data. Here, the classification performance has been evaluated using measures such as mean absolute error, root mean squared error, relative absolute error, F-measure, classification accuracy, receiver operating characteristics and area under curve. In this way, the proposed novel approach has provided an attractive performance in terms of overall classification accuracy of 91.11 % on unseen test data set. From this work, it is shown that this approach performs well on the ensembles size of 15 and 20.  相似文献   

3.
软件缺陷预测通过预先识别出被测项目内的潜在缺陷程序模块,可以优化测试资源的分配并提高软件产品的质量。论文对跨项目缺陷预测问题展开了深入研究,在源项目实例选择时,考虑了三种不同的实例相似度计算方法,并发现这些方法的缺陷预测结果存在多样性,因此提出了一种基于Box-Cox转换的集成跨项目软件缺陷预测方法BCEL,具体来说,首先基于不同的实例相似度计算方法,从候选集中选出不同的训练集,随后针对这些数据集,进行针对性的Box-Cox转化,并借助特定分类方法构造出不同的基分类器,最后将这三个基分类器进行有效集成。基于实际项目的数据集,验证了BCEL方法的有效性,并深入分析了BCEL方法内的影响因素对缺陷预测性能的影响。  相似文献   

4.
现有的软件缺陷预测方法面临数据类别不平衡性、高维数据处理等问题。如何有效解决上述问题已成为目前相关领域的研究热点。针对软件缺陷预测所面临的类别不平衡、预测精度低等问题,本文提出一种基于混合采样与Random_Stacking的软件缺陷预测算法DP_HSRS。DP_HSRS算法首先采用混合采样算法对不平衡数据进行平衡化处理;然后在该平衡数据集上采用Random_Stacking算法进行软件缺陷预测。Random_Stacking算法是对传统Stacking算法的一种有效改进,它通过融合多个经典的分类算法以及Bagging机制构建多个Stacking分类器,对多个Stacking分类器进行投票,得到一个集成分类器,最后利用该集成分类器对软件缺陷进行预测。通过在NASA MDP数据集上的实验结果表明,DP_HSRS算法的性能优于现有的算法,具有更好的缺陷预测性能。  相似文献   

5.
Software defect prediction aims to predict the defect proneness of new software modules with the historical defect data so as to improve the quality of a software system. Software historical defect data has a complicated structure and a marked characteristic of class-imbalance; how to fully analyze and utilize the existing historical defect data and build more precise and effective classifiers has attracted considerable researchers’ interest from both academia and industry. Multiple kernel learning and ensemble learning are effective techniques in the field of machine learning. Multiple kernel learning can map the historical defect data to a higher-dimensional feature space and make them express better, and ensemble learning can use a series of weak classifiers to reduce the bias generated by the majority class and obtain better predictive performance. In this paper, we propose to use the multiple kernel learning to predict software defect. By using the characteristics of the metrics mined from the open source software, we get a multiple kernel classifier through ensemble learning method, which has the advantages of both multiple kernel learning and ensemble learning. We thus propose a multiple kernel ensemble learning (MKEL) approach for software defect classification and prediction. Considering the cost of risk in software defect prediction, we design a new sample weight vector updating strategy to reduce the cost of risk caused by misclassifying defective modules as non-defective ones. We employ the widely used NASA MDP datasets as test data to evaluate the performance of all compared methods; experimental results show that MKEL outperforms several representative state-of-the-art defect prediction methods.  相似文献   

6.

It is known that Naive Bayesian classifier (NB) works very well on some domains, and poorly on others. The performance of NB suffers in domains that involve correlated features. C4.5 decision trees, on the other hand, typically perform better than the Naive Bayesian algorithm on such domains. This paper describes a Selective Bayesian classifier (SBC) that simply uses only those features that C4.5 would use in its decision tree when learning a small example of a training set, a combination of the two different natures of classifiers. Experiments conducted on ten data sets indicate that SBC performs markedly better than NB on all domains, and SBC outperforms C4.5 on many data sets of which C4.5 outperform NB. Augmented Bayesian classifier (ABC) is also tested on the same data, and SBC appears to perform as well as ABC. SBC also can eliminate, in most cases, more than half of the original attributes, which can greatly reduce the size of the training and test data as well as the running time. Further, the SBC algorithm typically learns faster than both C4.5 and NB, needing fewer training examples to reach a high accuracy of classifications.  相似文献   

7.
Software systems have grown significantly and in complexity. As a result of these qualities, preventing software faults is extremely difficult. Software defect prediction (SDP) can assist developers in finding potential bugs and reducing maintenance costs. When it comes to lowering software costs and assuring software quality, SDP plays a critical role in software development. As a result, automatically forecasting the number of errors in software modules is important, and it may assist developers in allocating limited resources more efficiently. Several methods for detecting and addressing such flaws at a low cost have been offered. These approaches, on the other hand, need to be significantly improved in terms of performance. Therefore in this paper, two deep learning (DL) models Multilayer preceptor (MLP) and deep neural network (DNN) are proposed. The proposed approaches combine the newly established Whale optimization algorithm (WOA) with the complementary Firefly algorithm (FA) to establish the emphasized metaheuristic search EMWS algorithm, which selects fewer but closely related representative features. To find the best-implemented classifier in terms of prediction achievement measurement factor, classifiers were applied to five PROMISE repository datasets. When compared to existing methods, the proposed technique for SDP outperforms, with 0.91% for the JM1 dataset, 0.98% accuracy for the KC2 dataset, 0.91% accuracy for the PC1 dataset, 0.93% accuracy for the MC2 dataset, and 0.92% accuracy for KC3.  相似文献   

8.
Objective: Time series often appear in medical databases, but only few machine learning methods exist that process this kind of data properly. Most modeling techniques have been designed with a static data model in mind and are not suitable for coping with the dynamic nature of time series. Recurrent neural networks (RNNs) are often used to process time series, but only a few training algorithms exist for RNNs which are complex and often yield poor results. Therefore, researchers often turn to traditional machine learning approaches, such as support vector machines (SVMs), which can easily be set up and trained and combine them with feature extraction (FE) and selection (FS) to process the high-dimensional temporal data. Recently, a new approach, called echo state networks (ESNs), has been developed to simplify the training process of RNNs. This approach allows modeling the dynamics of a system based on time series data in a straightforwardway.The objective of this study is to explore the advantages of using ESN instead of other traditional classifiers combined with FE and FS in classification problems in the intensive care unit (ICU) when the input data consists of time series. While ESNs have mostly been used to predict the future course of a time series, we use the ESN model for classification instead. Although time series often appear in medical data, little medical applications of ESNs have been studiedyet.Methods and material: ESN is used to predict the need for dialysis between the fifth and tenth day after admission in the ICU. The input time series consist of measured diuresis and creatinine values during the first 3days after admission. Data about 830 patients was used for the study, of which 82 needed dialysis between the fifth and tenth day after admission. ESN is compared to traditional classifiers, a sophisticated and a simple one, namely support vector machines and the naive Bayes (NB) classifier. Prior to the use of the SVM and NB classifier, FE and FS is required to reduce the number of input features and thus alleviate the curse dimensionality. Extensive feature extraction was applied to capture both the overall properties of the time series and the correlation between the different measurements in the time series. The feature selection method consists of a greedy hybrid filter-wrapper method using a NB classifier, which selects in each iteration the feature that improves prediction the best and shows little multicollinearity with the already selected set. Least squares regression with noise was used to train the linear readout function of the ESN to mitigate sensitivity to noise and overfitting. Fisher labeling was used to deal with the unbalanced data set. Parameter sweeps were performed to determine the optimal parameter values for the different classifiers. The area under the curve (AUC) and maximum balanced accuracy are used as performance measures. The required execution time was also measured.Results: The classification performance of the ESN shows significant difference at the 5% level compared to the performance of the SVM or the NB classifier combined with FE and FS. The NB+FE+FS, with an average AUC of 0.874, has the best classification performance. This classifier is followed by the ESN, which has an average AUC of 0.849. The SVM+FE+FS has the worst performance with an average AUC of 0.838. The computation time needed to pre-process the data and to train and test the classifier is significantly less for the ESN compared to the SVM andNB.Conclusion: It can be concluded that the use of ESN has an added value in predicting the need for dialysis through the analysis of time series data. The ESN requires significantly less processing time, needs no domain knowledge, is easy to implement, and can be configured using rules ofthumb.  相似文献   

9.
当软件历史仓库中有标记训练样本较少时,有效的预测模型难以构建.针对此问题,文中提出基于二次学习的半监督字典学习软件缺陷预测方法.在第一阶段的学习中,利用稀疏表示分类器将大量无标记样本通过概率软标记标注扩充至有标记训练样本集中.再在扩充后的训练集上进行第二阶段的鉴别字典学习,最后在学得的字典上预测缺陷倾向性.在NASA MDP和PROMISE AR数据集上的实验验证文中方法的优越性.  相似文献   

10.
In current software defect prediction (SDP) research, most previous empirical studies only use datasets provided by PROMISE repository and this may cause a threat to the external validity of previous empirical results. Instead of SDP dataset sharing, SDP model sharing is a potential solution to alleviate this problem and can encourage researchers in the research community and practitioners in the industrial community to share more models. However, directly sharing models may result in privacy disclosure, such as model inversion attack. To the best of our knowledge, we are the first to apply differential privacy (DP) to privacy-preserving SDP model sharing and then propose a novel method DP-Share, since DP mechanisms can prevent this attack when the privacy budget is carefully selected. In particular, DP-Share first performs data preprocessing for the dataset, such as over-sampling for minority instances (i.e., defective modules) and conducting discretization for continuous features to optimize privacy budget allocation. Then, it uses a novel sampling strategy to create a set of training sets. Finally it constructs decision trees based on these training sets and these decision trees can form a random forest (i.e., model). The last phase of DP-Share uses Laplace and exponential mechanisms to satisfy the requirements of DP. In our empirical studies, we choose nine experimental subjects from real software projects. Then, we use AUC (area under ROC curve) as the performance measure and holdout as our model validation technique. After privacy and utility analysis, we find that DP-Share can achieve better performance than a baseline method DF-Enhance in most cases when using the same privacy budget. Moreover, we also provide guidelines to effectively use our proposed method. Our work attempts to fill the research gap in terms of differential privacy for SDP, which can encourage researchers and practitioners to share more SDP models and then effectively advance the state of the art of SDP.  相似文献   

11.
软件缺陷预测是提升软件质量的有效方法,而软件缺陷预测方法的预测效果与数据集自身的特点有着密切的相关性。针对软件缺陷预测中数据集特征信息冗余、维度过大的问题,结合深度学习对数据特征强大的学习能力,提出了一种基于深度自编码网络的软件缺陷预测方法。该方法首先使用一种基于无监督学习的采样方法对6个开源项目数据集进行采样,解决了数据集中类不平衡问题;然后训练出一个深度自编码网络模型。该模型能对数据集进行特征降维,模型的最后使用了三种分类器进行连接,该模型使用降维后的训练集训练分类器,最后用测试集进行预测。实验结果表明,该方法在维数较大、特征信息冗余的数据集上的预测性能要优于基准的软件缺陷预测模型和基于现有的特征提取方法的软件缺陷预测模型,并且适用于不同分类算法。  相似文献   

12.
Gender recognition has been playing a very important role in various applications such as human–computer interaction, surveillance, and security. Nonlinear support vector machines (SVMs) were investigated for the identification of gender using the Face Recognition Technology (FERET) image face database. It was shown that SVM classifiers outperform the traditional pattern classifiers (linear, quadratic, Fisher linear discriminant, and nearest neighbour). In this context, this paper aims to improve the SVM classification accuracy in the gender classification system and propose new models for a better performance. We have evaluated different SVM learning algorithms; the SVM‐radial basis function with a 5% outlier fraction outperformed other SVM classifiers. We have examined the effectiveness of different feature selection methods. AdaBoost performs better than the other feature selection methods in selecting the most discriminating features. We have proposed two classification methods that focus on training subsets of images among the training images. Method 1 combines the outcome of different classifiers based on different image subsets, whereas method 2 is based on clustering the training data and building a classifier for each cluster. Experimental results showed that both methods have increased the classification accuracy.  相似文献   

13.
王铁建  吴飞  荆晓远 《计算机科学》2017,44(12):131-134, 168
提出一种多核字典学习方法,用以对软件模块是否存在缺陷进行预测。用于软件缺陷预测的历史数据具有结构复杂、类不平衡的特点,用多个核函数构成的合成核将这些数据映射到一个高维特征空间,通过对多核字典基的选择,得到一个类别平衡的多核字典,用以对新的软件模块进行分类和预测,并判定其中是否存在缺陷。在NASA MDP数据集上的实验表明,与其他软件缺陷预测方法相比,多核字典学习方法能够针对软件缺陷历史数据结构复杂、类不平衡的特点,较好地解决软件缺陷预测问题。  相似文献   

14.
软件缺陷预测通过预先识别出被测项目内的潜在缺陷程序模块,有助于合理分配测试资源,并最终提高被测软件产品的质量。但在搜集缺陷预测数据集的时候,由于考虑了大量与代码复杂度或开发过程相关的度量元,造成数据集内存在维数灾难问题。借助基于搜索的软件工程思想,提出一种新颖的基于搜索的包裹式特征选择框架SBFS。该框架在实现时,首先借助SMOTE方法来缓解数据集内存在的类不平衡问题,随后借助基于遗传算法的特征选择方法,基于训练集选出最优特征子集。在实证研究中,以NASA数据集作为评测对象,以基于前向选择策略的包裹式特征选择方法FW、基于后向选择策略的包裹式特征选择BW、不进行特征选择的Origin作为基准方法。最终实证研究结果表明:SBFS方法在90%的情况下,不差于Origin法。在82.3%的情况下,不差于BW法。在69.3%的情况下,不差于FW法。除此之外,我们发现若基于决策树分类器,则应用SMOTE方法后,可以在71%的情况下,提高模型性能。而基于朴素贝叶斯和Logistic回归分类器,则应用SMOTE方法后,仅可以在47%和43%的情况下,提高模型的预测性能。  相似文献   

15.

Software Defect Prediction (SDP) is highly crucial task in software development process to forecast about which modules are more prone to errors and faults before the instigation of the testing phase. It aims to reduce the development cost of the software by focusing the testing efforts to those predicted faulty modules. Though, it ensures in-time delivery of good quality end-product, but class-imbalance of dataset is a major hinderance to SDP. This paper proposes a novel Neighbourhood based Under-Sampling (N-US) algorithm to handle class imbalance issue. This work is dedicated to demonstrating the effectiveness of proposed Neighbourhood based Under-Sampling (N-US) approach to attain high accuracy while predicting the defective modules. The algorithm N-US under samples the dataset to maximize the visibility of minority data points while restricting the excessive elimination of majority data points to avoid information loss. To assess the applicability of N-US, it is compared with three standard under-sampling techniques. Further, this study investigates the performance of N-US as a trusted ally for SDP classifiers. Extensive experiments are conducted using benchmark datasets from NASA repository which are CM1, JM1, KC1, KC2 and PC1. The proposed SDP classifier with N-US technique is compared with baseline models statistically to assess the effectiveness of N-US algorithm for SDP. The proposed model outperforms the rest of the candidate SDP models with the highest AUC score (=?95.6%), the maximum Accuracy value (=?96.9%) and the closest ROC curve to the top left corner. It shows up with the best prediction power statistically with confidence level of 95%.

  相似文献   

16.
樊康新 《计算机工程》2009,35(24):191-193
针对朴素贝叶斯(NB)分类器在分类过程中存在诸如分类模型对样本具有敏感性、分类精度难以提高等缺陷,提出一种基于多种特征选择方法的NB组合文本分类器方法。依据Boosting分类算法,采用多种不同的特征选择方法建立文本的特征词集,训练NB分类器作为Boosting迭代过程的基分类器,通过对基分类器的加权投票生成最终的NB组合文本分类器。实验结果表明,该组合分类器较单NB文本分类器具有更好的分类性能。  相似文献   

17.
为了设计高效的软件缺陷预测模型,提出一种将粒子群优化算法与朴素贝叶斯(NB)相结合的方法。该方法对历史数据进行离散化后,以NB分类的错误率作为粒子适应值函数,构建软件缺陷预测模型。通过对美国国家航天局软件工程项目的JM1数据进行仿真实验,证明该模型在预测性能方面优于同类方法,预测效果良好。  相似文献   

18.
乔辉  周雁舟  邵楠 《计算机应用》2012,32(5):1436-1438
针对传统的软件可靠性预测模型在实际应用中存在预测泛化性能不佳等问题,提出一种基于学习向量量化(LVQ)神经网络的软件可靠性预测模型。首先分析了LVQ神经网络的结构特点以及它与软件可靠性预测的联系,然后运用该网络来进行软件可靠性的预测,并基于美国国家航空航天局(NASA)软件数据项目中的实例数据集,运用Matlab工具进行了仿真实验。通过与传统预测方法的对比,证明该方法具有可行性和较高的预测泛化性能。  相似文献   

19.
A bootstrap technique for nearest neighbor classifier design   总被引:4,自引:0,他引:4  
A bootstrap technique for nearest neighbor classifier design is proposed. Our primary interest in designing a classifier is in small training sample size situations. Conventional bootstrapping techniques sample the training samples with replacement. On the other hand, our technique generates bootstrap samples by locally combining original training samples. The nearest neighbor classifier is designed on the bootstrap samples and is tested on the test samples independent of training samples. The performance of the proposed classifier is demonstrated on three artificial data sets and one real data set. Experimental results show that the nearest neighbor classifier designed on the bootstrap samples outperforms the conventional k-NN classifiers as well as the edited 1-NN classifiers, particularly in high dimensions  相似文献   

20.
提出了一种使用基于规则的基分类器建立组合分类器的新方法PCARules。尽管新方法也采用基分类器预测的加权投票来决定待分类样本的类,但是为基分类器创建训练数据集的方法与bagging和boosting完全不同。该方法不是通过抽样为基分类器创建数据集,而是随机地将特征划分成K个子集,使用PCA得到每个子集的主成分,形成新的特征空间,并将所有训练数据映射到新的特征空间作为基分类器的训练集。在UCI机器学习库的30个随机选取的数据集上的实验表明:算法不仅能够显著提高基于规则的分类方法的分类性能,而且与bagging和boosting等传统组合方法相比,在大部分数据集上都具有更高的分类准确率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号