首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
王培  金聪  葛贺贺 《计算机应用》2012,32(6):1738-1740
软件开发过程中准确有效地预测具有缺陷倾向的软件模块是提高软件质量的重要方法。属性选择能够显著地提高软件缺陷预测模型的精确度和效率。提出了一种基于互信息的属性选择方法,将选择出的最优属性子集用于软件缺陷预测模型。方法采用了前向搜索策略,并在评价函数中引入非线性平衡系数。实验结果表明,基于互信息的属性选择方法提供的属性子集能提高各类软件缺陷预测模型的预测精度和效率。  相似文献   

2.
Software defect prediction is aimed to find potential defects based on historical data and software features. Software features can reflect the characteristics of software modules. However, some of these features may be more relevant to the class (defective or non-defective), but others may be redundant or irrelevant. To fully measure the correlation between different features and the class, we present a feature selection approach based on a similarity measure (SM) for software defect prediction. First, the feature weights are updated according to the similarity of samples in different classes. Second, a feature ranking list is generated by sorting the feature weights in descending order, and all feature subsets are selected from the feature ranking list in sequence. Finally, all feature subsets are evaluated on a k-nearest neighbor (KNN) model and measured by an area under curve (AUC) metric for classification performance. The experiments are conducted on 11 National Aeronautics and Space Administration (NASA) datasets, and the results show that our approach performs better than or is comparable to the compared feature selection approaches in terms of classification performance.  相似文献   

3.
针对软件缺陷数据集中不相关特征和冗余特征会降低软件缺陷个数预测模型的性能的问题,提出了一种面向软件缺陷个数预测的混合式特征选择方法-HFSNFP。首先,利用ReliefF算法计算每个特征与缺陷个数之间的相关性,选出相关性最高的m个特征;然后,基于特征之间的关联性利用谱聚类对这m个特征进行聚类;最后,利用基于包裹式特征选择思想从每个簇中依次挑选最相关的特征形成最终的特征子集。实验结果表明,相比于已有的五种过滤式特征选择方法,HFSNFP方法在提高预测率的同时降低了误报率,且G-measure与RMSE度量值更佳;相比于已有的两种包裹式特征选择方法,HFSNFP方法在保证了缺陷个数预测性能的同时可以显著降低特征选择的时间。  相似文献   

4.
Estimating the number of defects in a software product is an important and challenging problem. A multitude of estimation techniques have been proposed for defect prediction. However, not all techniques are applicable in all cases. The selection of the proper approach to use depends on multiple factors: the features of the approach; the availability of resources; and the goals for using the estimated defect data. In this paper a survey of existing estimation techniques and a decision support approach for selecting the most suitable defect estimation technique for a project, with specific goals, is proposed. The results of the ranking are a clear indication that no estimation technique provides a single, comprehensive solution; the selection must be done according to a given scenario. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

5.
Recent feature selection scores using pairwise constraints (must-link and cannot-link) have shown better performances than the unsupervised methods and comparable to the supervised ones. However, these scores use only the pairwise constraints and ignore the available information brought by the unlabeled data. Moreover, these constraint scores strongly depend on the given must-link and cannot-link subsets built by the user. In this paper, we address these problems and propose a new semi-supervised constraint score that uses both pairwise constraints and local properties of the unlabeled data. Experiments using Kendall’s coefficient and accuracy rates, show that this new score is less sensitive to the given constraints than the previous scores while providing similar performances.  相似文献   

6.
New methodologies and tools have gradually made the life cycle for software development more human-independent. Much of the research in this field focuses on defect reduction, defect identification and defect prediction. Defect prediction is a relatively new research area that involves using various methods from artificial intelligence to data mining. Identifying and locating defects in software projects is a difficult task. Measuring software in a continuous and disciplined manner provides many advantages such as the accurate estimation of project costs and schedules as well as improving product and process qualities. This study aims to propose a model to predict the number of defects in the new version of a software product with respect to the previous stable version. The new version may contain changes related to a new feature or a modification in the algorithm or bug fixes. Our proposed model aims to predict the new defects introduced into the new version by analyzing the types of changes in an objective and formal manner as well as considering the lines of code (LOC) change. Defect predictors are helpful tools for both project managers and developers. Accurate predictors may help reducing test times and guide developers towards implementing higher quality codes. Our proposed model can aid software engineers in determining the stability of software before it goes on production. Furthermore, such a model may provide useful insight for understanding the effects of a feature, bug fix or change in the process of defect detection.
Ayşe Basar BenerEmail:
  相似文献   

7.
李莉  石可欣  任振康 《计算机应用》2022,42(5):1554-1562
跨项目软件缺陷预测可以解决预测项目中训练数据较少的问题,然而源项目和目标项目通常会有较大的数据分布差异,这降低了预测性能。针对该问题,提出了一种基于特征选择和TrAdaBoost的跨项目缺陷预测方法(CPDP-FSTr)。首先,在特征选择阶段,采用核主成分分析法(KPCA)删除源项目中的冗余数据;然后,根据源项目和目标项目的属性特征分布,按距离选出与目标项目分布最接近的候选源项目数据;最后,在实例迁移阶段,通过采用评估因子改进的TrAdaBoost方法,在源项目中找出与目标项目中少量有标签实例分布相近的实例,并建立缺陷预测模型。以F1作为评价指标,与基于特征聚类和TrAdaBoost的跨项目软件缺陷预测(FeCTrA)方法以及基于多核集成学习的跨项目软件缺陷预测(CMKEL)方法相比,CPDP-FSTr的预测性能在AEEEM数据集上分别提高了5.84%、105.42%,在NASA数据集上分别提高了5.25%、85.97%,且其两过程特征选择优于单一特征选择过程。实验结果表明,当源项目特征选择比例和目标项目有类标实例比例分别为60%、20%时,所提CPDP-FSTr能取得较好的预测性能。  相似文献   

8.
For the last years, a considerable amount of attention has been devoted to the research about the link prediction (LP) problem in complex networks. This problem tries to predict the likelihood of an association between two not interconnected nodes in a network to appear in the future. One of the most important approaches to the LP problem is based on supervised machine learning (ML) techniques for classification. Although many works have presented promising results with this approach, choosing the set of features (variables) to train the classifiers is still a major challenge. In this article, we report on the effects of three different automatic variable selection strategies (Forward, Backward and Evolutionary) applied to the feature-based supervised learning approach in LP applications. The results of the experiments show that the use of these strategies does lead to better classification models than classifiers built with the complete set of variables. Such experiments were performed over three datasets (Microsoft Academic Network, Amazon and Flickr) that contained more than twenty different features each, including topological and domain-specific ones. We also describe the specification and implementation of the process used to support the experiments. It combines the use of the feature selection strategies, six different classification algorithms (SVM, K-NN, naïve Bayes, CART, random forest and multilayer perceptron) and three evaluation metrics (Precision, F-Measure and Area Under the Curve). Moreover, this process includes a novel ML voting committee inspired approach that suggests sets of features to represent data in LP applications. It mines the log of the experiments in order to identify sets of features frequently selected to produce classification models with high performance. The experiments showed interesting correlations between frequently selected features and datasets.  相似文献   

9.
10.
A number of software cost estimation methods have been presented in literature over the past decades. Analogy based estimation (ABE), which is essentially a case based reasoning (CBR) approach, is one of the most popular techniques. In order to improve the performance of ABE, many previous studies proposed effective approaches to optimize the weights of the project features (feature weighting) in its similarity function. However, ABE is still criticized for the low prediction accuracy, the large memory requirement, and the expensive computation cost. To alleviate these drawbacks, in this paper we propose the project selection technique for ABE (PSABE) which reduces the whole project base into a small subset that consist only of representative projects. Moreover, PSABE is combined with the feature weighting to form FWPSABE for a further improvement of ABE. The proposed methods are validated on four datasets (two real-world sets and two artificial sets) and compared with conventional ABE, feature weighted ABE (FWABE), and machine learning methods. The promising results indicate that project selection technique could significantly improve analogy based models for software cost estimation.  相似文献   

11.
Naive Bayes is one of the most widely used algorithms in classification problems because of its simplicity, effectiveness, and robustness. It is suitable for many learning scenarios, such as image classification, fraud detection, web mining, and text classification. Naive Bayes is a probabilistic approach based on assumptions that features are independent of each other and that their weights are equally important. However, in practice, features may be interrelated. In that case, such assumptions may cause a dramatic decrease in performance. In this study, by following preprocessing steps, a Feature Dependent Naive Bayes (FDNB) classification method is proposed. Features are included for calculation as pairs to create dependence between one another. This method was applied to the software defect prediction problem and experiments were carried out using widely recognized NASA PROMISE data sets. The obtained results show that this new method is more successful than the standard Naive Bayes approach and that it has a competitive performance with other feature-weighting techniques. A further aim of this study is to demonstrate that to be reliable, a learning model must be constructed by using only training data, as otherwise misleading results arise from the use of the entire data set.  相似文献   

12.
A critique of software defect prediction models   总被引:4,自引:0,他引:4  
Many organizations want to predict the number of defects (faults) in software systems, before they are deployed, to gauge the likely delivered quality and maintenance effort. To help in this numerous software metrics and statistical models have been developed, with a correspondingly large literature. We provide a critical review of this literature and the state-of-the-art. Most of the wide range of prediction models use size and complexity metrics to predict defects. Others are based on testing data, the “quality” of the development process, or take a multivariate approach. The authors of the models have often made heroic contributions to a subject otherwise bereft of empirical studies. However, there are a number of serious theoretical and practical problems in many studies. The models are weak because of their inability to cope with the, as yet, unknown relationship between defects and failures. There are fundamental statistical and data quality problems that undermine model validity. More significantly many prediction models tend to model only part of the underlying problem and seriously misspecify it. To illustrate these points the Goldilock's Conjecture, that there is an optimum module size, is used to show the considerable problems inherent in current defect prediction approaches. Careful and considered analysis of past and new results shows that the conjecture lacks support and that some models are misleading. We recommend holistic models for software defect prediction, using Bayesian belief networks, as alternative approaches to the single-issue models used at present. We also argue for research into a theory of “software decomposition” in order to test hypotheses about defect introduction and help construct a better science of software engineering  相似文献   

13.
Just-in-time defect prediction can remind software developers and managers to verify and fix bugs at the moment they appeared, thus improving the effectiveness and validity of bug fixing. Existing studies mainly focus on just-in-time prediction for software files (JIT-F). JIT-F is a binary classification problem, which classifies (hence predicts) a file change as buggy or clean. This article provides a detailed analysis of just-in-time defect prediction for software hunks (JIT-H), which predicts bugs at a finer level of granularity, and hence further improves the efficiency of bug fixing. Classification is performed using the ensemble technique of bagging—aggregated combinations of random under sampling plus multiple classifiers (J48 and Random Forest). An empirical study with 10 open source projects was conducted to validate the effectiveness of JIT-H. Experimental results show that JIT-H is effective at predicting defects in software hunk changes. Compared with JIT-F, JIT-H is more cost effective. Additionally, analysis on the change features indicates that Text Vector features and hunk change level features are of more importance than features in other groups and levels.  相似文献   

14.
Software quality engineering comprises of several quality assurance activities such as testing, formal verification, inspection, fault tolerance, and software fault prediction. Until now, many researchers developed and validated several fault prediction models by using machine learning and statistical techniques. There have been used different kinds of software metrics and diverse feature reduction techniques in order to improve the models’ performance. However, these studies did not investigate the effect of dataset size, metrics set, and feature selection techniques for software fault prediction. This study is focused on the high-performance fault predictors based on machine learning such as Random Forests and the algorithms based on a new computational intelligence approach called Artificial Immune Systems. We used public NASA datasets from the PROMISE repository to make our predictive models repeatable, refutable, and verifiable. The research questions were based on the effects of dataset size, metrics set, and feature selection techniques. In order to answer these questions, there were defined seven test groups. Additionally, nine classifiers were examined for each of the five public NASA datasets. According to this study, Random Forests provides the best prediction performance for large datasets and Naive Bayes is the best prediction algorithm for small datasets in terms of the Area Under Receiver Operating Characteristics Curve (AUC) evaluation parameter. The parallel implementation of Artificial Immune Recognition Systems (AIRS2Parallel) algorithm is the best Artificial Immune Systems paradigm-based algorithm when the method-level metrics are used.  相似文献   

15.
针对软件缺陷预测中数据维度的复杂化和类不平衡问题,提出一种基于代理辅助模型的多目标萤火虫算法(SMO-MSFFA)的软件缺陷预测方法。该方法采用了多组策略萤火虫算法(MSFFA),以最小化数据的特征选择比率和最大化模型评测AUC值为多目标目标函数,分别以随机森林(RF)、支持向量机(SVM)和K近邻分类算法(KNN)为分类器构建软件缺陷预测模型。考虑到进化算法自身的迭代特点,嵌入代理模型离线完成部分个体评价函数的计算,以缩短计算耗时。在公开数据集NASA中的PC1、KC1和MC1项目上进行实验验证,与NSGA-II方法相比,在项目PC1、KC1和MC1上模型AUC均值分别提升0.17、降低0.01和提升0.09,平均特征选择比率分别降低0.08,0.17和0.05,平均耗时分别增加131 s,降低了199 s和降低了431 s。实验结果表明,提出的方法在提高模型性能、降低特征选择比率和缩短计算耗时方面具有明显的优势。  相似文献   

16.
Band selection (dimensionality reduction) plays an essential role in hyper-spectral image processing and applications. This article presents a unified comparison framework for systematic performance comparison of filter-based feature selection models and conducts a comparative evaluation of four methods: maximal minimal associated index (MMAIQ), mutual information-based max-dependency criterion (mRMR), relief feature selection (Relief-F), and correlation-based feature selection (CFS) for hyper-spectral band selection. The evaluation is based on the performance of effectiveness, robustness, and classification accuracy, which involves five measuring indices: class separability, feature entropy, feature stability, feature redundancy, and classification accuracy. Three images acquired by different sensors were used to investigate the performance of the metrics. Experimental results show the best results for MMAIQ for all data sets in terms of used measurements, except for feature stability where mRMR and Relief-F exhibit their superiority.  相似文献   

17.
Over the last few years research has been oriented toward developing a machine vision system for locating and identifying, automatically, defects on rails. Rail defects exhibit different properties and are divided into various categories related to the type and position of flaws on the rail. Several kinds of interrelated factors cause rail defects such as type of rail, construction conditions, and speed and/or frequency of trains using the rail. The aim of this paper is to present an experimental comparison among three filtering approaches, based on texture analysis of rail surfaces, to detect the presence/absence of a particular class of surface defects: corrugation.Received: 7 April 2002, Accepted: 13 April 2004, Published online: 13 July 2004  相似文献   

18.
为了提高预测模型的性能,解决不同属性子集带来的分歧,提出了基本偏相关方法的预测模型。首先,该方法在公开数据集上分析出代码静态属性与缺陷数之间存在偏相关关系;然后基于偏相关系数值,计算出代码复杂性度密度属性值;最后基于该属性值建立新的缺陷预测模型。实验表明,该模型具有较高的召回率和很好的F-measure性能,从而进一步证实了代码属性与模块缺陷之间的偏相关性是影响软件质量预测性能的重要因素的结论。该结论有助于建立更加稳定可靠的软件缺陷预测模型。  相似文献   

19.

Context

Software defect prediction studies usually built models using within-company data, but very few focused on the prediction models trained with cross-company data. It is difficult to employ these models which are built on the within-company data in practice, because of the lack of these local data repositories. Recently, transfer learning has attracted more and more attention for building classifier in target domain using the data from related source domain. It is very useful in cases when distributions of training and test instances differ, but is it appropriate for cross-company software defect prediction?

Objective

In this paper, we consider the cross-company defect prediction scenario where source and target data are drawn from different companies. In order to harness cross company data, we try to exploit the transfer learning method to build faster and highly effective prediction model.

Method

Unlike the prior works selecting training data which are similar from the test data, we proposed a novel algorithm called Transfer Naive Bayes (TNB), by using the information of all the proper features in training data. Our solution estimates the distribution of the test data, and transfers cross-company data information into the weights of the training data. On these weighted data, the defect prediction model is built.

Results

This article presents a theoretical analysis for the comparative methods, and shows the experiment results on the data sets from different organizations. It indicates that TNB is more accurate in terms of AUC (The area under the receiver operating characteristic curve), within less runtime than the state of the art methods.

Conclusion

It is concluded that when there are too few local training data to train good classifiers, the useful knowledge from different-distribution training data on feature level may help. We are optimistic that our transfer learning method can guide optimal resource allocation strategies, which may reduce software testing cost and increase effectiveness of software testing process.  相似文献   

20.
Several models have been developed that attempt to predict the total number of defects in a software product. One such approach uses the capture–recapture model, a technique employed by biologists for predicting wildlife populations. In this method once the software is built and defects begin to be identified a prediction can be made for the total number of software defects present. But capture–recapture models rely on expert inspectors and the technique cannot be employed once the software has been released. The work reported here extends the capture–recapture technique to the post-inspection phase and to where inspection data is unavailable, by using user defect reports. The proposed technique does not rely on expert inspectors and is particularly suitable for open source software.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号