首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Noise detection for software measurement datasets is a topic of growing interest. The presence of class and attribute noise in software measurement datasets degrades the performance of machine learning-based classifiers, and the identification of these noisy modules improves the overall performance. In this study, we propose a noise detection algorithm based on software metrics threshold values. The threshold values are obtained from the Receiver Operating Characteristic (ROC) analysis. This paper focuses on case studies of five public NASA datasets and details the construction of Naive Bayes-based software fault prediction models both before and after applying the proposed noise detection algorithm. Experimental results show that this noise detection approach is very effective for detecting the class noise and that the performance of fault predictors using a Naive Bayes algorithm with a logNum filter improves if the class labels of identified noisy modules are corrected.  相似文献   

2.
Software quality engineering comprises of several quality assurance activities such as testing, formal verification, inspection, fault tolerance, and software fault prediction. Until now, many researchers developed and validated several fault prediction models by using machine learning and statistical techniques. There have been used different kinds of software metrics and diverse feature reduction techniques in order to improve the models’ performance. However, these studies did not investigate the effect of dataset size, metrics set, and feature selection techniques for software fault prediction. This study is focused on the high-performance fault predictors based on machine learning such as Random Forests and the algorithms based on a new computational intelligence approach called Artificial Immune Systems. We used public NASA datasets from the PROMISE repository to make our predictive models repeatable, refutable, and verifiable. The research questions were based on the effects of dataset size, metrics set, and feature selection techniques. In order to answer these questions, there were defined seven test groups. Additionally, nine classifiers were examined for each of the five public NASA datasets. According to this study, Random Forests provides the best prediction performance for large datasets and Naive Bayes is the best prediction algorithm for small datasets in terms of the Area Under Receiver Operating Characteristics Curve (AUC) evaluation parameter. The parallel implementation of Artificial Immune Recognition Systems (AIRS2Parallel) algorithm is the best Artificial Immune Systems paradigm-based algorithm when the method-level metrics are used.  相似文献   

3.
We addresses the important problem of software quality analysis when there is limited software fault or fault-proneness data. A software quality model is typically trained using software measurement and fault data obtained from a previous release or similar project. Such an approach assumes that fault data is available for all the training modules. Various issues in software development may limit the availability of fault-proneness data for all the training modules. Consequently, the available labeled training dataset is such that the trained software quality model may not provide predictions. More specifically, the small set of modules with known fault-proneness labels is not sufficient for capturing the software quality trends of the project. We investigate semi-supervised learning with the Expectation Maximization (EM) algorithm for software quality estimation with limited fault-proneness data. The hypothesis is that knowledge stored in software attributes of the unlabeled program modules will aid in improving software quality estimation. Software data collected from a large NASA software project is used during the semi-supervised learning process. The software quality model is evaluated with multiple test datasets collected from other NASA software projects. Compared to software quality models trained only with the available set of labeled program modules, the EM-based semi-supervised learning scheme improves generalization performance of the software quality models.  相似文献   

4.
Mining class outliers: concepts, algorithms and applications in CRM   总被引:4,自引:0,他引:4  
Outliers, or commonly referred to as exceptional cases, exist in many real-world databases. Detection of such outliers is important for many applications and has attracted much attention from the data mining research community recently. However, most existing methods are designed for mining outliers from a single dataset without considering the class labels of data objects. In this paper, we consider the class outlier detection problem ‘given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels’. By generalizing two pioneer contributions [Proc WAIM02 (2002); Proc SSTD03] in this field, we develop the notion of class outlier and propose practical solutions by extending existing outlier detection algorithms to this case. Furthermore, its potential applications in CRM (customer relationship management) are also discussed. Finally, the experiments in real datasets show that our method can find interesting outliers and is of practical use.  相似文献   

5.
为了进一步提升现有非齐次泊松过程类软件可靠性增长模型的拟合和预测性能,首先从故障总数增长趋势角度对不完美排错模型进行深入研究,提出两个一般性不完美排错框架模型,分别考虑了总故障数量函数与累计检测故障函数间的线性关系与微分关系,并求得累计检测的故障数量与软件中总故障数量函数表达式;其次,在六组真实的失效数据集上对比了提出的两种一般性不完美排错模型和六种不完美排错模型拟合预测性能表现。实例验证结果表明,提出的一般性不完美排错框架模型在大多数失效数据集上都具有优秀的拟合和预测性能,证明了新建模型的有效性和实用性;通过对提出的模型与其他不完美排错模型在数据集上的性能的深入分析,为实际应用中不完美排错模型的选择提出了建议。  相似文献   

6.
The evaluation of unsupervised outlier detection algorithms is a constant challenge in data mining research. Little is known regarding the strengths and weaknesses of different standard outlier detection models, and the impact of parameter choices for these algorithms. The scarcity of appropriate benchmark datasets with ground truth annotation is a significant impediment to the evaluation of outlier methods. Even when labeled datasets are available, their suitability for the outlier detection task is typically unknown. Furthermore, the biases of commonly-used evaluation measures are not fully understood. It is thus difficult to ascertain the extent to which newly-proposed outlier detection methods improve over established methods. In this paper, we perform an extensive experimental study on the performance of a representative set of standard k nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose. Based on the overall performance of the outlier detection methods, we provide a characterization of the datasets themselves, and discuss their suitability as outlier detection benchmark sets. We also examine the most commonly-used measures for comparing the performance of different methods, and suggest adaptations that are more suitable for the evaluation of outlier detection results.  相似文献   

7.
Recent advances in supervised salient object detection modeling has resulted in significant performance improvements on benchmark datasets. However, most of the existing salient object detection models assume that at least one salient object exists in the input image. Such an assumption often leads to less appealing saliencymaps on the background images with no salient object at all. Therefore, handling those cases can reduce the false positive rate of a model. In this paper, we propose a supervised learning approach for jointly addressing the salient object detection and existence prediction problems. Given a set of background-only images and images with salient objects, as well as their salient object annotations, we adopt the structural SVM framework and formulate the two problems jointly in a single integrated objective function: saliency labels of superpixels are involved in a classification term conditioned on the salient object existence variable, which in turn depends on both global image and regional saliency features and saliency labels assignments. The loss function also considers both image-level and regionlevel mis-classifications. Extensive evaluation on benchmark datasets validate the effectiveness of our proposed joint approach compared to the baseline and state-of-the-art models.  相似文献   

8.
Statistical topic models for multi-label document classification   总被引:2,自引:0,他引:2  
Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A?drawback of these approaches is that performance rapidly drops off as the total number of labels and the number of labels per document increase. This problem is amplified when the label frequencies exhibit the type of highly skewed distributions that are often observed in real-world datasets. In this paper we investigate a class of generative statistical topic models for multi-label documents that associate individual word tokens with different labels. We investigate the advantages of this approach relative to discriminative models, particularly with respect to classification problems involving large numbers of relatively rare labels. We compare the performance of generative and discriminative approaches on document labeling tasks ranging from datasets with several thousand labels to datasets with tens of labels. The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies.  相似文献   

9.
Due to the difficulties of outlier and skewed data, the prediction of breast cancer survivability has presented many challenges in the field of data mining and pattern precognition, especially in medical research. To solve these problems, we have proposed a hybrid approach to generating higher quality data sets in the creation of improved breast cancer survival prediction models. This approach comprises two main steps: (1) utilization of an outlier filtering approach based on C-Support Vector Classification (C-SVC) to identify and eliminate outlier instances; and (2) application of an over-sampling approach using over-sampling with replacement to increase the number of instances in the minority class. In order to assess the capability and effectiveness of the proposed approach, several measurement methods including basic performance (e.g., accuracy, sensitivity, and specificity), Area Under the receiver operating characteristic Curve (AUC) and F-measure were utilized. Moreover, a 10-fold cross-validation method was used to reduce the bias and variance of the results of breast cancer survivability prediction models. Results have indicated that the proposed approach leads to improving the performance of breast cancer survivability prediction models by up to 28.34% due to the improved training data space.  相似文献   

10.
It is well-known that software defect prediction is one of the most important tasks for software quality improvement. The use of defect predictors allows test engineers to focus on defective modules. Thereby testing resources can be allocated effectively and the quality assurance costs can be reduced. For within-project defect prediction (WPDP), there should be sufficient data within a company to train any prediction model. Without such local data, cross-project defect prediction (CPDP) is feasible since it uses data collected from similar projects in other companies. Software defect datasets have the class imbalance problem increasing the difficulty for the learner to predict defects. In addition, the impact of imbalanced data on the real performance of models can be hidden by the performance measures chosen. We investigate if the class imbalance learning can be beneficial for CPDP. In our approach, the asymmetric misclassification cost and the similarity weights obtained from distributional characteristics are closely associated to guide the appropriate resampling mechanism. We performed the effect size A-statistics test to evaluate the magnitude of the improvement. For the statistical significant test, we used Wilcoxon rank-sum test. The experimental results show that our approach can provide higher prediction performance than both the existing CPDP technique and the existing class imbalance technique.  相似文献   

11.

Context

Assessing software quality at the early stages of the design and development process is very difficult since most of the software quality characteristics are not directly measurable. Nonetheless, they can be derived from other measurable attributes. For this purpose, software quality prediction models have been extensively used. However, building accurate prediction models is hard due to the lack of data in the domain of software engineering. As a result, the prediction models built on one data set show a significant deterioration of their accuracy when they are used to classify new, unseen data.

Objective

The objective of this paper is to present an approach that optimizes the accuracy of software quality predictive models when used to classify new data.

Method

This paper presents an adaptive approach that takes already built predictive models and adapts them (one at a time) to new data. We use an ant colony optimization algorithm in the adaptation process. The approach is validated on stability of classes in object-oriented software systems and can easily be used for any other software quality characteristic. It can also be easily extended to work with software quality predictive problems involving more than two classification labels.

Results

Results show that our approach out-performs the machine learning algorithm C4.5 as well as random guessing. It also preserves the expressiveness of the models which provide not only the classification label but also guidelines to attain it.

Conclusion

Our approach is an adaptive one that can be seen as taking predictive models that have already been built from common domain data and adapting them to context-specific data. This is suitable for the domain of software quality since the data is very scarce and hence predictive models built from one data set is hard to generalize and reuse on new data.  相似文献   

12.
Fault prediction is an important topic for the industry as, by providing effective methods for predictive maintenance, allows companies to perform important time and cost savings. In this paper we describe an application developed to predict and explain door failures on metro trains. To this end, the aim was twofold: first, devising prediction techniques capable of early detecting door failures from diagnostic data; second, describing failures in terms of properties distinguishing them from normal behavior. Data pre-processing was a complex task aimed at overcoming a number of issues with the dataset, like size, sparsity, bias, burst effect and trust. Since failure premonitory signals did not share common patterns, but were only characterized as non-normal device signals, fault prediction was performed by using outlier detection. Fault explanation was finally achieved by exhibiting device features showing abnormal values. An experimental evaluation was performed to assess the quality of the proposed approach. Results show that high-degree outliers are effective indicators of incipient failures. Also, explanation in terms of abnormal feature values (responsible for outlierness) seems to be quite expressive.There are some aspects in the proposed approach that deserve particular attention. We introduce a general framework for the failure detection problem based on an abstract model of diagnostic data, along with a formal problem statement. They both provide the basis for the definition of an effective data pre-processing technique where the behavior of a device, in a given time frame, is summarized through a number of suitable statistics. This approach strongly mitigates the issues related to data errors/noise, thus enabling to perform an effective outlier detection. All this, in our view, provides the grounds of a general methodology for advanced prognostic systems.  相似文献   

13.
Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions. In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of magnitude.  相似文献   

14.
传统的离群检测方法多数源于单个数据集或多数据源融合后的单一数据集,其检测结果忽略了多源数据之间的关联知识和单数据源中的关键信息。为了检测多源数据之间的离群关联知识,提出一种基于相关子空间的多源离群检测算法RSMOD。结合[k]近邻集和反向近邻集的双向影响,给出面向多源数据的对象影响空间,提高了离群对象度量的准确性;在影响空间基础上,提出面向多源数据的稀疏因子及稀疏差异因子,有效地刻画了数据对象在多源数据中的稀疏程度,重新定义了相关子空间的度量,使其能适用于多源数据集,并给出基于相关子空间的离群检测算法;采用人工合成数据集和真实的美国人口普查数据集,实验验证了RSMOD算法的性能并分析了源于多数据集的离群关联知识。  相似文献   

15.
Software quality assurance is a vital component of software project development. A software quality estimation model is trained using software measurement and defect (software quality) data of a previously developed release or similar project. Such an approach assumes that the development organization has experience with systems similar to the current project and that defect data are available for all modules in the training data. In software engineering practice, however, various practical issues limit the availability of defect data for modules in the training data. In addition, the organization may not have experience developing a similar system. In such cases, the task of software quality estimation or labeling modules as fault prone or not fault prone falls on the expert. We propose a semisupervised clustering scheme for software quality analysis of program modules with no defect data or quality-based class labels. It is a constraint-based semisupervised clustering scheme that uses k-means as the underlying clustering algorithm. Software measurement data sets obtained from multiple National Aeronautics and Space Administration software projects are used in our empirical investigation. The proposed technique is shown to aid the expert in making better estimations as compared to predictions made when the expert labels the clusters formed by an unsupervised learning algorithm. In addition, the software quality knowledge learnt during the semisupervised process provided good generalization performance for multiple test data sets. An analysis of program modules that remain unlabeled subsequent to our semisupervised clustering scheme provided useful insight into the characteristics of their software attributes  相似文献   

16.
当软件历史仓库中有标记训练样本较少时,有效的预测模型难以构建.针对此问题,文中提出基于二次学习的半监督字典学习软件缺陷预测方法.在第一阶段的学习中,利用稀疏表示分类器将大量无标记样本通过概率软标记标注扩充至有标记训练样本集中.再在扩充后的训练集上进行第二阶段的鉴别字典学习,最后在学得的字典上预测缺陷倾向性.在NASA MDP和PROMISE AR数据集上的实验验证文中方法的优越性.  相似文献   

17.
Software fault prediction is a process of developing modules that are used by developers in order to help them to detect faulty classes or faulty modules in early phases of the development life cycle and to determine the modules that need more refactoring in the maintenance phase. Software reliability means the probability of failure has occurred during a period of time, so when we describe a system as not reliable, it means that it contains many errors, and these errors can be accepted in some systems, but it may lead to crucial problems in critical systems like aircraft, space shuttle, and medical systems. Therefore, locating faulty software modules is an essential step because it helps defining the modules that need more refactoring or more testing. In this article, an approach is developed by integrating genetics algorithm (GA) with support vector machine (SVM) classifier and particle swarm algorithm for software fault prediction as a stand though for better software fault prediction technique. The developed approach is applied into 24 datasets (12-NASA MDP and 12-Java open-source projects), where NASA MDP is considered as a large-scale dataset and Java open-source projects are considered as a small-scale dataset. Results indicate that integrating GA with SVM and particle swarm algorithm improves the performance of the software fault prediction process when it is applied into large-scale and small-scale datasets and overcome the limitations in the previous studies.  相似文献   

18.
In addition to classification and regression, outlier detection has emerged as a relevant activity in deep learning. In comparison with previous approaches where the original features of the examples were used for separating the examples with high dissimilarity from the rest of the examples, deep learning can automatically extract useful features from raw data, thus removing the need for most of the feature engineering efforts usually required with classical machine learning approaches. This requires training the deep learning algorithm with labels identifying the examples or with numerical values. Although outlier detection in deep learning has been usually undertaken by training the algorithm with categorical labels—classifier—, it can also be performed by using the algorithm as regressor. Nowadays numerous urban areas have deployed a network of sensors for monitoring multiple variables about air quality. The measurements of these sensors can be treated individually—as time series—or collectively. Collectively, a variable monitored by a network of sensors can be transformed into a map. Maps can be used as images in machine learning algorithms—including computer vision algorithms—for outlier detection. The identification of anomalous episodes in air quality monitoring networks allows later processing this time period with finer‐grained scientific packages involving fluid dynamic and chemical evolution software, or the identification of malfunction stations. In this work, a Convolutional Neural Network is trained—as a regressor—using as input Ozone‐urban images generated from the Air Quality Monitoring Network of Madrid (Spain). The learned features are processed by Density‐based Spatial Clustering of Applications with Noise (DBSCAN) algorithm for identifying anomalous maps. Comparisons with other deep learning architectures are undertaken, for instance, autoencoders—undercomplete and denoizing—for learning salient features of the maps and later to use as input of DBSCAN. The proposed approach is able efficiently find maps with local anomalies compared to other approaches based on raw images or latent features extracted with autoencoders architectures with DBSCAN.  相似文献   

19.
Almost all drift detection mechanisms designed for classification problems work reactively: after receiving the complete data set (input patterns and class labels) they apply a sequence of procedures to identify some change in the class-conditional distribution – a concept drift. However, detecting changes after its occurrence can be in some situations harmful to the process under analysis. This paper proposes a proactive approach for abrupt drift detection, called DetectA (Detect Abrupt Drift). Briefly, this method is composed of three steps: (i) label the patterns from the test set (an unlabelled data block), using an unsupervised method; (ii) compute some statistics from the train and test sets, conditioned to the given class labels for train set; and (iii) compare the training and testing statistics using a multivariate hypothesis test. Based on the results of the hypothesis tests, we attempt to detect the drift on the test set, before the real labels are obtained. A procedure for creating datasets with abrupt drift has been proposed to perform a sensitivity analysis of the DetectA model. The result of the sensitivity analysis suggests that the detector is efficient and suitable for datasets of high-dimensionality, blocks with any proportion of drifts, and datasets with class imbalance. The performance of the DetectA method, with different configurations, was also evaluated on real and artificial datasets, using an MLP as a classifier. The best results were obtained using one of the detection methods, being the proactive manner a top contender regarding improving the underlying base classifier accuracy.  相似文献   

20.
软件可靠性增长模型在可靠性评估与保障中具有重要作用,针对软件测试过程中的故障检测和排错等待延迟问题,提出了一种考虑故障排错等待延迟的广义动态集成神经网络模型(RWD-SRGM)。该模型考虑软件工程的多样性,利用神经网络方法构建广义动态集成模型,并考虑排错等待延迟现象完成故障检测和预测。通过2组真实失效数据集(DS1和DS2)的实验,将所提模型与现有的软件可靠性增长模型进行了比较,结果显示考虑故障排错等待延迟的神经网络模型拟合效果最优,表现出了更好的软件可靠性评估性能和模型通用性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号