期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A comparative study of iterative and non-iterative feature selection techniques for software defect prediction

Taghi M. Khoshgoftaar Kehan Gao Amri Napolitano Randall Wald 《Information Systems Frontiers》2014,16(5):801-822

Two important problems which can affect the performance of classification models are high-dimensionality (an overabundance of independent features in the dataset) and imbalanced data (a skewed class distribution which creates at least one class with many fewer instances than other classes). To resolve these problems concurrently, we propose an iterative feature selection approach, which repeated applies data sampling (in order to address class imbalance) followed by feature selection (in order to address high-dimensionality), and finally we perform an aggregation step which combines the ranked feature lists from the separate iterations of sampling. This approach is designed to find a ranked feature list which is particularly effective on the more balanced dataset resulting from sampling while minimizing the risk of losing data through the sampling step and missing important features. To demonstrate this technique, we employ 18 different feature selection algorithms and Random Undersampling with two post-sampling class distributions. We also investigate the use of sampling and feature selection without the iterative step (e.g., using the ranked list from a single iteration, rather than combining the lists from multiple iterations), and compare these results from the version which uses iteration. Our study is carried out using three groups of datasets with different levels of class balance, all of which were collected from a real-world software system. All of our experiments use four different learners and one feature subset size. We find that our proposed iterative feature selection approach outperforms the non-iterative approach. 相似文献

2.

Imputation techniques for multivariate missingness in software measurement data

Taghi M. Khoshgoftaar Jason Van Hulse 《Software Quality Journal》2008,16(4):563-600

The problem of missing values in software measurement data used in empirical analysis has led to the proposal of numerous potential solutions. Imputation procedures, for example, have been proposed to ‘fill-in’ the missing values with plausible alternatives. We present a comprehensive study of imputation techniques using real-world software measurement datasets. Two different datasets with dramatically different properties were utilized in this study, with the injection of missing values according to three different missingness mechanisms (MCAR, MAR, and NI). We consider the occurrence of missing values in multiple attributes, and compare three procedures, Bayesian multiple imputation, k Nearest Neighbor imputation, and Mean imputation. We also examine the relationship between noise in the dataset and the performance of the imputation techniques, which has not been addressed previously. Our comprehensive experiments demonstrate conclusively that Bayesian multiple imputation is an extremely effective imputation technique.

Jason Van HulseEmail:

Taghi M. Khoshgoftaar is a professor of the Department of Computer Science and Engineering, Florida Atlantic University and the Director of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, machine learning, and statistical modeling. He has published more than 300 refereed papers in these areas. He is a member of the IEEE, IEEE Computer Society, and IEEE Reliability Society. He was the program chair and General Chair of the IEEE International Conference on Tools with Artificial Intelligence in 2004 and 2005 respectively. He has served on technical program committees of various international conferences, symposia, and workshops. Also, he has served as North American Editor of the Software Quality Journal, and is on the editorial boards of the journals Software Quality and Fuzzy systems. Jason Van Hulse received the Ph.D. degree in Computer Engineering from the Department of Computer Science and Engineering at Florida Atlantic University in 2007, the M.A. degree in Mathematics from Stony Brook University in 2000, and the B.S. degree in Mathematics from the University at Albany in 1997. His research interests include data mining and knowledge discovery, machine learning, computational intelligence, and statistics. He has published numerous peer-reviewed research papers in various conferences and journals, and is a member of the IEEE, IEEE Computer Society, and ACM. He has worked in the data mining and predictive modeling field at First Data Corp. since 2000, and is currently Vice President, Decision Science. 相似文献

3.

Balancing Misclassification Rates in Classification-Tree Models of Software Quality 总被引：3，自引：0，他引：3

Taghi M. Khoshgoftaar Xiaojing Yuan Edward B. Allen 《Empirical Software Engineering》2000,5(4):313-330

Software product and process metrics can be useful predictorsof which modules are likely to have faults during operations.Developers and managers can use such predictions by softwarequality models to focus enhancement efforts before release.However, in practice, software quality modeling methods in theliterature may not produce a useful balance between the two kindsof misclassification rates, especially when there are few faultymodules.This paper presents a practical classificationrule in the context of classification tree models that allowsappropriate emphasis on each type of misclassification accordingto the needs of the project. This is especially important whenthe faulty modules are rare.An industrial case study using classification trees, illustrates the tradeoffs.The trees were built using the TREEDISC algorithm whichis a refinement of the CHAID algorithm. We examinedtwo releases of a very large telecommunications system, and builtmodels suited to two points in the development life cycle: theend of coding and the end of beta testing. Both trees had onlyfive significant predictors, out of 28 and 42 candidates, respectively.We interpreted the structure of the classification trees, andwe found the models had useful accuracy. 相似文献

4.

Analogy-Based Practical Classification Rules for Software Quality Estimation 总被引：3，自引：0，他引：3

Taghi M. Khoshgoftaar Naeem Seliya 《Empirical Software Engineering》2003,8(4):325-350

Software metrics-based quality estimation models can be effective tools for identifying which modules are likely to be fault-prone or not fault-prone. The use of such models prior to system deployment can considerably reduce the likelihood of faults discovered during operations, hence improving system reliability. A software quality classification model is calibrated using metrics from a past release or similar project, and is then applied to modules currently under development. Subsequently, a timely prediction of which modules are likely to have faults can be obtained. However, software quality classification models used in practice may not provide a useful balance between the two misclassification rates, especially when there are very few faulty modules in the system being modeled.This paper presents, in the context of case-based reasoning, two practical classification rules that allow appropriate emphasis on each type of misclassification as per the project requirements. The suggested techniques are especially useful for high-assurance systems where faulty modules are rare. The proposed generalized classification methods emphasize on the costs of misclassifications, and the unbalanced distribution of the faulty program modules. We illustrate the proposed techniques with a case study that consists of software measurements and fault data collected over multiple releases of a large-scale legacy telecommunication system. In addition to investigating the two classification methods, a brief relative comparison of the techniques is also presented. It is indicated that the level of classification accuracy and model-robustness observed for the case study would be beneficial in achieving high software reliability of its subsequent system releases. Similar observations are made from our empirical studies with other case studies. 相似文献

5.

Using process history to predict software quality

Khoshgoftaar T.M. Allen E.B. Halstead R. Trio G.P. Flass R.M. 《Computer》1998,31(4):66-72

Many software quality models use only software product metrics to predict module reliability. For evolving systems, however, software process measures are also important. In this case study, the authors use module history data to predict module reliability in a subsystem of JStars, a real time military system 相似文献

6.

Analyzing software measurement data with clustering techniques 总被引：1，自引：0，他引：1

Zhong S. Khoshgoftaar T.M. Seliya N. 《Intelligent Systems, IEEE》2004,19(2):20-27

For software quality estimation, software development practitioners typically construct quality-classification or fault prediction models using software metrics and fault data from a previous system release or a similar software project. Engineers then use these models to predict the fault proneness of software modules in development. Software quality estimation using supervised-learning approaches is difficult without software fault measurement data from similar projects or earlier system releases. Cluster analysis with expert input is a viable unsupervised-learning solution for predicting software modules' fault proneness and potential noisy modules. Data analysts and software engineering experts can collaborate more closely to construct and collect more informative software metrics. 相似文献

7.

Uncertain Classification of Fault-Prone Software Modules

Taghi M. Khoshgoftaar Xiaojing Yuan Edward B. Allen Wendell D. Jones John P. Hudepohl 《Empirical Software Engineering》2002,7(4):297-318

Many development organizations try to minimize faults in software as a means for improving customer satisfaction. Assuring high software quality often entails time-consuming and costly development processes. A software quality model based on software metrics can be used to guide enhancement efforts by predicting which modules are fault-prone. This paper presents statistical techniques to determine which predictions by a classification tree should be considered uncertain. We conducted a case study of a large legacy telecommunications system. One release was the basis for the training dataset, and the subsequent release was the basis for the evaluation dataset. We built a classification tree using the TREEDISC algorithm, which is based on ² tests of contingency tables. The model predicted whether a module was likely to have faults discovered by customers, or not, based on software product, process, and execution metrics. We simulated practical use of the model by classifying the modules in the evaluation dataset. The model achieved useful accuracy, in spite of the very small proportion of fault-prone modules in the system. We assessed whether the classes assigned to the leaves were appropriate by statistical tests, and found sizable subsets of modules with uncertain classification. Discovering which modules have uncertain classifications allows sophisticated enhancement strategies to resolve uncertainties. Moreover, TREEDISC is especially well suited to identifying uncertain classifications. 相似文献

8.

The pairwise attribute noise detection algorithm 总被引：1，自引：3，他引：1

Jason D. Van Hulse Taghi M. Khoshgoftaar Haiying Huang 《Knowledge and Information Systems》2007,11(2):171-190

Analyzing the quality of data prior to constructing data mining models is emerging as an important issue. Algorithms for identifying noise in a given data set can provide a good measure of data quality. Considerable attention has been devoted to detecting class noise or labeling errors. In contrast, limited research work has been devoted to detecting instances with attribute noise, in part due to the difficulty of the problem. We present a novel approach for detecting instances with attribute noise and demonstrate its usefulness with case studies using two different real-world software measurement data sets. Our approach, called Pairwise Attribute Noise Detection Algorithm (PANDA), is compared with a nearest neighbor, distance-based outlier detection technique (denoted DM) investigated in related literature. Since what constitutes noise is domain specific, our case studies uses a software engineering expert to inspect the instances identified by the two approaches to determine whether they actually contain noise. It is shown that PANDA provides better noise detection performance than the DM algorithm. Jason Van Hulse is a Ph.D. candidate in the Department of Computer Science and Engineering at Florida Atlantic University. His research interests include data mining and knowledge discovery, machine learning, computational intelligence and statistics. He is a student member of the IEEE and IEEE Computer Society. He received the M.A. degree in mathematics from Stony Brook University in 2000, and is currently Director, Decision Science at First Data Corporation. Taghi M. Khoshgoftaar is a professor at the Department of Computer Science and Engineering, Florida Atlantic University, and the director of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, machine learning, and statistical modeling. He has published more than 300 refereed papers in these subjects. He has been a principal investigator and project leader in a number of projects with industry, government, and other research-sponsoring agencies. He is a member of the IEEE, the IEEE Computer Society, and IEEE Reliability Society. He served as the program chair and general chair of the IEEE International Conference on Tools with Artificial Intelligence in 2004 and 2005, respectively. Also, he has served on technical program committees of various international conferences, symposia, and workshops. He has served as North American editor of the Software Quality Journal, and is on the editorial boards of the journals Empirical Software Engineering, Software Quality, and Fuzzy Systems. Haiying Huang received the M.S. degree in computer engineeringfrom Florida Atlantic University, Boca Raton, Florida, USA, in 2002. She is currently a Ph.D. candidate in the Department of Computer Science and Engineering at Florida Atlantic University. Her research interests include software engineering, computational intelligence, data mining, software measurement, software reliability, and quality engineering. 相似文献

9.

A practical classification-rule for software-quality models 总被引：1，自引：0，他引：1

Khoshgoftaar T.M. Allen E.B. 《Reliability, IEEE Transactions on》2000,49(2):209-216

A practical classification rule for a SQ (software quality) model considers the needs of the project to use a model to guide targeting software RE (reliability enhancement) efforts, such as extra reviews early in development. Such a rule is often more useful than alternative rules. This paper discusses several classification rules for SQ models, and recommends a generalized classification rule, where the effectiveness and efficiency of the model for guiding software RE efforts can be explicitly considered. This is the first application of this rule to SQ modeling that we know of. Two case studies illustrate application of the generalized classification rule. A telecommunication-system case-study models membership in the class of fault-prone modules as a function of the number of interfaces to other modules. A military-system case-study models membership in the class of fault-prone modules as a function of a set of process metrics that depict the development history of a module. These case studies are examples where balanced misclassification rates resulted in more useful and practical SQ models than other classification rules 相似文献

10.

Classification-tree models of software-quality over multiplereleases

Khoshgoftaar T.M. Allen E.B. Jones W.D. Hudepohl J.P. 《Reliability, IEEE Transactions on》2000,49(1):4-11

This paper presents an empirical study that evaluates software-quality models over several releases, to address the question, “How long will a model yield useful predictions?” The classification and regression trees (CART) algorithm is introduced, CART can achieve a preferred balance between the two types of misclassification rates. This is desirable because misclassification of fault-prone modules often has much more severe consequences than misclassification of those that are not fault-prone. The case-study developed 2 classification-tree models based on 4 consecutive releases of a very large legacy telecommunication system. Forty-two software product, process and execution metrics were candidate predictors. Model 1 used measurements of the first release as the training data set; this model had 11 important predictors. Model 2 used measurements of the second release as the training data set; this model had 15 important predictors. Measurements of subsequent releases were evaluation data sets. Analysis of the models' predictors yielded insights into various software development practices. Both models had accuracy that would be useful to developers. One might suppose that software-quality models lose their value very quickly over successive releases due to evolution of the product and the underlying development processes. The authors found the models remained useful over all the releases studied 相似文献