首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
ContextAlthough independent imputation techniques are comprehensively studied in software effort prediction, there are few studies on embedded methods in dealing with missing data in software effort prediction.ObjectiveWe propose BREM (Bayesian Regression and Expectation Maximization) algorithm for software effort prediction and two embedded strategies to handle missing data.MethodThe MDT (Missing Data Toleration) strategy ignores the missing data when using BREM for software effort prediction and the MDI (Missing Data Imputation) strategy uses observed data to impute missing data in an iterative manner while elaborating the predictive model.ResultsExperiments on the ISBSG and CSBSG datasets demonstrate that when there are no missing values in historical dataset, BREM outperforms LR (Linear Regression), BR (Bayesian Regression), SVR (Support Vector Regression) and M5′ regression tree in software effort prediction on the condition that the test set is not greater than 30% of the whole historical dataset for ISBSG dataset and 25% of the whole historical dataset for CSBSG dataset. When there are missing values in historical datasets, BREM with the MDT and MDI strategies significantly outperforms those independent imputation techniques, including MI, BMI, CMI, MINI and M5′. Moreover, the MDI strategy provides BREM with more accurate imputation for the missing values than those given by the independent missing imputation techniques on the condition that the level of missing data in training set is not larger than 10% for both ISBSG and CSBSG datasets.ConclusionThe experimental results suggest that BREM is promising in software effort prediction. When there are missing values, the MDI strategy is preferred to be embedded with BREM.  相似文献   

2.
The problem of missing values in software measurement data used in empirical analysis has led to the proposal of numerous potential solutions. Imputation procedures, for example, have been proposed to ‘fill-in’ the missing values with plausible alternatives. We present a comprehensive study of imputation techniques using real-world software measurement datasets. Two different datasets with dramatically different properties were utilized in this study, with the injection of missing values according to three different missingness mechanisms (MCAR, MAR, and NI). We consider the occurrence of missing values in multiple attributes, and compare three procedures, Bayesian multiple imputation, k Nearest Neighbor imputation, and Mean imputation. We also examine the relationship between noise in the dataset and the performance of the imputation techniques, which has not been addressed previously. Our comprehensive experiments demonstrate conclusively that Bayesian multiple imputation is an extremely effective imputation technique.
Jason Van HulseEmail:

Taghi M. Khoshgoftaar   is a professor of the Department of Computer Science and Engineering, Florida Atlantic University and the Director of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, machine learning, and statistical modeling. He has published more than 300 refereed papers in these areas. He is a member of the IEEE, IEEE Computer Society, and IEEE Reliability Society. He was the program chair and General Chair of the IEEE International Conference on Tools with Artificial Intelligence in 2004 and 2005 respectively. He has served on technical program committees of various international conferences, symposia, and workshops. Also, he has served as North American Editor of the Software Quality Journal, and is on the editorial boards of the journals Software Quality and Fuzzy systems. Jason Van Hulse   received the Ph.D. degree in Computer Engineering from the Department of Computer Science and Engineering at Florida Atlantic University in 2007, the M.A. degree in Mathematics from Stony Brook University in 2000, and the B.S. degree in Mathematics from the University at Albany in 1997. His research interests include data mining and knowledge discovery, machine learning, computational intelligence, and statistics. He has published numerous peer-reviewed research papers in various conferences and journals, and is a member of the IEEE, IEEE Computer Society, and ACM. He has worked in the data mining and predictive modeling field at First Data Corp. since 2000, and is currently Vice President, Decision Science.   相似文献   

3.
基于EM和贝叶斯网络的丢失数据填充算法   总被引:2,自引:0,他引:2  
实际应用中存在大量的丢失数据的数据集,对丢失数据的处理已成为目前分类领域的研究热点。分析和比较了几种通用的丢失数据填充算法,并提出一种新的基于EM和贝叶斯网络的丢失数据填充算法。算法利用朴素贝叶斯估计出EM算法初值,然后将EM和贝叶斯网络结合进行迭代确定最终更新器,同时得到填充后的完整数据集。实验结果表明,与经典填充算法相比,新算法具有更高的分类准确率,且节省了大量开销。  相似文献   

4.
近年来,工业界和学术界面临着非常严重的数据缺失问题,缺失值极大降低了数据可用性。现有的缺失值填充技术需要较大的时间开销,很难满足大数据查询实时性的需求,为此,研究在有缺失值的情况下高效处理聚集查询,将基于采样的近似聚集查询处理与缺失值填充技术有效的结合,快速返回满足用户需求的聚集结果。采用基于块(block-level)的采样策略,在采集到的样本上进行缺失值填充,并根据缺失值填充的结果重构得到聚集结果的无偏估计。真实数据集和合成数据集上的实验结果表明,该文的方法比当前最好的方法在保证相同精度的前提下,大大提升了查询效率。  相似文献   

5.
6.
A very common problem when building software engineering models is dealing with missing data. To address this there exist a range of imputation techniques. However, selecting the appropriate imputation technique can also be a difficult problem. One reason for this is that these techniques make assumptions about the underlying missingness mechanism, that is how the missing values are distributed within the data set. It is compounded by the fact that, for small data sets, it may be very difficult to determine what is the missingness mechanism. This means there is a danger of using an inappropriate imputation technique. Therefore, it is necessary to determine what is the safest default assumption about the missingness mechanism for imputation techniques when dealing with small data sets. We examine experimentally, two simple and commonly used techniques: Class Mean Imputation (CMI) and k Nearest Neighbors (k-NN) coupled with two missingness mechanisms: missing completely at random (MCAR) and missing at random (MAR). We draw two conclusions. First, that for our analysis CMI is the preferred technique since it is more accurate. Second, and more importantly, the impact of missingness mechanism on imputation accuracy is not statistically significant. This is a useful finding since it suggests that even for small data sets we can reasonably make a weaker assumption that the missingness mechanism is MAR. Thus both imputation techniques have practical application for small software engineering data sets with missing values.  相似文献   

7.
The problem of anomaly and attack detection in IoT environment is one of the prime challenges in the domain of internet of things that requires an immediate concern. For example, anomalies and attacks in IoT environment such as scan, malicious operation, denial of service, spying, data type probing, wrong setup, malicious control can lead to failure of an IoT system. Datasets generated in an IoT environment usually have missing values. The presence of missing values makes the classifier unsuitable for classification task. This article introduces (a) a novel imputation technique for imputation of missing data values (b) a classifier which is based on feature transformation to perform classification (c) imputation measure for similarity computation between any two instances that can also be used as similarity measure. The performance of proposed classifier is studied by using imputed datasets obtained through applying Kmeans, F-Kmeans and proposed imputation methods. Experiments are also conducted by applying existing and proposed classifiers on the imputed dataset obtained using proposed imputation technique. For experimental study in this article, we have used an open source dataset named distributed smart space orchestration system publicly available from Kaggle. Experiment results obtained are also validated using Wilcoxon non-parametric statistical test. It is proved that the performance of proposed approach is better when compared to existing classifiers when the imputation process is performed using F-Kmeans and K-Means imputation techniques. It is also observed that accuracies for attack classes scan, malicious operation, denial of service, spying, data type probing, wrong setup are 100% while it is 99% for malicious control attack class when the proposed imputation and classification technique are applied.  相似文献   

8.
Yeon  Hanbyul  Seo  Seongbum  Son  Hyesook  Jang  Yun 《The Journal of supercomputing》2022,78(2):1759-1782

Bayesian network is derived from conditional probability and is useful in inferring the next state of the currently observed variables. If data are missed or corrupted during data collection or transfer, the characteristics of the original data may be distorted and biased. Therefore, predicted values from the Bayesian network designed with missing data are not reliable. Various techniques have been studied to resolve the imperfection in data using statistical techniques or machine learning, but since the complete data are unknown, there is no optimal way to impute missing values. In this paper, we present a visual analysis system that supports decision-making to impute missing values occurring in panel data. The visual analysis system allows data analysts to explore the cause of missing data in panel datasets. The system also enables us to compare the performance of suitable imputation models with the Bayesian network accuracy and the Kolmogorov–Smirnov test. We evaluate how the visual analysis system supports the decision-making process for the data imputation with datasets in different domains.

  相似文献   

9.
完整性是数据质量的一个重要维度,由于数据本身固有的不确定性、采集的随机性及不准确性,导致现实应用中产生了大量具有如下特点的数据集:1)数据规模庞大;2)数据往往是不完整、不准确的.因此将大规模数据集分段到不同的数据窗口中处理是数据处理的重要方法,但缺失数据估算的相关研究大都忽视了数据集的特点和窗口的应用,而且回定大小的数据窗17容易造成算法的准确性和性能受窗口大小及窗口内数据值分布的影响.假设数据满足一定的领域相关的约束,首先提出了一种新的基于时间的动态自适应数据窗口检测算法,并基于此窗口提出了一种改进的模糊k-均值聚类算法来进行不完整数据的缺失数据估算.实验表明较之其他算法,不仅能更适应数据集的特点,具有较好的性能,而且能够保证准确性.  相似文献   

10.
The size of datasets is becoming larger nowadays and missing values in such datasets pose serious threat to data analysts. Although various techniques have been developed by researchers to handle missing values in different kinds of datasets, there is not much effort to deal with the missing values in mixed attributes in large datasets. This paper has proposed novel strategies for dealing with this issue. The significant attributes (covariates) required for imputation are first selected using gain ratio measure to decrease the computational complexity. Since analysis of continuous attributes in imputation process is complex, they are first discretized using a novel methodology called Bayesian classifier-based discretization. Then, missing values in them are imputed using Bayesian max–min ant colony optimization algorithm which hybridizes ACO with Bayesian principles. The local search technique is also introduced in ACO implementation to improve its exploitative capability. The proposed methodology is implemented in real datasets with different missing rates ranging from 5 to 50% and from the experimental results, it is observed that the proposed discretization and imputation algorithms produce better results than the existing methods.  相似文献   

11.
The article presents an experimental study on multiclass Support Vector Machine (SVM) methods over a cardiac arrhythmia dataset that has missing attribute values for electrocardiogram (ECG) diagnostic application. The presence of an incomplete dataset and high data dimensionality can affect the performance of classifiers. Imputation of missing data and discriminant analysis are commonly used as preprocessing techniques in such large datasets. The article proposes experiments to evaluate performance of One-Against-All (OAA) and One-Against-One (OAO) approaches in kernel multiclass SVM for a heartbeat classification problem with imputation and dimension reduction techniques. The results indicate that the OAA approach has superiority over OAO in multiclass SVM for ECG data analysis with missing values.  相似文献   

12.

One relevant problem in data quality is missing data. Despite the frequent occurrence and the relevance of the missing data problem, many machine learning algorithms handle missing data in a rather naive way. However, missing data treatment should be carefully treated, otherwise bias might be introduced into the knowledge induced. In this work, we analyze the use of the k-nearest neighbor as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set with some plausible values. One advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. Our analysis indicates that missing data imputation based on the k-nearest neighbor algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data, and can also outperform the mean or mode imputation method, which is a method broadly used to treat missing values.  相似文献   

13.
Fuzzy rule-based classification systems (FRBCSs) are known due to their ability to treat with low quality data and obtain good results in these scenarios. However, their application in problems with missing data are uncommon while in real-life data, information is frequently incomplete in data mining, caused by the presence of missing values in attributes. Several schemes have been studied to overcome the drawbacks produced by missing values in data mining tasks; one of the most well known is based on preprocessing, formerly known as imputation. In this work, we focus on FRBCSs considering 14 different approaches to missing attribute values treatment that are presented and analyzed. The analysis involves three different methods, in which we distinguish between Mamdani and TSK models. From the obtained results, the convenience of using imputation methods for FRBCSs with missing values is stated. The analysis suggests that each type behaves differently while the use of determined missing values imputation methods could improve the accuracy obtained for these methods. Thus, the use of particular imputation methods conditioned to the type of FRBCSs is required.  相似文献   

14.
缺失填补是机器学习与数据挖掘领域中极富有挑战性的工作。数据源中的缺失值会对学习算法的性能与学习的质量产生较大的负面影响。目前存在的缺失值填补方法还不能满足用户的需要。提出了一种基于灰色系统理论的缺失值填补方法,该方法采用了基于实例学习的非参拟合和灰色理论技术,对缺失数据进行重复填补,直至填补结果收敛或者满足用户的需要。实验结果表明,该方法在填补效果与效率方面都比现有的KNN填补法和普通的均值替代法要好。  相似文献   

15.
The knowledge discovery process is supported by data files information gathered from collected data sets, which often contain errors in the form of missing values. Data imputation is the activity aimed at estimating values for missing data items. This study focuses on the development of automated data imputation models, based on artificial neural networks for monotone patterns of missing values. The present work proposes a single imputation approach relying on a multilayer perceptron whose training is conducted with different learning rules, and a multiple imputation approach based on the combination of multilayer perceptron and k-nearest neighbours. Eighteen real and simulated databases were exposed to a perturbation experiment with random generation of monotone missing data pattern. An empirical test was accomplished on these data sets, including both approaches (single and multiple imputations), and three classical single imputation procedures – mean/mode imputation, regression and hot-deck – were also considered. Therefore, the experiments involved five imputation methods. The results, considering different performance measures, demonstrated that, in comparison with traditional tools, both proposals improve the automation level and data quality offering a satisfactory performance.  相似文献   

16.
A Novel Framework for Imputation of Missing Values in Databases   总被引:2,自引:0,他引:2  
Many of the industrial and research databases are plagued by the problem of missing values. Some evident examples include databases associated with instrument maintenance, medical applications, and surveys. One of the common ways to cope with missing values is to complete their imputation (filling in). Given the rapid growth of sizes of databases, it becomes imperative to come up with a new imputation methodology along with efficient algorithms. The main objective of this paper is to develop a unified framework supporting a host of imputation methods. In the development of this framework, we require that its usage should (on average) lead to the significant improvement of accuracy of imputation while maintaining the same asymptotic computational complexity of the individual methods. Our intent is to provide a comprehensive review of the representative imputation techniques. It is noticeable that the use of the framework in the case of a low-quality single-imputation method has resulted in the imputation accuracy that is comparable to the one achieved when dealing with some other advanced imputation techniques. We also demonstrate, both theoretically and experimentally, that the application of the proposed framework leads to a linear computational complexity and, therefore, does not affect the asymptotic complexity of the associated imputation method.  相似文献   

17.
Model averaging or combining is often considered as an alternative to model selection. Frequentist Model Averaging (FMA) is considered extensively and strategies for the application of FMA methods in the presence of missing data based on two distinct approaches are presented. The first approach combines estimates from a set of appropriate models which are weighted by scores of a missing data adjusted criterion developed in the recent literature of model selection. The second approach averages over the estimates of a set of models with weights based on conventional model selection criteria but with the missing data replaced by imputed values prior to estimating the models. For this purpose three easy-to-use imputation methods that have been programmed in currently available statistical software are considered, and a simple recursive algorithm is further adapted to implement a generalized regression imputation in a way such that the missing values are predicted successively. The latter algorithm is found to be quite useful when one is confronted with two or more missing values simultaneously in a given row of observations. Focusing on a binary logistic regression model, the properties of the FMA estimators resulting from these strategies are explored by means of a Monte Carlo study. The results show that in many situations, averaging after imputation is preferred to averaging using weights that adjust for the missing data, and model average estimators often provide better estimates than those resulting from any single model. As an illustration, the proposed methods are applied to a dataset from a study of Duchenne muscular dystrophy detection.  相似文献   

18.
This paper investigates the characteristics of a clinical dataset using a combination of feature selection and classification methods to handle missing values and understand the underlying statistical characteristics of a typical clinical dataset. Typically, when a large clinical dataset is presented, it consists of challenges such as missing values, high dimensionality, and unbalanced classes. These pose an inherent problem when implementing feature selection and classification algorithms. With most clinical datasets, an initial exploration of the dataset is carried out, and those attributes with more than a certain percentage of missing values are eliminated from the dataset. Later, with the help of missing value imputation, feature selection and classification algorithms, prognostic and diagnostic models are developed. This paper has two main conclusions: 1) Despite the nature of clinical datasets, and their large size, methods for missing value imputation do not affect the final performance. What is crucial is that the dataset is an accurate representation of the clinical problem and those methods of imputing missing values are not critical for developing classifiers and prognostic/diagnostic models. 2) Supervised learning has proven to be more suitable for mining clinical data than unsupervised methods. It is also shown that non-parametric classifiers such as decision trees give better results when compared to parametric classifiers such as radial basis function networks(RBFNs).  相似文献   

19.
传统的时间序列缺失修复方法通常假设数据由线性动态系统产生,然而时间序列更多地表现为非线性。为此,提出了基于残差连接长短期记忆(LSTM)网络的时间序列修复模型,称为RSI-LSTM,用来有效捕获时间序列的非线性动态特性,并且挖掘缺失数据和最近的非缺失数据之间的潜在关联。具体来说,就是采用LSTM网络对时间序列的非线性动态特性进行建模,同时引入残差连接来挖掘历史值与缺失值的联系,从而提升模型的修复能力。首先使用RSI-LSTM对单变量日供电量数据集的缺失数据进行修复,然后在第九届电工数学建模竞赛A题的电力负荷数据集上,引入气象因素作为RSI-LSTM的多变量输入,以提升模型对时间序列缺失值的修复效果。此外,使用了两个通用的多变量时间序列数据集以验证模型的缺失修复能力。实验结果表明,在单变量和多变量数据集上,RSI-LSTM的缺失值修复效果均优于LSTM,得到的均方误差(MSE)总体下降了10%。  相似文献   

20.
In real-life data, information is frequently lost in data mining, caused by the presence of missing values in attributes. Several schemes have been studied to overcome the drawbacks produced by missing values in data mining tasks; one of the most well known is based on preprocessing, formerly known as imputation. In this work, we focus on a classification task with twenty-three classification methods and fourteen different imputation approaches to missing values treatment that are presented and analyzed. The analysis involves a group-based approach, in which we distinguish between three different categories of classification methods. Each category behaves differently, and the evidence obtained shows that the use of determined missing values imputation methods could improve the accuracy obtained for these methods. In this study, the convenience of using imputation methods for preprocessing data sets with missing values is stated. The analysis suggests that the use of particular imputation methods conditioned to the groups is required.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号