首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Data preparation is an important step in mining incomplete data. To deal with this problem, this paper introduces a new imputation approach called SN (Shell Neighbors) imputation, or simply SNI. The SNI fills in an incomplete instance (with missing values) in a given dataset by only using its left and right nearest neighbors with respect to each factor (attribute), referred them to Shell Neighbors. The left and right nearest neighbors are selected from a set of nearest neighbors of the incomplete instance. The size of the sets of the nearest neighbors is determined with the cross-validation method. And then the SNI is generalized to deal with missing data in datasets with mixed attributes, for example, continuous and categorical attributes. Some experiments are conducted for evaluating the proposed approach, and demonstrate that the generalized SNI method outperforms the kNN imputation method at imputation accuracy and classification accuracy.  相似文献   

2.
信息处理过程中对异常信息的智能化处理是一个前沿的且富有挑战性的研究方向;针对所获取的信息由于噪声干扰等因素存在缺失这一异常现象,提出了一种不完整(缺失)数据的智能分类算法;对于某一个不完整样本,该方法首先根据找到的近邻类别信息得到单个或多个版本的估计样本,这样在保证插补的准确性的同时能够有效地表征由于缺失引起的不精确性,然后用分类器分类带有估计值的样本;最后,在证据推理框架下提出一种新的信任分类方法,将难以划分类别的样本分配到对应的复合类来描述由于缺失值引起的样本类别的不确定性,同时降低错误分类的风险;用UCI数据库的真实数据集来验证算法的有效性,实验结果表明该算法能够有效地处理不完整数据分类问题.  相似文献   

3.
Multivariate extensions of well-known linear mixed-effects models have been increasingly utilized in inference by multiple imputation in the analysis of multilevel incomplete data. The normality assumption for the underlying error terms and random effects plays a crucial role in simulating the posterior predictive distribution from which the multiple imputations are drawn. The plausibility of this normality assumption on the subject-specific random effects is assessed. Specifically, the performance of multiple imputation created under a multivariate linear mixed-effects model is investigated on a diverse set of incomplete data sets simulated under varying distributional characteristics. Under moderate amounts of missing data, the simulation study confirms that the underlying model leads to a well-calibrated procedure with negligible biases and actual coverage rates close to nominal rates in estimates of the regression coefficients. Estimation quality of the random-effect variance and association measures, however, are negatively affected from both the misspecification of the random-effect distribution and number of incompletely-observed variables. Some of the adverse impacts include lower coverage rates and increased biases.  相似文献   

4.
A soft computing system used to optimize deep drilling operations under high-speed conditions in the manufacture of steel components is presented. The input data includes cutting parameters and axial cutting force obtained from the power consumption of the feed motor of the milling centres. Two different coolant strategies are tested: traditional working fluid and Minimum Quantity Lubrication (MQL). The model is constructed in three phases. First, a new strategy is proposed to evaluate and complete the set of available measurements. The primary objective of this phase is to decide whether further drilling experiments are required to develop an accurate roughness prediction model. An important aspect of the proposed strategy is the imputation of missing data, which is used to fully exploit both complete and incomplete measurements. The proposed imputation algorithm is based on a genetic algorithm and aims to improve prediction accuracy. In the second phase, a bag of multilayer perceptrons is used to model the impact of deep drilling settings on borehole roughness. Finally, this model is supplied with the borehole dimensions, coolant option and expected axial force to develop a 3D surface showing the expected borehole roughness as a function of drilling process settings. This plot is the necessary output of the model for its use under real workshop conditions. The proposed system is capable of approximating the optimal model used to control deep drilling tasks on steel components for industrial use.  相似文献   

5.
The handling of missing values is a topic of growing interest in the software quality modeling domain. Data values may be absent from a dataset for numerous reasons, for example, the inability to measure certain attributes. As software engineering datasets are sometimes small in size, discarding observations (or program modules) with incomplete data is usually not desirable. Deleting data from a dataset can result in a significant loss of potentially valuable information. This is especially true when the missing data is located in an attribute that measures the quality of the program module, such as the number of faults observed in the program module during testing and after release. We present a comprehensive experimental analysis of five commonly used imputation techniques. This work also considers three different mechanisms governing the distribution of missing values in a dataset, and examines the impact of noise on the imputation process. To our knowledge, this is the first study to thoroughly evaluate the relationship between data quality and imputation. Further, our work is unique in that it employs a software engineering expert to oversee the evaluation of all of the procedures and to ensure that the results are not inadvertently influenced by poor quality data. Based on a comprehensive set of carefully controlled experiments, we conclude that Bayesian multiple imputation and regression imputation are the most effective techniques, while mean imputation performs extremely poorly. Although a preliminary evaluation has been conducted using Bayesian multiple imputation in the empirical software engineering domain, this is the first work to provide a thorough and detailed analysis of this technique. Our studies also demonstrate conclusively that the presence of noisy data has a dramatic impact on the effectiveness of imputation techniques.  相似文献   

6.
Energy consumption prediction of a CNC machining process is important for energy efficiency optimization strategies.To improve the generalization abilities,more and more parameters are acquired for energy prediction modeling.While the data collected from workshops may be incomplete because of misoperation,unstable network connections,and frequent transfers,etc.This work proposes a framework for energy modeling based on incomplete data to address this issue.First,some necessary preliminary operations are used for incomplete data sets.Then,missing values are estimated to generate a new complete data set based on generative adversarial imputation nets(GAIN).Next,the gene expression programming(GEP)algorithm is utilized to train the energy model based on the generated data sets.Finally,we test the predictive accuracy of the obtained model.Computational experiments are designed to investigate the performance of the proposed framework with different rates of missing data.Experimental results demonstrate that even when the missing data rate increases to 30%,the proposed framework can still make efficient predictions,with the corresponding RMSE and MAE 0.903 k J and 0.739 k J,respectively.  相似文献   

7.
《Information Sciences》2005,169(1-2):1-25
Imputation of missing data is of interest in many areas such as survey data editing, medical documentation maintaining and DNA microarray data analysis. This paper is devoted to experimental analysis of a set of imputation methods developed within the so-called least-squares approximation approach, a non-parametric computationally effective multidimensional technique. First, we review global methods for least-squares data imputation. Then we propose extensions of these algorithms based on the nearest neighbours approach. An experimental study of the algorithms on generated data sets is conducted. It appears that straight algorithms may work rather well on data of simple structure and/or with small number of missing entries. However, in more complex cases, the only winner within the least-squares approximation approach is a method, INI, proposed in this paper as a combination of global and local imputation algorithms.  相似文献   

8.
在机器学习理论与应用中,特征选择是降低高维数据特征维度的常用方法之一。传统的特征选择方法多数基于完整数据集,对实际应用中普遍存在缺失数据的情形研究较少。针对不完整数据中含有未被观察信息和存在异常值的特点,提出一种基于概率矩阵分解技术的鲁棒特征选择方法。使用基于分簇的概率矩阵分解模型对数据集中的缺失值进行近似估计,以有效测量相邻簇之间数据的相似性,缩小问题规模,同时降低填充误差。依据缺失数据值存在少量异常值的情形,利用基于l2,1损失函数的方法进行特征选择,在此基础上给出不完整数据集的特征选择方法流程,并对其收敛性进行理论分析。该方法利用不完整数据集中的所有信息,有效应对不完整数据集中异常值带来的影响。实验结果表明,相比传统特征选择方法,该方法在合成数据集上选择更少的无关特征,可降低异常值带来的影响,在真实数据集上获得了较高的分类准确率,能够选择出更为准确的特征。  相似文献   

9.
目前已有的不完整数据填充方法大多局限于单一类型的缺失变量,对大规模数据的填充效果相对弱势.为了解决真实大数据中混合类型变量的缺失问题,本文提出了一个新的模型——SXGBI(Spark-based eXtreme Gradient Boosting Imputation),其适应于连续型和分类型两种缺失变量并存的不完整数据填充,同时具备快速处理大数据的泛化能力.该方法通过对集成学习方法XGBoost的改进,将多种补全算法结合在一起,构建了一个集成学习器,并结合Spark分布式计算框架进行了并行化设计,能较好地运行于Spark分布式集群上.实验表明,随着缺失率的增长,SXGBI在RMSE、PFC和F1几项评价指标上都取得了比实验中其它填充方法更好的填充结果.此外,它还可以有效地运用在大规模的数据集上.  相似文献   

10.
目前的决策粗糙集研究主要集中在完备离散型信息系统,很少有对不完备连续型数据进行研究,考虑这一问题,提出一种不完备邻域决策粗糙集模型。首先在不完备连续型数据中引入了不完备邻域关系,然后利用该二元关系对传统的决策粗糙集进行重构,一种称之为不完备邻域决策粗糙集的模型被提出,同时基于决策代价原则,进一步地提出了最小化决策代价的属性约简算法。最后通过实验表明了所提出的算法具有更高的属性约简性能。  相似文献   

11.
Data imputation is a common practice encountered when dealing with incomplete data. Irrespectively of the existing spectrum of techniques, the results of imputation are commonly numeric meaning that once the data have been imputed they are not distinguishable from the original data being initially available prior to imputation. In this study, the crux of the proposed approach is to develop a way of representing imputed (missing) entries as information granules and in this manner quantify the quality of the imputation process and the quality of the ensuing data. We establish a two-stage imputation mechanism in which we start with any method of numeric imputation and then form a granular representative of missing value. In this sense, the approach could be regarded as an enhancement of the existing imputation techniques.Proceeding with the detailed imputation schemes, we discuss two ways of imputation. In the first one, imputation is realized for individual variables of data sets and afterwards enhanced by the buildup of information granules. In the second approach, we are concerned with the use of fuzzy clustering, Fuzzy C-Means (FCM), which helps establish a structure in the data and then use this information in the imputation process.The design of information granules invokes the fundamentals of Granular Computing, namely a principle of justifiable granularity and an allocation of information granularity. Numeric experiments concerned with a suite of publicly available data sets offer detailed insights into the main facets of the overall design process and deliver a parametric analysis of the methods.  相似文献   

12.
针对不完备信息系统的数据缺失填补精度不够高问题,以水产养殖预警信息系统为背景,提出一种基于属性相关度的缺失数据填补算法。在有效保证预警信息系统确定性的前提下,通过研究限制容差关系知识和决策规则,根据新定义的限制相容关系求出缺失对象的限制相容类,同时将条件属性之间的相关度概念引入,构造出一种新的扩展矩阵进行数据填补,实现了系统的完备性。以鲈鱼养殖缺失数据填补为实例,以数据集进行填补验证,结果表明与其他方法相比该算法在填补准确度和时间性能上有明显提高。  相似文献   

13.
近年来,工业界和学术界面临着非常严重的数据缺失问题,缺失值极大降低了数据可用性。现有的缺失值填充技术需要较大的时间开销,很难满足大数据查询实时性的需求,为此,研究在有缺失值的情况下高效处理聚集查询,将基于采样的近似聚集查询处理与缺失值填充技术有效的结合,快速返回满足用户需求的聚集结果。采用基于块(block-level)的采样策略,在采集到的样本上进行缺失值填充,并根据缺失值填充的结果重构得到聚集结果的无偏估计。真实数据集和合成数据集上的实验结果表明,该文的方法比当前最好的方法在保证相同精度的前提下,大大提升了查询效率。  相似文献   

14.
针对现有处理不完全数据的填充方法,对数据集引入新的噪声这一问题,提出一种基于最大间隔理论的预测模型,直接使用含缺失特征的样本进行预测,通过实验验证该模型优于常用的填充模型。  相似文献   

15.
The knowledge discovery process is supported by data files information gathered from collected data sets, which often contain errors in the form of missing values. Data imputation is the activity aimed at estimating values for missing data items. This study focuses on the development of automated data imputation models, based on artificial neural networks for monotone patterns of missing values. The present work proposes a single imputation approach relying on a multilayer perceptron whose training is conducted with different learning rules, and a multiple imputation approach based on the combination of multilayer perceptron and k-nearest neighbours. Eighteen real and simulated databases were exposed to a perturbation experiment with random generation of monotone missing data pattern. An empirical test was accomplished on these data sets, including both approaches (single and multiple imputations), and three classical single imputation procedures – mean/mode imputation, regression and hot-deck – were also considered. Therefore, the experiments involved five imputation methods. The results, considering different performance measures, demonstrated that, in comparison with traditional tools, both proposals improve the automation level and data quality offering a satisfactory performance.  相似文献   

16.
王俊陆  王玲  王妍  宋宝燕 《计算机科学》2017,44(2):98-102, 106
随着互联网及信息技术的发展,数据缺失、损坏等问题越来越普遍,尤其随着数据收集工作从人工转向机器,存储介质的不稳定性及网络传输出现遗漏等原因都导致数据缺失更加严重。数据库中大量的缺失值不仅严重影响了用户查询质量,还对数据挖掘与数据分析结果的正确性造成了影响,进而误导决策。目前,对缺失数据的填补还没有一种比较通用的方法,大部分策略都是针对某一类型的缺失值问题进行处理。因此,针对不同缺失类型同时出现在不完备数据中的复杂情况,提出了一种基于元组相似度的不完备数据填补方法(IATS)。采用数据挖掘的方法提取出不完备数据集中的加权关联规则,并根据此规则进行常规缺失数据的填补,而对于数据集的异常缺失问题,又引入数据推荐算法,采用推荐筛选策略进行元组相似度的计算并实现相应填补,在很大程度上提高了数据的有效利用率和用户查询结果的质量。实验表明,IATS策略在保证填补率的前提下具有更好的准确率。  相似文献   

17.
当前的不完整数据处理算法填充缺失值时,精度低下。针对这个问题,提出一种基于CFS聚类和改进的自动编码模型的不完整数据填充算法。利用CFS聚类算法对不完整数据集进行聚类,对降噪自动编码模型进行改进,根据聚类结果,利用改进的自动编码模型对缺失数据进行填充。为了使得CFS聚类算法能够对不完整数据集进行聚类,提出一种部分距离策略,用于度量不完整数据对象之间的距离。实验结果表明提出的算法能够有效填充缺失数据。  相似文献   

18.
In real-life data, information is frequently lost in data mining, caused by the presence of missing values in attributes. Several schemes have been studied to overcome the drawbacks produced by missing values in data mining tasks; one of the most well known is based on preprocessing, formerly known as imputation. In this work, we focus on a classification task with twenty-three classification methods and fourteen different imputation approaches to missing values treatment that are presented and analyzed. The analysis involves a group-based approach, in which we distinguish between three different categories of classification methods. Each category behaves differently, and the evidence obtained shows that the use of determined missing values imputation methods could improve the accuracy obtained for these methods. In this study, the convenience of using imputation methods for preprocessing data sets with missing values is stated. The analysis suggests that the use of particular imputation methods conditioned to the groups is required.  相似文献   

19.
介绍了数据挖掘中不完整数据的研究现状及ICA与SOM的特点,提出了基于ICA与SOM的不完整数据的处理模型IVS-IDH,研究了数据之间存在相关关系且为非高斯分布时不完整数据的处理方法,在SOM基础上取得了不完整数据集的可视化分析结果,从而克服了Wang S提出的不完整数据处理方法的不足。  相似文献   

20.
张安珍  李建中  高宏 《软件学报》2020,31(2):406-420
本文研究了基于符号语义的不完整数据聚集查询处理问题.不完整数据又称为缺失数据,缺失值包括可填充的和不可填充的两种类型.现有的缺失值填充算法不能保证填充后查询结果的准确度,为此,本文给出不完整数据聚集查询结果的区间估计.本文在符号语义中扩展传统关系数据库模型,提出一种通用不完整数据库模型,该模型可以处理可填充的和不可填充的两种类型缺失值.在该模型下,提出一种新的不完整数据聚集查询结果语义:可靠结果.可靠结果是真实查询结果的区间估计,可以保证真实查询结果很大概率在该估计区间范围内.本文给出线性时间求解SUM、COUNT和AVG查询可靠结果的方法.真实数据集和合成数据集上的扩展实验验证了本文所提方法的有效性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号