首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 296 毫秒
1.
基于马氏距离的缺失值填充算法   总被引:1,自引:0,他引:1  
杨涛  骆嘉伟  王艳  吴君浩 《计算机应用》2005,25(12):2868-2871
提出了一种基于马氏距离的填充算法来估计基因表达数据集中的缺失数据。该算法通过基因之间的马氏距离来选择最近邻居基因,并将已得到的估计值应用到后续的估计过程中,然后采用信息论中熵值的概念计算最近邻居的加权系数,得到缺失数据的填充值。实验结果证明了该算法具有有效性,其性能优于其他基于最近邻居法的缺失值处理算法。  相似文献   

2.
马茜  谷峪  李芳芳  于戈 《软件学报》2016,27(9):2332-2347
近年来,随着感知网络的广泛应用,感知数据呈爆炸式增长.但是由于受到硬件设备的固有限制、部署环境的随机性以及数据处理过程中的人为失误等多方面因素的影响,感知数据中通常包含大量的缺失值.而大多数现有的上层应用分析工具无法处理包含缺失值的数据集,因此对缺失数据进行填补是不可或缺的.目前也有很多缺失数据填补算法,但在缺失数据较为密集的情况下,已有算法的填补准确性很难保证,同时未考虑填补顺序对填补精度的影响.基于此,提出了一种面向多源感知数据且顺序敏感的缺失值填补框架OMSMVI(order-sensitive missing value imputation framework for multi-source sensory data).该框架充分利用感知数据特有的多维度相关性:时间相关性、空间相关性、属性相关性,对不同数据源间的相似度进行衡量;进而,基于多维度相似性构建以缺失数据源为中心的相似图,并将已填补的缺失值作为观测值用于后续填补过程中.同时考虑缺失数据源的整体分布,提出对缺失值进行顺序敏感的填补,即:首先对缺失值的填补顺序进行决策,再对缺失值进行填补.对缺失值进行顺序填补能够有效缓解在缺失数据较为密集的情况下,由于缺失数据源的完整近邻与其相似度较低引起的填补精度下降问题;最后,对KNN填补算法进行改进,提出一种新的基于近邻节点的缺失值填补算法NI(neighborhood-based imputation),该算法利用感知数据的多维度相似性对缺失数据源的所有近邻节点进行查找,解决了KNN填补算法K值难以确定的问题,也进一步提高了填补准确性.利用两个真实数据集,并与基本填补算法进行对比,验证了算法的准确性及有效性.  相似文献   

3.
为了提高无线传感器网络(WSN)中缺失数据估计值的精度,提出了一种自决策插值算法。该算法能够根据数据集的空间相关性以及缺失数据的连续性选择不同的缺失数据估计策略,并将自回归滑动平均(ARMA)模型引入到对缺失数据插值的研究中。与传统缺失值估计算法相比,该算法不仅考虑到无线传感器网络的特性,而且考虑到数据集本身的特性。在真实数据集上测试结果表明,该算法提高了对缺失值估计的精度。  相似文献   

4.
提出一种基于支持向量机的缺失值填补方法。该方法将缺失值填补分为连续属性缺失值填补和类别属性缺失值填补两种情况。对于连续属性的情况,采用支持向量机回归进行缺失值的预测;对于类别属性的情况,采用支持向量机分类进行缺失值的预测。在几个UCI数据集和MINIT手写阿拉伯数字数据集上的对比实验说明,该算法较传统的均值填补方法和基于决策树回归的缺失值填补方法具有更高的恢复率。  相似文献   

5.
数据缺失会影响数据的质量,可能导致分析结果的不准确和降低模型的可靠性,缺失值填补能减低偏差方便后续分析.大多数的缺失值填补算法,都是假设多项缺失值之间是弱相关甚至无相关,很少考虑缺失值之间的相关性以及填补顺序.在销售领域中对缺失值进行独立填补,会减少缺失值信息的利用,从而对缺失值填补的准确度造成较大的影响.针对以上问题,本文以销售领域为研究目标,根据销售行为的多维度特征,利用不同模型输出值的空间分布特征特性,探索多项缺失值的填补更新机制,研究面向销售数据多项缺失值增量填补方法,根据特征相关性,对缺失特征排序并用已填补的数据作为信息要素融合对后面的缺失值进行增量填补.该算法同时考虑了模型的泛化性和缺失数据之间的信息相关问题,并结合多模型融合,对多项缺失值进行有效填补.最后基于真实连锁药店销售数据集通过大量实验对比验证了所提算法的有效性.  相似文献   

6.
何云  皮德常 《计算机科学》2015,42(11):251-255, 283
基因表达数据时常出现缺失,阻碍了对基因表达的研究。提出了一种新的相似性度量方案——精简关联度,在此基础上,又提出了基于精简关联度的缺失数据迭代填补算法(RKNNimpute)。精简关联度是对灰色关联度的一种改进,能达到与灰色关联度同样的效果,却显著降低了算法的时间复杂度。RKNNimpute算法以精简关联度作为相似度量,将填补后的基因扩充到近邻的候选基因集,通过迭代的方式填补其他缺失数据,提高了算法的填补效果和性能。选用时序、非时序、混合等不同类型的基因表达数据集进行了大量实验来评估RKNNimpute算法的性能。实验结果表明,精简关联度是一种高效的距离度量方法,所提出的RKNNimpute算法优于常规填补算法。  相似文献   

7.
陈静杰  车洁 《计算机科学》2017,44(Z6):109-111, 125
为减小数据缺失对飞机油耗统计推断精度带来的负面影响,针对基于传统欧氏距离、马氏距离以及精简关联度的最近邻填补算法的不足,提出了一种基于标准欧氏距离的填补算法来估计QAR(Quick Access Recorder)数据中部分燃油流量数值的缺失。该算法通过QAR数据样本之间的标准欧氏距离选择最近邻样本,并利用熵值赋权法计算最近邻的加权系数,基于最近邻样本中燃油流量的加权平均即可得到缺失燃油流量的估计值。实验结果表明,标准欧氏距离能够有效度量样本相似性,所提出的算法优于常规填补算法,是处理飞机油耗数据缺失的一种有效方法。  相似文献   

8.
研究一种缺失观测值条件下,锂电池剩余使用寿命(RUL)的新型估计方法,算法框架包括预处理模块和预测模块,并引入极端学习机(ELM)。预处理模块基于单点插值和多重插值技术填补缺失观测值,预测模块基于一步/多步超前预测估计剩余寿命。将插值技术和超前预测算法相结合,构建锂电池剩余寿命智能估计系统,处理具有缺失观测值的时间序列数据。该系统具有良好的鲁棒性,并能够自动产生完整的时间序列数据集。实验结果表明,新估计方法适用于锂电池相关的智能诊断与预测系统,具有广泛的应用价值。  相似文献   

9.
缺失填补是机器学习与数据挖掘领域中极富有挑战性的工作。数据源中的缺失值会对学习算法的性能与学习的质量产生较大的负面影响。目前存在的缺失值填补方法还不能满足用户的需要。提出了一种基于灰色系统理论的缺失值填补方法,该方法采用了基于实例学习的非参拟合和灰色理论技术,对缺失数据进行重复填补,直至填补结果收敛或者满足用户的需要。实验结果表明,该方法在填补效果与效率方面都比现有的KNN填补法和普通的均值替代法要好。  相似文献   

10.
针对传统的kNN(k-NearestNeighbor)近邻填补算法对缺失数据的填补效果会因为k最近邻数据存在噪声受到较大干扰的问题,提出一种基于kNN-DBSCAN(k-NearestNeighbor Density-based Spatial Clustering of Applications with Noise)的缺失数据填补优化算法。将基于密度的DBSCAN聚类算法运用到kNN近邻填补算法中,先用kNN算法得到目标填补数据的原始k最近邻数据集,运用DBSCAN聚类算法对原始k最近邻数据集进行噪声检测并消除噪声数据,得到当前k最近邻数据集,最后并入kNN计算,填补目标缺失数据;同时,针对DBSCAN聚类算法参数设置敏感的问题,通过分析数据集的统计特性来确定参数,避免人为经验判断。最后利用真实数据对算法进行验证,结果显示该算法对目标缺失数据的填补准确度要优于传统的kNN算法。  相似文献   

11.
当前的不完整数据处理算法填充缺失值时,精度低下。针对这个问题,提出一种基于CFS聚类和改进的自动编码模型的不完整数据填充算法。利用CFS聚类算法对不完整数据集进行聚类,对降噪自动编码模型进行改进,根据聚类结果,利用改进的自动编码模型对缺失数据进行填充。为了使得CFS聚类算法能够对不完整数据集进行聚类,提出一种部分距离策略,用于度量不完整数据对象之间的距离。实验结果表明提出的算法能够有效填充缺失数据。  相似文献   

12.
Yeon  Hanbyul  Seo  Seongbum  Son  Hyesook  Jang  Yun 《The Journal of supercomputing》2022,78(2):1759-1782

Bayesian network is derived from conditional probability and is useful in inferring the next state of the currently observed variables. If data are missed or corrupted during data collection or transfer, the characteristics of the original data may be distorted and biased. Therefore, predicted values from the Bayesian network designed with missing data are not reliable. Various techniques have been studied to resolve the imperfection in data using statistical techniques or machine learning, but since the complete data are unknown, there is no optimal way to impute missing values. In this paper, we present a visual analysis system that supports decision-making to impute missing values occurring in panel data. The visual analysis system allows data analysts to explore the cause of missing data in panel datasets. The system also enables us to compare the performance of suitable imputation models with the Bayesian network accuracy and the Kolmogorov–Smirnov test. We evaluate how the visual analysis system supports the decision-making process for the data imputation with datasets in different domains.

  相似文献   

13.
An effective way to increase noise robustness in automatic speech recognition is to label the noisy speech features as either reliable or unreliable (‘missing’), and replace (‘impute’) the missing ones by clean speech estimates. Conventional imputation techniques employ parametric models and impute the missing features on a frame-by-frame basis. At low SNRs, frame-based imputation techniques fail because many time frames contain few, if any, reliable features. In previous work, we introduced an exemplar-based method, dubbed sparse imputation, which can impute missing features using reliable features from neighbouring frames. We achieved substantial gains in performance at low SNRs for a connected digit recognition task. In this work, we investigate whether the exemplar-based approach can be generalised to a large vocabulary task.Experiments on artificially corrupted speech show that sparse imputation substantially outperforms a conventional imputation technique when the ideal ‘oracle’ reliability of features is used. With error-prone estimates of feature reliability, sparse imputation performance is comparable to our baseline imputation technique in the cleanest conditions, and substantially better at lower SNRs. With noisy speech recorded in realistic noise conditions, sparse imputation performs slightly worse than our baseline imputation technique in the cleanest conditions, but substantially better in the noisier conditions.  相似文献   

14.
Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Na?¨ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Na?¨ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Na?¨ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Na?¨ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.  相似文献   

15.
DNA microarray experiment inevitably generates gene expression data with missing values. An important and necessary pre-processing step is thus to impute these missing values. Existing imputation methods exploit gene correlation among all experimental conditions for estimating the missing values. However, related genes coexpress in subsets of experimental conditions only. In this paper, we propose to use biclusters, which contain similar genes under subset of conditions for characterizing the gene similarity and then estimating the missing values. To further improve the accuracy in missing value estimation, an iterative framework is developed with a stopping criterion on minimizing uncertainty. Extensive experiments have been conducted on artificial datasets, real microarray datasets as well as one non-microarray dataset. Our proposed biclusters-based approach is able to reduce errors in missing value estimation.  相似文献   

16.
Complete and high-quality deformation monitoring data are critical for shield tunnel construction safety and quality. In engineering practices, data missing frequently occurs during instrumentation, adversely impacting further analysis and decision making. Existing imputation methods either ignore the crucial interactions between different parameters during shield tunneling or focus on the global characteristics of deformation data while neglect their local difference. This paper proposes a novel hybrid model, MCCB, combining multi-view matrix completion algorithms, convolutional neural network (CNN), and bidirectional long short-term memory (BiLSTM) algorithms to impute missing deformation values in shield tunnel monitoring data. The performance of the proposed method is verified using bridge deformation data from a practical project in Beijing. Different missing patterns of the bridge deformation data are filled. The experiment results show that the proposed model can effectively learn the various characteristics of the deformation data and outperforms the four selected models and its two sub-models, and can be used to improve the accuracy of the deformation prediction through data imputation. The novelty of this study includes two aspects. First is that the complicated interaction between different parameters and local difference of the data are considered simultaneously. They have not been addressed before by existing studies. Second is that the innovative combination of matrix completion and deep learning algorithm for application in missing deformation values imputation. To our best knowledge, no research on engineering construction has implemented this technique before.  相似文献   

17.
Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%.  相似文献   

18.
This paper proposes to utilize information within incomplete instances (instances with missing values) when estimating missing values. Accordingly, a simple and efficient nonparametric iterative imputation algorithm, called the NIIA method, is designed for iteratively imputing missing target values. The NIIA method imputes each missing value several times until the algorithm converges. In the first iteration, all the complete instances are used to estimate missing values. The information within incomplete instances is utilized since the second imputation iteration. We conduct some experiments for evaluating the efficiency, and demonstrate: (1) the utilization of information within incomplete instances is of benefit to easily capture the distribution of a dataset; and (2) the NIIA method outperforms the existing methods in accuracy, and this advantage is clearly highlighted when datasets have a high missing ratio.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号