首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 281 毫秒
1.
Model averaging or combining is often considered as an alternative to model selection. Frequentist Model Averaging (FMA) is considered extensively and strategies for the application of FMA methods in the presence of missing data based on two distinct approaches are presented. The first approach combines estimates from a set of appropriate models which are weighted by scores of a missing data adjusted criterion developed in the recent literature of model selection. The second approach averages over the estimates of a set of models with weights based on conventional model selection criteria but with the missing data replaced by imputed values prior to estimating the models. For this purpose three easy-to-use imputation methods that have been programmed in currently available statistical software are considered, and a simple recursive algorithm is further adapted to implement a generalized regression imputation in a way such that the missing values are predicted successively. The latter algorithm is found to be quite useful when one is confronted with two or more missing values simultaneously in a given row of observations. Focusing on a binary logistic regression model, the properties of the FMA estimators resulting from these strategies are explored by means of a Monte Carlo study. The results show that in many situations, averaging after imputation is preferred to averaging using weights that adjust for the missing data, and model average estimators often provide better estimates than those resulting from any single model. As an illustration, the proposed methods are applied to a dataset from a study of Duchenne muscular dystrophy detection.  相似文献   

2.
To complete missing values a solution is to use correlations between the attributes of the data. The problem is that it is difficult to identify relations within data containing missing values. Accordingly, we develop a kernel-based missing data imputation in this paper. This approach aims at making an optimal inference on statistical parameters: mean, distribution function and quantile after missing data are imputed. And we refer this approach to parameter optimization method (POP algorithm). We experimentally evaluate our approach, and demonstrate that our POP algorithm (random regression imputation) is much better than deterministic regression imputation in efficiency and generating an inference on the above parameters.  相似文献   

3.
Data-driven soft sensors have been applied extensively in process industry for process monitoring and control. Linear soft sensors, which are only valid within a relatively small operating envelope, are considered to be insufficient in practice when the processes transit among several operating modes. Moreover, owing to a variety of causes such as malfunction of sensors, multiple rate sampling scheme for different process variables, etc., missing data problem is commonly experienced in process industry. In this paper, soft sensor development with irregular/missing output data is considered and a multiple model based linear parameter varying (LPV) modeling scheme is proposed for handling nonlinearity. The efficiency of the proposed algorithm is demonstrated through several numerical simulation examples as well as a pilot-scale experiment. It is shown through the comparison with the traditional missing data treatment methods in terms of the parameter estimation accuracy that the developed soft sensors enjoy improved performance by employing the expectation-maximization (EM) algorithm in handling the missing process data and model switching problem.  相似文献   

4.
The problem of anomaly and attack detection in IoT environment is one of the prime challenges in the domain of internet of things that requires an immediate concern. For example, anomalies and attacks in IoT environment such as scan, malicious operation, denial of service, spying, data type probing, wrong setup, malicious control can lead to failure of an IoT system. Datasets generated in an IoT environment usually have missing values. The presence of missing values makes the classifier unsuitable for classification task. This article introduces (a) a novel imputation technique for imputation of missing data values (b) a classifier which is based on feature transformation to perform classification (c) imputation measure for similarity computation between any two instances that can also be used as similarity measure. The performance of proposed classifier is studied by using imputed datasets obtained through applying Kmeans, F-Kmeans and proposed imputation methods. Experiments are also conducted by applying existing and proposed classifiers on the imputed dataset obtained using proposed imputation technique. For experimental study in this article, we have used an open source dataset named distributed smart space orchestration system publicly available from Kaggle. Experiment results obtained are also validated using Wilcoxon non-parametric statistical test. It is proved that the performance of proposed approach is better when compared to existing classifiers when the imputation process is performed using F-Kmeans and K-Means imputation techniques. It is also observed that accuracies for attack classes scan, malicious operation, denial of service, spying, data type probing, wrong setup are 100% while it is 99% for malicious control attack class when the proposed imputation and classification technique are applied.  相似文献   

5.
We present an automatic speech recognition system that uses a missing data approach to compensate for challenging environmental noise containing both additive and convolutive components. The unreliable and noise-corrupted (“missing”) components are identified using a Gaussian mixture model (GMM) classifier based on a diverse range of acoustic features. To perform speech recognition using the partially observed data, the missing components are substituted with clean speech estimates computed using both sparse imputation and cluster-based GMM imputation. Compared to two reference mask estimation techniques based on interaural level and time difference-pairs, the proposed missing data approach significantly improved the keyword accuracy rates in all signal-to-noise ratio conditions when evaluated on the CHiME reverberant multisource environment corpus. Of the imputation methods, cluster-based imputation was found to outperform sparse imputation. The highest keyword accuracy was achieved when the system was trained on imputed data, which made it more robust to possible imputation errors.  相似文献   

6.
Nonlinear structural equation models with nonignorable missing outcomes from reproductive dispersion models are proposed to identify the relationship between manifest variables and latent variables in modern educational, medical, social and psychological studies. The nonignorable missing mechanism is specified by a logistic regression model. An EM algorithm is developed to obtain the maximum likelihood estimates of the structural parameters and parameters in the logistic regression model. Assessment of local influence is investigated in nonlinear structural equation models with nonignorable missing outcomes from reproductive dispersion models on the basis of the conditional expectation of the complete-data log-likelihood function. Some local influence diagnostics are obtained via observations of missing data and latent variables that are generated by the Gibbs sampler and Metropolis-Hastings algorithm on the basis of the conformal normal curvature. A simulation study and a real example are used to illustrate the application of the proposed methodologies.  相似文献   

7.
马茜  谷峪  李芳芳  于戈 《软件学报》2016,27(9):2332-2347
近年来,随着感知网络的广泛应用,感知数据呈爆炸式增长.但是由于受到硬件设备的固有限制、部署环境的随机性以及数据处理过程中的人为失误等多方面因素的影响,感知数据中通常包含大量的缺失值.而大多数现有的上层应用分析工具无法处理包含缺失值的数据集,因此对缺失数据进行填补是不可或缺的.目前也有很多缺失数据填补算法,但在缺失数据较为密集的情况下,已有算法的填补准确性很难保证,同时未考虑填补顺序对填补精度的影响.基于此,提出了一种面向多源感知数据且顺序敏感的缺失值填补框架OMSMVI(order-sensitive missing value imputation framework for multi-source sensory data).该框架充分利用感知数据特有的多维度相关性:时间相关性、空间相关性、属性相关性,对不同数据源间的相似度进行衡量;进而,基于多维度相似性构建以缺失数据源为中心的相似图,并将已填补的缺失值作为观测值用于后续填补过程中.同时考虑缺失数据源的整体分布,提出对缺失值进行顺序敏感的填补,即:首先对缺失值的填补顺序进行决策,再对缺失值进行填补.对缺失值进行顺序填补能够有效缓解在缺失数据较为密集的情况下,由于缺失数据源的完整近邻与其相似度较低引起的填补精度下降问题;最后,对KNN填补算法进行改进,提出一种新的基于近邻节点的缺失值填补算法NI(neighborhood-based imputation),该算法利用感知数据的多维度相似性对缺失数据源的所有近邻节点进行查找,解决了KNN填补算法K值难以确定的问题,也进一步提高了填补准确性.利用两个真实数据集,并与基本填补算法进行对比,验证了算法的准确性及有效性.  相似文献   

8.
基于马氏距离的缺失值填充算法   总被引:1,自引:0,他引:1  
杨涛  骆嘉伟  王艳  吴君浩 《计算机应用》2005,25(12):2868-2871
提出了一种基于马氏距离的填充算法来估计基因表达数据集中的缺失数据。该算法通过基因之间的马氏距离来选择最近邻居基因,并将已得到的估计值应用到后续的估计过程中,然后采用信息论中熵值的概念计算最近邻居的加权系数,得到缺失数据的填充值。实验结果证明了该算法具有有效性,其性能优于其他基于最近邻居法的缺失值处理算法。  相似文献   

9.
A composite multiple-model approach based on multivariate Gaussian process regression (MGPR) with correlated noises is proposed in this paper. In complex industrial processes, observation noises of multiple response variables can be correlated with each other and process is nonlinear. In order to model the multivariate nonlinear processes with correlated noises, a dependent multivariate Gaussian process regression (DMGPR) model is developed in this paper. The covariance functions of this DMGPR model are formulated by considering the “between-data” correlation, the “between-output” correlation, and the correlation between noise variables. Further, owing to the complexity of nonlinear systems as well as possible multiple-mode operation of the industrial processes, to improve the performance of the proposed DMGPR model, this paper proposes a composite multiple-model DMGPR approach based on the Gaussian Mixture Model algorithm (GMM-DMGPR). The proposed modelling approach utilizes the weights of all the samples belonging to each sub-DMGPR model which are evaluated by utilizing the GMM algorithm when estimating model parameters through expectation and maximization (EM) algorithm. The effectiveness of the proposed GMM-DMGPR approach is demonstrated by two numerical examples and a three-level drawing process of Carbon fiber production.  相似文献   

10.
Data imputation is a common practice encountered when dealing with incomplete data. Irrespectively of the existing spectrum of techniques, the results of imputation are commonly numeric meaning that once the data have been imputed they are not distinguishable from the original data being initially available prior to imputation. In this study, the crux of the proposed approach is to develop a way of representing imputed (missing) entries as information granules and in this manner quantify the quality of the imputation process and the quality of the ensuing data. We establish a two-stage imputation mechanism in which we start with any method of numeric imputation and then form a granular representative of missing value. In this sense, the approach could be regarded as an enhancement of the existing imputation techniques.Proceeding with the detailed imputation schemes, we discuss two ways of imputation. In the first one, imputation is realized for individual variables of data sets and afterwards enhanced by the buildup of information granules. In the second approach, we are concerned with the use of fuzzy clustering, Fuzzy C-Means (FCM), which helps establish a structure in the data and then use this information in the imputation process.The design of information granules invokes the fundamentals of Granular Computing, namely a principle of justifiable granularity and an allocation of information granularity. Numeric experiments concerned with a suite of publicly available data sets offer detailed insights into the main facets of the overall design process and deliver a parametric analysis of the methods.  相似文献   

11.
A new matching procedure based on imputing missing data by means of a local linear estimator of the underlying population regression function (that is assumed not necessarily linear) is introduced. Such a procedure is compared to other traditional approaches, more precisely hot deck methods as well as methods based on kNN estimators. The relationship between the variables of interest is assumed not necessarily linear. Performance is measured by the matching noise given by the discrepancy between the distribution generating genuine data and the distribution generating imputed values.  相似文献   

12.
A new matching procedure based on imputing missing data by means of a local linear estimator of the underlying population regression function (that is assumed not necessarily linear) is introduced. Such a procedure is compared to other traditional approaches, more precisely hot deck methods as well as methods based on kNN estimators. The relationship between the variables of interest is assumed not necessarily linear. Performance is measured by the matching noise given by the discrepancy between the distribution generating genuine data and the distribution generating imputed values.  相似文献   

13.
Whilst conventional approach in structural design is based on reliability-calibrated factored design formula, performance-based design customizes a solution to the specific circumstance. In this work, an artificial neural network approach is employed to determine implicit limit state functions for reliability evaluations in performance-based design and to optimally evaluate a set of design variables under specified performance criteria and corresponding desired reliability levels in design. Case examples are shown for reliability design. Through the establishment of the response and reliability databases, for specified target reliabilities, structural response computations are integrated with the evaluation of design parameters and design can be accomplished. By employing this methodology, with the same performance requirements, pertinent design parameters can be altered in order to evaluate feasible design alternatives, to explore the usage of various structural materials and to define required material quality control.  相似文献   

14.
数据缺失会影响数据的质量,可能导致分析结果的不准确和降低模型的可靠性,缺失值填补能减低偏差方便后续分析.大多数的缺失值填补算法,都是假设多项缺失值之间是弱相关甚至无相关,很少考虑缺失值之间的相关性以及填补顺序.在销售领域中对缺失值进行独立填补,会减少缺失值信息的利用,从而对缺失值填补的准确度造成较大的影响.针对以上问题,本文以销售领域为研究目标,根据销售行为的多维度特征,利用不同模型输出值的空间分布特征特性,探索多项缺失值的填补更新机制,研究面向销售数据多项缺失值增量填补方法,根据特征相关性,对缺失特征排序并用已填补的数据作为信息要素融合对后面的缺失值进行增量填补.该算法同时考虑了模型的泛化性和缺失数据之间的信息相关问题,并结合多模型融合,对多项缺失值进行有效填补.最后基于真实连锁药店销售数据集通过大量实验对比验证了所提算法的有效性.  相似文献   

15.
More than two decades ago the imbalanced data problem turned out to be one of the most important and challenging problems. Indeed, missing information about the minority class leads to a significant degradation in classifier performance. Moreover, comprehensive research has proved that there are certain factors increasing the problem’s complexity. These additional difficulties are closely related to the data distribution over decision classes. In spite of numerous methods which have been proposed, the flexibility of existing solutions needs further improvement. Therefore, we offer a novel rough–granular computing approach (RGA, in short) to address the mentioned issues. New synthetic examples are generated only in specific regions of feature space. This selective oversampling approach is applied to reduce the number of misclassified minority class examples. A strategy relevant for a given problem is obtained by formation of information granules and an analysis of their degrees of inclusion in the minority class. Potential inconsistencies are eliminated by applying an editing phase based on a similarity relation. The most significant algorithm parameters are tuned in an iterative process. The set of evaluated parameters includes the number of nearest neighbours, complexity threshold, distance threshold and cardinality redundancy. Each data model is built by exploiting different parameters’ values. The results obtained by the experimental study on different datasets from the UCI repository are presented. They prove that the proposed method of inducing the neighbourhoods of examples is crucial in the proper creation of synthetic positive instances. The proposed algorithm outperforms related methods in most of the tested datasets. The set of valid parameters for the Rough–Granular Approach (RGA) technique is established.  相似文献   

16.
针对k最近邻填充算法(kNNI)在缺失数据的k个最近邻可能存在噪声,提出一种新的缺失值填充算法——相互k最近邻填充算法MkNNI(Mutualk-NearestNeighborImputa—tion)。用于填充缺失值的数据,不仅是缺失数据的k最近邻,而且它的k最近邻也包含该缺失数据.从而有效地防止kNNI算法选取的k个最近邻点可能存在噪声这一情况。实验结果表明.MkNNI算法的填充准确性总体上要优于kNNI算法。  相似文献   

17.
Accurate and reliable information about buildings can greatly improve post-earthquake responses, such as search and rescue, repair and recovery. Building Information Modeling (BIM), rapid scanning and other assessment technologies offer the opportunity not only to retrieve as-built information but also to compile as-damaged models. This research proposes an information model to facilitate the data flow for post-earthquake assessment of reinforced concrete structures. The schema development was based on typical damage modes and the existing Industry Foundation Class (IFC) schema. Two examples of damaged structures from recent earthquake events, compiled using an experimental damage modeling software, illustrate the use of the data model. The model introduces two new classes, one to represent segments of building elements and the other to model the relationships between segments and cracks. A unique feature is the ability to model the process of damage with a binary tree structure. Methods for exporting as-damaged instance models using IFC are also discussed.  相似文献   

18.
The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. Imputation algorithms have been traditionally compared in terms of the similarity between imputed and original values. However, this traditional approach, sometimes referred to as prediction ability, does not allow inferring the influence of imputed values in the ultimate modeling tasks (e.g., in classification). Based on an extensive experimental work, we study the influence of five nearest-neighbor based imputation algorithms (KNNImpute, SKNN, IKNNImpute, KMI and EACImpute) and two simple algorithms widely used in practice (Mean Imputation and Majority Method) on classification problems. In order to experimentally assess these algorithms, simulations of missing values were performed on six datasets by means of two missingness mechanisms: Missing Completely at Random (MCAR) and Missing at Random (MAR). The latter allows the probabilities of missingness to depend on observed data but not on missing data, whereas the former occurs when the distribution of missingness does not depend on the observed data either. The quality of the imputed values is assessed by two measures: prediction ability and classification bias. Experimental results show that IKNNImpute outperforms the other algorithms in the MCAR mechanism. KNNImpute, SKNN and EACImpute, by their turn, provided the best results in the MAR mechanism. Finally, our experiments also show that best prediction results (in terms of mean squared errors) do not necessarily yield to less classification bias.  相似文献   

19.
误差生成是基于机理模型故障检测方法的核心本质,但鲜有应用于统计过程监测方法中.为此,提出一种基于缺失数据的误差生成策略,将能反映出采样数据对统计模型拟合程度的误差作为新的被监测对象实施故障检测.所提出的基于缺失数据的主元分析(MD-PCA)方法通过逐一假设各变量测量数据缺失后,利用缺失数据处理方法推测出相应缺失数据的估计值,并对缺失数据的实际值与估计值之间的误差实施基于PCA模型的故障检测.利用误差实施故障检测的优势在于,生成的误差能在一定程度上降低原测量变量的非高斯性程度,而且误差体现的是对应缺失变量中与其他测量变量不相关的成分信息,更能揭示各测量变量的本质.通过在TE过程上的实验充分验证了所提出方法的优势,以及MD-PCA方法用于故障检测的可行性与优越性.  相似文献   

20.
The presence of rounded zeros results in an important drawback for the statistical analysis of compositional data. Data analysis methodology based on log-ratios cannot be applied under these conditions. In this paper rounded zeros are considered as a special kind of missing data. Thus, an EM-type computational algorithm for replacing them is provided. The procedure is based on the additive logistic transformation and assumes an additive logistic normal model for the data. First, the alr transformation moves data from the constrained simplex space to the unconstrained real space. Next, missing transformed data are imputed by using modified EM steps. Last, imputed data are transformed back into the simplex space to obtain a compositional data set free of rounded zeros. Additionally, a sequential strategy is proposed for the case of rounded zeros in all the components of a composition. This work focuses on the algorithm's properties and on computational implementation details. Also, its effectiveness on simulated data sets with a range of detection limits is analyzed. Special attention is paid on the effects on the covariance structure of a compositional data set. Results confirm the good behavior of our proposal. Finally, MATLAB routines implementing the algorithm are made available to the reader.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号