首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
It is now widely accepted that multiple imputation (MI) methods properly handle the uncertainty of missing data over single imputation methods. Several standard statistical software packages, such as SAS, R and STATA, have standard procedures or user-written programs to perform MI. The performance of these packages is generally acceptable for most types of data. However, it is unclear whether these applications are appropriate for imputing data with a large proportion of zero values resulting in a semi-continuous distribution. In addition, it is not clear whether the use of these applications is suitable when the distribution of the data needs to be preserved for subsequent analysis. This article reports the findings of a simulation study carried out to evaluate the performance of the MI procedures for handling semi-continuous data within these statistical packages. Complete resource use data on 1060 participants from a large randomized clinical trial were used as the simulation population from which 500 bootstrap samples were obtained and missing data imposed. The findings of this study showed differences in the performance of the MI programs when imputing semi-continuous data. Caution should be exercised when deciding which program should perform MI on this type of data.  相似文献   

2.
Risk models that aim to predict the future course and outcome of disease processes are increasingly used in health research, and it is important that they are accurate and reliable. Most of these risk models are fitted using routinely collected data in hospitals or general practices. Clinical outcomes such as short-term mortality will be near-complete, but many of the predictors may have missing values. A common approach to dealing with this is to perform a complete-case analysis. However, this may lead to overfitted models and biased estimates if entire patient subgroups are excluded. The aim of this paper is to investigate a number of methods for imputing missing data to evaluate their effect on risk model estimation and the reliability of the predictions. Multiple imputation methods, including hotdecking and multiple imputation by chained equations (MICE), were investigated along with several single imputation methods. A large national cardiac surgery database was used to create simulated yet realistic datasets. The results suggest that complete case analysis may produce unreliable risk predictions and should be avoided. Conditional mean imputation performed well in our scenario, but may not be appropriate if using variable selection methods. MICE was amongst the best performing multiple imputation methods with regards to the quality of the predictions. Additionally, it produced the least biased estimates, with good coverage, and hence is recommended for use in practice.  相似文献   

3.
This paper provides an overview of multiple imputation and current perspectives on its use in medical research. We begin with a brief review of the problem of handling missing data in general and place multiple imputation in this context, emphasizing its relevance for longitudinal clinical trials and observational studies with missing covariates. We outline how multiple imputation proceeds in practice and then sketch its rationale. We explore the problem of obtaining proper imputations in some detail and distinguish two main classes of approach, methods based on fully multivariate models, and those that iterate conditional univariate models. We show how the use of so-called uncongenial imputation models are particularly valuable for sensitivity analyses and also for certain analyses in clinical trial settings. We also touch upon other forms of sensitivity analysis that use multiple imputation. Finally, we give some open questions that the increasing use of multiple imputation has thrown up, which we believe are useful directions for future research.  相似文献   

4.
This article addresses the problem of missing process data in data-driven dynamic modeling approaches. The key motivation is to avoid using imputation methods or deletion of key process information when identifying the model, and utilizing the rest of the information appropriately at the model building stage. To this end, a novel approach is developed that adapts nonlinear iterative partial least squares (NIPALS) algorithms from both partial least squares (PLS) and principle component analysis (PCA) for use in subspace identification. Note that the existing subspace identification approaches often utilize singular value decomposition (SVD) as part of the identification algorithm which is generally not robust to missing data. In contrast, the NIPALS algorithms used in this work leverage the inherent correlation structure of the identification matrices to minimize the impact of missing data values while generating an accurate system model. Furthermore, in computing the system matrices, the calculated scores from the latent variable methods are utilized as the states of the system. The efficacy of the proposed approach is shown via simulation of a nonlinear batch process example.  相似文献   

5.
For handling missing data, newer methods such as those based on multiple imputation are generally more accurate than older ones and entail weaker assumptions. Yet most do assume that data are missing at random (MAR). The issue of assessing whether the MAR assumption holds to begin with has been largely ignored. In fact, no way to directly test MAR is available. We propose an alternate assumption, MAR+, that can be tested. MAR+ always implies MAR, so inability to reject MAR+ bodes well for MAR. In contrast, MAR implies MAR+ not universally, but under certain conditions that are often plausible; thus, rejection of MAR+ can raise suspicions about MAR. Our approach is applicable mainly to studies that are not longitudinal. We present five illustrative medical examples, in most of which it turns out that MAR+ fails. There are limits to the ability of sophisticated statistical methods to correct for missing data. Efforts to try to prevent missing data in the first place should therefore receive more attention in medical studies than they have heretofore attracted. If MAR+ is found to fail for a study whose data have already been gathered, extra caution may need to be exercised in the interpretation of the results.  相似文献   

6.
Multiple imputation (MI) is now well established as a flexible, general, method for the analysis of data sets with missing values. Most implementations assume the missing data are ;missing at random' (MAR), that is, given the observed data, the reason for the missing data does not depend on the unseen data. However, although this is a helpful and simplifying working assumption, it is unlikely to be true in practice. Assessing the sensitivity of the analysis to the MAR assumption is therefore important. However, there is very limited MI software for this. Further, analysis of a data set with missing values that are not missing at random (NMAR) is complicated by the need to extend the MAR imputation model to include a model for the reason for dropout. Here, we propose a simple alternative. We first impute under MAR and obtain parameter estimates for each imputed data set. The overall NMAR parameter estimate is a weighted average of these parameter estimates, where the weights depend on the assumed degree of departure from MAR. In some settings, this approach gives results that closely agree with joint modelling as the number of imputations increases. In others, it provides ball-park estimates of the results of full NMAR modelling, indicating the extent to which it is necessary and providing a check on its results. We illustrate our approach with a small simulation study, and the analysis of data from a trial of interventions to improve the quality of peer review.  相似文献   

7.
The goal of multiple imputation is to provide valid inferences for statistical estimates from incomplete data. To achieve that goal, imputed values should preserve the structure in the data, as well as the uncertainty about this structure, and include any knowledge about the process that generated the missing data. Two approaches for imputing multivariate data exist: joint modeling (JM) and fully conditional specification (FCS). JM is based on parametric statistical theory, and leads to imputation procedures whose statistical properties are known. JM is theoretically sound, but the joint model may lack flexibility needed to represent typical data features, potentially leading to bias. FCS is a semi-parametric and flexible alternative that specifies the multivariate model by a series of conditional models, one for each incomplete variable. FCS provides tremendous flexibility and is easy to apply, but its statistical properties are difficult to establish. Simulation work shows that FCS behaves very well in the cases studied. The present paper reviews and compares the approaches. JM and FCS were applied to pubertal development data of 3801 Dutch girls that had missing data on menarche (two categories), breast development (five categories) and pubic hair development (six stages). Imputations for these data were created under two models: a multivariate normal model with rounding and a conditionally specified discrete model. The JM approach introduced biases in the reference curves, whereas FCS did not. The paper concludes that FCS is a useful and easily applied flexible alternative to JM when no convenient and realistic joint distribution can be specified.  相似文献   

8.
High-quality data play a paramount role in monitoring, control, and prediction of wastewater treatment process (WWTP) and can effectively ensure the efficient and stable operation of system. Missing values seriously degrade the accuracy, reliability and completeness of the data quality due to network collapses, connection errors and data transformation failures. In these cases, it is infeasible to recover missing data depending on the correlation with other variables. To tackle this issue, a univariate imputation method (UIM) is proposed for WWTP integrating decomposition method and imputation algorithms. First, the seasonal-trend decomposition based on loess method is utilized to decompose the original time series into the seasonal, trend and remainder components to deal with the nonstationary characteristics of WWTP data. Second, the support vector regression is used to approximate the nonlinearity of the trend and remainder components respectively to provide estimates of its missing values. A self-similarity decomposition is conducted to fill the seasonal component based on its periodic pattern. Third, all the imputed results are merged to obtain the imputation result. Finally, six time series of WWTP are used to evaluate the imputation performance of the proposed UIM by comparing with existing seven methods based on two indicators. The experimental results illustrate that the proposed UIM is effective for WWTP time series under different missing ratios. Therefore, the proposed UIM is a promising method to impute WWTP time series.  相似文献   

9.
Analysis of differential abundance in proteomics data sets requires careful application of missing value imputation. Missing abundance values widely vary when performing comparisons across different sample treatments. For example, one would expect a consistent rate of “missing at random” (MAR) across batches of samples and varying rates of “missing not at random” (MNAR) depending on the inherent difference in sample treatments within the study. The missing value imputation strategy must thus be selected that best accounts for both MAR and MNAR simultaneously. Several important issues must be considered when deciding the appropriate missing value imputation strategy: (1) when it is appropriate to impute data; (2) how to choose a method that reflects the combinatorial manner of MAR and MNAR that occurs in an experiment. This paper provides an evaluation of missing value imputation strategies used in proteomics and presents a case for the use of hybrid left-censored missing value imputation approaches that can handle the MNAR problem common to proteomics data.  相似文献   

10.
Lack of knowledge of the first principles, that describe the behavior of processed particulate mixtures, has created significant attention to data‐driven models for characterizing the performance of pharmaceutical processes—which are often treated as black—box operations. Uncertainty contained in the experimental data sets, however, can decrease the quality of the produced predictive models. In this work, the effect of missing and noisy data on the predictive capability of surrogate modeling methodologies such as Kriging and Response Surface Method (RSM) is evaluated. The key areas that affect the final error of prediction and the computational efficiency of the algorithm were found to be: (a) the method used to assign initial estimate values to the missing elements and (b) the iterative procedure used to further improve these initial estimates. The proposed approach includes the combination of the most appropriate initialization technique and the Expectation Maximization Principal Component Analysis algorithm to impute missing elements and minimize noise. Comparative analysis of the use of different initial imputation techniques such as mean, matching procedure, and a Kriging‐based approach proves that the two former used approaches give more accurate, “warm‐start” estimates of the missing data points that can significantly reduce computational time requirements. Experimental data from two case studies of different unit operations of the pharmaceutical powder tablet production process (feeding and mixing) are used as examples to illustrate the performance of the proposed methodology. Results show that by introducing an extra imputation step, the pseudo complete data sets created, produce very accurate predictive responses, whereas discarding incomplete observations leads to loss of valuable information and distortion of the predictive response. Results are also given for different percentages of missing data and different missing patterns. © 2010 American Institute of Chemical Engineers AIChE J, 2010  相似文献   

11.
Basic aspects in the handling of fatty acid-data have remained largely underexposed. Of these, we aimed to address three statistical methodological issues, by quantitatively exemplifying their imminent confounding impact on analytical outcomes: (1) presenting results as relative percentages or absolute concentrations, (2) handling of missing/non-detectable values, and (3) using structural indices for data-reduction. Therefore, we reanalyzed an example dataset containing erythrocyte fatty acid-concentrations of 137 recurrently depressed patients and 73 controls. First, correlations between data presented as percentages and concentrations varied for different fatty acids, depending on their correlation with the total fatty acid-concentration. Second, multiple imputation of non-detects resulted in differences in significance compared to zero-substitution or omission of non-detects. Third, patients’ chain length-, unsaturation-, and peroxidation-indices were significantly lower compared to controls, which corresponded with patterns interpreted from individual fatty acid tests. In conclusion, results from our example dataset show that statistical methodological choices can have a significant influence on outcomes of fatty acid analysis, which emphasizes the relevance of: (1) hypothesis-based fatty acid-presentation (percentages or concentrations), (2) multiple imputation, preventing bias introduced by non-detects; and (3) the possibility of using (structural) indices, to delineate fatty acid-patterns thereby preventing multiple testing.  相似文献   

12.
Mendelian randomisation analyses use genetic variants as instrumental variables (IVs) to estimate causal effects of modifiable risk factors on disease outcomes. Genetic variants typically explain a small proportion of the variability in risk factors; hence Mendelian randomisation analyses can require large sample sizes. However, an increasing number of genetic variants have been found to be robustly associated with disease-related outcomes in genome-wide association studies. Use of multiple instruments can improve the precision of IV estimates, and also permit examination of underlying IV assumptions. We discuss the use of multiple genetic variants in Mendelian randomisation analyses with continuous outcome variables where all relationships are assumed to be linear. We describe possible violations of IV assumptions, and how multiple instrument analyses can be used to identify them. We present an example using four adiposity-associated genetic variants as IVs for the causal effect of fat mass on bone density, using data on 5509 children enrolled in the ALSPAC birth cohort study. We also use simulation studies to examine the effect of different sets of IVs on precision and bias. When each instrument independently explains variability in the risk factor, use of multiple instruments increases the precision of IV estimates. However, inclusion of weak instruments could increase finite sample bias. Missing data on multiple genetic variants can diminish the available sample size, compared with single instrument analyses. In simulations with additive genotype-risk factor effects, IV estimates using a weighted allele score had similar properties to estimates using multiple instruments. Under the correct conditions, multiple instrument analyses are a promising approach for Mendelian randomisation studies. Further research is required into multiple imputation methods to address missing data issues in IV estimation.  相似文献   

13.
In modern industrial processes, timely detection and diagnosis of process abnormalities are critical for monitoring process operations. Various fault detection and diagnosis (FDD) methods have been proposed and implemented, the performance of which, however, could be drastically influenced by the common presence of incomplete or missing data in real industrial scenarios. This paper presents a new FDD approach based on an incomplete data imputation technique for process fault recognition. It employs the modified stacked autoencoder, a deep learning structure, in the phase of incomplete data treatment, and classifies data representations rather than the imputed complete data in the phase of fault identification. A benchmark process, the Tennessee Eastman process, is employed to illustrate the effectiveness and applicability of the proposed method.  相似文献   

14.
In modern industrial processes, timely detection and diagnosis of process abnormalities are critical for monitoring process operations. Various fault detection and diagnosis (FDD) methods have been proposed and implemented, the performance of which, however, could be drastically influenced by the common presence of incomplete or missing data in real industrial scenarios. This paper presents a new FDD approach based on an incomplete data imputation technique for process fault recognition. It employs the modified stacked autoencoder, a deep learning structure, in the phase of incomplete data treatment, and classifies data representations rather than the imputed complete data in the phase of fault identification. A benchmark process, the Tennessee Eastman process, is employed to illustrate the effectiveness and applicability of the proposed method.  相似文献   

15.
Identification of nonlinear processes in the presence of noise corrupted and correlated multiple scheduling variables with missing data is concerned. The dynamics of the hidden scheduling variables are represented by a state‐space model with unknown parameters. To assure generality, it is assumed that the multiple correlated scheduling variables are corrupted with unknown disturbances and the identification dataset is incomplete with missing data. A multiple model approach is proposed to formulate the identification problem of nonlinear systems under the framework of the expectation‐maximization algorithm. The parameters of the local process models and scheduling variable models as well as the hyperparameters of the weighting function are simultaneously estimated. The particle smoothing technique is adopted to handle the computation of expectation functions. The efficiency of the proposed method is demonstrated through several simulated examples. Through an experimental study on a pilot‐scale multitank system, the practical advantages are further illustrated. © 2015 American Institute of Chemical Engineers AIChE J, 61: 3270–3287, 2015  相似文献   

16.
王幼琴  赵忠盖  刘飞 《化工学报》2016,67(3):931-939
线性时变参数系统(LPV)将多阶段、非线性的过程建模转化为线性多模型的辨识问题,近年来得到了极大关注。考虑缺失数据下LPV系统的离线建模问题,首先引入一个二进制变量表征输出样本缺失状态,选取过程关键变量作为调度变量,确定主要工况点;然后围绕不同工况点建立局部子模型,将输出缺失部分和采样数据的模型归属当作隐藏变量,利用EM算法进行参数估计,再采用高斯权重函数融合各子模型。最后分别针对典型二阶过程和连续搅拌反应釜(CSTR),运用提出的多模型和算法进行仿真实验,表明有效性。  相似文献   

17.
In the present work, we consider the problem of variable duration economic model predictive control of batch processes subject to multi‐rate and missing data. To this end, we first generalize a recently developed subspace‐based model identification approach for batch processes to handle multi‐rate and missing data by utilizing the incremental singular value decomposition technique. Exploiting the fact that the proposed identification approach is capable of handling inconsistent batch lengths, the resulting dynamic model is integrated into a tiered EMPC formulation that optimizes process economics (including batch duration). Simulation case studies involving application to the energy intensive electric arc furnace process demonstrate the efficacy of the proposed approach compared to a traditional trajectory tracking approach subject to limited availability of process measurements, missing data, measurement noise, and constraints. © 2017 American Institute of Chemical Engineers AIChE J, 63: 2705–2718, 2017  相似文献   

18.
Multivariate meta-analysis is increasingly utilised in biomedical research to combine data of multiple comparative clinical studies for evaluating drug efficacy and safety profile. When the probability of the event of interest is rare, or when the individual study sample sizes are small, a substantial proportion of studies may not have any event of interest. Conventional meta-analysis methods either exclude such studies or include them through ad hoc continuality correction by adding an arbitrary positive value to each cell of the corresponding 2?×?2 tables, which may result in less accurate conclusions. Furthermore, different continuity corrections may result in inconsistent conclusions. In this article, we discuss a bivariate Beta-binomial model derived from Sarmanov family of bivariate distributions and a bivariate generalised linear mixed effects model for binary clustered data to make valid inferences. These bivariate random effects models use all available data without ad hoc continuity corrections, and accounts for the potential correlation between treatment (or exposure) and control groups within studies naturally. We then utilise the bivariate random effects models to reanalyse two recent meta-analysis data sets.  相似文献   

19.
Process data suffer from many different types of imperfections. For example, bad data due to sensor problems, multi‐rate sampling, outliers, compressed data etc. Since most modelling and data analysis methods are developed to analyze regularly sampled and well conditioned data sets there is a need for pre‐treatment of data. Traditionally data conditioning or pre‐treatment has been done without taking into account the end use of the data, for example, univariate methods have been used to interpolate bad data even when the intended end use of data is for multivariate analysis. In this paper we consider the pre‐treatment and data analysis as a collective problem and propose data conditioning methods in a multivariate framework. We first review classical process data analysis methods and acclaimed missing data handling techniques used in statistical surveys and biostatistics. The applications of these acclaimed missing data techniques are demonstrated in three different instances: (i) principal components analysis (PCA) is extended in data augmentation (DA) framework for dealing with missing values, (ii) iterative missing data technique is used to synchronize uneven length batch process data, and (iii) PCA based iterative missing data technique is used to restore the correlation structure of compressed data.  相似文献   

20.
Erroneous information from sensors affect process monitoring and control. An algorithm with multiple model identification methods will improve the sensitivity and accuracy of sensor fault detection and data reconciliation (SFD&DR). A novel SFD&DR algorithm with four types of models including outlier robust Kalman filter, locally weighted partial least squares, predictor-based subspace identification, and approximate linear dependency-based kernel recursive least squares is proposed. The residuals are further analyzed by artificial neural networks and a voting algorithm. The performance of the SFD&DR algorithm is illustrated by clinical data from artificial pancreas experiments with people with diabetes. The glucose-insulin metabolism has time-varying parameters and nonlinearities, providing a challenging system for fault detection and data reconciliation. Data from 17 clinical experiments collected over 896 h were analyzed; the results indicate that the proposed SFD&DR algorithm is capable of detecting and diagnosing sensor faults and reconciling the erroneous sensor signals with better model-estimated values. © 2018 American Institute of Chemical Engineers AIChE J, 65: 629–639, 2019  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号