首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Multiple imputation under the multivariate normality assumption has often been considered a workable model-based approach in dealing with incomplete continuous data. A situation where the measurements are taken on a continuous scale with an eventual interest in ordinalized versions via threshold concept is commonly encountered in applied research, especially in medical and social sciences. In practice, researchers ordinarily impute missing values for continuous outcomes under a Gaussian imputation model, and then ordinalize them via pre-specified cutoff points. An alternate strategy is creating multiply imputed data sets after ordinalization under a log-linear imputation model that uses a saturated multinomial structure. In this work, the performances of the two imputation methods were examined on a fairly broad range of simulated incomplete data sets that exhibit varying distributional characteristics such as skewness and multimodality. Behavior of efficiency and accuracy measures was investigated to determine the degree to which the procedures work appositely. The conclusion drawn is that ordinalization before carrying out a log-linear imputation should be the preferred procedure except for a few special cases. It is recommended that researchers use the less common second strategy whenever the interest centers on ordinal quantities that are obtained through underlying continuous measurements. This postulate is probably due to the transformation of non-Gaussian features into better-behaving categorical trends in this particular missing-data environment. This premise preponderates the factual argument that continuous variables intrinsically convey more information, leading to a counter-intuitive, but potentially beneficial result for practitioners.  相似文献   

2.
Generalized linear mixed models are popular for regressing a discrete response when there is clustering, e.g. in longitudinal studies or in hierarchical data structures. It is standard to assume that the random effects have a normal distribution. Recently, it has been examined whether wrongly assuming a normal distribution for the random effects is important for the estimation of the fixed effects parameters. While it has been shown that misspecifying the distribution of the random effects has a minor effect in the context of linear mixed models, the conclusion for generalized mixed models is less clear. Some studies report a minor impact, while others report that the assumption of normality really matters especially when the variance of the random effect is relatively high. Since it is unclear whether the normality assumption is truly satisfied in practice, it is important that generalized mixed models are available which relax the normality assumption. A replacement of the normal distribution with a mixture of Gaussian distributions specified on a grid whereby only the weights of the mixture components are estimated using a penalized approach ensuring a smooth distribution for the random effects is proposed. The parameters of the model are estimated in a Bayesian context using MCMC techniques. The usefulness of the approach is illustrated on two longitudinal studies using R-functions.  相似文献   

3.
Normality is one of the most common assumptions made in the development of statistical models such as the fixed effect model and the random effect model. White and MacDonald [1980. Some large-sample tests for normality in the linear regression model. JASA 75, 16-18] and Bonett and Woodward [1990. Testing residual normality in the ANOVA model. J. Appl. Statist. 17, 383-387] showed that many tests of normality perform well when applied to the residuals of a fixed effect model. The elements of the error vector are not independent in random effects models and standard tests of normality are not expected to perform properly when applied to the residuals of a random effects model.In this paper, we propose a transformation method to convert the correlated error vector into an uncorrelated vector. Moreover, under the normality assumption, the uncorrelated vector becomes an independent vector. Thus, all the existing methods can then be implemented. Monte-Carlo simulations are used to evaluate the feasibility of the transformation. Results show that this transformation method can preserve the Type I error and provide greater powers under most alternatives.  相似文献   

4.
Missingness frequently complicates the analysis of longitudinal data. A popular solution for dealing with incomplete longitudinal data is the use of likelihood-based methods, when, for example, linear, generalized linear, or non-linear mixed models are considered, due to their validity under the assumption of missing at random (MAR). Semi-parametric methods such as generalized estimating equations (GEEs) offer another attractive approach but require the assumption of missing completely at random (MCAR). Weighted GEE (WGEE) has been proposed as an elegant way to ensure validity under MAR. Alternatively, multiple imputation (MI) can be used to pre-process incomplete data, after which GEE is applied (MI-GEE). Focusing on incomplete binary repeated measures, both methods are compared using the so-called asymptotic, as well as small-sample, simulations, in a variety of correctly specified as well as incorrectly specified models. In spite of the asymptotic unbiasedness of WGEE, results provide striking evidence that MI-GEE is both less biased and more accurate in the small to moderate sample sizes which typically arise in clinical trials.  相似文献   

5.
The linear discriminant analysis (LDA) is a linear classifier which has proven to be powerful and competitive compared to the main state-of-the-art classifiers. However, the LDA algorithm assumes the sample vectors of each class are generated from underlying multivariate normal distributions of common covariance matrix with different means (i.e., homoscedastic data). This assumption has restricted the use of LDA considerably. Over the years, authors have defined several extensions to the basic formulation of LDA. One such method is the heteroscedastic LDA (HLDA) which is proposed to address the heteroscedasticity problem. Another method is the nonparametric DA (NDA) where the normality assumption is relaxed. In this paper, we propose a novel Bayesian logistic discriminant (BLD) model which can address both normality and heteroscedasticity problems. The normality assumption is relaxed by approximating the underlying distribution of each class with a mixture of Gaussians. Hence, the proposed BLD provides more flexibility and better classification performances than the LDA, HLDA and NDA. A subclass and multinomial versions of the BLD are proposed. The posterior distribution of the BLD model is elegantly approximated by a tractable Gaussian form using variational transformation and Jensen's inequality, allowing a straightforward computation of the weights. An extensive comparison of the BLD to the LDA, support vector machine (SVM), HLDA, NDA and subclass discriminant analysis (SDA), performed on artificial and real data sets, has shown the advantages and superiority of our proposed method. In particular, the experiments on face recognition have clearly shown a significant improvement of the proposed BLD over the LDA.  相似文献   

6.
目前已有的不完整数据填充方法大多局限于单一类型的缺失变量,对大规模数据的填充效果相对弱势.为了解决真实大数据中混合类型变量的缺失问题,本文提出了一个新的模型——SXGBI(Spark-based eXtreme Gradient Boosting Imputation),其适应于连续型和分类型两种缺失变量并存的不完整数据填充,同时具备快速处理大数据的泛化能力.该方法通过对集成学习方法XGBoost的改进,将多种补全算法结合在一起,构建了一个集成学习器,并结合Spark分布式计算框架进行了并行化设计,能较好地运行于Spark分布式集群上.实验表明,随着缺失率的增长,SXGBI在RMSE、PFC和F1几项评价指标上都取得了比实验中其它填充方法更好的填充结果.此外,它还可以有效地运用在大规模的数据集上.  相似文献   

7.
A very common problem when building software engineering models is dealing with missing data. To address this there exist a range of imputation techniques. However, selecting the appropriate imputation technique can also be a difficult problem. One reason for this is that these techniques make assumptions about the underlying missingness mechanism, that is how the missing values are distributed within the data set. It is compounded by the fact that, for small data sets, it may be very difficult to determine what is the missingness mechanism. This means there is a danger of using an inappropriate imputation technique. Therefore, it is necessary to determine what is the safest default assumption about the missingness mechanism for imputation techniques when dealing with small data sets. We examine experimentally, two simple and commonly used techniques: Class Mean Imputation (CMI) and k Nearest Neighbors (k-NN) coupled with two missingness mechanisms: missing completely at random (MCAR) and missing at random (MAR). We draw two conclusions. First, that for our analysis CMI is the preferred technique since it is more accurate. Second, and more importantly, the impact of missingness mechanism on imputation accuracy is not statistically significant. This is a useful finding since it suggests that even for small data sets we can reasonably make a weaker assumption that the missingness mechanism is MAR. Thus both imputation techniques have practical application for small software engineering data sets with missing values.  相似文献   

8.
During the drug development, nonlinear mixed effects models are routinely used to study the drug’s pharmacokinetics and pharmacodynamics. The distribution of random effects is of special interest because it allows to describe the heterogeneity of the drug’s kinetics or dynamics in the population of individuals studied. Parametric models are widely used, but they rely on a normality assumption which may be too restrictive. In practice, this assumption is often checked using the empirical distribution of random effects’ empirical Bayes estimates. Unfortunately, when data are sparse (like in patients phase III clinical trials), this method is unreliable. In this context, nonparametric estimators of the random effects distribution are attractive. Several nonparametric methods (estimators and their associated computation algorithms) have been proposed but their use is limited. Indeed, their practical and theoretical properties are unclear and they have a reputation for being computationally expensive. Four nonparametric methods in comparison with the usual parametric method are evaluated. Statistical and computational features are reviewed and practical performances are compared in simulation studies mimicking real pharmacokinetic analyses. The nonparametric methods seemed very useful when data are sparse. On a simple pharmacokinetic model, all the nonparametric methods performed roughly equivalently. On a more challenging pharmacokinetic model, differences between the methods were clearer.  相似文献   

9.
张安珍  李建中  高宏 《软件学报》2020,31(2):406-420
本文研究了基于符号语义的不完整数据聚集查询处理问题.不完整数据又称为缺失数据,缺失值包括可填充的和不可填充的两种类型.现有的缺失值填充算法不能保证填充后查询结果的准确度,为此,本文给出不完整数据聚集查询结果的区间估计.本文在符号语义中扩展传统关系数据库模型,提出一种通用不完整数据库模型,该模型可以处理可填充的和不可填充的两种类型缺失值.在该模型下,提出一种新的不完整数据聚集查询结果语义:可靠结果.可靠结果是真实查询结果的区间估计,可以保证真实查询结果很大概率在该估计区间范围内.本文给出线性时间求解SUM、COUNT和AVG查询可靠结果的方法.真实数据集和合成数据集上的扩展实验验证了本文所提方法的有效性.  相似文献   

10.
A simulation study is performed to investigate the robustness of the maximum likelihood estimator of fixed effects from a linear mixed model when the error distribution is misspecified. Inference for the fixed effects under the assumption of independent normally distributed errors with constant variance is shown to be robust when the errors are either non-gaussian or heteroscedastic, except when the error variance depends on a covariate included in the model with interaction with time. Inference is impaired when the errors are correlated. In the latter case, the model including a random slope in addition to the random intercept is more robust than the random intercept model. The use of Cholesky residuals and conditional residuals to evaluate the fit of a linear mixed model is also discussed.  相似文献   

11.
Linear mixed-effects models involve fixed effects, random effects and covariance structures, which require model selection to simplify a model and to enhance its interpretability and predictability. In this article, we develop, in the context of linear mixed-effects models, the generalized degrees of freedom and an adaptive model selection procedure defined by a data-driven model complexity penalty. Numerically, the procedure performs well against its competitors not only in selecting fixed effects but in selecting random effects and covariance structure as well. Theoretically, asymptotic optimality of the proposed methodology is established over a class of information criteria. The proposed methodology is applied to the BioCycle Study, to determine predictors of hormone levels among premenopausal women and to assess variation in hormone levels both between and within women across the menstrual cycle.  相似文献   

12.
Typically, the fundamental assumption in non-linear regression models is the normality of the errors. Even though this model offers great flexibility for modeling these effects, it suffers from the same lack of robustness against departures from distributional assumptions as other statistical models based on the Gaussian distribution. It is of practical interest, therefore, to study non-linear models which are less sensitive to departures from normality, as well as related assumptions. Thus the current methods proposed for linear regression models need to be extended to non-linear regression models. This paper discusses non-linear regression models for longitudinal data with errors that follow a skew-elliptical distribution. Additionally, we discuss Bayesian statistical methods for the classification of observations into two or more groups based on skew-models for non-linear longitudinal profiles. Parameter estimation for a discriminant model that classifies individuals into distinct predefined groups or populations uses appropriate posterior simulation schemes. The methods are illustrated with data from a study involving 173 pregnant women. The main objective in this study is to predict normal versus abnormal pregnancy outcomes from beta human chorionic gonadotropin data available at early stages of pregnancy.  相似文献   

13.
构造性覆盖下不完整数据修正填充方法   总被引:1,自引:0,他引:1       下载免费PDF全文
不完整数据处理是数据挖掘、机器学习等领域中的重要问题,缺失值填充是处理不完整数据的主流方法。当前已有的缺失值填充方法大多运用统计学和机器学习领域的相关技术来分析原始数据中的剩余信息,从而得到较为合理的值来替代缺失部分。缺失值填充大致可以分为单一填充和多重填充,这些填充方法在不同的场景下有着各自的优势。但是,很少有方法能进一步考虑样本空间分布中的邻域信息,并以此对缺失值的填充结果进行修正。鉴于此,本文提出了一种可广泛应用于诸多现有填充方法的框架用以提升现有方法的填充效果,该框架由预填充、空间邻域信息挖掘和修正填充三部分构成。本文对7种填充方法在8个UCI数据集上进行了实验,实验结果验证了本文所提框架的有效性和鲁棒性。  相似文献   

14.
In real applications, important rates of missing data are often found and have to be preprocessed before the analysis. The literature for missing imputation is abundant. However, the most precise imputation methods require long time, and sometimes specific software; this implies a significant delay to get final results. The Mixed Intelligent-Multivariate Missing Imputation (MIMMI) method is proposed as a hybrid missing imputation methodology based on clustering. The MIMMI is a non-parametric method that combines the prior expert knowledge with multivariate analysis without requiring assumptions on the probabilistic models of the variables (normality, exponentiality, etc.). The proposed imputation values implicitly take into account the joint distribution of all variables and can be determined in a relatively short time. The MIMMI uses the conditional mean according to the self-underlying structure of the data set. It provides a good trade-off between accuracy and both simplicity and required time to data preparation. The mechanics of the method is illustrated with some case-studies, both synthetic and real applications related with human behaviour. In both cases, acceptable quality results were obtained in short time.  相似文献   

15.
Typically, the fundamental assumption in non-linear regression models is the normality of the errors. Even though this model offers great flexibility for modeling these effects, it suffers from the same lack of robustness against departures from distributional assumptions as other statistical models based on the Gaussian distribution. It is of practical interest, therefore, to study non-linear models which are less sensitive to departures from normality, as well as related assumptions. Thus the current methods proposed for linear regression models need to be extended to non-linear regression models. This paper discusses non-linear regression models for longitudinal data with errors that follow a skew-elliptical distribution. Additionally, we discuss Bayesian statistical methods for the classification of observations into two or more groups based on skew-models for non-linear longitudinal profiles. Parameter estimation for a discriminant model that classifies individuals into distinct predefined groups or populations uses appropriate posterior simulation schemes. The methods are illustrated with data from a study involving 173 pregnant women. The main objective in this study is to predict normal versus abnormal pregnancy outcomes from beta human chorionic gonadotropin data available at early stages of pregnancy.  相似文献   

16.
Measurement error models often arise in epidemiological and clinical research. Usually, in this set up it is assumed that the latent variable has a normal distribution. However, the normality assumption may not be always correct. Skew-normal/independent distribution is a class of asymmetric thick-tailed distributions which includes the skew-normal distribution as a special case. In this paper, we explore the use of skew-normal/independent distribution as a robust alternative to null intercept measurement error model under a Bayesian paradigm. We assume that the random errors and the unobserved value of the covariate (latent variable) follows jointly a skew-normal/independent distribution, providing an appealing robust alternative to the routine use of symmetric normal distribution in this type of model. Specific distributions examined include univariate and multivariate versions of the skew-normal distribution, the skew-t distributions, the skew-slash distributions and the skew contaminated normal distributions. The methods developed is illustrated using a real data set from a dental clinical trial.  相似文献   

17.
Multiply imputed data sets can be created with the approximate Bayesian bootstrap (ABB) approach under the assumption of ignorable nonresponse. The theoretical development and inferential validity are predicated upon asymptotic properties; and biases are known to occur in small-to-moderate samples. There have been attempts to reduce the finite-sample bias for the multiple imputation variance estimator. In this note, we present an empirical study for evaluating the comparative performance of the two proposed bias-correction techniques and their impact on precision. The results suggest that to varying degrees, bias improvements are outweighed by efficiency losses for the variance estimator. We argue that the original ABB has better small-sample properties than the modified versions in terms of the integrated behavior of accuracy and precision, as measured by the root mean-square error.  相似文献   

18.
Using normal distribution assumptions, one can obtain confidence intervals for variance components in a variety of applications. A normal-based interval, which has exact coverage probability under normality, is usually constructed from a pivot so that the endpoints of the interval depend on the data as well as the distribution of the pivotal quantity. Alternatively, one can employ a point estimation technique to form a large-sample (or approximate) confidence interval. A commonly used approach to estimate variance components is the restricted maximum likelihood (REML) method. The endpoints of a REML-based confidence interval depend on the data and the asymptotic distribution of the REML estimator. In this paper, simulation studies are conducted to evaluate the performance of the normal-based and the REML-based intervals for the intraclass correlation coefficient under non-normal distribution assumptions. Simulated coverage probabilities and expected lengths provide guidance as to which interval procedure is favored for a particular scenario. Estimating the kurtosis of the underlying distribution plays a central role in implementing the REML-based procedure. An empirical example is given to illustrate the usefulness of the REML-based confidence intervals under non-normality.  相似文献   

19.
The handling of missing values is a topic of growing interest in the software quality modeling domain. Data values may be absent from a dataset for numerous reasons, for example, the inability to measure certain attributes. As software engineering datasets are sometimes small in size, discarding observations (or program modules) with incomplete data is usually not desirable. Deleting data from a dataset can result in a significant loss of potentially valuable information. This is especially true when the missing data is located in an attribute that measures the quality of the program module, such as the number of faults observed in the program module during testing and after release. We present a comprehensive experimental analysis of five commonly used imputation techniques. This work also considers three different mechanisms governing the distribution of missing values in a dataset, and examines the impact of noise on the imputation process. To our knowledge, this is the first study to thoroughly evaluate the relationship between data quality and imputation. Further, our work is unique in that it employs a software engineering expert to oversee the evaluation of all of the procedures and to ensure that the results are not inadvertently influenced by poor quality data. Based on a comprehensive set of carefully controlled experiments, we conclude that Bayesian multiple imputation and regression imputation are the most effective techniques, while mean imputation performs extremely poorly. Although a preliminary evaluation has been conducted using Bayesian multiple imputation in the empirical software engineering domain, this is the first work to provide a thorough and detailed analysis of this technique. Our studies also demonstrate conclusively that the presence of noisy data has a dramatic impact on the effectiveness of imputation techniques.  相似文献   

20.
The estimation of the components of variance of the one-way random effects model is considered when the assumption of normality is removed. The asymptotic relative efficiency of the estimators, derived by minimizing the mean squared error, is evaluated. It is found that when the kurtosis is large, the estimators obtained are more efficient than the well known ANOVA estimators. The methodology is illustrated on the systolic blood pressures data of Miall and Oldham [7].  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号