首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Longitudinal data refer to the situation where repeated observations are available for each sampled object. Clustered data, where observations are nested in a hierarchical structure within objects (without time necessarily being involved) represent a similar type of situation. Methodologies that take this structure into account allow for the possibilities of systematic differences between objects that are not related to attributes and autocorrelation within objects across time periods. A standard methodology in the statistics literature for this type of data is the mixed effects model, where these differences between objects are represented by so-called “random effects” that are estimated from the data (population-level relationships are termed “fixed effects,” together resulting in a mixed effects model). This paper presents a methodology that combines the structure of mixed effects models for longitudinal and clustered data with the flexibility of tree-based estimation methods. We apply the resulting estimation method, called the RE-EM tree, to pricing in online transactions, showing that the RE-EM tree is less sensitive to parametric assumptions and provides improved predictive power compared to linear models with random effects and regression trees without random effects. We also apply it to a smaller data set examining accident fatalities, and show that the RE-EM tree strongly outperforms a tree without random effects while performing comparably to a linear model with random effects. We also perform extensive simulation experiments to show that the estimator improves predictive performance relative to regression trees without random effects and is comparable or superior to using linear models with random effects in more general situations.  相似文献   

2.
3.
研究了运用EM算法对树状模型参数辨识的方法,针对树状模型结构特性提出了一种矩阵形式的EM算法。在原有算法的基础上,由一系列矩阵分别描述出原始模型的全部信息,以矩阵运算的形式完成原算法的迭代过程,从而获得模型各参数估计值。实验结果表明,与原有算法相比,本方法具有更好表达形式和算法性能,在高精度和重负荷情况下更能显示出其优势。  相似文献   

4.
Multi-level nonlinear mixed effects (ML-NLME) models have received a great deal of attention in recent years because of the flexibility they offer in handling the repeated-measures data arising from various disciplines. In this study, we propose both maximum likelihood and restricted maximum likelihood estimations of ML-NLME models with two-level random effects, using first order conditional expansion (FOCE) and the expectation–maximization (EM) algorithm. The FOCE–EM algorithm was compared with the most popular Lindstrom and Bates (LB) method in terms of computational and statistical properties. Basal area growth series data measured from Chinese fir (Cunninghamia lanceolata) experimental stands and simulated data were used for evaluation. The FOCE–EM and LB algorithms given the same parameter estimates and fit statistics for models that converged by both. However, FOCE–EM converged for all the models, while LB did not, especially for the models in which two-level random effects are simultaneously considered in several base parameters to account for between-group variation. We recommend the use of FOCE–EM in ML-NLME models, particularly when convergence is a concern in model selection.  相似文献   

5.
师彦文  王宏杰 《计算机科学》2017,44(Z11):98-101
针对不平衡数据集的有效分类问题,提出一种结合代价敏感学习和随机森林算法的分类器。首先提出了一种新型不纯度度量,该度量不仅考虑了决策树的总代价,还考虑了同一节点对于不同样本的代价差异;其次,执行随机森林算法,对数据集作K次抽样,构建K个基础分类器;然后,基于提出的不纯度度量,通过分类回归树(CART)算法来构建决策树,从而形成决策树森林;最后,随机森林通过投票机制做出数据分类决策。在UCI数据库上进行实验,与传统随机森林和现有的代价敏感随机森林分类器相比,该分类器在分类精度、AUC面积和Kappa系数这3种性能度量上都具有良好的表现。  相似文献   

6.
Tree models are valuable tools for predictive modeling and data mining. Traditional tree-growing methodologies such as CART are known to suffer from problems including greediness, instability, and bias in split rule selection. Alternative tree methods, including Bayesian CART (Chipman et al., 1998; Denison et al., 1998), random forests (Breiman, 2001a), bootstrap bumping (Tibshirani and Knight, 1999), QUEST (Loh and Shih, 1997), and CRUISE (Kim and Loh, 2001), have been proposed to resolve these issues from various aspects, but each has its own drawbacks.Gray and Fan (2003) described a genetic algorithm approach to constructing decision trees called tree analysis with randomly generated and evolved trees (TARGET) that performs a better search of the tree model space and largely resolves the problems with current tree modeling techniques. Utilizing the Bayesian information criterion (BIC), Fan and Gray (2005) developed a version of TARGET for regression tree analysis. In this article, we consider the construction of classification trees using TARGET. We modify the BIC to handle a categorical response variable, but we also adjust its penalty component to better account for the model complexity of TARGET. We also incorporate the option of splitting rules based on linear combinations of two or three variables in TARGET, which greatly improves the prediction accuracy of TARGET trees. Comparisons of TARGET to existing methods, using simulated and real data sets, indicate that TARGET has advantages over these other approaches.  相似文献   

7.
Accuracy is a critical factor in predictive modeling. A predictive model such as a decision tree must be accurate to draw conclusions about the system being modeled. This research aims at analyzing and improving the performance of classification and regression trees (CART), a decision tree algorithm, by evaluating and deriving a new methodology based on the performance of real-world data sets that were studied. This paper introduces a new approach to tree induction to improve the efficiency of the CART algorithm by combining the existing functionality of CART with the addition of artificial neural networks (ANNs). Trained ANNs are utilized by the tree induction algorithm by generating new, synthetic data, which have been shown to improve the overall accuracy of the decision tree model when actual training samples are limited. In this paper, traditional decision trees developed by the standard CART methodology are compared with the enhanced decision trees that utilize the ANN’s synthetic data generation, or CART+. This research demonstrates the improved accuracies that can be obtained with CART+, which can ultimately improve the knowledge that can be extracted by researchers about a system being modeled.  相似文献   

8.
机器学习中的隐私保护问题是目前信息安全领域的研究热点之一。针对隐私保护下的分类问题,该文提出一种基于差分隐私保护的AdaBoost集成分类算法:CART-DPsAdaBoost (CART-Differential Privacy structure of AdaBoost)。算法在Boosting过程中结合Bagging的基本思想以增加采样本的多样性,在基于随机子空间算法的特征扰动中利用指数机制选择连续特征分裂点,利用Gini指数选择最佳离散特征,构造CART提升树作为集成学习的基分类器,并根据Laplace机制添加噪声。在整个算法过程中合理分配隐私预算以满足差分隐私保护需求。在实验中分析不同树深度下隐私水平对集成分类模型的影响并得出最优树深值和隐私预算域。相比同类算法,该方法无需对数据进行离散化预处理,用Adult、Census Income两个数据集实验结果表明,模型在兼顾隐私性和可用性的同时具有较好的分类准确率。此外,样本扰动和特征扰动两类随机性方案的引入能有效处理大规模、高维度数据分类问题。  相似文献   

9.
An extension of the Cox proportional hazards model for clustered survival data is proposed. This allows both general random effects (frailties) and time-varying regression coefficients, the latter being smooth functions of time. The model is fitted using a mixed-model representation of penalized spline smoothing which offers a unified framework for estimation of the baseline hazard, the smooth effects and the random effects. The estimator is computed using a stacked laplace-EM (SLaEM) algorithm. More specifically, the smoothing parameters are integrated out in the log likelihood via a Laplace approximation. The approximation itself involves an integrated log-likelihood over the random cluster effects, for which the EM algorithm is used. A marginal Akaike information criterion is developed for selection among possible candidate models. The time-varying and mixed effects model is applied to unemployment data taken from the German Socio-Economic Panel. The duration of unemployment is modeled in a flexible way including smooth covariate effects and individual random effects.  相似文献   

10.
树形算法由于其对大量高维数据的有效处理、对噪声点的高容忍度和对知识的有效表示,是最常用的CRM客户细分技术。通过对几类树形算法,包括决策树C4.5算法、决策树CART算法和平衡随机森林BRF算法,在解决电信客户细分问题中的表现进行分析研究,并且选用BP神经网络算法作为树形算法的参照,最终研究得出:平衡随机森林在处理电信客户问题上具有最好的表现。  相似文献   

11.
A mixed effects least squares support vector machine (LS-SVM) classifier is introduced to extend the standard LS-SVM classifier for handling longitudinal data. The mixed effects LS-SVM model contains a random intercept and allows to classify highly unbalanced data, in the sense that there is an unequal number of observations for each case at non-fixed time points. The methodology consists of a regression modeling and a classification step based on the obtained regression estimates. Regression and classification of new cases are performed in a straightforward manner by solving a linear system. It is demonstrated that the methodology can be generalized to deal with multi-class problems and can be extended to incorporate multiple random effects. The technique is illustrated on simulated data sets and real-life problems concerning human growth.  相似文献   

12.
In this paper, an online soft computing model based on an integration between the fuzzy ARTMAP (FAM) neural network and the classification and regression tree (CART) for undertaking data classification problems is presented. Online FAM network is useful for conducting incremental learning with data samples, whereas the CART model prevails in depicting the knowledge learned explicitly in a tree structure. Capitalizing on their respective advantages, the hybrid FAM‐CART model is capable of learning incrementally while explaining its predictions with knowledge elicited from data samples. To evaluate the usefulness of FAM‐CART, 2 sets of benchmark experiments with a total of 12 problems are used in both offline and online learning modes. The results are examined and compared with those published in the literature. The experimental outcome positively indicates that the online FAM‐CART model is useful for tackling data classification tasks. In addition, a decision tree is produced to allow users in understanding the predictions, which is an important property of the hybrid FAM‐CART model in supporting decision‐making tasks.  相似文献   

13.
Regression analysis is a machine learning approach that aims to accurately predict the value of continuous output variables from certain independent input variables, via automatic estimation of their latent relationship from data. Tree-based regression models are popular in literature due to their flexibility to model higher order non-linearity and great interpretability. Conventionally, regression tree models are trained in a two-stage procedure, i.e. recursive binary partitioning is employed to produce a tree structure, followed by a pruning process of removing insignificant leaves, with the possibility of assigning multivariate functions to terminal leaves to improve generalisation. This work introduces a novel methodology of node partitioning which, in a single optimisation model, simultaneously performs the two tasks of identifying the break-point of a binary split and assignment of multivariate functions to either leaf, thus leading to an efficient regression tree model. Using six real world benchmark problems, we demonstrate that the proposed method consistently outperforms a number of state-of-the-art regression tree models and methods based on other techniques, with an average improvement of 7–60% on the mean absolute errors (MAE) of the predictions.  相似文献   

14.
Generalized linear mixed models (GLMM) form a very general class of random effects models for discrete and continuous responses in the exponential family. They are useful in a variety of applications. The traditional likelihood approach for GLMM usually involves high dimensional integrations which are computationally intensive. In this work, we investigate the case of binary outcomes analyzed under a two stage probit normal model with random effects. First, it is shown how ML estimates of the fixed effects and variance components can be computed using a stochastic approximation of the EM algorithm (SAEM). The SAEM algorithm can be applied directly, or in conjunction with a parameter expansion version of EM to speed up the convergence. A procedure is also proposed to obtain REML estimates of variance components and REML-based estimates of fixed effects. Finally an application to a real data set involving a clinical trial is presented, in which these techniques are compared to other procedures (penalized quasi-likelihood, maximum likelihood, Bayesian inference) already available in classical softwares (SAS Glimmix, SAS Nlmixed, WinBUGS), as well as to a Monte Carlo EM (MCEM) algorithm.  相似文献   

15.
Zero-inflated Poisson (ZIP) regression model is a popular approach to the analysis of count data with excess zeros. For correlated count data where the observations are either repeated or clustered outcomes from individual subjects, ZIP mixed regression model may be appropriate. However, ZIP model may often fail to fit such data either because of over-dispersion or because of under-dispersion in relation to the Poisson distribution. In this paper, we extend the ZIP mixed regression model to zero-inflated generalized Poisson (ZIGP) mixed regression model, where the base-line discrete distribution is generalized Poisson (GP) distribution, which is a natural extension of standard Poisson distribution. Furthermore, the random effects are considered in both zero-inflated and GP components throughout the paper. An EM algorithm for estimating parameters is proposed based on the best linear unbiased prediction-type (BLUP) log-likelihood and the residual maximum likelihood (REML). Meanwhile, several score tests are presented for testing the ZIP mixed regression model against the ZIGP mixed regression model, and for testing the significance of regression coefficients in zero-inflation and generalized Poisson portion. A numerical example is given to illustrate our methodology and the properties of score test statistics are investigated through Monte Carlo simulations.  相似文献   

16.
Classification of large datasets is an important data mining problem. Many classification algorithms have been proposed in the literature, but studies have shown that so far no algorithm uniformly outperforms all other algorithms in terms of quality. In this paper, we present a unifying framework called Rain Forest for classification tree construction that separates the scalability aspects of algorithms for constructing a tree from the central features that determine the quality of the tree. The generic algorithm is easy to instantiate with specific split selection methods from the literature (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, SPRINT and QUEST). In addition to its generality, in that it yields scalable versions of a wide range of classification algorithms, our approach also offers performance improvements of over a factor of three over the SPRINT algorithm, the fastest scalable classification algorithm proposed previously. In contrast to SPRINT, however, our generic algorithm requires a certain minimum amount of main memory, proportional to the set of distinct values in a column of the input relation. Given current main memory costs, this requirement is readily met in most if not all workloads.  相似文献   

17.
Count data are widely existed in the fields of medical trials, public health, surveys and environmental studies. In analyzing count data, it is important to find out whether the zero-inflation exists or not and how to select the most suitable model. However, the classic AIC criterion for model selection is invalid when the observations are missing. In this paper, we develop a new model selection criterion in line with AIC for the zero-inflated regression models with missing covariates. This method is a modified version of Monte Carlo EM algorithm which is based on the data augmentation scheme. One of the main attractions of this new method is that it is applicable for comparison of candidate models regardless of whether there are missing data or not. What is more, it is very simple to compute as it is just a by-product of Monte Carlo EM algorithm when the estimations of parameters are obtained. A simulation study and a real example are used to illustrate the proposed methodologies.  相似文献   

18.
提出了一种基于树结构椭圆簇分裂的深度图像分割算法 .该算法是根据聚类簇协方差矩阵分解的物理含义 ,利用数据的二维散布来同时确定分裂扰动矢量的方向和长度 ,迭代地分裂聚类簇 ,为期望最大化算法提供初始值 .算法还充分利用表面法向高斯混合模型的物理含义来减少聚类次数 ,并根据几何含义清晰的门限自适应确定类别数 .作者针对两种深度相机的 6 0幅真实深度图像进行了实验 ,并与传统的树结构扰动方案以及K均值算法初始方案进行了客观比较 .实验证明 ,新的初始值方案以更少的聚类次数得到了更好的结果  相似文献   

19.
Random effects in generalized linear mixed models (GLMM) are used to explain the serial correlation of the longitudinal categorical data. Because the covariance matrix is high dimensional and should be positive definite, its structure is assumed to be constant over subjects and to be restricted such as AR(1) structure. However, these assumptions are too strong and can result in biased estimates of the fixed effects. In this paper we propose a Bayesian modeling for the GLMM with regression models for parameters of the random effects covariance matrix using a moving average Cholesky decomposition which factors the covariance matrix into moving average (MA) parameters and IVs. We analyze lung cancer data using our proposed model.  相似文献   

20.
大数据、云计算技术的迅猛发展为挖掘气象数据丰富的科研和经济价值提供了技术支撑,促进了Hadoop及其包含的文件存储系统(HDFS,Hadoop Distributed File System)和分布式计算模型在气象数据处理领域广泛应用。由于气象数据具有大数据的4V特征,还需要引入新的数据处理算法来提高气象数据处理效率。通过对决策树算法原理的研究,基于Hadoop云平台,创建随机森林模型,为数据挖掘算法在云平台上的应用提供一种新的可能性。基于决策树(CART,Classification And Regression Trees)挖掘算法的气象大数据云平台设计,采用Hadoop系统架构和MapReduce工作流程,对气象大数据云平台采用集群部署。平台总体架构分为基础设施层、数据管理与处理层、应用层,减少了决策树建立的时间,实现了气象数据高效加工和挖掘分析等平台功能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号