首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
This paper introduces a tree-based model that combines aspects of classification and regression trees (CART) and smooth transition regression (STR). The model is called the STR-tree. The main idea relies on specifying a parametric nonlinear model through a tree-growing procedure. The resulting model can be analyzed as a smooth transition regression with multiple regimes. Decisions about splits are entirely based on a sequence of Lagrange multiplier (LM) tests of hypotheses. An alternative specification strategy based on a 10-fold cross-validation is also discussed and a Monte Carlo experiment is carried out to evaluate the performance of the proposed methodology in comparison with standard techniques. The STR-tree model outperforms CART when the correct selection of the architecture of simulated trees is discussed. Furthermore, the LM test seems to be a promising alternative to 10-fold cross-validation. Function approximation is also analyzed. When put into proof with real and simulated data sets, the STR-tree model has a superior predictive ability than CART.  相似文献   

In this paper, we develop a semi-supervised regression algorithm to analyze data sets which contain both categorical and numerical attributes. This algorithm partitions the data sets into several clusters and at the same time fits a multivariate regression model to each cluster. This framework allows one to incorporate both multivariate regression models for numerical variables (supervised learning methods) and k-mode clustering algorithms for categorical variables (unsupervised learning methods). The estimates of regression models and k-mode parameters can be obtained simultaneously by minimizing a function which is the weighted sum of the least-square errors in the multivariate regression models and the dissimilarity measures among the categorical variables. Both synthetic and real data sets are presented to demonstrate the effectiveness of the proposed method.  相似文献   

Many pattern classification algorithms such as Support Vector Machines (SVMs), Multi-Layer Perceptrons (MLPs), and K-Nearest Neighbors (KNNs) require data to consist of purely numerical variables. However many real world data consist of both categorical and numerical variables. In this paper we suggest an effective method of converting the mixed data of categorical and numerical variables into data of purely numerical variables for binary classifications. Since the suggested method is based on the theory of learning Bayesian Network Classifiers (BNCs), it is computationally efficient and robust to noises and data losses. Also the suggested method is expected to extract sufficient information for estimating a minimum-error-rate (MER) classifier. Simulations on artificial data sets and real world data sets are conducted to demonstrate the competitiveness of the suggested method when the number of values in each categorical variable is large and BNCs accurately model the data.  相似文献   

Heterogeneous (mixed-type) data present significant challenges in both supervised and unsupervised learning. The situation is even more complicated when nominal variables have several levels (values) that make using indicator variables (for every categorical level) infeasible. With unsupervised learning, several fairly involved, computationally intensive, nonlinear multivariate techniques iteratively alternate data transformations with optimal scoring. These seek to optimize an objective on the basis of a covariance matrix. Our goal is to find a computationally efficient and flexible method for mapping categorical variables to numeric scores in mixed-type data. We attempt to go beyond optimizing second-order statistics (such as covariance) and enable distance-based methods by exploring mutual relationships or bumps of dependencies between variables. This is a new objective for a scoring method that's based on patterns learned from all the available variables.  相似文献   

朱杰  陈黎飞 《计算机应用》2017,37(4):1026-1031
针对类属型数据聚类中对象间距离函数定义的困难问题,提出一种基于贝叶斯概率估计的类属数据聚类算法。首先,提出一种属性加权的概率模型,在这个模型中每个类属属性被赋予一个反映其重要性的权重;其次,经过贝叶斯公式的变换,定义了基于最大似然估计的聚类优化目标函数,并提出了一种基于划分的聚类算法,该算法不再依赖于对象间的距离,而是根据对象与数据集划分间的加权似然进行聚类;第三,推导了计算属性权重的表达式,得出了类属型属性权重与其符号分布的信息熵成反比的结论。在实际数据和合成数据集上进行了实验,结果表明,与基于距离的现有聚类算法相比,所提算法提高了聚类精度,特别是在生物信息学数据上取得了5%~48%的提升幅度,并可以获得有实际意义的属性加权结果。  相似文献   

A Bayesian network is developed to monitor a production process where categorical attribute data are available. The number of sample items in each category is entered each time period, allowing the revised probability that the system is in-control or in one of multiple out-of-control states to be calculated. In contrast to other Bayesian methods, qualitative knowledge can be combined with sample data. The network permits the classification of the system into more than two states, so diagnostic analysis can be performed simultaneously with inference. The system state can be updated to reflect evidence on variables that complements the sample data.  相似文献   

When conducting experiments, the selected quality characteristic should as far as possible be a continuous variable and be easy to measure. Due to the inherent nature of the quality characteristic or the convenience of the measurement technique and cost-effectiveness, the data observed in many experiments are ordered categorical. To analyze ordered categorical data for optimizing factor settings, there are three widely accepted approaches: Taguchi’s accumulation analysis, Nair’s scoring scheme and Jeng’s weighted probability scoring scheme. In this paper, a simpler method named the weighted SN ratio method for analyzing ordered categorical data is introduced. A case study involving optimizing the polysilicon deposition process for minimizing surface defects and achieving the target thickness in a very large-scale integrated circuit can demonstrate the four approaches. Finally, comparative analyses of efficiency for employing the four approaches to optimize factor settings are presented according to simulated experimental data that are normally, Weibull and Gamma distributed. From the results, it is obvious that the weighted SN ratio method has the properties of easy computation and uses one-step optimization to obtain the optimal factor settings. Its efficiency is slightly less than that of the scoring scheme, better than that of the accumulation analysis and the weighted probability-scoring scheme.  相似文献   

In this paper, a class of denoised nonlinear regression estimators is suggested for a nonlinear measurement error model where the variables in error are observed together with an auxiliary variable. The programming involved in this denoised nonlinear regression estimation is relatively simple and it can be modified with a little effort from the existing programs for nonlinear regression estimation. We establish the consistency and asymptotic normality of such denoised estimators based on the least squares and M-methods. A simulation study is carried out to illustrate the performance of these estimates. An empirical application of the model to production models in economics further demonstrates the potential of the proposed modeling procedures.  相似文献   

A flexible Bayesian approach to a generalized linear model is proposed to describe the dependence of binary data on explanatory variables. The inverse of the exponential power cumulative distribution function is used as the link to the binary regression model. The exponential power family provides distributions with both lighter and heavier tails compared to the normal distribution, and includes the normal and an approximation to the logistic distribution as particular cases. The idea of using a data augmentation framework and a mixture representation of the exponential power distribution is exploited to derive efficient Gibbs sampling algorithms for both informative and noninformative settings. Some examples are given to illustrate the performance of the proposed approach when compared with other competing models.  相似文献   

针对现有学习方法对完全时间不对称数据的动态贝叶斯网络学习不具有实用性,提出一种借助传递变量进行完全时间不对称数据的动态贝叶斯网络结构学习方法.首先进行相邻时间片间的传递变量序列学习;然后,基于节点排序和局部打分一搜索,进行动态贝叶斯网络局部结构学习;最后通过时序扩展得到整个动态贝叶斯网络结构.  相似文献   

We propose a model for a point-referenced spatially correlated ordered categorical response and methodology for inference. Models and methods for spatially correlated continuous response data are widespread, but models for spatially correlated categorical data, and especially ordered multi-category data, are less developed. Bayesian models and methodology have been proposed for the analysis of independent and clustered ordered categorical data, and also for binary and count point-referenced spatial data. We combine and extend these methods to describe a Bayesian model for point-referenced (as opposed to lattice) spatially correlated ordered categorical data. We include simulation results and show that our model offers superior predictive performance as compared to a non-spatial cumulative probit model and a more standard Bayesian generalized linear spatial model. We demonstrate the usefulness of our model in a real-world example to predict ordered categories describing stream health within the state of Maryland.  相似文献   

The problem of missing data in building multidimensional composite indicators is a delicate problem which is often underrated. An imputation method particularly suitable for categorical data is proposed. This method is discussed in detail in the framework of nonlinear principal component analysis and compared to other missing data treatments which are commonly used in this analysis. Its performance vs. these other methods is evaluated throughout a simulation procedure performed on both an artificial case, varying the experimental conditions, and a real case. The proposed procedure is implemented using R1.  相似文献   

The modeling of uncertainty in continuous and categorical regionalized variables is a common issue in the geosciences. We present a hybrid continuous/categorical model, in which the continuous variable is represented by the transform of a Gaussian random field, while the categorical variable is obtained by truncating one or more Gaussian random fields. The dependencies between the continuous and categorical variables are reproduced by assuming that all the Gaussian random fields are spatially cross-correlated. Algorithms and computer programs are proposed to infer the model parameters and to co-simulate the variables, and illustrated through a case study on a mining data set.  相似文献   

Variable selection for Poisson regression when the response variable is potentially underreported is considered. A logistic regression model is used to model the latent underreporting probabilities. An efficient MCMC sampling scheme is designed, incorporating uncertainty about which explanatory variables affect the dependent variable and which affect the underreporting probabilities. Validation data is required in order to identify and estimate all parameters. A simulation study illustrates favorable results both in terms of variable selection and parameter estimation. Finally, the procedure is applied to a real data example concerning deaths from cervical cancer.  相似文献   

针对分类变量相似度定义存在的不足, 提出一种新的相似度定义. 利用新的相似度定义, 将数据集抽象为无向图, 将聚类过程转化为求无向图连通分量的过程, 进而提出一种基于连通分量的分类变量聚类算法. 为了定量地分析该算法的聚类效果, 针对类别归属已知的数据集, 提出一种新的聚类结果评价指标. 实验结果表明, 所提出的算法具有较高的聚类精度和聚类效率.  相似文献   

Machine Learning - The purpose of this paper is to introduce a new distance metric learning algorithm for ordinal regression. Ordinal regression addresses the problem of predicting classes for...  相似文献   

把支持向量回归机中的原始凸二次规划问题转化为光滑的无约束问题,构建了无约束支持向量回归机,使得许多成熟有效的无约束最优化算法能够应用到支持向量回归机中去。提出了一种光滑支持向量回归算法,实验结果表明,它相对于其它回归训练方法有较快的收敛速度和较高的拟合精度。  相似文献   

王凯 《微计算机信息》2007,23(3X):232-233,190
把支持向量回归机中的原始凸二次规划问题转化为光滑的无约束问题.构建了无约束支持向量回归机.使得许多成熟有效的无约束最优化算法能够应用到支持向量回归机中去。提出了一种光滑支持向量回归算法.实验结果表明.它相对于其它回归训练方法有较快的收敛速度和较高的拟合精度.  相似文献   

Bayesian wavelet networks for nonparametric regression   总被引:2,自引:0,他引:2  
Radial wavelet networks have been proposed previously as a method for nonparametric regression. We analyze their performance within a Bayesian framework. We derive probability distributions over both the dimension of the networks and the network coefficients by placing a prior on the degrees of freedom of the model. This process bypasses the need to test or select a finite number of networks during the modeling process. Predictions are formed by mixing over many models of varying dimension and parameterization. We show that the complexity of the models adapts to the complexity of the data and produces good results on a number of benchmark test series.  相似文献   

A new technique based on Bayesian quantile regression that models the dependence of a quantile of one variable on the values of another using a natural cubic spline is presented. Inference is based on the posterior density of the spline and an associated smoothing parameter and is performed by means of a Markov chain Monte Carlo algorithm. Examples of the application of the new technique to two real environmental data sets and to simulated data for which polynomial modelling is inappropriate are given. An aid for making a good choice of proposal density in the Metropolis-Hastings algorithm is discussed. The new nonparametric methodology provides more flexible modelling than the currently used Bayesian parametric quantile regression approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号