首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Balance-sheet data offer a potentially large number of candidate predictors of corporate financial failure. In this paper we provide a novel predictor selection procedure based on non-parametric regression and classification tree method (CART) and test its performance within a standard logit model. We show that a simple logit model with dummy variables created in accordance with the nodes of estimated classification tree outperforms both standard logit model with step-wise-selected financial ratios, and CART itself. On a population of Slovenian companies our method achieves remarkable rates of precision in out-of-sample bankruptcy prediction. Our selection method thus represents an efficient way of introducing non-linear effects of predictor variables on the default probability in standard single-index models like logit. These findings are robust to choice-based sampling of estimation samples.  相似文献   

2.
Remote sensing often involves the estimation of in situ quantities from remote measurements. Linear regression, where there are no non-linear combinations of regressors, is a common approach to this prediction problem in the remote sensing community. A review of recent remote sensing articles using univariate linear regression indicates that in the majority of cases, ordinary least squares (OLS) linear regression has been applied, with approximately half the articles using the in situ observations as regressors and the other half using the inverse regression with remote measurements as regressors. OLS implicitly assume an underlying normal structural data model to arrive at unbiased estimates of the response. OLS regression can be a biased predictor in the presence of measurement errors when the regression problem is based on a functional rather than structural data model. Parametric (Modified Least Squares) and non-parametric (Theil-Sen) consistent predictors are given for linear regression in the presence of measurement errors together with analytical approximations of their prediction confidence intervals. Three case studies involving estimation of leaf area index from nadir reflectance estimates are used to compare these unbiased estimators with OLS linear regression. A comparison to Geometric Mean regression, a standardized version of Reduced Major Axis regression, is also performed. The Theil-Sen approach is suggested as a potential replacement of OLS for linear regression in remote sensing applications. It offers simplicity in computation, analytical estimates of confidence intervals, robustness to outliers, testable assumptions regarding residuals and requires limited a priori information regarding measurement errors.  相似文献   

3.
《Ergonomics》2012,55(5):499-511
This paper aims to demonstrate the effects of measurement errors on psychometric measurements in ergonomics studies. A variety of sources can cause random measurement errors in ergonomics studies and these errors can distort virtually every statistic computed and lead investigators to erroneous conclusions. The effects of measurement errors on five most widely used statistical analysis tools have been discussed and illustrated: correlation; ANOVA; linear regression; factor analysis; linear discriminant analysis. It has been shown that measurement errors can greatly attenuate correlations between variables, reduce statistical power of ANOVA, distort (overestimate, underestimate or even change the sign of) regression coefficients, underrate the explanation contributions of the most important factors in factor analysis and depreciate the significance of discriminant function and discrimination abilities of individual variables in discrimination analysis. The discussions will be restricted to subjective scales and survey methods and their reliability estimates. Other methods applied in ergonomics research, such as physical and electrophysiological measurements and chemical and biomedical analysis methods, also have issues of measurement errors, but they are beyond the scope of this paper. As there has been increasing interest in the development and testing of theories in ergonomics research, it has become very important for ergonomics researchers to understand the effects of measurement errors on their experiment results, which the authors believe is very critical to research progress in theory development and cumulative knowledge in the ergonomics field.  相似文献   

4.
Researchers in lidar (Light Detection And Ranging) strive to search for the most appropriate laser-based metrics as predictors in regression models for estimating forest structural variables. Many previously developed models are scale-dependent that need to be fitted and then applied both at the same scale or pixel size. The objective of this paper is to develop methods for scale-invariant estimation of forest biomass using lidar data. We proposed two scale-invariant models for biomass: a linear functional model and an equivalent nonlinear model that use lidar-derived canopy height distributions (CHD) and canopy height quantile functions (CHQ) as predictors, respectively. The two models are called functional regression models because the predictors CHD and CHQ are themselves functions or functional data. The model formulation was justified mathematically under moderate assumptions. We also created a fine-resolution biomass map by mapping individual tree component biomass in a temperate forest of eastern Texas with a lidar tree-delineation approach. The map was used as reference data to synthesize training and test datasets at multiple scales for validating the two scale-invariant models. Results suggest that the models can accurately predict biomass and yield consistent predictive performances across a variety of scales with an R2 ranging from 0.80 to 0.95 (RMSE: from 14. 3 Mg/ha to 33.7 Mg/ha) among all the fitted models. Results also show that a training data size of around 50 plots or less was enough to guarantee a good fitting of the linear functional model. Our findings demonstrate the effectiveness of CHD and CHQ as lidar metrics for estimating biomass as well as the capability of lidar for mapping biomass at a range of scales. The functional regression models of this study are useful for lidar-based forest inventory tasks where the analysis units vary in size and shape. They also hold promise for estimating other forest characteristics such as below-ground biomass, timber volume, crown fuel weight, and Leaf Area Index.  相似文献   

5.
工业数据挖掘中应用最小二乘法的缺陷   总被引:1,自引:1,他引:0  
工业数据建模常常使用最小二乘法进行参数估计,在数据满足一定条件的前提下,可得到被估计参数的无偏估计值。但是工业数据一般含有测量误差,当基于误差数据作为自变量进行最小二乘回归时,得到的参数估计值往往是有偏的,其结果不能正确反映数据变量之间的结构关系。因此,对二元变量模型进行了深入分析,通过对测量误差进行合理假设,建立了在统计意义下被估计参数真值与测量误差和参数有偏估计值之间的解析关系式,为进一步参数校正奠定了理论基础。仿真实例表明了该方法的有效性。  相似文献   

6.
Maximum likelihood (ML) in the linear model overfits when the number of predictors (M) exceeds the number of objects (N). One of the possible solution is the relevance vector machine (RVM) which is a form of automatic relevance detection and has gained popularity in the pattern recognition machine learning community by the famous textbook of Bishop (2006). RVM assigns individual precisions to weights of predictors which are then estimated by maximizing the marginal likelihood (type II ML or empirical Bayes). We investigated the selection properties of RVM both analytically and by experiments in a regression setting.We show analytically that RVM selects predictors when the absolute z-ratio (|least squares estimate|/standard error) exceeds 1 in the case of orthogonal predictors and, for M = 2, that this still holds true for correlated predictors when the other z-ratio is large. RVM selects the stronger of two highly correlated predictors. In experiments with real and simulated data, RVM is outcompeted by other popular regularization methods (LASSO and/or PLS) in terms of the prediction performance. We conclude that type II ML is not the general answer in high dimensional prediction problems.In extensions of RVM to obtain stronger selection, improper priors (based on the inverse gamma family) have been assigned to the inverse precisions (variances) with parameters estimated by penalized marginal likelihood. We critically assess this approach and suggest a proper variance prior related to the Beta distribution which gives similar selection and shrinkage properties and allows a fully Bayesian treatment.  相似文献   

7.
In fitting regression models data analysts are often faced with many predictor variables which may influence the outcome. Several strategies for selection of variables to identify a subset of ‘important’ predictors are available for many years. A further issue to model building is how to deal with non-linearity in the relationship between outcome and a continuous predictor. Traditionally, for such predictors either a linear functional relationship or a step function after grouping is assumed. However, the assumption of linearity may be incorrect, leading to a misspecified final model. For multivariable model building a systematic approach to investigate possible non-linear functional relationships based on fractional polynomials and the combination with backward elimination was proposed recently. So far a program was only available in Stata, certainly preventing a more general application of this useful procedure. The approach will be introduced, advantages will be shown in two examples, a new approach to present FP functions will be illustrated and a macro in SAS will be shortly introduced. Differences to Stata and R programs are noted.  相似文献   

8.
As advanced control rooms for new process control plants are being designed, the question arises as to whether operators of the future need to have a particular set of cognitive characteristics to make the most of those advanced control rooms. This issue was investigated by examining the interaction between ecological interface design (EID) and individual differences in the context of a process control microworld. A number of potential predictors of performance were investigated, including: demographic data, type of interface, type of instruction, and data from two cognitive style tests. Eight linear regression analyses were conducted to determine which variables were the strongest predictors of performance. The results indicate that the strongest and most consistent predictor of performance was the interaction between a holist cognitive style score and an interface based on the principles of EID. That is, individuals who used an EID interface and who had high holist scores were the best performers. It seems that these individuals have the relational thinking ability that is required to exploit the value of the higher-order functional information provided by an EID interface. This empirical result has practical implications for operator selection.  相似文献   

9.
《Ergonomics》2012,55(7):791-808
Individual differences in vigilance are ubiquitous and relevant to a variety of work environments in industrial, transportation, medical and security settings. Despite much previous work, mostly on personality traits, it remains difficult to identify vigilant operators. This paper reviews recent research that may point towards practically useful predictor variables for vigilance. Theoretical approaches to identifying predictors that accommodate the heterogeneous nature of vigilance tasks are compared. The article surveys recent empirical studies using personality measures, ability tests and scales for stress and coping as predictors of vigilance. Promising new constructs include trait scales linked to fatigue, abnormal personality and the stress state of task engagement. Implications of the data reviewed for occupational selection are discussed. Selection should be based on a multivariate assessment strategy, cognitive task analysis of the operational vigilance task and use of work sample measures to capture typical stress responses to the task. This review paper surveys recent research that may point towards practically useful predictor variables for vigilance. The article surveys recent empirical studies using personality measures, ability tests and scales for stress and coping as predictors of vigilance. Selection should be based on a multivariate assessment strategy.  相似文献   

10.
A new mathematical programming model is proposed to address the subset selection problem in multiple linear regression where the objective is to select a minimal subset of predictor variables without sacrificing any explanatory power. A parametric solution of this model yields a number of efficient subsets. To obtain this solution, an optimal or one of two heuristic algorithms is repeatedly used. The subsets generated are compared to ones generated by several standard procedures. The results suggest that the new approach finds subsets that compare favorably against the standard procedures in terms of the generally accepted measure: adjusted R2.  相似文献   

11.
Empirical characterization of random forest variable importance measures   总被引:2,自引:0,他引:2  
Microarray studies yield data sets consisting of a large number of candidate predictors (genes) on a small number of observations (samples). When interest lies in predicting phenotypic class using gene expression data, often the goals are both to produce an accurate classifier and to uncover the predictive structure of the problem. Most machine learning methods, such as k-nearest neighbors, support vector machines, and neural networks, are useful for classification. However, these methods provide no insight regarding the covariates that best contribute to the predictive structure. Other methods, such as linear discriminant analysis, require the predictor space be substantially reduced prior to deriving the classifier. A recently developed method, random forests (RF), does not require reduction of the predictor space prior to classification. Additionally, RF yield variable importance measures for each candidate predictor. This study examined the effectiveness of RF variable importance measures in identifying the true predictor among a large number of candidate predictors. An extensive simulation study was conducted using 20 levels of correlation among the predictor variables and 7 levels of association between the true predictor and the dichotomous response. We conclude that the RF methodology is attractive for use in classification problems when the goals of the study are to produce an accurate classifier and to provide insight regarding the discriminative ability of individual predictor variables. Such goals are common among microarray studies, and therefore application of the RF methodology for the purpose of obtaining variable importance measures is demonstrated on a microarray data set.  相似文献   

12.
Bootstrap estimated true and false positive rates and ROC curve   总被引:1,自引:0,他引:1  
Diagnostic studies and new biomarkers are assessed by the estimated true and false positive rates of the classification rule. One diagnostic rule is considered for high-dimensional predictor data. Cross-validation and the leave-one-out bootstrap are discussed to estimate true and false positive rates of classifiers by the machine learning methods Adaboost, Bagging, Random Forest, (penalized) logistic regression and support vector machines. The .632+ bootstrap estimation of the misclassification error has been previously proposed to adjust the overfitting of the apparent error. This idea is generalized to the estimation of true and false positive rates. Tree-based simulation models with 8 and 50 binary non-informative variables are analysed to examine the properties of the estimators. Finally, a bootstrap estimation of receiver operating characteristic (ROC) curves is suggested and a .632+ bootstrap estimation of ROC curves is discussed. This approach is applied to high-dimensional gene expression data of leukemia and predictors of image data for glaucoma diagnosis.  相似文献   

13.
Methods of regression diagnostics for functional regression models are developed which relate a functional response to predictor variables that can be multivariate vectors or random functions. For this purpose, a residual process is defined by subtracting the predicted from the observed response functions. This residual process is expanded into functional principal components (FPC), and the corresponding FPC scores are used as natural proxies for the residuals in functional regression models. For the case of a univariate covariate, a randomization test is proposed based on these scores to examine if the residual process depends on the covariate. If this is the case, it indicates lack of fit of the model. Graphical methods based on the FPC scores of observed and fitted functions can be used to complement more formal tests. The methods are illustrated with data from a recent study of Drosophila fruit flies regarding life-cycle gene expression trajectories as well as functional data from a dose-response experiment for Mediterranean fruit flies (Ceratitis capitata).  相似文献   

14.
We present a Bayesian variable selection method for the setting in which the number of independent variables or predictors in a particular dataset is much larger than the available sample size. While most of the existing methods allow some degree of correlations among predictors but do not consider these correlations for variable selection, our method accounts for correlations among the predictors in variable selection. Our correlation-based stochastic search (CBS) method, the hybrid-CBS algorithm, extends a popular search algorithm for high-dimensional data, the stochastic search variable selection (SSVS) method. Similar to SSVS, we search the space of all possible models using variable addition, deletion or swap moves. However, our moves through the model space are designed to accommodate correlations among the variables. We describe our approach for continuous, binary, ordinal, and count outcome data. The impact of choices of prior distributions and hyperparameters is assessed in simulation studies. We also examined the performance of variable selection and prediction as the correlation structure of the predictors varies. We found that the hybrid-CBS resulted in lower prediction errors and identified better the true outcome associated predictors than SSVS when predictors were moderately to highly correlated. We illustrate the method on data from a proteomic profiling study of melanoma, a type of skin cancer.  相似文献   

15.
The rank and regression rank score tests of linear hypothesis in the linear regression model are modified for measurement error models. The modified tests are still distribution free. Some tests of linear subhypotheses are invariant to the nuisance parameter, others are based on the aligned ranks using the R-estimators. The asymptotic relative efficiencies of tests with respect to tests in models without measurement errors are evaluated. The simulation study illustrates the powers of the tests.  相似文献   

16.
Many automatic speech recognition (ASR) systems rely on the sole pronunciation dictionaries and language models to take into account information about language. Implicitly, morphology and syntax are to a certain extent embedded in the language models but the richness of such linguistic knowledge is not exploited. This paper studies the use of morpho-syntactic (MS) information in a post-processing stage of an ASR system, by reordering N-best lists. Each sentence hypothesis is first part-of-speech tagged. A morpho-syntactic score is computed over the tag sequence with a long-span language model and combined to the acoustic and word-level language model scores. This new sentence-level score is finally used to rescore N-best lists by reranking or consensus. Experiments on a French broadcast news task show that morpho-syntactic knowledge improves the word error rate and confidence measures. In particular, it was observed that the errors corrected are not only agreement errors and errors on short grammatical words but also other errors on lexical words where the hypothesized lemma was modified.  相似文献   

17.
Some regularization methods, including the group lasso and the adaptive group lasso, have been developed for the automatic selection of grouped variables (factors) in conditional mean regression. In many practical situations, such a problem arises naturally when a set of dummy variables is used to represent a categorical factor and/or when a set of basis functions of a continuous variable is included in the predictor set. Complementary to these earlier works, the simultaneous and automatic factor selection is examined in quantile regression. To incorporate the factor information into regularized model fitting, the adaptive sup-norm regularized quantile regression is proposed, which penalizes the empirical check loss function by the sum of factor-wise adaptive sup-norm penalties. It is shown that the proposed method possesses the oracle property. A simulation study demonstrates that the proposed method is a more appropriate tool for factor selection than the adaptive lasso regularized quantile regression.  相似文献   

18.
In this paper, we propose the weighted fusion, a new penalized regression and variable selection method for data with correlated variables. The weighted fusion can potentially incorporate information redundancy among correlated variables for estimation and variable selection. Weighted fusion is also useful when the number of predictors p is larger than the number of observations n. It allows the selection of more than n variables in a motivated way. Real data and simulation examples show that weighted fusion can improve variable selection and prediction accuracy.  相似文献   

19.
A linear multivariate measurement error model AX=B is considered. The errors in are row-wise finite dependent, and within each row, the errors may be correlated. Some of the columns may be observed without errors, and in addition the error covariance matrix may differ from row to row. The columns of the error matrix are united into two uncorrelated blocks, and in each block, the total covariance structure is supposed to be known up to a corresponding scalar factor. Moreover the row data are clustered into two groups, according to the behavior of the rows of true A matrix. The change point is unknown and estimated in the paper. After that, based on the method of corrected objective function, strongly consistent estimators of the scalar factors and X are constructed, as the numbers of rows in the clusters tend to infinity. Since Toeplitz/Hankel structure is allowed, the results are applicable to system identification, with a change point in the input data.  相似文献   

20.
Stock index forecasting is one of the most difficult tasks that financial organizations, firms and private investors have to face. Support vector regression (SVR) has become a popular alternative in stock index forecasting tasks due to its generalization capability in obtaining a unique solution. However, the major limitation of SVR is that it cannot capture the relative importance of independent variables to the dependent variable when many potential independent variables are considered. This study incorporates feature selection method and SVR for building stock index forecasting model. The proposed model uses multivariate adaptive regression splines (MARS), an effective nonlinear and nonparametric regression methodology, to identify important forecasting variables. The obtained significant predictor variables are then served as the inputs for the SVR model. Experimental results reveal that the obtained important variables from MARS can improve the forecasting performance of the SVR models. Moreover, the MARS results provide useful information about the relationship between the selected predictor variables and stock index through the obtained basis functions, important predictor variables and the MARS prediction function. Hence, the proposed stock index forecasting model can generate good forecasting performance and exhibits the capability of identifying significant predictor variables, which provide valuable information for further investment decisions/strategies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号