首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
3.

Multilevel modeling is often used in the social sciences for analyzing data that has a hierarchical structure, e.g., students nested within schools. In an earlier study, we investigated the performance of various prediction rules for predicting a future observable within a hierarchical data set (Afshartous & de Leeuw, 2004). We apply the multilevel prediction approach to the NELS:88 educational data in order to assess the predictive performance on a real data set; four candidate models are considered and predictions are evaluated via both cross-validation and bootstrapping methods. The goal is to develop model selection criteria that assess the predictive ability of candidate multilevel models. We also introduce two plots that 1) aid in visualizing the amount to which the multilevel model predictions are “shrunk” or translated from the OLS predictions, and 2) help identify if certain groups exist for which the predictions are particularly good or bad.

  相似文献   

4.
This paper provides statistical guidance on the development and application of model-based geostatistical methods for disease prevalence mapping. We illustrate the different stages of the analysis, from exploratory analysis to spatial prediction of prevalence, through a case study on malaria mapping in Tanzania. Throughout the paper, we distinguish between predictive modelling, whose main focus is on maximizing the predictive accuracy of the model, and explanatory modelling, where greater emphasis is placed on understanding the relationships between the health outcome and risk factors. We demonstrate that these two paradigms can result in different modelling choices. We also propose a simple approach for detecting over-fitting based on inspection of the correlation matrix of the estimators of the regression coefficients. To enhance the interpretability of geostatistical models, we introduce the concept of domain effects in order to assist variable selection and model validation. The statistical ideas and principles illustrated here in the specific context of disease prevalence mapping are more widely applicable to any regression model for the analysis of epidemiological outcomes but are particularly relevant to geostatistical models, for which the separation between fixed and random effects can be ambiguous.  相似文献   

5.
Maximum likelihood principal component regression (MLPCR) is an errors-in-variables method used to accommodate measurement error information when building multivariate calibration models. A hindrance of MLPCR has been the substantial demand on computational resources sometimes made by the algorithm, especially for certain types of error structures. Operations on these large matrices are memory intensive and time consuming, especially when techniques such as cross-validation are used. This work describes the use of wavelet transforms (WT) as a data compression method for MLPCR. It is shown that the error covariance matrix in the wavelet and spectral domains are related through a two-dimensional WT. This allows the user to account for any effects of the wavelet transform on spectral and error structures. The wavelet transform can be applied to MLPCR when using either the full error covariance matrix or the smaller pooled error covariance matrix. Simulated and experimental near-infrared data sets are used to demonstrate the benefits of using wavelets with the MLPCR algorithm. In all cases, significant compression can be obtained while maintaining favorable predictive ability. Considerable time savings were also attained, with improvements ranging from a factor of 2 to a factor of 720. Using the WT-compressed data in MLPCR gave a reduction in prediction errors compared to using the raw data in MLPCR. An analogous reduction in prediction errors was not always seen when using PCR.  相似文献   

6.
Fu GH  Xu QS  Li HD  Cao DS  Liang YZ 《Applied spectroscopy》2011,65(4):402-408
In this paper a novel wavelength region selection algorithm, called elastic net grouping variable selection combined with partial least squares regression (EN-PLSR), is proposed for multi-component spectral data analysis. The EN-PLSR algorithm can automatically select successive strongly correlated prediction variable groups related to the response variable using two steps. First, a portion of the correlated predictors are selected and divided into subgroups by means of the grouping effect of elastic net estimation. Then, a recursive leave-one-group-out strategy is employed to further shrink the variable groups in terms of the root mean square error of cross-validation (RMSECV) criterion. The performance of the algorithm with real near-infrared (NIR) spectroscopic data sets shows that the EN-PLSR algorithm is competitive with full-spectrum PLS and moving window partial least squares (MWPLS) regression methods and it is suitable for use with strongly correlated spectroscopic data.  相似文献   

7.
A machine learning–based framework for modeling the error introduced by surrogate models of parameterized dynamical systems is proposed. The framework entails the use of high‐dimensional regression techniques (eg, random forests, and LASSO) to map a large set of inexpensively computed “error indicators” (ie, features) produced by the surrogate model at a given time instance to a prediction of the surrogate‐model error in a quantity of interest (QoI). This eliminates the need for the user to hand‐select a small number of informative features. The methodology requires a training set of parameter instances at which the time‐dependent surrogate‐model error is computed by simulating both the high‐fidelity and surrogate models. Using these training data, the method first determines regression‐model locality (via classification or clustering) and subsequently constructs a “local” regression model to predict the time‐instantaneous error within each identified region of feature space. We consider 2 uses for the resulting error model: (1) as a correction to the surrogate‐model QoI prediction at each time instance and (2) as a way to statistically model arbitrary functions of the time‐dependent surrogate‐model error (eg, time‐integrated errors). We apply the proposed framework to model errors in reduced‐order models of nonlinear oil‐water subsurface flow simulations, with time‐varying well‐control (bottom‐hole pressure) parameters. The reduced‐order models used in this work entail application of trajectory piecewise linearization in conjunction with proper orthogonal decomposition. When the first use of the method is considered, numerical experiments demonstrate consistent improvement in accuracy in the time‐instantaneous QoI prediction relative to the original surrogate model, across a large number of test cases. When the second use is considered, results show that the proposed method provides accurate statistical predictions of the time‐ and well‐averaged errors.  相似文献   

8.
9.
Penalized regression methods that perform simultaneous model selection and estimation are ubiquitous in statistical modeling. The use of such methods is often unavoidable as manual inspection of all possible models quickly becomes intractable when there are more than a handful of predictors. However, automated methods usually fail to incorporate domain-knowledge, exploratory analyses, or other factors that might guide a more interactive model-building approach. A hybrid approach is to use penalized regression to identify a set of candidate models and then to use interactive model-building to examine this candidate set more closely. To identify a set of candidate models, we derive point and interval estimators of the probability that each model along a solution path will minimize a given model selection criterion, for example, Akaike information criterion, Bayesian information criterion (AIC, BIC), etc., conditional on the observed solution path. Then models with a high probability of selection are considered for further examination. Thus, the proposed methodology attempts to strike a balance between algorithmic modeling approaches that are computationally efficient but fail to incorporate expert knowledge, and interactive modeling approaches that are labor intensive but informed by experience, intuition, and domain knowledge. Supplementary materials for this article are available online.  相似文献   

10.
Lung cancer is a leading cause of cancer‐related death worldwide. The early diagnosis of cancer has demonstrated to be greatly helpful for curing the disease effectively. Microarray technology provides a promising approach of exploiting gene profiles for cancer diagnosis. In this study, the authors propose a gene expression programming (GEP)‐based model to predict lung cancer from microarray data. The authors use two gene selection methods to extract the significant lung cancer related genes, and accordingly propose different GEP‐based prediction models. Prediction performance evaluations and comparisons between the authors’ GEP models and three representative machine learning methods, support vector machine, multi‐layer perceptron and radial basis function neural network, were conducted thoroughly on real microarray lung cancer datasets. Reliability was assessed by the cross‐data set validation. The experimental results show that the GEP model using fewer feature genes outperformed other models in terms of accuracy, sensitivity, specificity and area under the receiver operating characteristic curve. It is concluded that GEP model is a better solution to lung cancer prediction problems.Inspec keywords: lung, cancer, medical diagnostic computing, patient diagnosis, genetic algorithms, feature selection, learning (artificial intelligence), support vector machines, multilayer perceptrons, radial basis function networks, reliability, sensitivity analysisOther keywords: lung cancer prediction, cancer‐related death, cancer diagnosis, gene profiles, gene expression programming‐based model, gene selection, GEP‐based prediction models, prediction performance evaluations, representative machine learning methods, support vector machine, multilayer perceptron, radial basis function neural network, real microarray lung cancer datasets, cross‐data set validation, reliability, receiver operating characteristic curve  相似文献   

11.
Comparisons of prediction models from the new augmented classical least squares (ACLS) and partial least squares (PLS) multivariate spectral analysis methods were conducted using simulated data containing deviations from the idealized model. The simulated data were based on pure spectral components derived from real near-infrared spectra of multicomponent dilute aqueous solutions. Simulated uncorrelated concentration errors, uncorrelated and correlated spectral noise, and nonlinear spectral responses were included to evaluate the methods on situations representative of experimental data. The statistical significance of differences in prediction ability was evaluated using the Wilcoxon signed rank test. The prediction differences were found to be dependent on the type of noise added, the numbers of calibration samples, and the component being predicted. For analyses applied to simulated spectra with noise-free nonlinear response, PLS was shown to be statistically superior to ACLS for most of the cases. With added uncorrelated spectral noise, both methods performed comparably. Using 50 calibration samples with simulated correlated spectral noise, PLS showed an advantage in 3 out of 9 cases, but the advantage dropped to 1 out of 9 cases with 25 calibration samples. For cases with different noise distributions between calibration and validation, ACLS predictions were statistically better than PLS for two of the four components. Also, when experimentally derived correlated spectral error was added, ACLS gave better predictions that were statistically significant in 15 out of 24 cases simulated. On data sets with nonuniform noise, neither method was statistically better, although ACLS usually had smaller standard errors of prediction (SEPs). The varying results emphasize the need to use realistic simulations when making comparisons between various multivariate calibration methods. Even when the differences between the standard error of predictions were statistically significant, in most cases the differences in SEP were small. This study demonstrated that unlike CLS, ACLS is competitive with PLS in modeling nonlinearities in spectra without knowledge of all the component concentrations. This competitiveness is important when maintaining and transferring models for system drift, spectrometer differences, and unmodeled components, since ACLS models can be rapidly updated during prediction when used in conjunction with the prediction augmented classical least squares (PACLS) method, while PLS requires full recalibration.  相似文献   

12.
A class of multivariate calibration methods called augmented classical least squares (ACLS) has been proposed which combines an explicit linear additive model with the predictive power of inverse models, such as principal component regression (PCR) and partial least squares (PLS). Because of its use of the explicit linear additive model, ACLS provides an interesting framework to incorporate different sources of prior information, such as measured pure component spectra, in the model. In this study, the predictive power of ACLS models incorporating different amounts of prior information has been compared to that of PCR and PLS using two examples, a designed experiment and one with biological samples. In both cases, the ACLS models showed predictive power comparable to PLS under idealized validation conditions. When a different interferent structure was present in the validation samples, the predictive power of the inverse models (PCR and PLS) dramatically decreased, with an increase in root-mean-squared error of prediction by a factor of 3.5 for the first example and a factor of 2 in the second example. The incorporation of prior information in the ACLS framework was found to considerably reduce or even completely remove these dramatic effects, especially when the pure component contributions for the interferents were taken into account.  相似文献   

13.
Recent work has shown that ridge regression (RR) is Pareto to partial least squares (PLS) and principal component regression (PCR) when the variance indicator Euclidian norm of the regression coefficients, //p//, is plotted against the bias indicator root mean square error of calibration (RMSEC). Simplex optimization demonstrates that RR is Pareto for several other spectral data sets when //p// is used with RMSEC and the root mean square error of evaluation (RMSEE) as optimization criteria. From this investigation, it was observed that while RR is Pareto optimal, PLS and PCR harmonious models are near equivalent to harmonious RR models. Additionally, it was found that RR is Pareto robust, i.e., models formed at one temperature were then used to predict samples at another temperature. Wavelength selection is commonly performed to improve analysis results such that bias indicators RMSEC, RMSEE, root mean square error of validation, or root mean square error of cross-validation decrease using a subset of wavelengths. Just as critical to an analysis of selected wavelengths is an assessment of variance. Using wavelengths deemed optimal in a previous study, this paper reports on the variance/bias tradeoff. An approach that forms the Pareto model with a Pareto wavelength subset is suggested.  相似文献   

14.
The limits of quantitative multivariate assays for the analysis of extra virgin olive oil samples from various Greek sites adulterated by sunflower oil have been evaluated based on their Fourier transform (FT) Raman spectra. Different strategies for wavelength selection were tested for calculating optimal partial least squares (PLS) models. Compared to the full spectrum methods previously applied, the optimum standard error of prediction (SEP) for the sunflower oil concentrations in spiked olive oil samples could be significantly reduced. One efficient approach (PMMS, pair-wise minima and maxima selection) used a special variable selection strategy based on a pair-wise consideration of significant respective minima and maxima of PLS regression vectors, calculated for broad spectral intervals and a low number of PLS factors. PMMS provided robust calibration models with a small number of variables. On the other hand, the Tabu search strategy recently published (search process guided by restrictions leading to Tabu list) achieved lower SEP values but at the cost of extensive computing time when searching for a global minimum and less robust calibration models. Robustness was tested by using packages of ten and twenty randomly selected samples within cross-validation for calculating independent prediction values. The best SEP values for a one year's harvest with a total number of 66 Cretian samples were obtained by such spectral variable optimized PLS calibration models using leave-20-out cross-validation (values between 0.5 and 0.7% by weight). For the more complex population of olive oil samples from all over Greece (total number of 92 samples), results were between 0.7 and 0.9% by weight with a cross-validation sample package size of 20. Notably, the calibration method with Tabu variable selection has been shown to be a valid chemometric approach by which a single model can be applied with a low SEP of 1.4% for olive oil samples across three different harvest years.  相似文献   

15.
A new wavelength interval selection procedure, moving window partial least-squares regression (MWPLSR), is proposed for multicomponent spectral analysis. This procedure builds a series of PLS models in a window that moves over the whole spectral region and then locates useful spectral intervals in terms of the least complexity of PLS models reaching a desired error level. Based on a proposed theory demonstrating the necessity of wavelength selection, it is shown that MWPLSR provides a viable approach to eliminate the extra variability generated by non-composition-related factors such as the perturbations in experimental conditions and physical properties of samples. A salient advantage of MWPLSR is that the calibration model is very stable against the interference from non-composition-related factors. Moreover, the selection of spectral intervals in terms of the least model complexity enables the reduction of the size of a calibration sample set in calibration modeling. Two strategies are suggested for coupling the MWPLSR procedure with PLS for multicomponent spectral analysis: One is the inclusion of all selected intervals to develop a PLS calibration model, and the other is the combination of the PLS models built separately in each interval. The combination of multiple PLS models offers a novel potential tool for improving the performance of individual models. The proposed procedures are evaluated using two open-path Fourier transform infrared data sets and one near-infrared data set, each having different noise characteristics. The results reveal that the proposed procedures are very promising for vibrational spectroscopy-based multicomponent analyses and give much better prediction than the full-spectrum PLS modeling.  相似文献   

16.
The need for automated quality surveillance of liquid hydrocarbon fuels has driven the development of rapid fuel property modeling from spectroscopic sensor data. The correlation of near-infrared (NIR) and Raman spectroscopic data with jet and diesel fuel properties can be improved by the deliberate selection of continuous wavelength sub-ranges. An automatic wavelength selection strategy would allow for the unsupervised construction of partial least squares (PLS) regression models of increased predictive utility when supervised model construction and maintenance is not feasible. Changeable size moving window partial least squares (CSMWPLS) is one of the most thorough operations suited for this task. Unfortunately, the necessarily large number of PLS model constructions required by an automated version of this procedure limits the evaluation of the predictive ability of the resulting models through full cross-validation results. Presented here is a novel restricted version of the CSMWPLS algorithm in which the initial spectral range selection is accomplished through multiple interval PLS (iPLS) analyses, where analysis windows for the refinement step no longer move, and size changes are limited to a series of symmetric attenuations. It is shown that the proposed algorithm can provide significant PLS model improvements during the course of a fully automated analysis of jet and diesel fuel spectra in less time than an automated CSMWPLS algorithm.  相似文献   

17.
This work was aimed at determining the feasibility of artificial neural networks (ANN) by implementing backpropagation algorithms with default settings to generate better predictive models than multiple linear regression (MLR) analysis. The study was hypothesized on timolol-loaded liposomes. As tutorial data for ANN, causal factors were used, which were fed into the computer program. The number of training cycles has been identified in order to optimize the performance of the ANN. The optimization was performed by minimizing the error between the predicted and real response values in the training step. The results showed that training was stopped at 10?000 training cycles with 80% of the pattern values, because at this point the ANN generalizes better. Minimum validation error was achieved at 12 hidden neurons in a single layer. MLR has great prediction ability, with errors between predicted and real values lower than 1% in some of the parameters evaluated. Thus, the performance of this model was compared to that of the MLR using a factorial design. Optimal formulations were identified by minimizing the distance among measured and theoretical parameters, by estimating the prediction errors. Results indicate that the ANN shows much better predictive ability than the MLR model. These findings demonstrate the increased efficiency of the combination of ANN and design of experiments, compared to the conventional MLR modeling techniques.  相似文献   

18.
19.
It is especially significant for a manufacturing company to select a proper maintenance policy because maintenance impacts not only on economy,reliability and availability but also on personnel safety.This article reports on research in the backlash error data interpretation and compensation for intelligent predictive maintenance in machine centers based on artificial neural networks(ANNs).The backlash error,measurement system and prediction methods are analyzed in detail.The result indicates that it is possible to predict and compensate for the backlash error in both forward and backward directions in machine centers.  相似文献   

20.
Spectro-fluorescence signature (SFS) of water samples contains information that may be used to quantify dissolved organic carbon (DOC) if combined with multivariate analyses. A model was built through SFS and partial least squared (PLS) regression. The SFSs of 219 samples of natural water along the Raritan River and Millstone River watersheds located in central New Jersey, and their corresponding DOC concentrations were used to build the model. Calibration, full cross-validation, and prediction performances of various models were statistically compared before optimal model selection. The final selected model, tested on the Passaic River watershed in northern New Jersey, provided a bias of 0.028 mg/l and a root mean squared error of prediction (RMSEP) of 0.35 mg/l. Linked to PLS, SFS can be a quality and cost effective method to perform on-line rapid DOC measurements.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号