首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Dimensionality reduction is an important technique for preprocessing of high-dimensional data. Because only one side of the original data is represented in a low-dimensional subspace, useful information may be lost. In the present study, novel dimensionality reduction methods were developed that are suitable for metabolome data, where observation varies with time. Metabolomics deal with this type of data, which are often obtained in microorganism fermentation processes. However, no dimensionality reduction method that utilizes information from the original data in a positive manner has been reported to date. The ordinary dimensionality reduction methods of principal component analysis (PCA), partial least squares (PLS), orthonormalized PLS (OPLS), and regularized Fisher discriminant analysis (RFDA) were extended by introducing differential penalties to the latent variables in each class. A nonlinear extension of this approach, using kernel methods, was also proposed in the form of kernel-smoothed PCA, PLS, OPLS, and FDA. Since all of these methods are formulated as generalized eigenvalue problems, the solutions can be computed easily. These methods were then applied to intracellular metabolite data of a xylose-fermenting yeast in ethanol fermentation. Visualization in the low-dimensional subspace suggests that smoothed PCA successfully preserves the information about the time course of observations during fermentation, and that RFDA can produce high separation among different strains.  相似文献   

2.
This paper reports the analysis of a multiblock environmental dataset consisting of 176 samples collected in Islamabad Pakistan between February 2006 and August 2007. The concentrations of 32 elements in each sample were measured using Proton Induced X-ray Emission plus black carbon for both coarse and fine particulate matter. Six meteorological parameters were also recorded, namely maximum and minimum daily temperatures, humidity, rainfall, windspeed and pressure. The data were explored using Principal Components Analysis (PCA), Partial Least Squares (PLS), Consensus PCA, Multiblock PLS, Mantel test, Procrustes analysis and the RV coefficient. Seasonal trends can be identified and interpreted. Using the elemental composition of the particulates it is possible to predict meteorological parameters. Based on the models from PLS, it is possible to use elemental composition in the airborne particulates matter (APM) to predict meteorological parameters. The results from block similarity measures show that fine APM resembles meteorological parameters better than coarse APM. Multiblock PLS models however are not better than classical PLSR. This paper also demonstrates the potential of multiblock approach in environmental monitoring.  相似文献   

3.
4.
Metabolomics studies generate increasingly complex data tables, which are hard to summarize and visualize without appropriate tools. The use of chemometrics tools, e.g., principal component analysis (PCA), partial least-squares to latent structures (PLS), and orthogonal PLS (OPLS), is therefore of great importance as these include efficient, validated, and robust methods for modeling information-rich chemical and biological data. Here the S-plot is proposed as a tool for visualization and interpretation of multivariate classification models, e.g., OPLS discriminate analysis, having two or more classes. The S-plot visualizes both the covariance and correlation between the metabolites and the modeled class designation. Thereby the S-plot helps identifying statistically significant and potentially biochemically significant metabolites, based both on contributions to the model and their reliability. An extension of the S-plot, the SUS-plot (shared and unique structure), is applied to compare the outcome of multiple classification models compared to a common reference, e.g., control. The used example is a gas chromatography coupled mass spectroscopy based metabolomics study in plant biology where two different transgenic poplar lines are compared to wild type. By using OPLS, an improved visualization and discrimination of interesting metabolites could be demonstrated.  相似文献   

5.
This paper describes an adaptation of Ergon's 2PLS approach (Compression into two-component PLS factorizations. J. Chemom. 2003; 17: 303-312.) to represent a single predictor regression model in terms of a two-factor latent vector model. The purpose of this reduction is to aid model interpretation and diagnostics. Non-orthogonal score vectors are produced from two orthonormal loading vectors: one identical to the first PLS loading vector, and a second built from the regression vector. Using an invertible matrix, the factorization can be alternatively represented by two orthogonal score vectors, one of which is proportional to centred predictions. An auxiliary set of loadings is also calculated, which captures a different model space, but is provided since its associated residuals have useful properties. Identities connecting the two model spaces are provided. The latent vector regression coefficients are not always least-squares estimates but can be represented as the solution to a two-term generalized ridge regression. Consequences of this are addressed. The utility of TinyLVR is demonstrated with example models built using stepwise variate selection and ridge regression.  相似文献   

6.
Principal component regression (PCR) has been widely used for soft sensor modeling and quality prediction in last several decades, which is still very popular for both academy researches and industry applications. However, most PCR models are determined by the projection method, which may lack probabilistic interpretation for the process data. In fact, due to the inevitable process noise, most process data are inherently random variables. Several probabilistic PCA methods have already been proposed in the past years. Compared to the deterministic modeling method, the probabilistic model is more appropriate to characterize the behavior of the random variables in the process. This paper first presents a probabilistic derivation of the PCR model (PPCR) and then extends it to the mixture form (MPPCR). For quality prediction of processes with multiple operation modes, a mixture probabilistic soft sensor is developed based on the MPPCR model. Simultaneously, the information of the operation mode can also be located by the proposed soft sensor. To evaluate the performance of the MPPCR model, a numerical example and a benchmark simulation case study of the Tennessee Eastman process are provided. Different methods have been compared with the proposed model, including the global, local, and multi-local PCR models. As a result, the proposed MPPCR model performs the best among these methods.  相似文献   

7.
As systems biology develops, various types of high-throughput -omics data become rapidly available. An increasing challenge is to analyze such massive data, interpret the results and validate the findings. Data analysis for most of the omics-techniques is in a fledgling immature stage. Alone the dimensionality of the data tables calls for new ways to reveal structure in the data, without cognitive overflow and excessive false discovery rate. Multi-block methods have been developed and adapted in order to find common variation patterns in data and depict these findings on graphical displays while providing tools to enhance the interpretation of the outcomes. In particular, multi-block methods based on latent variables are powerful tools to study block and global variation patterns, e.g. by inspecting block and global score plots. These methods can be used to achieve a graphical overview over sample and variable variation patterns in an efficient way. However, a visual detection of patterns may be subjective and, therefore, there is a need for validation tools. In this paper tools for validation of visually identified patterns in multi-block results are presented. Cross-validated estimates of Root Mean Square Error (RMSE) for block results are introduced for estimating the number of relevant PCs of the Consensus Principal Component Analysis (CPCA) models. Furthermore, important variables are identified by approximate t-tests based on Procrustes-corrected jackknifing. For the assessment of the stability of score patterns, block stability plots are introduced. Outliers can be revealed graphically on block and global level by stability plots.  相似文献   

8.
In this paper, a new method to approximate a data set by another data set with constrained covariance matrix is proposed. The method is termed Approximation of a DIstribution for a given COVariance (ADICOV). The approximation is solved in any projection subspace, including that of Principal Component Analysis (PCA) and Partial Least Squares (PLS). Given the direct relationship between covariance matrices and projection models, ADICOV is useful to test whether a data set satisfies the covariance structure in a projection model. This idea is broadly applicable in chemometrics. Also, ADICOV can be used to simulate data with a specific covariance structure and data distribution. Some applications are illustrated in an industrial case of study.  相似文献   

9.
Given the relevance of principal component analysis (PCA) to the treatment of spectrometric data, we have evaluated potentialities and limitations of such useful statistical approach for the harvesting of information in large sets of X-ray photoelectron spectroscopy (XPS) spectra. Examples allowed highlighting the contribution of PCA to data treatment by comparing the results of this data analysis with those obtained by the usual XPS quantification methods. PCA was shown to improve the identification of chemical shifts of interest and to reveal correlations between peak components. First attempts to use the method led to poor results, which showed mainly the distance between series of samples analyzed at different moments. To weaken the effect of variations of minor interest, a data normalization strategy was developed and tested. A second issue was encountered with spectra suffering of an even slightly inaccurate binding energy scale correction. Indeed, minor shifts of energy channels lead to the PCA being performed on incorrect variables and consequently to misleading information. In order to improve the energy scale correction and to speed up this step of data pretreatment, a data processing method based on PCA was used. Finally, the overlap of different sources of variation was studied. Since the intensity of a given energy channel consists of electrons from several origins, having suffered inelastic collisions (background) or not (peaks), the PCA approach cannot compare them separately, which may lead to confusion or loss of information. By extracting the peaks from the background and considering them as new variables, the effect of the elemental composition could be taken into account in the case of spectra with very different backgrounds. In conclusion, PCA is a very useful diagnostic tool for the interpretation of XPS spectra, but it requires a careful and appropriate data pretreatment.  相似文献   

10.
赵威  王伟 《工程力学》2013,30(2):272-277
针对目前多维变量可靠度问题中广泛应用的均匀设计响应面法,分析了采用最小二乘法拟合样本数据回归模型时存在的局限性,并在已有方法的基础上提出了一种改进的方法。该方法将均匀设计与偏最小二乘回归技术相结合来回归响应面模型,从而计算结构的失效概率,有效的解决了变量间多重相关性及小样本条件下建立回归模型的问题。通过算例验证了该方法的适用性,尤其对于高维变量的可靠度问题,与最小二乘拟合响应面相比,计算结果更加精确。  相似文献   

11.
An analytical technique based on kernel matrix representation is demonstrated to provide further chemically meaningful insight into partial least squares (PLS) regression models. The kernel matrix condenses essential information about scores derived from PLS or principal component analysis (PCA). Thus, it becomes possible to establish the proper interpretation of the scores. A PLS model for the total nitrogen (TN) content in multiple Thai fish sauces is built with a set of near-infrared (NIR) transmittance spectra of the fish sauce samples. The kernel analysis of the scores effectively reveals that the variation of the spectral feature induced by the change in protein content is substantially associated with the total water content and the protein hydration. Kernel analysis is also carried out on a set of time-dependent infrared (IR) spectra representing transient evaporation of ethanol from a binary mixture solution of ethanol and oleic acid. A PLS model to predict the elapsed time is built with the IR spectra and the kernel matrix is derived from the scores. The detailed analysis of the kernel matrix provides penetrating insight into the interaction between the ethanol and the oleic acid.  相似文献   

12.
A series of herbicidal materials, N-phenylacetamides (NPAs), has been studied for their Quantitative Structure–Activity Relationships (QSAR). The molecular structure as well as the activity data were taken from literature [O. Kirino, C. Takayama, A. Mine, Quantitative structure relationships of herbicidal N-(1-methyl-1-phenylethyi) phenylacetamides, Journal Pesticide Science 11 (1986) 611–617]. The independent variables used to describe the structure of compounds consisted of seven physicochemical properties, including the mode of molecular connection, steric factor, hydrophobic parameter, etc. Fifty different compounds constitute a sample set which is divided into two groups, 47 of them form a training set and the remaining three a checking set. Through a systematic study by using the classic multivariate analysis such as the Multiple Linear Regression (MLR), the Principal Component Analysis (PCA), and the Partial Least Squares (PLS) Regression, several QSAR models were established. For finding a better way to depict the nonlinear nature of the problem, multi-layered feed-forward (MLF) neural networks (NNs) was employed. The results indicated that the conventional multivariate analysis gave larger prediction errors, while the NNs method showed better accuracy in both self-checking and prediction-checking. The error variance of predictions made by NNs was the smallest among the all methods tested, only around half of the others.  相似文献   

13.
The aim of this study was to develop a new strategy for choosing excipients in tablet formulation. Multivariate techniques such as principal component analysis (PCA) and experimental design were combined in a multivariate design for screening experiments. Of a total 87 investigated excipients, the initial screening experiments contained 5 lubricants, 9 binders, and 5 disintegrants, and 35 experiments were carried out. Considering a reduced factorial design was used, the resulting PCA and partial least squares (PLS) models offered good insight into the possibilities of tablet formulation. It also offered solutions to the problems and clearly gave directions for optimum formulations. Further, it offered several alternatives for achieving quality formulations. Additional experiments conducted to validate and verify the usefulness of the model were successful, resulting in several tablets of good quality. The conclusion is that a multivariate strategy in tablet formulation is efficient and can be used to reduce the number of experiments drastically. Combining multivariate characterization, physicochemical properties, experimental design, multivariate design, and PLS would lead to an evolutionary strategy for tablet formulation. Since it includes a learning strategy that continuously incorporates data for new compounds and from conducted experiments, this would be an even more powerful tool than expert systems.  相似文献   

14.
Natural organic matter (NOM) from nine different water sources located in the southern part of Norway selected for the “NOM typing project” was characterised by using near infrared spectroscopy and multivariate data analysis.

The near infrared profiles of these NOM samples were corrected for multiple scattering effect and differentiated twice before subjecting them for multivariate data analysis. The preprocessed profiles were first subjected to multivariate calibration using partial least squares (PLS) technique against earlier determined values of four different biopolymer input (carbohydrates, N-acetyl amino sugars, proteins and polyhydroxy aromatics) of the NOM as dependent variables. The profiles were then classified using principal component analysis (PCA).

The PLS calibration models obtained demonstrate that the biopolymer input of the NOM samples can be predicted with acceptable precision.

The PCA reveals that the samples fall into three different groups. This classification agrees with earlier classifications carried out by using variables that were determined by alternative expensive and time-consuming analytical techniques.  相似文献   


15.
The aim of this study was to develop a new strategy for choosing excipients in tablet formulation. Multivariate techniques such as principal component analysis (PCA) and experimental design were combined in a multivariate design for screening experiments. Of a total 87 investigated excipients, the initial screening experiments contained 5 lubricants, 9 binders, and 5 disintegrants, and 35 experiments were carried out. Considering a reduced factorial design was used, the resulting PCA and partial least squares (PLS) models offered good insight into the possibilities of tablet formulation. It also offered solutions to the problems and clearly gave directions for optimum formulations. Further, it offered several alternatives for achieving quality formulations. Additional experiments conducted to validate and verify the usefulness of the model were successful, resulting in several tablets of good quality. The conclusion is that a multivariate strategy in tablet formulation is efficient and can be used to reduce the number of experiments drastically. Combining multivariate characterization, physicochemical properties, experimental design, multivariate design, and PLS would lead to an evolutionary strategy for tablet formulation. Since it includes a learning strategy that continuously incorporates data for new compounds and from conducted experiments, this would be an even more powerful tool than expert systems.  相似文献   

16.
The industry is demanding quality control systems that enable not only certified safety of an end-product but also a secure and efficient production system. Due to this, fast and accurate technologies are required for developing real time decision systems. Sensors based on Near-Infrared Spectroscopy (NIRS), together with the use of chemometrics models, have been studied for on-line quality control as a Process Analytical Technology (PAT) tool in several industries. A critical issue is the development of robust and sufficiently accurate mathematical models that can contain hundreds of very heterogeneous samples representing the large natural variability of the process and product; this especially holds for the agro-food production. This paper evaluates the performance of different linear (PLS) and non-linear regression algorithms (LOCAL and Locally Weighted Regression — LWR) plus a new local approach for the prediction of ingredient composition in compound feeds (called, Local Central Algorithm — LCA). The comparison is based on complexity, accuracy and predicted percentages in test set samples. The new local modelling approach is based on the use of Principal Component Analysis (PCA) and the Mahalanobis Distance (MD) for selecting a training set and calculating the final prediction estimate using a central tendency statistics such as mean of the local neighbours for the unknown samples. The results show that the local strategy proposed in this work enables the prediction in seconds of all the unknown samples in the test set and performed comparable to LWR, although the RMSEP was somewhat higher than using LWR or LOCAL. However, it was found that this approach produced smaller prediction errors than the other methods for less commonly present ingredients that are not well represented by even a large number of training samples. This finding could be relevant for the start-up phase in the implementation of NIRS sensors in the feed industry at which stage the libraries build only on-line contain data of a limited production period.  相似文献   

17.
Imaging mass spectrometry (IMS) is a promising technology which allows for detailed analysis of spatial distributions of (bio)molecules in organic samples. In many current applications, IMS relies heavily on (semi)automated exploratory data analysis procedures to decompose the data into characteristic component spectra and corresponding abundance maps, visualizing spectral and spatial structure. The most commonly used techniques are principal component analysis (PCA) and independent component analysis (ICA). Both methods operate in an unsupervised manner. However, their decomposition estimates usually feature negative counts and are not amenable to direct physical interpretation. We propose probabilistic latent semantic analysis (pLSA) for non-negative decomposition and the elucidation of interpretable component spectra and abundance maps. We compare this algorithm to PCA, ICA, and non-negative PARAFAC (parallel factors analysis) and show on simulated and real-world data that pLSA and non-negative PARAFAC are superior to PCA or ICA in terms of complementarity of the resulting components and reconstruction accuracy. We further combine pLSA decomposition with a statistical complexity estimation scheme based on the Akaike information criterion (AIC) to automatically estimate the number of components present in a tissue sample data set and show that this results in sensible complexity estimates.  相似文献   

18.
This article aimed to model the effects of raw material properties and roller compactor operating parameters (OPs) on the properties of roller compacted ribbons and granules with the aid of principal component analysis (PCA) and partial least squares (PLS) projection. A database of raw material properties was established through extensive physical and mechanical characterization of several microcrystalline cellulose (MCC) and lactose grades and their blends. A design of experiment (DoE) was used for ribbon production. PLS models constructed with only OP-modeled roller compaction (RC) responded poorly. Inclusion of raw material properties markedly improved the goodness of fit (R2?=?.897) and model predictability (Q2?=?0.72).  相似文献   

19.
Several analytical applications of spectroscopy are based on the assessment of a linear model, linking laboratory values to spectral data. Among various procedures, the following three methods have been used, i.e. principal component regression (PCR), partial least squares (PLS) and latent root regression (LRR). These methods can be applied in order to tackle the high collinearity commonly observed with spectral data. A collection of 99 near-infrared spectra, each including 351 data points, was used for the comparison of the 3 methods. The dependent variable was the specific production of pelleting. The spectral collection was divided into 49 and 50 observations for calibration and validation, respectively. The main elements of comparison were the minimum error observed on the verification set, the number of regressors introduced in the models and the stability of the errors around the minimum values. The minimum errors were 3.29, 3.13 and 3.07 for PCR, PLS and LRR, respectively. LRR required a large number of regressors in order to obtain the minimum error. Nevertheless, it gave very stable results, and the errors were not markedly increased when an arbitrary large number of regressors was introduced into the LRR model.  相似文献   

20.
The purpose of this paper was to evaluate a multivariate strategy for handling time-dependent kinetic data during formulation development. Dissolution profiles were evaluated by the Weibull equation, multiple linear regression (MLR), principal component analysis (PCA), alone and in combination. In addition a soft independent modeling of class analogy (SIMCA) was performed. Employing a typical kinetic model for solid formulations (here Weibull) showed difficulties with the model adaptation, resulting in increased model standard deviation and thereby failure in identifying significant variables. In general, the selection of a kinetic model is crucial for finding the significant formulation variables. Describing the dissolution profile based on MLR models of individual time points described the dissolution rates as a function of formulation variables with good precision. Establishing prediction models made it easy to evaluate effects on the entire dissolution profile. The use of PCA/MLR (PCR) reduced the influence of noise from single measurements in a kinetic profile, since they develop statistical parameters representing the profile without being dependent on a physicochemically-modeled profile. The use of PCA reduced the eight time-point variables to two latent variables (principal components), simplifying the classification of formulations and new samples as well as avoiding unwanted effects of model non-linearities between the factors and responses (model error). The group membership of new samples was demonstrated by SIMCA.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号