首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 96 毫秒
1.
There is an increasing need to develop powerful techniques to improve biomedical pattern discovery and visualization. This paper presents an automated approach, based on hybrid self-adaptive neural networks, to pattern identification and visualization for biomolecular data. The methods are tested on two datasets: leukemia expression data and DNA splice-junction sequences. Several supervised and unsupervised models are implemented and compared. A comprehensive evaluation study of some of their intrinsic mechanisms is presented. The results suggest that these tools may be useful to support biological knowledge discovery based on advanced classification and visualization tasks.  相似文献   

2.
3.
Metabolomic analysis by liquid chromatography-high-resolution mass spectrometry results in data sets with thousands of features arising from metabolites, fragments, isotopes, and adducts. Here we describe a software package, Metabolomic Analysis and Visualization ENgine (MAVEN), designed for efficient interactive analysis of LC-MS data, including in the presence of isotope labeling. The software contains tools for all aspects of the data analysis process, from feature extraction to pathway-based graphical data display. To facilitate data validation, a machine learning algorithm automatically assesses peak quality. Users interact with raw data primarily in the form of extracted ion chromatograms, which are displayed with overlaid circles indicating peak quality, and bar graphs of peak intensities for both unlabeled and isotope-labeled metabolite forms. Click-based navigation leads to additional information, such as raw data for specific isotopic forms or for metabolites changing significantly between conditions. Fast data processing algorithms result in nearly delay-free browsing. Drop-down menus provide tools for the overlay of data onto pathway maps. These tools enable animating series of pathway graphs, e.g., to show propagation of labeled forms through a metabolic network. MAVEN is released under an open source license at http://maven.princeton.edu.  相似文献   

4.
Data mining (DM) can be defined as the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data. Modelling is the crucial step where DM algorithms are applied in order to extract data patterns. In order for domain experts, who play significant roles in DM process, to make the most efficient and effective use of DM tools, these tools must incorporate appropriate visualization to facilitate the process of modelling. Yet, unfortunately, study of how visualization should be designed, particularly what components should be included and how to present them, has been rather limited. This paper surveys the current state of art in application of visualization techniques to better comprehend and improve the decision trees modelling process in three modes: visualization of tree models, visualization of model evaluation and visual interactive tree construction. A number of issues that have been overlooked and areas that need to be improved are identified through reviewing a collection of related research and examining six current DM softwares in terms of their design of a few important features in each mode of the visualization support to decision trees classification modelling. Although this article focuses on decision trees classification modelling, guidelines derived from this study can be beneficial to other modelling techniques as well. At the end of the paper, a desirable design of visualization support to DM modelling is proposed with a conceptual model.  相似文献   

5.
A new methodology for the alignment of matrix chromatographic data is proposed, based on the decomposition of a three-way array composed of a test and a reference data matrix using a suitably initialized and constrained parallel factor (PARAFAC) model. It allows one to perform matrix alignment when the test data matrix contains unexpected chemical interferences, in contrast to most of the available algorithms. A series of simulated analytical systems is studied, as well as an experimental one, all having calibrated analytes and also potential interferences in the test samples, i.e., requiring the second-order advantage for successful analyte quantitation. The results show that the newly proposed method is able to properly align the different data matrix, restoring the trilinearity which is required to process the calibration and test data with second-order multivariate calibration algorithms such as PARAFAC. Recent models including unfolded partial least-squares regression (U-PLS) and N-dimensional PLS (N-PLS), combined with residual bilinearization (RBL), are also applied to both simulated and experimental data. The latter one corresponds to the determination of the polycyclic aromatic hydrocarbons benzo[b]fluoranthene and benzo[k]fluoranthene in the presence of benzo[j]fluoranthene as interference. The analytical figures of merit provided by the second-order calibration models are compared and discussed.  相似文献   

6.
In the literature there are only few papers concerned with classification methods for multi-way arrays. The most common procedure, by far, is to unfold the multi-way data array into an ordinary matrix and then to apply the traditional multivariate tools for classification. As opposed to unfolding the data several possibilities exist for building classification models more directly based on the multi-way structure of the data. As an example, multi-way partial least squares discriminant analysis has been used as a supervised classification method, another alternative that has been investigated is to perform classification using Fisher's LDA or SIMCA on the score matrix from e.g. a PARAFAC or a Tucker model. Despite a few attempts of applying such multi-way classification approaches, no-one has looked into how such models are best built and implemented.In this work, the SIMCA method is extended to three-way arrays. Included in this work is also actual code that will work on general multi-way arrays rather than just three-way arrays. In analogy with two-way SIMCA, a decomposition model is separately built for the multi-way data for each class, using multi-way decomposition method such as PARAFAC or Tucker3. In the choice of the best class dimensionality, i.e. number of latent factors, both the results of cross-validation but mainly the sensitivity/specificity values are evaluated. In order to estimate the class limits for each class model, orthogonal and score distances are considered, and different statistics are implemented and tested to set confidence limits for these two parameters. Classification performance using different definitions of class boundaries and classification rules, including the use of cross-validated residuals and scores is compared.The proposed N-SIMCA methodology and code, besides simulated data sets of varying dimensionality, has been tested on two case studies, concerning food authentication tasks for typical food products.  相似文献   

7.
Extraction and GC/MS analysis of the human blood plasma metabolome   总被引:14,自引:0,他引:14  
Analysis of the entire set of low molecular weight compounds (LMC), the metabolome, could provide deeper insights into mechanisms of disease and novel markers for diagnosis. In the investigation, we developed an extraction and derivatization protocol, using experimental design theory (design of experiment), for analyzing the human blood plasma metabolome by GC/MS. The protocol was optimized by evaluating the data for more than 500 resolved peaks using multivariate statistical tools including principal component analysis and partial least-squares projections to latent structures (PLS). The performance of five organic solvents (methanol, ethanol, acetonitrile, acetone, chloroform), singly and in combination, was investigated to optimize the LMC extraction. PLS analysis demonstrated that methanol extraction was particularly efficient and highly reproducible. The extraction and derivatization conditions were also optimized. Quantitative data for 32 endogenous compounds showed good precision and linearity. In addition, the determined amounts of eight selected compounds agreed well with analyses by independent methods in accredited laboratories, and most of the compounds could be detected at absolute levels of approximately 0.1 pmol injected, corresponding to plasma concentrations between 0.1 and 1 microM. The results suggest that the method could be usefully integrated into metabolomic studies for various purposes, e.g., for identifying biological markers related to diseases.  相似文献   

8.
This paper reports on the transfer of calibration models between Fourier transform near-infrared (FT-NIR) instruments from four different manufacturers. The piecewise direct standardization (PDS) method is compared with the new hybrid calibration method known as prediction augmented classical least squares/partial least squares (PACLS/PLS). The success of a calibration transfer experiment is judged by prediction error and by the number of samples that are flagged as outliers that would not have been flagged as such if a complete recalibration were performed. Prediction results must be acceptable and the outlier diagnostics capabilities must be preserved for the transfer to be deemed successful. Previous studies have measured the success of a calibration transfer method by comparing only the prediction performance (e.g., the root mean square error of prediction, RMSEP). However, our study emphasizes the need to consider outlier detection performance as well. As our study illustrates, the RMSEP values for a calibration transfer can be within acceptable range; however, statistical analysis of the spectral residuals can show that differences in outlier performance can vary significantly between competing transfer methods. There was no statistically significant difference in the prediction error between the PDS and PACLS/PLS methods when the same subset sample selection method was used for both methods. However, the PACLS/PLS method was better at preserving the outlier detection capabilities and therefore was judged to have performed better than the PDS algorithm when transferring calibrations with the use of a subset of samples to define the transfer function. The method of sample subset selection was found to make a significant difference in the calibration transfer results using the PDS algorithm, while the transfer results were less sensitive to subset selection when the PACLS/PLS method was used.  相似文献   

9.
Dimensionality reduction is an important technique for preprocessing of high-dimensional data. Because only one side of the original data is represented in a low-dimensional subspace, useful information may be lost. In the present study, novel dimensionality reduction methods were developed that are suitable for metabolome data, where observation varies with time. Metabolomics deal with this type of data, which are often obtained in microorganism fermentation processes. However, no dimensionality reduction method that utilizes information from the original data in a positive manner has been reported to date. The ordinary dimensionality reduction methods of principal component analysis (PCA), partial least squares (PLS), orthonormalized PLS (OPLS), and regularized Fisher discriminant analysis (RFDA) were extended by introducing differential penalties to the latent variables in each class. A nonlinear extension of this approach, using kernel methods, was also proposed in the form of kernel-smoothed PCA, PLS, OPLS, and FDA. Since all of these methods are formulated as generalized eigenvalue problems, the solutions can be computed easily. These methods were then applied to intracellular metabolite data of a xylose-fermenting yeast in ethanol fermentation. Visualization in the low-dimensional subspace suggests that smoothed PCA successfully preserves the information about the time course of observations during fermentation, and that RFDA can produce high separation among different strains.  相似文献   

10.
The possibilities of employing methods of chemometrics in order to characterize macromolecules are described. The review has been limited to chemometric methods concerning multivariate data analysis. Principal component analysis (PCA) has shown to be very useful for pattern recognition problems arising from chromatographic and spectroscopic data. An example of using a classification technique, SIMCA (Soft Independent Modelling of Class Analogy), as a product control method is presented. The suitability of Partial Least Squares (PLS) for relating data of different natures, e.g. chemical data with biological data, is discussed. Moreover, examples ranging from spectroscopic determinations to QSAR (Quantitative Structure Activity Relationships) are illustrated.  相似文献   

11.
12.
Comparisons of prediction models from the new augmented classical least squares (ACLS) and partial least squares (PLS) multivariate spectral analysis methods were conducted using simulated data containing deviations from the idealized model. The simulated data were based on pure spectral components derived from real near-infrared spectra of multicomponent dilute aqueous solutions. Simulated uncorrelated concentration errors, uncorrelated and correlated spectral noise, and nonlinear spectral responses were included to evaluate the methods on situations representative of experimental data. The statistical significance of differences in prediction ability was evaluated using the Wilcoxon signed rank test. The prediction differences were found to be dependent on the type of noise added, the numbers of calibration samples, and the component being predicted. For analyses applied to simulated spectra with noise-free nonlinear response, PLS was shown to be statistically superior to ACLS for most of the cases. With added uncorrelated spectral noise, both methods performed comparably. Using 50 calibration samples with simulated correlated spectral noise, PLS showed an advantage in 3 out of 9 cases, but the advantage dropped to 1 out of 9 cases with 25 calibration samples. For cases with different noise distributions between calibration and validation, ACLS predictions were statistically better than PLS for two of the four components. Also, when experimentally derived correlated spectral error was added, ACLS gave better predictions that were statistically significant in 15 out of 24 cases simulated. On data sets with nonuniform noise, neither method was statistically better, although ACLS usually had smaller standard errors of prediction (SEPs). The varying results emphasize the need to use realistic simulations when making comparisons between various multivariate calibration methods. Even when the differences between the standard error of predictions were statistically significant, in most cases the differences in SEP were small. This study demonstrated that unlike CLS, ACLS is competitive with PLS in modeling nonlinearities in spectra without knowledge of all the component concentrations. This competitiveness is important when maintaining and transferring models for system drift, spectrometer differences, and unmodeled components, since ACLS models can be rapidly updated during prediction when used in conjunction with the prediction augmented classical least squares (PACLS) method, while PLS requires full recalibration.  相似文献   

13.
Different spectroscopic approaches have proved to be excellent analytical tools for monitoring process-induced transformations of active pharmaceutical ingredients during pharmaceutical unit operations. In order to use these tools effectively, it is necessary to build calibration models that describe the relationship between the amount of each solid-state form of interest and the spectroscopic signal. In this study, near-infrared (NIR) and Raman spectroscopic methods have been evaluated for the quantification of hydrate and anhydrate forms in pharmaceutical powders. Process type spectrometers were used to collect the data and the role of the sampling procedure was examined. Multivariate regression models were compared with traditional univariate calibrations and special emphasis was placed on data treatment prior to multivariate modeling by partial least squares (PLS). It was found that the measured sample volume greatly affected the performance of the model whereby the calibrations were significantly improved by utilizing a larger sampling area. In addition, multivariate regression did not always improve the predictability of the data compared to univariate analysis. The data treatment prior to multivariate modeling had a significant influence on the quality of predictions with standard normal variate transformation generally proving to be the best preprocessing method. When the appropriate sampling techniques and data analysis methods were utilized, both NIR and Raman spectroscopy were found to be suitable methods for the quantification of anhydrate/hydrate in powder systems, and thus the method of choice will depend on the conditions in the process under investigation.  相似文献   

14.
Metabolomics experiments involve the simultaneous detection of a high number of metabolites leading to large multivariate datasets and computer-based applications are required to extract relevant biological information. A high-throughput metabolic fingerprinting approach based on ultra performance liquid chromatography (UPLC) and high resolution time-of-flight (TOF) mass spectrometry (MS) was developed for the detection of wound biomarkers in the model plant Arabidopsis thaliana. High-dimensional data were generated and analysed with chemometric methods.Besides, machine learning classification algorithms constitute promising tools to decipher complex metabolic phenotypes but their application remains however scarcely reported in that research field. The present work proposes a comparative evaluation of a set of diverse machine learning schemes in the context of metabolomic data with respect to their ability to provide a deeper insight into the metabolite network involved in the wound response. Standalone classifiers, i.e. J48 (decision tree), kNN (instance-based learner), SMO (support vector machine), multilayer perceptron and RBF network (neural networks) and Naive Bayes (probabilistic method), or combinations of classification and feature selection algorithms, such as Information Gain, RELIEF-F, Correlation Feature-based Selection and SVM-based methods, are concurrently assessed and cross-validation resampling procedures are used to avoid overfitting.This study demonstrates that machine learning methods represent valuable tools for the analysis of UPLC-TOF/MS metabolomic data. In addition, remarkable performance was achieved, while the models' stability showed the robustness and the interpretability potential. The results allowed drawing attention to both temporal and spatial metabolic patterns in the context of stress signalling and highlighting relevant biomarkers not evidenced with standard data treatment.  相似文献   

15.
Abstract

The possibilities of employing methods of chemometrics in order to characterize macromolecules are described. The review has been limited to chemometric methods concerning multivariate data analysis. Principal component analysis (PCA) has shown to be very useful for pattern recognition problems arising from chromatographic and spectroscopic data. An example of using a classification technique, SIMCA (Soft Independent Modelling of Class Analogy), as a product control method is presented. The suitability of Partial Least Squares (PLS) for relating data of different natures, e.g. chemical data with biological data, is discussed. Moreover, examples ranging from spectroscopic determinations to QSAR (Quantitative Structure Activity Relationships) are illustrated.  相似文献   

16.
Nord LI  Vaag P  Duus JØ 《Analytical chemistry》2004,76(16):4790-4798
The quantification of organic and amino acids in beer using 1H NMR spectroscopy is demonstrated. Quantification was made both by integration of signals in the spectra together with use of calibration references and by use of partial least-squares (PLS) regression. Results from the NMR quantifications were compared with those obtained from determinations by amino acid analysis on HPLC and organic acid analysis by capillary electrophoresis. The described NMR-based methods could satisfactorily be used for quantification of several of the investigated metabolites in beer down to approximately 10 mg/L and for most with a good to high accuracy compared to results obtained by HPLC and capillary electrophoresis (R2 0.90-0.99). This was achieved with a simple sample preparation and one-dimensional 1H NMR spectra obtained in a few minutes. The use of PLS clearly improves the accuracy of the quantifications, based on comparison to results obtained by HPLC and capillary electrophoresis, and furthermore permits the determination of components with partially overlapped signals in the spectrum. NMR spectroscopy in combination with PLS will be a useful tool for the quantification of metabolites, not only in beer but also in other beverages and biofluids.  相似文献   

17.
The purpose of this study was to predict drug content and hardness of intact tablets using artificial neural networks (ANN) and near-infrared spectroscopy (NIRS). Tablets for the drug content study were compressed from mixtures of Avicel® PH-101, 0.5% magnesium stearate, and varying concentrations (0%, 1%, 2%, 5%, 10%, 20%, and 40% w/w) of theophylline. Tablets for the hardness study were compressed from mixtures of Avicel PH-101 and 0.5% magnesium stearate at varying compression forces ranging from 0.4 to 1 ton. An Intact Analyzer™ was used to obtain near infrared spectra from the tablets with varying drug contents, whereas a Rapid Content Analyzer™ (RCA) was used to obtain spectral data from the tablets with varying hardness. Two sets of tablets from each batch (i.e., tablets with varying drug content and hardness) were randomly selected. One set of tablets was used to generate appropriate calibration models, while the other set was used as the unknown (test) set. A total of 10 ANN calibration models (5 each with 10 and 160 inputs at appropriate wavelengths) and five separate 4-factor partial least squares (PLS) calibration models were generated to predict drug contents of the test tablets from the spectral data. For the prediction of tablet hardness, two ANN calibration models (one each with 10 and 160 inputs) and two 4-factor PLS calibration models were generated and used to predict the hardness of test tablets. The PLS calibration models were generated using Vision® software. Prediction of drug contents of test tablets using the ANN calibration models generated with 10 inputs was significantly better than the prediction obtained with the ANN calibration models with 160 inputs. For tablets with low drug concentrations (less than or equal to 2%w/w), prediction of drug content was better with either of the two ANN calibration models than with the PLS calibration models. However, prediction of drug contents of tablets with greater than or equal to 5% w/w drug was better with the PLS calibration models than with the ANN calibration models. Prediction of tablet hardness was better with the ANN calibration models generated with either 10 or 160 inputs than with the PLS calibration models. This work demonstrated that a well-trained ANN model is a powerful alternative technique for analysis of NIRS data. Moreover, the technique could be used in instances when the conventional modeling of data does not work adequately.  相似文献   

18.
In this work, it is demonstrated that the coating weight of printed layers can be determined in-line in a running printing press by near-infrared (NIR) reflection spectroscopy assisted by chemometric methods. Three different unpigmented lacquer systems, i.e., a conventional oil-based printing lacquer, an ultraviolet (UV)-curable formulation, and a water-based dispersion varnish, were printed on paper with coating weights between about 0.5 and 7 g m(-2). NIR spectra for calibration were recorded with a special metal reflector simulating the mounting conditions of the probe head at the printing press. Calibration models were developed on the basis of the partial least squares (PLS) algorithm and evaluated by independent test samples. The prediction performance of the developed models was examined at a sheet-fed offset printing press at line speeds between 90 and 180 m min(-1). Results show an excellent correlation of data predicted in-line from the NIR spectra with reference values obtained off-line by gravimetry. The prediction errors were found to be ≤ 0.2 g m(-2), which confirms the suitability of the developed spectroscopic method for process control in technical printing processes.  相似文献   

19.
An assessment of off-site exposure from spills/releases of toxic chemicals can be conducted by compiling site-specific operational, geographic, demographic, and meteorological data and by using screening-level public-domain modeling tools (e.g., RMP*Comp, ALOHA and DEGADIS). In general, the analysis is confined to the following: event-based simulations (allow for the use of known, constant, atmospheric conditions), known receptor distances (on the order of miles or less), short time scale for the distances considered (order of 10's of minutes or less), gently sloping rough terrain, dense and neutrally buoyant gas dispersion, known chemical inventory and infrastructure (used to define source-term), and known toxic endpoint (defines significance). While screening-level models are relatively simple to use, care must be taken to ensure that the results are meaningful. This approach allows one to assess risk from catastrophic release (e.g., via terrorism), or plausible release scenarios (related to standard operating procedures and industry standards). In addition, given receptor distance and toxic endpoint, the model can be used to predict the critical spill volume to realize significant off-site risk. This information can then be used to assess site storage and operation parameters and to determine the most economical and effective risk reduction measures to be applied.  相似文献   

20.
The most common analyses carried out to assess gas engine oil quality include determination of viscosity, total base number (TBN), and total acid number (TAN). TAN has been considered to be an important indicator of oil quality, specifically in terms of defining oxidation and the extent of acidic contamination of used oils. TAN can be determined by potentiometric titration, and typical values for used oils can reach up to 4 mg KOH/g. A more convenient approach for the determination of TAN is based on infrared (IR) spectral data and multivariate regression models.We developed partial least-squares (PLS) models for the determination of TAN using IR data measured from monograde mineral gas engine oils (SAE 40, medium ash) that have been used in sewer and wood gas engines run with gaseous fuels from a sewage plant and a wood gasification plant, respectively. The final model performance was 0.07 mg KOH/g for the standard error of prediction (SEP). Essential for the development of powerful empirical models was an appropriate variable selection by combining expert knowledge, biPLS or dyn-biPLS, and a genetic algorithm. The optimum complexities of the models (the number of PLS components) and their prediction performances have been estimated by repeated double cross validation (rdCV).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号