The purpose of this work is to develop and verify statistical models for protein identification using peptide identifications derived from the results of tandem mass spectral database searches. Recently we have presented a probabilistic model for peptide identification that uses hypergeometric distribution to approximate fragment ion matches of database peptide sequences to experimental tandem mass spectra. Here we apply statistical models to the database search results to validate protein identifications. For this we formulate the protein identification problem in terms of two independent models, two-hypothesis binomial and multinomial models, which use the hypergeometric probabilities and cross-correlation scores, respectively. Each database search result is assumed to be a probabilistic event. The Bernoulli event has two outcomes: a protein is either identified or not. The probability of identifying a protein at each Bernoulli event is determined from relative length of the protein in the database (the null hypothesis) or the hypergeometric probability scores of the protein's peptides (the alternative hypothesis). We then calculate the binomial probability that the protein will be observed a certain number of times (number of database matches to its peptides) given the size of the data set (number of spectra) and the probability of protein identification at each Bernoulli event. The ratio of the probabilities from these two hypotheses (maximum likelihood ratio) is used as a test statistic to discriminate between true and false identifications. The significance and confidence levels of protein identifications are calculated from the model distributions. The multinomial model combines the database search results and generates an observed frequency distribution of cross-correlation scores (grouped into bins) between experimental spectra and identified amino acid sequences. The frequency distribution is used to generate p-value probabilities of each score bin. The probabilities are then normalized with respect to score bins to generate normalized probabilities of all score bins. A protein identification probability is the multinomial probability of observing the given set of peptide scores. To reduce the effect of random matches, we employ a marginalized multinomial model for small values of cross-correlation scores. We demonstrate that the combination of the two independent methods provides a useful tool for protein identification from results of database search using tandem mass spectra. A receiver operating characteristic curve demonstrates the sensitivity and accuracy level of the approach. The shortcomings of the models are related to the cases when protein assignment is based on unusual peptide fragmentation patterns that dominate over the model encoded in the peptide identification process. We have implemented the approach in a program called PROT_PROBE.  相似文献   

Recent technological advances have made multidimensional peptide separation techniques coupled with tandem mass spectrometry the method of choice for high-throughput identification of proteins. Due to these advances, the development of software tools for large-scale, fully automated, unambiguous peptide identification is highly necessary. In this work, we have used as a model the nuclear proteome from Jurkat cells and present a processing algorithm that allows accurate predictions of random matching distributions, based on the two SEQUEST scores Xcorr and DeltaCn. Our method permits a very simple and precise calculation of the probabilities associated with individual peptide assignments, as well as of the false discovery rate among the peptides identified in any experiment. A further mathematical analysis demonstrates that the score distributions are highly dependent on database size and precursor mass window and suggests that the probability associated with SEQUEST scores depends on the number of candidate peptide sequences available for the search. Our results highlight the importance of adjusting the filtering criteria to discriminate between correct and incorrect peptide sequences according to the circumstances of each particular experiment.  相似文献   

Glycosylation is the most widespread posttranslational modification in eukaryotes; however, the role of oligosaccharides attached to proteins has been little studied because of the lack of a sensitive and easy analytical method for oligosaccharide structures. Recently, tandem mass spectrometric techniques have been revealing that oligosaccharides might have characteristic signal intensity profiles. We describe here a strategy for the rapid and accurate identification of the oligosaccharide structures on glycoproteins using only mass spectrometry. It is based on a comparison of the signal intensity profiles of multistage tandem mass (MSn) spectra between the analyte and a library of observational mass spectra acquired from structurally defined oligosaccharides prepared using glycosyltransferases. To smartly identify the oligosaccharides released from biological materials, a computer suggests which ion among the fragment ions in the MS/MS spectrum should yield the most informative MS3 spectrum to distinguish similar oligosaccharides. Using this strategy, we were able to identify the structure of N-linked oligosaccharides in immunoglobulin G as an example.  相似文献   

Mass spectrometry based metabolomics represents a new area for bioinformatics technology development. While the computational tools currently available such as XCMS statistically assess and rank LC-MS features, they do not provide information about their structural identity. XCMS(2) is an open source software package which has been developed to automatically search tandem mass spectrometry (MS/MS) data against high quality experimental MS/MS data from known metabolites contained in a reference library (METLIN). Scoring of hits is based on a "shared peak count" method that identifies masses of fragment ions shared between the analytical and reference MS/MS spectra. Another functional component of XCMS(2) is the capability of providing structural information for unknown metabolites, which are not in the METLIN database. This "similarity search" algorithm has been developed to detect possible structural motifs in the unknown metabolite which may produce characteristic fragment ions and neutral losses to related reference compounds contained in METLIN, even if the precursor masses are not the same.  相似文献   

TwinPeaks, a close variant of the SEQUEST protein identification algorithm, is capable of unrestricted, large-scale, identification of post-translation modifications (PTMs). TwinPeaks is applied on a sample of 100441 tandem mass spectra from the HUPO Plasma Proteome Project data set, with full non-redundant human as a reference protein database. With a 3.5% error rate, TwinPeaks identifies a collection of 539 spectra that were not identified by the usual PTM-restricted identification algorithm. At this error rate, TwinPeaks increases the rate of spectra identifications by at least 17.6%, making unrestricted PTM identification an integral part of proteomics.  相似文献   

A statistical model for identifying proteins by tandem mass spectrometry  
A statistical model is presented for computing probabilities that proteins are present in a sample on the basis of peptides assigned to tandem mass (MS/MS) spectra acquired from a proteolytic digest of the sample. Peptides that correspond to more than a single protein in the sequence database are apportioned among all corresponding proteins, and a minimal protein list sufficient to account for the observed peptide assignments is derived using the expectation-maximization algorithm. Using peptide assignments to spectra generated from a sample of 18 purified proteins, as well as complex H. influenzae and Halobacterium samples, the model is shown to produce probabilities that are accurate and have high power to discriminate correct from incorrect protein identifications. This method allows filtering of large-scale proteomics data sets with predictable sensitivity and false positive identification error rates. Fast, consistent, and transparent, it provides a standard for publishing large-scale protein identification data sets in the literature and for comparing the results obtained from different experiments.  相似文献   

Multistage mass spectrometry (MS(n)) generating so-called spectral trees is a powerful tool in the annotation and structural elucidation of metabolites and is increasingly used in the area of accurate mass LC/MS-based metabolomics to identify unknown, but biologically relevant, compounds. As a consequence, there is a growing need for computational tools specifically designed for the processing and interpretation of MS(n) data. Here, we present a novel approach to represent and calculate the similarity between high-resolution mass spectral fragmentation trees. This approach can be used to query multiple-stage mass spectra in MS spectral libraries. Additionally the method can be used to calculate structure-spectrum correlations and potentially deduce substructures from spectra of unknown compounds. The approach was tested using two different spectral libraries composed of either human or plant metabolites which currently contain 872 MS(n) spectra acquired from 549 metabolites using Orbitrap FTMS(n). For validation purposes, for 282 of these 549 metabolites, 765 additional replicate MS(n) spectra acquired with the same instrument were used. Both the dereplication and de novo identification functionalities of the comparison approach are discussed. This novel MS(n) spectral processing and comparison approach increases the probability to assign the correct identity to an experimentally obtained fragmentation tree. Ultimately, this tool may pave the way for constructing and populating large MS(n) spectral libraries that can be used for searching and matching experimental MS(n) spectra for annotation and structural elucidation of unknown metabolites detected in untargeted metabolomics studies.  相似文献   

Tandem mass spectrometry (MS/MS) plays an important role in the unambiguous identification and structural elucidation of biomolecules. In contrast to conventional MS/MS approaches for protein identification where an individual polypeptide is sequentially selected and dissociated, a multiplexed-MS/MS approach increases throughput by selecting several peptides for simultaneous dissociation using either infrared multiphoton dissociation (IRMPD) or multiple frequency sustained off-resonance irradiation (SORI) collisionally induced dissociation (CID). The high mass measurement accuracy and resolution of FTICR combined with knowledge of peptide dissociation pathways allows the fragments arising from several different parent ions to be assigned. Herein we report the application of multiplexed-MS/MS coupled with on-line separations for the identification of peptides present in complex mixtures (i.e., whole cell lysate digests). Software was developed to enable "on-the-fly" data-dependent peak selection of a subset of polypeptides from each FTICR MS acquisition. In the subsequent MS/MS acquisitions, several coeluting peptides were fragmented simultaneously using either IRMPD or SORI-CID techniques. The utility of this approach has been demonstrated using a bovine serum albumin tryptic digest separated by capillary LC where multiple peptides were readily identified in single MS/MS acquisitions. We also present initial results from multiplexed-MS/MS analysis of a D. radiodurans whole cell digest to illustrate the utility of this approach for high-throughput analysis of a bacterial proteome.  相似文献   

We report a new tandem mass spectrometric approach for the improved identification of polypeptides from mixtures (e.g., using genomic databases). The approach involves the dissociation of several species simultaneously in a single experiment and provides both increased speed and sensitivity. The data analysis makes use of the known fragmentation pathways for polypeptides and highly accurate mass measurements for both the set of parent polypeptides and their fragments. The accurate mass information makes it possible to attribute most fragments to a specific parent species. We provide an initial demonstration of this multiplexed tandem MS approach using an FTICR mass spectrometer with a mixture of seven polypeptides dissociated using infrared irradiation from a CO2 laser. The peptides were added to, and then successfully identified from, the largest genomic database yet available (C. elegans), which is equivalent in complexity to that for a specific differentiated mammalian cell type. Additionally, since only a few enzymatic fragments are necessary to unambiguously identify a protein from an appropriate database, it is anticipated that the multiplexed MS/MS method will allow the more rapid identification of complex protein mixtures with on-line separation of their enzymatically produced polypeptides.  相似文献   

Many security and surveillance tasks involve either finding an object in a cluttered scene or discriminating between like objects. For example, an observer might look for a person of known height and weight in a crowd, or he might want to positively identify a specific face. The paper "Modeling target acquisition tasks associated with security and surveillance" [Appl. Opt. 46, 4209 (2007)] describes a specific-object model used to predict the probability of accomplishing this type of task. We describe four facial identification experiments and apply the specific-object model to predict the results. Facial identification is accurately predicted by the specific-object model.  相似文献   

A new solar spectral irradiometer that operates in the visible and near-infrared spectral ranges has been developed. This instrument takes advantage of a new concept optical head that collects the light that impinges on a hemispheric surface, thus improving the instrument angular response with respect to traditional devices. The technical characteristics of the instrument are investigated and detailed, and its radiometric calibration, performed by means of a Langley-like method, is discussed. A new simplified theoretical model that accounts for the diffuse irradiance observed in an optically thin plane-parallel atmosphere has been developed to improve the fit of the irradiance diurnal evolution. An alternative polynomial parametric representation of monochromatic diffuse irradiance evolution has been attempted, but satisfactory results were not obtained from the fitting of experimental data. The new instrument could be useful to carry out remote-sensing validation campaigns.  相似文献   

Modern determination techniques for pesticides must yield identification quickly with high confidence for timely enforcement of tolerances. A protocol for the collection of liquid chromatography (LC) electrospray ionization (ESI)-quadruple linear ion trap (Q-LIT) mass spectrometry (MS) library spectra was developed. Following the protocol, an enhanced product ion (EPI) library of 240 pesticides was developed by use of spectra collected from two laboratories. A LC-Q-LIT-MS workflow using scheduled multiple reaction monitoring (sMRM) survey scan, information-dependent acquisition (IDA) triggered collection of EPI spectra, and library search was developed and tested to identify the 240 target pesticides in one single LC-Q-LIT MS analysis. By use of LC retention time, one sMRM survey scan transition, and a library search, 75-87% of the 240 pesticides were identified in a single LC/MS analysis at fortified concentrations of 10 ng/g in 18 different foods. A conventional approach with LC-MS/MS using two MRM transitions produced the same identifications and comparable quantitative results with the same incurred foods as the LC-Q-LIT using EPI library search, finding 1.2-49 ng/g of either carbaryl, carbendazim, fenbuconazole, propiconazole, or pyridaben in peaches; carbendazim, imazalil, terbutryn, and thiabendazole in oranges; terbutryn in salmon; and azoxystrobin in ginseng. Incurred broccoli, cabbage, and kale were screened with the same EPI library using three LC-Q-LIT and a LC-quadruple time-of-flight (Q-TOF) instruments. The library search identified azoxystrobin, cyprodinil, fludioxinil, imidacloprid, metalaxyl, spinosyn A, D, and J, amd spirotetramat with each instrument. The approach has a broad application in LC-MS/MS type targeted screening in food analysis.  相似文献   

In shotgun proteomics, tandem mass spectrometry is used to identify peptides derived from proteins. After the peptides are detected, proteins are reassembled via a reference database of protein or gene information. Redundancy and homology between protein records in databases make it challenging to assign peptides to proteins that may or may not be in an experimental sample. Here, a probability model is introduced for determining the likelihood that peptides are correctly assigned to proteins. This model derives consistent probability estimates for assembled proteins. The probability scores make it easier to confidently identify proteins in complex samples and to accurately estimate false-positive rates. The algorithm based on this model is robust in creating protein complements from peptides from bovine protein standards, yeast, Ustilago maydis cell lysates, and Arabidopsis thaliana leaves. It also eliminates the side effects of redundancy and homology from the reference databases by employing a new concept of peptide grouping and by coherently distinguishing distinct peptides from unique records and shared peptides from homologous proteins. The software that runs the algorithm, called PANORAMICS, provides a tool to help analyze the data based on a researcher's knowledge about the sample. The software operates efficiently and quickly compared to other software platforms.  相似文献   

Amino acid sequence variations resulting from single-nucleotide polymorphisms (SNPs) were identified using a novel mass spectrometric method. This method obtains 99+% protein sequence coverage for human hemoglobin in a single LC-microspray tandem mass spectrometry (microLC-MS/MS) experiment. Tandem mass spectrometry data was analyzed using a modified version of the computer program SEQUEST to identify the sequence variations. Conditions of sample preparation, chromatographic separation, and data collection were optimized to correctly identify amino acid changes in six variants of human hemoglobin (Hb C, Hb E, Hb D-Los Angeles, Hb G-Philadelphia, Hb Hope, and Hb S). Hemoglobin proteins were isolated and purified, dehemed, (S)-carboxyami-domethylated, and then subjected to a combination proteolytic digestion to obtain a complex peptide mixture with multiple overlaps in sequence. Reversed-phase chromatographic separation of peptides was achieved on-line with MS utilizing a robust fritless microelectrospray interface. Tandem mass spectrometry was performed on an ion trap mass spectrometer using automated data-dependent MS/MS procedures. Tandem mass spectra were collected from the five most abundant ions in each scan using dynamic and isotopic exclusion to minimize redundancy. The spectra were analyzed by a version of the SEQUEST algorithm modified to identify amino acid substations resulting from SNPs.  相似文献   

A method for speciation and identification of organoselenium metabolites found in human urine samples using high performance liquid chromatography/inductively coupled plasma mass spectrometry (HPLC/ICP-MS) and tandem mass spectrometry (MS/MS) is described. Reversed-phase chromatographic separation was used for sample fractionation with the ICP-MS functioning as an element-selective detector, and six distinct selenium-containing species were detected in a human urine sample. Fractions were then collected and analyzed using a triple quadrupole mass spectrometer with electrospray ionization and collision-induced dissociation to obtain structural information. The first two fractions were identified specifically as selenomethionine and selenocystamine, estimated to be present at approximately 11 and 40 ppb, respectively. To the best of our knowledge, this is the first time these two metabolites have been positively identified in human urine.  相似文献   

The native reference peptide (NRP) method has been adapted to the measure of the degree of protein nitration at a specific tyrosine residue. In these experiments, human serum albumin was modified in a myeloperoxidase-mediated reaction in the presence of nitrite, with nitration detected predominantly at one site, Y162. The time-dependent increase in nitration at this site was measured based on the increasing abundance of the peptide 162YnLYEIAR168 and the corresponding decrease in the 162YLYEIAR168 peptide in in-gel trypsin digests. The peptide 66LVNEVTEFAK75, also formed in the tryptic digest, was used as the native reference peptide. Quantitation was achieved by determining the chromatographic peak area of the two analyte peptides relative to the native reference peptide by LC/tandem mass spectrometric analyses with selected reaction monitoring. The NRP results were validated by correlation to the time-dependent increase in total protein-nitrotyrosine content determined by Western blot analysis. The precision and limit of detection of the assay were also evaluated and were found to be approximately 10% (relative standard deviation) and 5 fmol on-column, respectively. These results demonstrate the utility of the NRP method for quantitative analyses of posttranslation modifications, in terms of broad applicability, ease of experimental design, sensitivity, and precision.  相似文献   

In this case study, we apply a recently developed method to systematically predict the linear dependencies in concentration profiles and identify minimum requirements to enable optimisation of rate constants and pure component spectra via direct multivariate kinetic hard-modelling of spectroscopic data. This systematic method was applied to the rank-deficient acid catalysed reaction of benzophenone with phenylhydrazine in THF. Various experimental conditions (different dosing and initial concentrations) and data treatments (defining uncoloured species, including known component spectra into the analysis) were considered. For all these conditions, the kinetic mechanism of this condensation reaction was successfully validated by the agreement between fitted and independently measured mid-IR and UV–vis pure component spectra and by the highly reproducible fitted rate constants. This case study particularly demonstrated the value of the direct spectral fitting as a tool for the validation of rank-deficient kinetic mechanisms, as inherent contributions within the fitted component spectra, due to the definition of uncoloured species, can be systematically addressed.  相似文献   

We investigated and compared three approaches for shotgun protein identification by combining MS and MS/MS information using LTQ-Orbitrap high mass accuracy data. In the first approach, we employed a unique mass identifier method where MS peaks matched to peptides predicted from proteins identified from an MS/MS database search are first subtracted before using the MS peaks as unique mass identifiers for protein identification. In the second method, we used an accurate mass and time tag method by building a potential mass and retention time database from previous MudPIT analyses. For the third method, we used a peptide mass fingerprinting-like approach in combination with a randomized database for protein identification. We show that we can improve protein identification sensitivity for low-abundance proteins by combining MS and MS/MS information. Furthermore, "one-hit wonders" from MS/MS database searching can be further substantiated by MS information and the approach improves the identification of low-abundance proteins. The advantages and disadvantages for the three approaches are then discussed.  相似文献   

Most algorithms for identifying peptides from tandem mass spectra use information only from the final spectrum, ignoring non-mass-based information acquired routinely in liquid chromatography tandem mass spectrometry analyses. One physiochemical property that is always obtained but rarely exploited is peptide chromatographic retention time. Efforts to use chromatographic retention time to improve peptide identification are complicated because of the variability of retention time in different experimental conditions-making retention time calculations nongeneralizable. We show that peptide retention time can be reliably predicted by training and testing a support vector regressor on a small collection of data from a single liquid chromatography run. This model can be used to filter peptide identifications with observed retention time that deviates from predicted retention time. After filtering, positive peptide identifications increase by as much as 50% at a false discovery rate of 3%. We demonstrate that our dynamically trained model generalizes well across diverse chromatography conditions and methods for generating peptides, in particular improving peptide identification using nonspecific proteases.  相似文献   

A powerful technique for peptide and protein identification is tandem mass spectrometry followed by database search using a program such as SEQUEST or Mascot. These programs, however, become slow and lose sensitivity when allowing nonspecific cleavages or peptide modifications. De novo sequencing and hybrid methods such as sequence tagging offer speed and robustness for wider searches, yet these approaches require better spectra with more complete and consecutive fragmentation and, hence, are less sensitive to low-abundance peptides. Here we describe a new hybrid method that retains the sensitivity of pure database search. The method uses a small amount of de novo analysis to identify likely b- and y-ion peaks--"lookup peaks"--that can then be used to extract candidate peptides from the database, with the number of candidates tunable to fit a computing budget. We describe a program called ByOnic that implements this method, and we benchmark ByOnic on several data sets, including one of mouse blood plasma spiked with low concentrations of recombinant human proteins. We demonstrate that ByOnic is more sensitive than sequence tagging and, indeed, more sensitive than the three most popular pure database search tools--SEQUEST, Mascot, and X!Tandem--on both the peptide and protein levels. On the mouse plasma samples, ByOnic consistently found spiked proteins missed by the other tools.  相似文献   

