首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We present a new probability-based method for protein identification using tandem mass spectra and protein databases. The method employs a hypergeometric distribution to model frequencies of matches between fragment ions predicted for peptide sequences with a specific (M + H)+ value (at some mass tolerance) in a protein sequence database and an experimental tandem mass spectrum. The hypergeometric distribution constitutes null hypothesis-all peptide matches to a tandem mass spectrum are random. It is used to generate a score characterizing the randomness of a database sequence match to an experimental tandem mass spectrum and to determine the level of significance of the null hypothesis. For each tandem mass spectrum and database search, a peptide is identified that has the least probability of being a random match to the spectrum and the corresponding level of significance of the null hypothesis is determined. To check the validity of the hypergeometric model in describing fragment ion matches, we used chi2 test. The distribution of frequencies and corresponding hypergeometric probabilities are generated for each tandem mass spectrum. No proteolytic cleavage specificity is used to create the peptide sequences from the database. We do not use any empirical probabilities in this method. The scores generated by the hypergeometric model do not have a significant molecular weight bias and are reasonably independent of database size. The approach has been implemented in a database search algorithm, PEP_PROBE. By using a large set of tandem mass spectra derived from a set of peptides created by digestion of a collection of known proteins using four different proteases, a false positive rate of 5% is demonstrated.  相似文献   

2.
We have developed an approach to identify the molecular weight of a peptide ion directly from its corresponding tandem mass spectrum using a cross-correlation function. We have shown that the monoisotopic molecular weight can be calculated for approximately 90% of tandem mass spectra identified from tryptic digests of complex protein mixtures. The accuracy of the calculated monoisotopic masses was dependent on the resolution and mass accuracy of the spectra analyzed, but was typically <0.25 amu for linear ion trap mass spectra. The ability to calculate accurate monoisotopic molecular weights for low-resolution ion trap data should significantly improve both the speed and performance of database searches for which typical mass accuracies of approximately 3 amu are employed. In addition, this strategy can also be used to identify the precursor ion for tandem mass spectra acquired using large ion selection windows in data-independent collision-activated dissociation and has the potential to identify multiplexed tandem mass spectra.  相似文献   

3.
The purpose of this work is to develop and verify statistical models for protein identification using peptide identifications derived from the results of tandem mass spectral database searches. Recently we have presented a probabilistic model for peptide identification that uses hypergeometric distribution to approximate fragment ion matches of database peptide sequences to experimental tandem mass spectra. Here we apply statistical models to the database search results to validate protein identifications. For this we formulate the protein identification problem in terms of two independent models, two-hypothesis binomial and multinomial models, which use the hypergeometric probabilities and cross-correlation scores, respectively. Each database search result is assumed to be a probabilistic event. The Bernoulli event has two outcomes: a protein is either identified or not. The probability of identifying a protein at each Bernoulli event is determined from relative length of the protein in the database (the null hypothesis) or the hypergeometric probability scores of the protein's peptides (the alternative hypothesis). We then calculate the binomial probability that the protein will be observed a certain number of times (number of database matches to its peptides) given the size of the data set (number of spectra) and the probability of protein identification at each Bernoulli event. The ratio of the probabilities from these two hypotheses (maximum likelihood ratio) is used as a test statistic to discriminate between true and false identifications. The significance and confidence levels of protein identifications are calculated from the model distributions. The multinomial model combines the database search results and generates an observed frequency distribution of cross-correlation scores (grouped into bins) between experimental spectra and identified amino acid sequences. The frequency distribution is used to generate p-value probabilities of each score bin. The probabilities are then normalized with respect to score bins to generate normalized probabilities of all score bins. A protein identification probability is the multinomial probability of observing the given set of peptide scores. To reduce the effect of random matches, we employ a marginalized multinomial model for small values of cross-correlation scores. We demonstrate that the combination of the two independent methods provides a useful tool for protein identification from results of database search using tandem mass spectra. A receiver operating characteristic curve demonstrates the sensitivity and accuracy level of the approach. The shortcomings of the models are related to the cases when protein assignment is based on unusual peptide fragmentation patterns that dominate over the model encoded in the peptide identification process. We have implemented the approach in a program called PROT_PROBE.  相似文献   

4.
Na S  Paek E  Lee C 《Analytical chemistry》2008,80(5):1520-1528
Tandem mass spectrometry (MS/MS) has become a common and useful tool for analyzing complex protein mixtures. Database search programs are the most popular means for peptide identification from MS/MS spectra. However, estimations of charge states of peptide MS/MS spectra obtained from low-resolution mass spectrometers have not been reliable. They require repetitive database searches and additional analyses of the search results. We propose here an algorithm designed to reliably differentiate doubly charged spectra from triply charged ones. We conducted a rigorous analysis of various spectral features and their effects. We employed the distinguishing features found in our analysis and developed a classifier for multiply charged spectra using a machine learning approach. The test on various data sets showed that our method could be successfully applied independent of experimental setup and mass instrument. This algorithm can be used to prefilter spectra so that only reasonably good spectra are submitted to database search programs, thereby saving considerable time. The software for MS/MS charge-state determination, which we named "CIFTER", is available at a website http://prix.uos.ac.kr/sifter/cifter.  相似文献   

5.
Reliable identification of posttranslational modifications is key to understanding various cellular regulatory processes. We describe a tool, InsPecT, to identify posttranslational modifications using tandem mass spectrometry data. InsPecT constructs database filters that proved to be very successful in genomics searches. Given an MS/MS spectrum S and a database D, a database filter selects a small fraction of database D that is guaranteed (with high probability) to contain a peptide that produced S. InsPecT uses peptide sequence tags as efficient filters that reduce the size of the database by a few orders of magnitude while retaining the correct peptide with very high probability. In addition to filtering, InsPecT also uses novel algorithms for scoring and validating in the presence of modifications, without explicit enumeration of all variants. InsPecT identifies modified peptides with better or equivalent accuracy than other database search tools while being 2 orders of magnitude faster than SEQUEST, and substantially faster than X!TANDEM on complex mixtures. The tool was used to identify a number of novel modifications in different data sets, including many phosphopeptides in data provided by Alliance for Cellular Signaling that were missed by other tools.  相似文献   

6.
Collision-induced dissociation (CID) is a common ion activation technique used to energize mass-selected peptide ions during tandem mass spectrometry. Characteristic fragment ions form from the cleavage of amide bonds within a peptide undergoing CID, allowing the inference of its amino acid sequence. The statistical characterization of these fragment ions is essential for improving peptide identification algorithms and for understanding the complex reactions taking place during CID. An examination of 1465 ion trap spectra from doubly charged tryptic peptides reveals several trends important to understanding this fragmentation process. While less abundant than y ions, b ions are present in sufficient numbers to aid sequencing algorithms. Fragment ions exhibit a characteristic series-specific relationship between their masses and intensities. Each residue influences fragmentation at adjacent amide bonds, with Pro quantifiably enhancing cleavage at its N-terminal amide bond and His increasing the formation of b ions at its C-terminal amide bond. Fragment ions corresponding to a formal loss of ammonia appear preferentially in peptides containing Gln and Asn. These trends are partially responsible for the complexity of peptide tandem mass spectra.  相似文献   

7.
Most algorithms for identifying peptides from tandem mass spectra use information only from the final spectrum, ignoring non-mass-based information acquired routinely in liquid chromatography tandem mass spectrometry analyses. One physiochemical property that is always obtained but rarely exploited is peptide chromatographic retention time. Efforts to use chromatographic retention time to improve peptide identification are complicated because of the variability of retention time in different experimental conditions-making retention time calculations nongeneralizable. We show that peptide retention time can be reliably predicted by training and testing a support vector regressor on a small collection of data from a single liquid chromatography run. This model can be used to filter peptide identifications with observed retention time that deviates from predicted retention time. After filtering, positive peptide identifications increase by as much as 50% at a false discovery rate of 3%. We demonstrate that our dynamically trained model generalizes well across diverse chromatography conditions and methods for generating peptides, in particular improving peptide identification using nonspecific proteases.  相似文献   

8.
In shotgun proteomics, tandem mass spectrometry is used to identify peptides derived from proteins. After the peptides are detected, proteins are reassembled via a reference database of protein or gene information. Redundancy and homology between protein records in databases make it challenging to assign peptides to proteins that may or may not be in an experimental sample. Here, a probability model is introduced for determining the likelihood that peptides are correctly assigned to proteins. This model derives consistent probability estimates for assembled proteins. The probability scores make it easier to confidently identify proteins in complex samples and to accurately estimate false-positive rates. The algorithm based on this model is robust in creating protein complements from peptides from bovine protein standards, yeast, Ustilago maydis cell lysates, and Arabidopsis thaliana leaves. It also eliminates the side effects of redundancy and homology from the reference databases by employing a new concept of peptide grouping and by coherently distinguishing distinct peptides from unique records and shared peptides from homologous proteins. The software that runs the algorithm, called PANORAMICS, provides a tool to help analyze the data based on a researcher's knowledge about the sample. The software operates efficiently and quickly compared to other software platforms.  相似文献   

9.
Tandem mass spectrometry (MS/MS) plays an important role in the unambiguous identification and structural elucidation of biomolecules. In contrast to conventional MS/MS approaches for protein identification where an individual polypeptide is sequentially selected and dissociated, a multiplexed-MS/MS approach increases throughput by selecting several peptides for simultaneous dissociation using either infrared multiphoton dissociation (IRMPD) or multiple frequency sustained off-resonance irradiation (SORI) collisionally induced dissociation (CID). The high mass measurement accuracy and resolution of FTICR combined with knowledge of peptide dissociation pathways allows the fragments arising from several different parent ions to be assigned. Herein we report the application of multiplexed-MS/MS coupled with on-line separations for the identification of peptides present in complex mixtures (i.e., whole cell lysate digests). Software was developed to enable "on-the-fly" data-dependent peak selection of a subset of polypeptides from each FTICR MS acquisition. In the subsequent MS/MS acquisitions, several coeluting peptides were fragmented simultaneously using either IRMPD or SORI-CID techniques. The utility of this approach has been demonstrated using a bovine serum albumin tryptic digest separated by capillary LC where multiple peptides were readily identified in single MS/MS acquisitions. We also present initial results from multiplexed-MS/MS analysis of a D. radiodurans whole cell digest to illustrate the utility of this approach for high-throughput analysis of a bacterial proteome.  相似文献   

10.
A widespread proteomics procedure for characterizing a complex mixture of proteins combines tandem mass spectrometry and database search software to yield mass spectra with identified peptide sequences. The same peptides are often detected in multiple experiments, and once they have been identified, the respective spectra can be used for future identifications. We present a method for collecting previously identified tandem mass spectra into a reference library that is used to identify new spectra. Query spectra are compared to references in the library to find the ones that are most similar. A dot product metric is used to measure the degree of similarity. With our largest library, the search of a query set finds 91% of the spectrum identifications and 93.7% of the protein identifications that could be made with a SEQUEST database search. A second experiment demonstrates that queries acquired on an LCQ ion trap mass spectrometer can be identified with a library of references acquired on an LTQ ion trap mass spectrometer. The dot product similarity score provides good separation of correct and incorrect identifications.  相似文献   

11.
We report a new tandem mass spectrometric approach for the improved identification of polypeptides from mixtures (e.g., using genomic databases). The approach involves the dissociation of several species simultaneously in a single experiment and provides both increased speed and sensitivity. The data analysis makes use of the known fragmentation pathways for polypeptides and highly accurate mass measurements for both the set of parent polypeptides and their fragments. The accurate mass information makes it possible to attribute most fragments to a specific parent species. We provide an initial demonstration of this multiplexed tandem MS approach using an FTICR mass spectrometer with a mixture of seven polypeptides dissociated using infrared irradiation from a CO2 laser. The peptides were added to, and then successfully identified from, the largest genomic database yet available (C. elegans), which is equivalent in complexity to that for a specific differentiated mammalian cell type. Additionally, since only a few enzymatic fragments are necessary to unambiguously identify a protein from an appropriate database, it is anticipated that the multiplexed MS/MS method will allow the more rapid identification of complex protein mixtures with on-line separation of their enzymatically produced polypeptides.  相似文献   

12.
Lu B  Ruse C  Xu T  Park SK  Yates J 《Analytical chemistry》2007,79(4):1301-1310
We developed and compared two approaches for automated validation of phosphopeptide tandem mass spectra identified using database searching algorithms. Phosphopeptide identifications were obtained through SEQUEST searches of a protein database appended with its decoy (reversed sequences). Statistical evaluation and iterative searches were employed to create a high-quality data set of phosphopeptides. Automation of postsearch validation was approached by two different strategies. By using statistical multiple testing, we calculate a p value for each tentative peptide phosphorylation. In a second method, we use a support vector machine (SVM; a machine learning algorithm) binary classifier to predict whether a tentative peptide phosphorylation is true. We show good agreement (85%) between postsearch validation of phosphopeptide/spectrum matches by multiple testing and that from support vector machines. Automatic methods conform very well with manual expert validation in a blinded test. Additionally, the algorithms were tested on the identification of synthetic phosphopeptides. We show that phosphate neutral losses in tandem mass spectra can be used to assess the correctness of phosphopeptide/spectrum matches. An SVM classifier with a radial basis function provided classification accuracy from 95.7% to 96.8% of the positive data set, depending on search algorithm used. Establishing the efficacy of an identification is a necessary step for further postsearch interrogation of the spectra for complete localization of phosphorylation sites. Our current implementation performs validation of phosphoserine/phosphothreonine-containing peptides having one or two phosphorylation sites from data gathered on an ion trap mass spectrometer. The SVM-based algorithm has been implemented in the software package DeBunker. We illustrate the application of the SVM-based software DeBunker on a large phosphorylation data set.  相似文献   

13.
A novel methodology for the automated de novo identification of peptides via integer linear optimization (also referred to as integer linear programming or ILP) and tandem mass spectrometry is presented in this article. The various features of the mathematical model are presented and examples are used to illustrate the key concepts of the proposed approach. A variety of challenging peptide identification problems, accompanied by a comparative study with five state-of-the-art methods, are examined to illustrate the proposed method's ability to address (a) residue-dependent fragmentation properties that result in missing ion peaks and (b) the variability of resolution in different mass analyzers. A preprocessing algorithm is utilized to identify important m/z values in the tandem mass spectrum. Missing peaks, due to residue-dependent fragmentation characteristics, are dealt with using a two-stage algorithmic framework. A cross-correlation approach is used to resolve missing amino acid assignments and to select the most probable peptide by comparing the theoretical spectra of the candidate sequences that were generated from the ILP sequencing stages with the experimental tandem mass spectrum. The novel, proposed de novo method, denoted as PILOT, is compared to existing popular methods such as Lutefisk, PEAKS, PepNovo, EigenMS, and NovoHMM for a set of spectra resulting from QTOF and ion trap instruments.  相似文献   

14.
Peptide identification based on tandem mass spectrometry and database searching algorithms has become one of the central technologies in proteomics. At the heart of this technology is the ability to reproducibly acquire high-quality tandem mass spectra for database interrogation. The variability in tandem mass spectra generation is often assumed to be minimal, and peptide identifications are typically based on a single tandem mass spectrum. In this paper, we characterize the variance of scores derived from replicate tandem mass spectra using several database search algorithms and demonstrate the effects of spectral variability on the correct identification of peptides. We show that the variance associated with the collection of tandem mass spectra can be substantial leading to sizable errors in search algorithm scores ( approximately 5-25% RSD) and ultimately incorrect assignments. Processing strategies are discussed to minimize the impact of tandem mass spectra variability on peptide identification.  相似文献   

15.
TwinPeaks, a close variant of the SEQUEST protein identification algorithm, is capable of unrestricted, large-scale, identification of post-translation modifications (PTMs). TwinPeaks is applied on a sample of 100441 tandem mass spectra from the HUPO Plasma Proteome Project data set, with full non-redundant human as a reference protein database. With a 3.5% error rate, TwinPeaks identifies a collection of 539 spectra that were not identified by the usual PTM-restricted identification algorithm. At this error rate, TwinPeaks increases the rate of spectra identifications by at least 17.6%, making unrestricted PTM identification an integral part of proteomics.  相似文献   

16.
Fragmentation at the Xxx-Pro bond was analyzed for a group of peptide mass spectra that were acquired in a Finnigan ion trap mass spectrometer and were generated from proteins digested by enzymes and identified by the Sequest algorithm. Cleavage with formation of a + b + y ions occurred more readily at the Xxx-Pro bond than at other locations in these peptides, and the importance of this cleavage varied by the identity of Xxx. The most abundant Xxx-Pro relative bond cleavage ratios were observed when Xxx was Val, His, Asp, Ile, and Leu, whereas the least abundant cleavage ratios occurred when Xxx was Gly or Pro. Rationalization for these cleavage ratios at Xxx-Pro may include contribution of the Asp or His side chain to enhanced cleavage or the conformation of Pro, Gly, and the aliphatic residues Val, Ile, and Leu at the Xxx location in the Xxx-Pro bond. Although unusual fragmentation behavior has been noted for Pro-containing peptides, this analysis suggests that fragmentation at the Xxx-Pro bond is predictable and that this information may be used to improve the identification of proteins if it is incorporated into peptide sequencing algorithms.  相似文献   

17.
Du P  Angeletti RH 《Analytical chemistry》2006,78(10):3385-3392
We present an algorithm for the deconvolution of isotope-resolved mass spectra of complex peptide mixtures where peaks and isotope series often overlap. The algorithm formulates the problem of mass spectrum deconvolution as a classical statistical problem of variable selection, which aims to interpret the spectrum with the least number of peptides. The LASSO method is used to perform automatic variable selection. The algorithm also makes use of the quantized distribution of peptide masses in the NCBInr database after in silico trypsin digestion as filters to aid the deconvolution process. Errors in the expected isotope pattern are accounted for to avoid spurious isotope series. The effectiveness of the algorithm is demonstrated with annotated ESI spectrum of known peptides for which the peaks and isotope series are highly overlapping. The algorithm successfully finds all correct masses in the experimental spectrum, except for one spectrum where an additional refinement procedure is required to obtain the correct results. Our results compare favorably to those from a widely used commercial program.  相似文献   

18.
Algorithmic search engines bridge the gap between large tandem mass spectrometry data sets and the identification of proteins associated with biological samples. Improvements in these tools can greatly enhance biological discovery. We present a new scoring scheme for comparing tandem mass spectra with a protein sequence database. The MASPIC (Multinomial Algorithm for Spectral Profile-based Intensity Comparison) scorer converts an experimental tandem mass spectrum into a m/z profile of probability and then scores peak lists from potential candidate peptides using a multinomial distribution model. The MASPIC scoring scheme incorporates intensity, spectral peak density variations, and m/z error distribution associated with peak matches into a multinomial distribution. The scoring scheme was validated on two standard protein mixtures and an additional set of spectra collected on a complex ribosomal protein mixture from Rhodopseudomonas palustris. The results indicate a 5-15% improvement over Sequest for high-confidence identifications. The performance gap grows as sequence database size increases. Additional tests on spectra from proteinase-K digest data showed similar performance improvements demonstrating the advantages in using MASPIC for studying proteins digested with less specific proteases. All these investigations show MASPIC to be a versatile and reliable system for peptide tandem mass spectral identification.  相似文献   

19.
For automated production of tandem mass spectrometric data for proteins and peptides >3 kDa at >50 000 resolution, a dual online-offline approach is presented here that improves upon standard liquid chromatography-tandem mass spectrometry (LC-MS/MS) strategies. An integrated hardware and software infrastructure analyzes online LC-MS data and intelligently determines which targets to interrogate offline using a posteriori knowledge such as prior observation, identification, and degree of characterization. This platform represents a way to implement accurate mass inclusion and exclusion lists in the context of a proteome project, automating collection of high-resolution MS/MS data that cannot currently be acquired on a chromatographic time scale at equivalent spectral quality. For intact proteins from an acid extract of human nuclei fractionated by reversed-phase liquid chromatography (RPLC), the automated offline system generated 57 successful identifications of protein forms arising from 30 distinct genes, a substantial improvement over online LC-MS/MS using the same 12 T LTQ FT Ultra instrument. Analysis of human nuclei subjected to a shotgun Lys-C digest using the same RPLC/automated offline sampling identified 147 unique peptides containing 29 co- and post-translational modifications. Expectation values ranged from 10 (-5) to 10 (-99), allowing routine multiplexed identifications.  相似文献   

20.
Zhang W  Chait BT 《Analytical chemistry》2000,72(11):2482-2489
We describe the protein search engine "ProFound", which employs a Bayesian algorithm to identify proteins from protein databases using mass spectrometric peptide mapping data. The algorithm ranks protein candidates by taking into account individual properties of each protein in the database as well as other information relevant to the peptide mapping experiment. The program consistently identifies the correct protein(s) even when the data quality is relatively low or when the sample consists of a simple mixture of proteins. Illustrative examples of protein identifications are provided.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号