首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Two homologous sequences, which have diverged beyond the point where their homology can be recognised by a simple direct comparison, can be related through a third sequence that is suitably intermediate between the two. High scores, for a sequence match between the first and third sequences and between the second and the third sequences, imply that the first and second sequences are related even though their own match score is low. We have tested the usefulness of this idea using a database that contains the sequences of 971 protein domains whose structures are known and whose residue identities with each other are some 40% or less (PDB40D). On the basis of sequence and structural information, 2143 pairs of these sequences are known to have an evolutionary relationship. FASTA, in an all-against-all comparison of the sequences in the database, detected 320 (15%) of these relationships as well as three false positive (i.e. 1% error rate). Using intermediate sequences found by FASTA matches of PDB40D sequences to those in the large non-redundant OWL database we could detect 550 evolutionary relationships with an error rate of 1%. This means the intermediate sequence procedure increases the ability to recognise the evolutionary relationships amongst the PDB40D sequences by 70%.  相似文献   

2.
The Munich Information Center for Protein Sequences (MIPS-GSF), Martinsried near Munich, Germany, develops and maintains genome oriented databases. It is commonplace that the amount of sequence data available increases rapidly, but not the capacity of qualified manual annotation at the sequence databases. Therefore, our strategy aims to cope with the data stream by the comprehensive application of analysis tools to sequences of complete genomes, the systematic classification of protein sequences and the active support of sequence analysis and functional genomics projects. This report describes the systematic and up-to-date analysis of genomes (PEDANT), a comprehensive database of the yeast genome (MYGD), a database reflecting the progress in sequencing the Arabidopsis thaliana genome (MATD), the database of assembled, annotated human EST clusters (MEST), and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). MIPS provides access through its WWW server (http://www.mips.biochem.mpg.de) to a spectrum of generic databases, including the above mentioned as well as a database of protein families (PROTFAM), the MITOP database, and the all-against-all FASTA database.  相似文献   

3.
We propose a new method for detecting conserved RNA secondary structures in a family of related RNA sequences. Our method is based on a combination of thermodynamic structure prediction and phylogenetic comparison. In contrast to purely phylogenetic methods, our algorithm can be used for small data sets of approximately 10 sequences, efficiently exploiting the information contained in the sequence variability. The procedure constructs a prediction only for those parts of sequences that are consistent with a single conserved structure. Our implementation produces reasonable consensus structures without user interference. As an example we have analysed the complete HIV-1 and hepatitis C virus (HCV) genomes as well as the small segment of hantavirus. Our method confirms the known structures in HIV-1 and predicts previously unknown conserved RNA secondary structures in HCV.  相似文献   

4.
A method is described for searching protein sequence databases using tandem mass spectra of tryptic peptides. The approach uses a de novo sequencing algorithm to derive a short list of possible sequence candidates which serve as query sequences in a subsequent homology-based database search routine. The sequencing algorithm employs a graph theory approach similar to previously described sequencing programs. In addition, amino acid composition, peptide sequence tags and incomplete or ambiguous Edman sequence data can be used to aid in the sequence determinations. Although sequencing of peptides from tandem mass spectra is possible, one of the frequently encountered difficulties is that several alternative sequences can be deduced from one spectrum. Most of the alternative sequences, however, are sufficiently similar for a homology-based sequence database search to be possible. Unfortunately, the available protein sequence database search algorithms (e.g. Blast or FASTA) require a single unambiguous sequence as input. Here we describe how the publicly available FASTA computer program was modified in order to search protein databases more effectively in spite of the ambiguities intrinsic in de novo peptide sequencing algorithms.  相似文献   

5.
The goal of the fungal mitochondrial genome project (FMGP) is to sequence complete mitochondrial genomes for a representative sample of the major fungal lineages; to analyze the genome structure, gene content, and conserved sequence elements of these sequences; and to study the evolution of gene expression in fungal mitochondria. By using our new sequence data for evolutionary studies, we were able to construct phylogenetic trees that provide further solid evidence that animals and fungi share a common ancestor to the exclusion of chlorophytes and protists. With a database comprising multiple mitochondrial gene sequences, the level of support for our mitochondrial phylogenies is unprecedented, in comparison to trees inferred with nuclear ribosomal RNA sequences. We also found several new molecular features in the mitochondrial genomes of lower fungi, including: (1) tRNA editing, which is the same type as that found in the mitochondria of the amoeboid protozoan Acanthamoeba castellanii; (2) two novel types of putative mobile DNA elements, one encoding a site-specific endonuclease that confers mobility on the element, and the other constituting a class of highly compact, structured elements; and (3) a large number of introns, which provide insights into intron origins and evolution. Here, we present an overview of these results, and discuss examples of the diversity of structures found in the fungal mitochondrial genome.  相似文献   

6.
We present a method for discovering conserved sequence motifs from families of aligned protein sequences. The method has been implemented as a computer program called EMOTIF (http://motif. stanford.edu/emotif). Given an aligned set of protein sequences, EMOTIF generates a set of motifs with a wide range of specificities and sensitivities. EMOTIF also can generate motifs that describe possible subfamilies of a protein superfamily. A disjunction of such motifs often can represent the entire superfamily with high specificity and sensitivity. We have used EMOTIF to generate sets of motifs from all 7,000 protein alignments in the BLOCKS and PRINTS databases. The resulting database, called IDENTIFY (http://motif. stanford.edu/identify), contains more than 50,000 motifs. For each alignment, the database contains several motifs having a probability of matching a false positive that range from 10(-10) to 10(-5). Highly specific motifs are well suited for searching entire proteomes, while generating very few false predictions. IDENTIFY assigns biological functions to 25-30% of all proteins encoded by the Saccharomyces cerevisiae genome and by several bacterial genomes. In particular, IDENTIFY assigned functions to 172 of proteins of unknown function in the yeast genome.  相似文献   

7.
The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40-J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40% or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional features. For nine false positive predictions out of a possible 432,680, i.e. at a false positive rate of about 1/50,000, SAM-T98 found 35% of the true homologous relationships in PDBD40-J, whilst PSI-BLAST found 30% and ISS found 25%. Overall, this is about twice the number of PDBD40-J relations that can be detected by the pairwise comparison procedures FASTA (17%) and GAP-BLAST (15%). For distantly related sequences in PDBD40-J, those pairs whose sequence identity is less than 30%, SAM-T98 and PSI-BLAST detect three times the number of relationships found by the pairwise methods.  相似文献   

8.
A "single-base sequence" is a DNA sequence in which the identities and locations of bases of only one type have been determined. We present experimental procedures for single-base sequencing and describe the effective use of existing software (FASTA) in similarity comparisons of single-base sequences. We determined the theoretical and experimental minimum sequence lengths required for identification of a sequence within a large dataset and optimized the FASTA parameters for use in single-base similarity comparisons. Single-base sequences have been used to identify cDNAs occurring in a database. Single-base sequencing could be used to reduce the redundancy of "shot-gun sequencing."  相似文献   

9.
Recent increases in the number of genome sequencing projects means that the amount of protein sequence in databases is increasing at an astonishing pace. In proteome studies, this is facilitating the identification of proteins from molecularly well-defined organisms. However, in studies of proteins from the majority of organisms, proteins must be identified by comparing analytical data to sequences in databases from other species. This process is known as cross-species protein identification. Here we present a new program, MultiIdent, which uses multiple protein parameters such as amino acid composition, peptide masses, sequence tags, estimated protein pI and mass, to achieve cross-species protein identification. The program is structured so that protein amino acid composition, which is highly conserved across species boundaries, first generates a set of candidate proteins. These proteins are then queried with other protein parameters such as sequence tags and peptide masses. A final list of database entries which considers all analytical parameters is presented, ranked by an integrated score. We illustrate the power of the approach with the identification of a set of standard proteins, and the identification of proteins from dog heart separated by two-dimensional gel electrophoresis. The MultiIdent program is available on the world-wide web at: http://www.expasy.ch/sprot/multiident.h tml.  相似文献   

10.
We have designed and implemented a system to carry out cross-genome comparisons of open reading frames (ORFs) from multiple genomes. This implementation includes a genome profiling system that allows us to explore pairwise comparisons at different levels of match similarity and ask biologically motivated queries involving number and identity of ORFs, their function, functional category, distribution in genomes or in biological domains, and statistics on their matches and match families. This analysis required precise definition of new classification terms and concepts. We define the terms genomic signature, summary signature, biologic domain signature, domain class, match level, match family, and extended match family, then use these terms to define concepts, including genomically universal proteins and proteins characteristics of sets of genomes. We initiate an analysis based on automated FASTA (Pearson, 1996) comparison of 22,419 conceptually translated protein sequences from nine microbial genomes.  相似文献   

11.
The determination of complete genome sequences provides us with an opportunity to describe and analyze evolution at the comprehensive level of genomes. Here we compare nine genomes with respect to their protein coding genes at two levels: (i) we compare genomes as "bags of genes" and measure the fraction of orthologs shared between genomes and (ii) we quantify correlations between genes with respect to their relative positions in genomes. Distances between the genomes are related to their divergence times, measured as the number of amino acid substitutions per site in a set of 34 orthologous genes that are shared among all the genomes compared. We establish a hierarchy of rates at which genomes have changed during evolution. Protein sequence identity is the most conserved, followed by the complement of genes within the genome. Next is the degree of conservation of the order of genes, whereas gene regulation appears to evolve at the highest rate. Finally, we show that some genomes are more highly organized than others: they show a higher degree of the clustering of genes that have orthologs in other genomes.  相似文献   

12.
Comparative analysis of the complete sequences of seven bacterial and three archaeal genomes leads to the first generalizations of emerging genome-based microbiology. Protein sequences are, generally, highly conserved, with -70% of the gene products in bacteria and archaea containing ancient conserved regions. In contrast, there is little conservation of genome organization, except for a few essential operons. The most striking conclusions derived by comparison of multiple genomes from phylogenetically distant species are that the number of universally conserved gene families is very small and that multiple events of horizontal gene transfer and genome fusion are major forces in evolution.  相似文献   

13.
Raw sequence data representing the majority of a bacterial genome can be obtained at a tiny fraction of the cost of a completed sequence. To demonstrate the utility of such a resource, 870 single-stranded M13 clones were sequenced from a shotgun library of the Salmonella typhi Ty2 genome. The sequence reads averaged over 400 bases and sampled the genome with an average spacing of once every 5,000 bases. A total of 339,243 bases of unique sequence was generated (approximately 7% representation). The sample of 870 sequences was compared to the complete Escherichia coli K-12 genome and to the rest of the GenBank database, which can also be considered a collection of sampled sequences. Despite the incomplete S. typhi data set, interesting categories could easily be discerned. Sixteen percent of the sequences determined from S. typhi had close homologs among known Salmonella sequences (P < 1e-40 in BlastX or BlastN), reflecting the proportion of these genomes that have been sequenced previously; 277 sequences (32%) had no apparent orthologs in the complete E. coli K-12 genome (P > 1e-20), of which 155 sequences (18%) had no close similarities to any sequence in the database (P > 1e-5). Eight of the 277 sequences had similarities to genes in other strains of E. coli or plasmids, and six sequences showed evidence of novel phage lysogens or sequence remnants of phage integrations, including a member of the lambda family (P < 1e-15). Twenty-three sample sequences had a significantly closer similarity a sequence in the database from organisms other than the E. coli/Salmonella clade (which includes Shigella and Citrobacter). These sequences are new candidate lateral transfer events to the S. typhi lineage or deletions on the E. coli K-12 lineage. Eleven putative junctions of insertion/deletion events greater than 100 bp were observed in the sample, indicating that well over 150 such events may distinguish S. typhi from E. coli K-12. The need for automatic methods to more effectively exploit sample sequences is discussed.  相似文献   

14.
Heterologous DNA sequences from rearrangements with the genomes of host cells, genomic fragments from hybrid cells, or impure tissue sources can threaten the purity of libraries that are derived from RNA or DNA. Hybridization methods can only detect contaminants from known or suspected heterologous sources, and whole library screening is technically very difficult. Detection of contaminating heterologous clones by sequence alignment is only possible when related sequences are present in a known database. We have developed a statistical test to identify heterologous sequences that is based on the differences in hexamer composition of DNA from different organisms. This test does not require that sequences similar to potential heterologous contaminants are present in the database, and can in principle detect contamination by previously unknown organisms. We have applied this test to the major public expressed sequence tag (EST) data sets to evaluate its utility as a quality control measure and a peer evaluation tool. There is detectable heterogeneity in most human and C.elegans EST data sets but it is not apparently associated with cross-species contamination. However, there is direct evidence for both yeast and bacterial sequence contamination in some public database sequences annotated as human. Results obtained with the hexamer test have been confirmed with similarity searches using sequences from the relevant data sets.  相似文献   

15.
Hydropathy profile alignment is introduced as a tool in functional genomics. The architecture of membrane proteins is reflected in the hydropathy profile of the amino acid sequence. Both secondary and tertiary structural elements determine the profile which provides enough sensitivity to detect evolutionary links between membrane proteins that are based on structural rather than sequence similarities. Since structure is better conserved than amino acid sequence, the hydropathy profile can detect more distant evolutionary relationships than can be detected by the primary structure. The technique is demonstrated by two approaches in the analysis of a subset of membrane proteins coded on the Escherichia coli and Bacillus subtilis genomes. The subset includes secondary transporters of the 12 helix type. In the first approach, the hydropathy profiles of proteins for which no function is known are aligned with the profiles of all other proteins in the subset to search for structural paralogues with known function. In the second approach, family hydropathy profiles of 8 defined families of secondary transporters that fall into 4 different structural classes (SC-ST1-4) are used to screen the membrane protein set for members of the structural classes. The analysis reveals that over 100 membrane proteins on each genome fall in only two structural classes. The largest structural class, SC-ST1, correlates largely with the Major Facilitator Superfamily defined before, but the number of families within the class has increased up to 57. The second large structural class, SC-ST2 contains secondary transporters for amino acids and amines and consists of 12 families.  相似文献   

16.
The complete amino acid sequence of gladiolus bulb chitinase-a (GBC-a) was determined. First the tryptic peptides from GBC-a after it was reduced and S-carboxymethylated were sequenced and then the peptides were further studied by chemical cleavage of the enzyme. GBC-a consisted of 274 amino acid residues and had a molecular mass of 30,714 Da. Two consensus sequences essential for chitinase activity by plant class III chitinases were conserved in GBC-a, although its sequence similarity with plant class III chitinases was less than 20%. Sequence comparison of GBC-a with sequences of other proteins in a protein identification resource (PIR) showed that the GBC-a sequence was 33% similar to that of narbonin, a seed storage 2S globulin from narbon beans.  相似文献   

17.
The ABC transporter is a major class of cellular translocation machinery in all bacterial species encoded in the largest set of paralogous genes. The operon structure is frequently found for the genes of three molecular components: the ATP-binding protein, the membrane protein, and the substrate-binding protein. Here, we developed an "ortholog group table" by comparison and classification of known and putative ABC transporters in the complete genomes of seven microorganisms. Our procedure was to first search and classify the most conserved ATP-binding protein components by the sequence similarity and then to classify the entire transporter units by examining the similarity of the other components and the conservation of the operon structure. The resulting 25 ortholog groups of ABC transporters were well correlated with known functions. Through the analysis, we could assign substrate specificity to hypothetical transporters, predict additional transporter operons, and identify novel types of putative transporters. The ortholog group table was also used as a reference data set for functional assignment in four additional genomes. In general, the ABC transporter operons were strongly conserved despite the extensive shuffling of gene locations in bacterial evolution. In Synechocystis, however, the tendency of forming operons was clearly diminished. Our result suggests that the ancestral ABC transporter operons may have arisen early in evolution before the speciation of bacteria and archaea.  相似文献   

18.
A detailed analysis of protein domains involved in DNA repair was performed by comparing the sequences of the repair proteins from two well-studied model organisms, the bacterium Escherichia coli and yeast Saccharomyces cerevisiae, to the entire sets of protein sequences encoded in completely sequenced genomes of bacteria, archaea and eukaryotes. Previously uncharacterized conserved domains involved in repair were identified, namely four families of nucleases and a family of eukaryotic repair proteins related to the proliferating cell nuclear antigen. In addition, a number of previously undetected occurrences of known conserved domains were detected; for example, a modified helix-hairpin-helix nucleic acid-binding domain in archaeal and eukaryotic RecA homologs. There is a limited repertoire of conserved domains, primarily ATPases and nucleases, nucleic acid-binding domains and adaptor (protein-protein interaction) domains that comprise the repair machinery in all cells, but very few of the repair proteins are represented by orthologs with conserved domain architecture across the three superkingdoms of life. Both the external environment of an organism and the internal environment of the cell, such as the chromatin superstructure in eukaryotes, seem to have a profound effect on the layout of the repair systems. Another factor that apparently has made a major contribution to the composition of the repair machinery is horizontal gene transfer, particularly the invasion of eukaryotic genomes by organellar genes, but also a number of likely transfer events between bacteria and archaea. Several additional general trends in the evolution of repair proteins were noticed; in particular, multiple, independent fusions of helicase and nuclease domains, and independent inactivation of enzymatic domains that apparently retain adaptor or regulatory functions.  相似文献   

19.
Graphical dot-matrix plots can provide the most complete and detailed comparison of two sequences. Presented here is DOTTER2, a dot-plot program for X-windows which can compare DNA or protein sequences, and also DNA versus protein. The main novel feature of DOTTER is that the user can vary the stringency cutoffs interactively, so that the dot-matrix only needs to be calculated once. This is possible thanks to a 'Greyramp tool' that was developed to change the displayed stringency of the matrix by dynamically changing the greyscale rendering of the dots. The Greyramp tool allows the user to interactively change the lower and upper score limit for the greyscale rendering. This allows exploration of the separation between signal and noise, and fine-grained visualisation of different score levels in the dot-matrix. Other useful features are dot-matrix compression, mouse-controlled zooming, sequence alignment display and saving/loading of dot-matrices. Since the matrix only has to be calculated once and since the algorithm is fast and linear in space, DOTTER is practical to use even for sequences as long as cosmids. DOTTER was integrated in the gene-modelling module of the genomic database system ACEDB3. This was done via the homology viewer BLIXEM in a way that also allows segments from the BLAST suite of searching programs to be superimposed on top of the full dot-matrix. This feature can also be used for very quick finding of the strongest matches. As examples, we analyse a Caenorhabditis elegans cosmid with several tandem repeat families, and illustrate how DOTTER can improve gene modelling.  相似文献   

20.
MOTIVATION: Optimal sequence alignment based on the Smith-Waterman algorithm is usually too computationally demanding to be practical for searching large sequence databases. Heuristic programs like FASTA and BLAST have been developed which run much faster, but at the expense of sensitivity. RESULTS: In an effort to approximate the sensitivity of an optimal alignment algorithm, a new algorithm has been devised for the computation of a gapped alignment of two sequences. After scanning for high-scoring words and extensions of these to form fragments of similarity, the algorithm uses dynamic programming to build an accurate alignment based on the fragments initially identified. The algorithm has been implemented in a program called SALSA and the performance has been evaluated on a set of test sequences. The sensitivity was found to be close to the Smith-Waterman algorithm, while the speed was similar to FASTA (ktup = 2). AVAILABILITY: Searches can be performed from the SALSA homepage at http://dna.uio.no/salsa/ using a wide range of databases. Source code and precompiled executables are also available. CONTACT: torbjorn.rognes@labmed.uio.no  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号