首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Scrutineer is an interactive, user-friendly program designed to search for motifs, patterns and profiles in the Swissprot, Protein Identification Resource (PIR) or SeqDb protein sequence databases. Basic capabilities include (i) searches for strings of amino acids with multiple choices at a given position; (ii) searches for strings including variable-length segments and delocalized constraints; (iii) searches over subsets of a database or particular regions within each sequence (e.g. N-terminal one-third); (iv) searches involving secondary structure predictions, physicochemical characteristics, and the like; and (v) searches using aligned sequences as targets with various optional weighting schemes. The various search criteria and hits can be combined and complex targets located. Once the data are loaded into virtual memory, all occurrences in PIR release 22.0 (3.7 x 10(6) amino acids) of a given short string of amino acids (e.g. a hexamer) are found in approximately 36 s. Scrutineer can also describe the entire database, user-specified hits, user-defined regions of sequence and all hits. The source code and accompanying manual are being freely distributed.  相似文献   

3.
Wu  Cathy  Berry  Michael  Shivakumar  Sailaja  McLarty  Jerry 《Machine Learning》1995,21(1-2):177-193
A neural network classification method has been developed as an alternative approach to the search/organization problem of protein sequence databases. The neural networks used are three-layered, feed-forward, back-propagation networks. The protein sequences are encoded into neural input vectors by a hashing method that counts occurrences ofn-gram words. A new SVD (singular value decomposition) method, which compresses the long and sparsen-gram input vectors and captures semantics ofn-gram words, has improved the generalization capability of the network. A full-scale protein classification system has been implemented on a Cray supercomputer to classify unknown sequences into 3311 PIR (Protein Identification Resource) superfamilies/families at a speed of less than 0.05 CPU second per sequence. The sensitivity is close to 90% overall, and approaches 100% for large superfamilies. The system could be used to reduce the database search time and is being used to help organize the PIR protein sequence database.  相似文献   

4.
nr数据库分析及其本地化   总被引:12,自引:0,他引:12  
通过用NCBI的blastp程序对nr数据库与IPI数据库中属于人类的蛋白质序列进行比较,论证了nr数据库整合到蛋白质注释系统中的必要性。并在此基础上,设计方案实现了nr数据库的本地化,对nr数据库的记录进行了统计分析,提供了数据库的内部访问,为蛋白质拄释系统整合nr数据库做好了第一步工作。  相似文献   

5.
The PRECISE database was developed by our laboratory to allow for the systematic study of the ligand interactions common to a set of functionally related enzymes, where an interaction site is defined broadly as any residue(s) that interact with a ligand. During the construction of PRECISE, enzyme chains are extracted from the protein data bank (PDB) and clustered according to functional homology as defined by the enzyme commission (EC) nomenclature system. A sequence representative is chosen from each cluster based on the criterion set forth by the non-redundant PDB set, and pair-wise alignments of each cluster member to the representative are performed. Atom-based residue–ligand interactions are calculated for each cluster member, and the summation of ligand interactions for all cluster members at each aligned position is determined. Although we were able to successfully align most clusters using a simple dynamic programming algorithm, several cluster created exhibited poor pair-wise alignments of each cluster member to its sequence representative. We hypothesized that the observed alignment problems were, in most cases, due to the incorrect separation and alignment of different domains in multi-domain proteins, a mistake that frequently causes error proliferation in functional annotation. Here we present the results of generating primary sequence patterns for each poorly aligned cluster in PRECISE to assess the extent to which multi-domain proteins that are incorrectly aligned contributes to poor pair-wise alignments of each cluster member to its representative. This requires the use of an iterative locally optimal pair-wise alignment algorithm to build a hierarchical similarity-based sequence pattern for a set of functionally related enzymes. Our results show that poor alignments in PRECISE are caused most frequently by the misalignment of multi-domain proteins, and that the generation of primary sequence patterns for the assignment of sequence family membership yields better alignments for the functionally related enzyme clusters in PRECISE than our original alignment algorithm.  相似文献   

6.
The PRECISE database was developed by our laboratory to allow for the systematic study of the ligand interactions common to a set of functionally related enzymes, where an interaction site is defined broadly as any residue(s) that interact with a ligand. During the construction of PRECISE, enzyme chains are extracted from the protein data bank (PDB) and clustered according to functional homology as defined by the enzyme commission (EC) nomenclature system. A sequence representative is chosen from each cluster based on the criterion set forth by the non-redundant PDB set, and pair-wise alignments of each cluster member to the representative are performed. Atom-based residue–ligand interactions are calculated for each cluster member, and the summation of ligand interactions for all cluster members at each aligned position is determined. Although we were able to successfully align most clusters using a simple dynamic programming algorithm, several cluster created exhibited poor pair-wise alignments of each cluster member to its sequence representative. We hypothesized that the observed alignment problems were, in most cases, due to the incorrect separation and alignment of different domains in multi-domain proteins, a mistake that frequently causes error proliferation in functional annotation. Here we present the results of generating primary sequence patterns for each poorly aligned cluster in PRECISE to assess the extent to which multi-domain proteins that are incorrectly aligned contributes to poor pair-wise alignments of each cluster member to its representative. This requires the use of an iterative locally optimal pair-wise alignment algorithm to build a hierarchical similarity-based sequence pattern for a set of functionally related enzymes. Our results show that poor alignments in PRECISE are caused most frequently by the misalignment of multi-domain proteins, and that the generation of primary sequence patterns for the assignment of sequence family membership yields better alignments for the functionally related enzyme clusters in PRECISE than our original alignment algorithm.  相似文献   

7.
Query processing in the uncertain database has become increasingly important due to the wide existence of uncertain data in many real applications. Different from handling precise data, the uncertain query processing needs to consider the data uncertainty and answer queries with confidence guarantees. In this paper, we formulate and tackle an important query, namely probabilistic inverse ranking (PIR) query, which retrieves possible ranks of a given query object in an uncertain database with confidence above a probability threshold. We present effective pruning methods to reduce the PIR search space, which can be seamlessly integrated into an efficient query procedure. Moreover, we tackle the problem of PIR query processing in high dimensional spaces, which reduces high dimensional uncertain data to a lower dimensional space. Furthermore, we study three interesting and useful aggregate PIR queries, that is, MAX, top-m, and AVG? PIRs. Moreover, we also study an important query type, PIR with uncertain query object (namely UQ-PIR), and design specific rules to facilitate the pruning. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approaches over both real and synthetic data sets, under various experimental settings.  相似文献   

8.
蛋白质的功能对于理解细胞和生物的活动机制、研究疾病机理等至关重要。面对序列数据库的快速增长,传统的实验和序列对比方法不足以支撑大规模的蛋白质功能标注。为此,提出EGNet(evolutionary graph network)模型,采用蛋白质预训练语言模型ESM2和one-hot编码得到蛋白质序列编码,通过序列自注意力和物理计算整合出残基间的协同进化信息PI(paired interaction)和SPI(strong paired interaction);之后将两种进化信息和序列编码作为多层串联图卷积网络输入,学习序列编码节点特征,实现端到端的蛋白质功能预测。与早期方法相比,在ENZYME数据库中的EC(Enzyme Commission)类别标签上,EGNet获得了更好的性能,其F-score达到0.89,AUPR值达到0.91。结果表明,EGNet仅仅采用单条序列来预测蛋白质功能就可以得到良好的结果,从而能够提供快速且有效的蛋白质功能注释。  相似文献   

9.
This paper proposes a heuristic procedure to solve the problem of scheduling and routing shipments in a hybrid hub‐and‐spoke network, when a given set of feasible discrete intershipment times is given. The heuristic procedure may be used to assist in the cooperative operational planning of a physical goods network between shippers and logistics service provider, or to assist shippers in making logistics outsourcing decisions. The objective is to minimise the transportation and inventory holding costs. It is shown through a set of problem instances that this heuristic procedure provides better solutions than existing economic order quantity‐based approaches. Computational results are presented and discussed.  相似文献   

10.
Concurrent broadcast involves the dissemination of a database, consisting of messages initially distributed among the nodes of a network, so that a copy of the entire database eventually resides at each node. One application is the dissemination of network status information for adaptive routing in a communications network. This paper examines the time complexity and communication complexity of several distributed procedures for concurrent broadcast. The procedures do not use information depending on the network topology. The worst-case time complexity of a flooding procedure for concurrent broadcast is shown to be linear in the number of nodes plus the number of messages, and no other procedure for concurrent broadcast has a better worst-case time complexity. A variant of flooding is proposed to eliminate redundant message receipts from the flooding process by real-time signaling between neighbors concerning messages residing at each. This variant can reduce communication complexity, while having a worst-case time complexity similar in form to that of the flooding procedure. Special properties of concurrent broadcast in a tree are also given. The present time complexity results can be used to bound the time during which inconsistent databases may reside at different nodes, to evaluate and compare procedures for (or including) concurrent broadcast, and to schedule a sequence of instances of concurrent broadcast so that the instances do not overlap and there is no need for sequence numbers.  相似文献   

11.
私有信息检索是一个重要的安全多方计算协议,是指参与查询的用户与数据库拥有者希望在各自的私有信息互不泄露的情况下完成查询操作,该问题在多个情报部门的合作计算领域有着广阔的应用前景.本文将密码学技术应用于预处理辅助随机服务器协议,提出了一个新的私有信息检索解决方案,该方案在保持传统PIR协议通信复杂度不变的情况下,有效地降低了计算复杂度,可以高效应用于文件数据检索.对方案的安全性,计算复杂性和通信复杂性进行了分析.  相似文献   

12.
In the present paper a distance concept of databases is investigated. Two database instances are of distance 0, if they have the same number of attributes and satisfy exactly the same set of functional dependencies. This naturally leads to the poset of closures as a model of changing database. The distance of two databases (closures) is defined to be the distance of the two closures in the Hasse diagram of that poset. We determine the diameter of the poset and show that the distance of two closures is equal to the natural lower bound, that is to the size of the symmetric difference of the collections of closed sets. We also investigate the diameter of the set of databases with a given system of keys. Sharp upper bounds are given in the case when the minimal keys are 2 (or r)-element sets.  相似文献   

13.
Protein secondary structure describe protein construction in terms of regular spatial shapes, including alpha-helices, beta-strands, and loops, which protein amino acid chain can adopt in some of its regions. This information is supportive for protein classification, functional annotation, and 3D structure prediction. The relevance of this information and the scope of its practical applications cause the requirement for its effective storage and processing. Relational databases, widely-used in commercial systems in recent years, are one of the serious alternatives honed by years of experience, enriched with developed technologies, equipped with the declarative SQL query language, and accepted by the large community of programmers. Unfortunately, relational database management systems are not designed for efficient storage and processing of biological data, such as protein secondary structures. In this paper, we present a new search method implemented in the search engine of the PSS-SQL language. The PSS-SQL allows formulation of queries against a relational database in order to find proteins having secondary structures similar to the structural pattern specified by a user. In the paper, we will show how the search process can be accelerated by multiple scanning of the Segment Index and parallel implementation of the alignment procedure using multiple threads working on multiple-core CPUs.  相似文献   

14.
Even as data and analytics-driven applications are becoming increasingly popular, retrieving data from shared databases poses a threat to the privacy of their users. For example, investors/patients retrieve records about stocks/diseases they are interested in from a stock/medical database. Knowledge of such interest is sensitive information that the database server would have access to, unless some mitigating measures are deployed. Private information retrieval (PIR) is a promising security primitive to protect the privacy of users’ interests. PIR allows the retrieval of a data record from a database without letting the database server know which record is being retrieved. The privacy guarantees could either be information theoretic or computational. Alternatively, anonymizers, which hide the identities of data users, may be used to protect the privacy of users’ interests for some situations. In this paper, we study rPIR, a new family of information-theoretic PIR schemes using ramp secret sharing. We have designed four rPIR schemes, using three ramp secret sharing approaches, achieving answer communication costs close to the cost of non-private information retrieval. Evaluation shows that, for many practical settings, rPIR schemes can achieve lower communication costs and the same level of privacy compared with traditional information-theoretic PIR schemes and anonymizers. Efficacy of the proposed schemes is demonstrated for two very different scenarios (outsourced data sharing and P2P content delivery) with realistic analysis and experiments. In many situations of these two scenarios, rPIR’s advantage of low communication cost outweighs its disadvantages, which results in less expenditure and/or better quality of service compared with what may be achieved if traditional information-theoretic PIR and anonymizers are used.  相似文献   

15.
A functional dependency (fd) family was recently defined [20] as the set of all instances satisfying some set of functional dependencies. A Boyce-Codd normal form, abbreviated BCNF, family is defined here as an fd-family specified by some BCNF set of functional dependencies. The purpose of this paper is to present set-theoretic/algebraic characterizations relating to both types of families.Two characterizations of F(I), the smallest fd-family containing the family I of instances, are established. The first involves the notion of agreement, a concept related to that of a closed set of attributes. The second describes F(I) as the smallest family of instances containing I and closed under four specific operations on instances. Companion results are also given for BCNF- families.The remaining results concern characterizations involving the well-known operations of projection, join and union. Two characterizations for when the projection of an fd-family is again an fd-family are given. Several corollaries are obtained, including the effective decidability of whether a projection of an fd-family is an fd-family. The problem for BCNF-families disappears since it is shown that the projection of a BCNF-family is always a BCNF-family. Analogous to results for fd-families presented in [20], characterizations of when the join and union of BCNF-families are BCNF-families are given. Finally, the collections of all fd-families and all BCNF-families are characterized in terms of inverse projection operations and intersection.  相似文献   

16.
Nearest neighbor editing aided by unlabeled data   总被引:1,自引:0,他引:1  
This paper proposes a novel method for nearest neighbor editing. Nearest neighbor editing aims to increase the classifier’s generalization ability by removing noisy instances from the training set. Traditionally nearest neighbor editing edits (removes/retains) each instance by the voting of the instances in the training set (labeled instances). However, motivated by semi-supervised learning, we propose a novel editing methodology which edits each training instance by the voting of all the available instances (both labeled and unlabeled instances). We expect that the editing performance could be boosted by appropriately using unlabeled data. Our idea relies on the fact that in many applications, in addition to the training instances, many unlabeled instances are also available since they do not need human annotation effort. Three popular data editing methods, including edited nearest neighbor, repeated edited nearest neighbor and All k-NN are adopted to verify our idea. They are tested on a set of UCI data sets. Experimental results indicate that all the three editing methods can achieve improved performance with the aid of unlabeled data. Moreover, the improvement is more remarkable when the ratio of training data to unlabeled data is small.  相似文献   

17.
With the rapid growth of articles of genomics research, it has become a challenge for biomedical researchers to access this ever-increasing quantity of information to understand the newest discovery of functions of proteins they are studying. To facilitate functional annotation of proteins by utilizing the huge amounts of biomedical literature and transforming the knowledge into easily accessible database formats, the text mining technique thus becomes essential. In this paper, we propose the method of sentence pattern mining to extract protein functions from biomedical literature. To recognize variants of function terms correctly, we identify morphological, syntactic, and semantic variation forms. The proposed methods can be used to aid database curators in annotating protein functions and to assist biologists and medical researchers in searching protein functions from biomedical literature.  相似文献   

18.
Pattern Databases   总被引:1,自引:0,他引:1  
The efficiency of A* searching depends on the quality of the lower bound estimates of the solution cost. Pattern databases enumerate all possible subgoals required by any solution, subject to constraints on the subgoal size. Each subgoal in the database provides a tight lower bound on the cost of achieving it. For a given state in the search space, all possible subgoals are looked up in the pattern database, with the maximum cost over all lookups being the lower bound. For sliding tile puzzles, the database enumerates all possible patterns containing N tiles and, for each one, contains a lower bound on the distance to correctly move all N tiles into their correct final location. For the 15-Puzzle, iterative-deepening A* with pattern databases(N ="8) reduces the total number of nodes searched on a standard problem set of 100 positions by over 1000‐fold.  相似文献   

19.
Private information retrieval (PIR) is normally modeled as a game between two players: a user and a database. The user wants to retrieve some item from the database without the latter learning which item is retrieved. Most current PIR protocols are ill-suited to provide PIR from a search engine or large database: (i) their computational complexity is linear in the size of the database; (ii) they assume active cooperation by the database server in the PIR protocol. If the database cannot be assumed to cooperate, a peer-to-peer (P2P) user community is a natural alternative to achieve some query anonymity: a user gets her queries submitted on her behalf by other users in the P2P community. In this way, the database still learns which item is being retrieved, but it cannot obtain the real query histories of users, which become diffused among the peer users. We name this relaxation of PIR user-private information retrieval (UPIR). A peer-to-peer UPIR system is described in this paper which relies on an underlying combinatorial structure to reduce the required key material and increase availability. Extensive simulation results are reported and a distributed key management version of the system is described.  相似文献   

20.
基于本体的Deep Web数据标注   总被引:3,自引:0,他引:3  
袁柳  李战怀  陈世亮 《软件学报》2008,19(2):237-245
借鉴语义Web领域中深度标注的思想,提出了一种对Web数据库查询结果进行语义标注的方法.为了获得完整且一致的标注结果,将领域本体作为Web数据库遵循的全局模式引入到查询结果语义标注过程中.对查询接口及查询结果特征进行详细分析,并采用查询条件重置的策略,从而确定查询结果数据的语义标记.通过对多个不同领域Web数据库的测试,在具有领域本体支持的条件下,该方法能够对Web数据库查询结果添加正确的语义标记,从而验证了该方法的有效性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号