首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
Proteins can be grouped into families according to some features such as hydrophobicity, composition or structure, aiming to establish the common biological functions. This paper presents a system that was conceived to discover features (particular sequences of amino acids, or motifs) that occur very often in proteins of a given family but rarely occur in proteins of other families. These features can be used for the classification of unknown proteins, that is, to predict their function by analyzing the primary structure. Runnings were done with the enzymes subset extracted from the Protein Data Bank. The heuristic method used was based on a genetic algorithm using specially tailored operators for the problem. Motifs found were used to build a decision tree using the C4.5 algorithm. The results were compared with motifs found by MEME, a freely available web tool. Another comparison was made with classification results of other two systems: a neural network-based tool and a hidden Markov model-based tool. The final performance was measured using sensitivity (Se) and specificity (Sp): similar results were obtained for the proposed tool (78.79 and 95.82) and the neural network-based tool (74.65 and 94.80, respectively), while MEME and HMMER resulted in an inferior performance. The proposed system has the advantage of giving comprehensible rules when compared with the other approaches. These results obtained for the enzyme dataset suggest that the evolutionary computation method proposed is very efficient to find patterns for protein classification.  相似文献   

2.
Approaches for indexing proteins and for fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this paper, we develop a new method for extracting local structural (or geometric) features from protein structures. These feature vectors are in turn converted into a set of symbols, which are then indexed using a suffix tree. For a given query, the suffix tree index can be used effectively to retrieve the maximal matches, which are then chained to obtain the local alignments. Finally, similar proteins are retrieved by their alignment score against the query. Our results show classification accuracy up to 50% and 92.9% at the topology and class level according to the CATH classification. These results outperform the best previous methods. We also show that PSIST is highly scalable due to the external suffix tree indexing approach it uses; it is able to index about 70,500 domains from SCOP in under an hour.  相似文献   

3.
Since the concept of structural classes of proteins was proposed, the problem of protein classification has been tackled by many groups. Most of their classification criteria are based only on the helix/strand contents of proteins. In this paper, we proposed a method for protein structural classification based on their secondary structure sequences. It is a classification scheme that can confirm existing classifications. Here a mathematical model is constructed to describe protein secondary structure sequences, in which each protein secondary structure sequence corresponds to a transition probability matrix that characterizes and differentiates protein structure numerically. Its application to a set of real data has indicated that our method can classify protein structures correctly. The final classification result is shown schematically. So it is visual to observe the structural classifications, which is different from traditional methods.  相似文献   

4.
Since the concept of structural classes of proteins was proposed, the problem of protein classification has been tackled by many groups. Most of their classification criteria are based only on the helix/strand contents of proteins. In this paper, we proposed a method for protein structural classification based on their secondary structure sequences. It is a classification scheme that can confirm existing classifications. Here a mathematical model is constructed to describe protein secondary structure sequences, in which each protein secondary structure sequence corresponds to a transition probability matrix that characterizes and differentiates protein structure numerically. Its application to a set of real data has indicated that our method can classify protein structures correctly. The final classification result is shown schematically. So it is visual to observe the structural classifications, which is different from traditional methods.  相似文献   

5.
This paper analyzes four geostatistical functions —semivariogram, semimadogram, covariogram, and correlogram—with the purpose of characterizing lung nodules as malignant or benign in computerized tomography images. The tests described in this paper were carried out using a sample of 30 nodules, 24 benign and 6 malignant. Stepwise discriminant analysis was used to determine which combination of measures were best able to discriminate between the benign and malignant nodules. Then, a linear discriminant analysis procedure was performed using the selected features to evaluate the ability of these features to predict the classification for each nodule. A leave-one-out procedure was used to provide a less biased estimate of the linear discriminator’s performance. All analyzed functions have value area under receiver operation characteristic (ROC) curve above 0.800, which means results with accuracy between good and excellent. The preliminary results of this approach are very promising in characterizing nodules using geostatistical functions.  相似文献   

6.
Linux malware can pose a significant threat—its (Linux) penetration is exponentially increasing—because little is known or understood about Linux OS vulnerabilities. We believe that now is the right time to devise non-signature based zero-day (previously unknown) malware detection strategies before Linux intruders take us by surprise. Therefore, in this paper, we first do a forensic analysis of Linux executable and linkable format (ELF) files. Our forensic analysis provides insight into different features that have the potential to discriminate malicious executables from benign ones. As a result, we can select a features’ set of 383 features that are extracted from an ELF headers. We quantify the classification potential of features using information gain and then remove redundant features by employing preprocessing filters. Finally, we do an extensive evaluation among classical rule-based machine learning classifiers—RIPPER, PART, C4.5 Rules, and decision tree J48—and bio-inspired classifiers—cAnt Miner, UCS, XCS, and GAssist—to select the best classifier for our system. We have evaluated our approach on an available collection of 709 Linux malware samples from vx heavens and offensive computing. Our experiments show that ELF-Miner provides more than 99% detection accuracy with less than 0.1% false alarm rate.  相似文献   

7.
The purpose of this paper is to show how a large group of students can work collaboratively in a synchronous way within the classroom using the cheapest possible technological support. Making use of the features of Single Display Groupware and of Multiple Mice we propose a computer-supported collaborative learning approach for big groups within the classroom. The approach uses a multiple classification matrix and our application was built for language-learning (in this case Spanish). The basic collaboration mechanism that the approach is based upon is “silent collaboration,” in which students—through suggestions and exchanges—must compare their ideas to those of their classmates. An exploratory experimental study was performed along with a quantitative and qualitative study that analyzed ease of use of the software, described how the conditions for collaborative learning were achieved, evaluated the achievements in learning under the defined language objectives, and analyzed the impact of silent and spoken collaboration. Our initial findings are that silent collaboration proved to be an effective mechanism to achieve learning in large groups in the classroom.  相似文献   

8.
A successful application of data mining to bioinformatics is protein classification. A number of techniques have been developed to classify proteins according to important features in their sequences, secondary structures, or three-dimensional structures. In this paper, we introduce a novel approach to protein classification based on significant patterns discovered on the surface of a protein. We define a notion called /spl alpha/-surface. We discuss the geometric properties of /spl alpha/-surface and present an algorithm that calculates the /spl alpha/-surface from a finite set of points in R/sup 3/. We apply the algorithm to extracting the /spl alpha/-surface of a protein and use a pattern discovery algorithm to discover frequently occurring patterns on the surfaces. The pattern discovery algorithm utilizes a new index structure called the /spl Delta/B/sup +/ tree. We use these patterns to classify the proteins. While most existing techniques focus on the binary classification problem, we apply our approach to classifying three families of proteins. Experimental results show the good performance of the proposed approach.  相似文献   

9.
In this paper we introduce the knowledge representation features of a new multi-paradigm programming language called go! that cleanly integrates logic, functional, object oriented and imperative programming styles. Borrowing from L&O [1] go! allows knowledge to be represented as a set of labeled theories incrementally constructed using multiple-inheritance. The theory label is a constructor for instances of the class. The instances are go!’s objects. A go! theory structure can be used to characterize any knowledge domain. In particular, it can be used to describe classes of things, such as people, students, etc., their subclass relationships and characteristics of their key properties. That is, it can be used to represent an ontology. For each ontology class we give a type definition—we declare what properties, with what value type, instances of the class have—and we give a labeled theory that defines these properties. Subclass relationships are reflected using both type and theory inheritance rules. Following [2], we shall call this ontology oriented programming. This paper describes the go! language and its use for ontology oriented programming, comparing its expressiveness with Owl, particularly Owl Lite[3]. The paper assumes some familiarity with ontology specification using Owl like languages and with logic and object oriented programming.  相似文献   

10.
为满足对新兴安卓恶意应用家族的快速检测需求,提出一种融合MAML(model-agnostic meta-learning)和CBAM(convolutional block attention module)的安卓恶意应用家族分类模型MAML-CAS。将安卓恶意应用样本集中的DEX文件可视化为灰度图,并构建任务集;融合混合域注意力机制CBAM,设计两个具有同等结构的卷积神经网络,分别作为基学习器和元学习器,这两个学习器在自动提取任务集中样本特征的同时,可从通道和空间两个维度来增强关键特征表达;利用元学习方法 MAML对两个学习器进行训练,其中基学习器完成特定恶意家族分类任务的属性学习,元学习器则学习不同任务的共性;在两个学习器训练完成后,MAML-CAS将获得初始化参数,在面对新的安卓恶意应用家族分类任务时,不需要重新训练,只需要少量样本就可以快速迭代;利用训练完成的基学习器提取安卓恶意应用家族特征,并利用SVM进行恶意家族分类。实验结果表明,MAML-CAS模型对新兴小样本安卓恶意应用家族具有良好的检测效果,检测速度较快,并具有较好的稳定性。  相似文献   

11.
In instance-based learning, the ‘nearness’ between two instances—used for pattern classification—is generally determined by some similarity functions, such as the Euclidean or Value Difference Metric (VDM). However, Euclidean-like similarity functions are normally only suitable for domains with numeric attributes. The VDM metrics are mainly applicable to domains with symbolic attributes, and their complexity increases with the number of classes in a specific application domain. This paper proposes an instance-based learning approach to alleviate these shortcomings. Grey relational analysis is used to precisely describe the entire relational structure of all instances in a specific domain. By using the grey relational structure, new instances can be classified with high accuracy. Moreover, the total number of classes in a specific domain does not affect the complexity of the proposed approach. Forty classification problems are used for performance comparison. Experimental results show that the proposed approach yields higher performance over other methods that adopt one of the above similarity functions or both. Meanwhile, the proposed method can yield higher performance, compared to some other classification algorithms. Chi-Chun Huang is currently Assistant Professor in the Department of Information Management at National Kaohsiung Marine University, Kaohsiung, Taiwan. He received the Ph.D. degree from the Department of Electronic Engineering at National Taiwan University of Science and Technology in 2003. His research includes intelligent Internet systems, grey theory, machine learning, neural networks and pattern recognition. Hahn-Ming Lee is currently Professor in the Department of Computer Science and Information Engineering at National Taiwan University of Science and Technology, Taipei, Taiwan. He received the B.S. degree and Ph.D. degree from the Department of Computer Science and Information Engineering at National Taiwan University in 1984 and 1991, respectively. His research interests include, intelligent Internet systems, fuzzy computing, neural networks and machine learning. He is a member of IEEE, TAAI, CFSA and IICM.  相似文献   

12.
The annotation of proteins can be achieved by classifying the protein of interest into a certain known protein family to induce its functional and structural features. This paper presents a new method for classifying protein sequences based upon the hydropathy blocks occurring in protein sequences. First, a fixed-dimensional feature vector is generated for each protein sequence using the frequency of the hydropathy blocks occurring in the sequence. Then, the support vector machine (SVM) classifier is utilized to classify the protein sequences into the known protein families. The experimental results have shown that the proteins belonging to the same family or subfamily can be identified using features generated from the hydropathy blocks.  相似文献   

13.
The high dimensionality of microarray datasets endows the task of multiclass tissue classification with various difficulties—the main challenge being the selection of features deemed relevant and non-redundant to form the predictor set for classifier training. The necessity of varying the emphases on relevance and redundancy, through the use of the degree of differential prioritization (DDP) during the search for the predictor set is also of no small importance. Furthermore, there are several types of decomposition technique for the feature selection (FS) problem—all-classes-at-once, one-vs.-all (OVA) or pairwise (PW). Also, in multiclass problems, there is the need to consider the type of classifier aggregation used—whether non-aggregated (a single machine), or aggregated (OVA or PW). From here, first we propose a systematic approach to combining the distinct problems of FS and classification. Then, using eight well-known multiclass microarray datasets, we empirically demonstrate the effectiveness of the DDP in various combinations of FS decomposition types and classifier aggregation methods. Aided by the variable DDP, feature selection leads to classification performance which is better than that of rank-based or equal-priorities scoring methods and accuracies higher than previously reported for benchmark datasets with large number of classes. Finally, based on several criteria, we make general recommendations on the optimal choice of the combination of FS decomposition type and classifier aggregation method for multiclass microarray datasets.  相似文献   

14.
This paper deals with protein structure analysis, which is useful for understanding the function of proteins and therefore evolutionary relationships, since for proteins, function follows from form (shape). One of the basic approaches to structure analysis is protein fold recognition (protein fold is a 3D pattern), which is applied when there is no significant sequence similarity between structurally similar proteins. It does not rely on sequence similarity and can be achieved with relevant features extracted from protein sequences. Given (numerical) features, one of the existing machine learning techniques can be then applied to learn and classify proteins represented by these features. In this paper, we experiment with the K-local hyperplane distance nearest neighbor algorithm (HKNN) [12] applied to protein fold recognition. The goal is to compare it with other methods tested on a real-world dataset [3]. Two tasks are considered: (1) classification into four structural classes of proteins and (2) classification into 27 most populated protein folds composing these structural classes. Preliminary results demonstrate that HKNN can successfully compete with other methods (in both speed and accuracy) and thus encourage its further exploration in bioinformatics. The text was submitted by the author in English. Oleg G. Okun received his candidate of technical sciences (PhD) degree in 1996 from the Institute of Engineering Cybernetics, Belarussian Academy of Sciences. In 1998, he joined the Machine Vision Group of the University of Oulu, Finland, where he is currently a senior lecturer. His research interests include machine learning and data mining as well as their applications in bioinformatics and finance. He has about 50 scientific publications.  相似文献   

15.
16.
Presents a method for finding patterns in 3D graphs. Each node in a graph is an undecomposable or atomic unit and has a label. Edges are links between the atomic units. Patterns are rigid substructures that may occur in a graph after allowing for an arbitrary number of whole-structure rotations and translations as well as a small number (specified by the user) of edit operations in the patterns or in the graph. (When a pattern appears in a graph only after the graph has been modified, we call that appearance "approximate occurrence.") The edit operations include relabeling a node, deleting a node and inserting a node. The proposed method is based on the geometric hashing technique, which hashes node-triplets of the graphs into a 3D table and compresses the label-triplets in the table. To demonstrate the utility of our algorithms, we discuss two applications of them in scientific data mining. First, we apply the method to locating frequently occurring motifs in two families of proteins pertaining to RNA-directed DNA polymerase and thymidylate synthase and use the motifs to classify the proteins. Then, we apply the method to clustering chemical compounds pertaining to aromatic compounds, bicyclicalkanes and photosynthesis. Experimental results indicate the good performance of our algorithms and high recall and precision rates for both classification and clustering  相似文献   

17.
Searching for documents by their type or genre is a natural way to enhance the effectiveness of document retrieval. The layout of a document contains a significant amount of information that can be used to classify it by type in the absence of domain-specific models. Our approach to classification is based on “visual similarity” of layout structure and is implemented by building a supervised classifier, given examples of each class. We use image features such as percentages of text and non-text (graphics, images, tables, and rulings) content regions, column structures, relative point sizes of fonts, density of content area, and statistics of features of connected components which can be derived without class knowledge. In order to obtain class labels for training samples, we conducted a study where subjects ranked document pages with respect to their resemblance to representative page images. Class labels can also be assigned based on known document types, or can be defined by the user. We implemented our classification scheme using decision tree classifiers and self-organizing maps. Received June 15, 2000 / Revised November 15, 2000  相似文献   

18.
19.
‘Particularism’ and ‘generalism’ refer to families of positions in the philosophy of moral reasoning, with the former playing down the importance of principles, rules or standards, and the latter stressing their importance. Part of the debate has taken an empirical turn, and this turn has implications for AI research and the philosophy of cognitive modeling. In this paper, Jonathan Dancy’s approach to particularism (arguably one of the best known and most radical approaches) is questioned both on logical and empirical grounds. Doubts are raised over whether Dancy’s brand of particularism can adequately explain the graded nature of similarity assessments in analogical arguments. Also, simple recurrent neural network models of moral case classification are presented and discussed. This is done to raise concerns about Dancy’s suggestion that neural networks can help us to understand how we could classify situations in a way that is compatible with his particularism. Throughout, the idea of a surveyable standard—one with restricted length and complexity—plays a key role. Analogical arguments are taken to involve multidimensional similarity assessments, and surveyable contributory standards are taken to be attempts to articulate the dimensions of similarity that may exist between cases. This work will be of relevance both to those who have interests in computationally modeling human moral cognition and to those who are interested in how such models may or may not improve our philosophical understanding of such cognition.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号