首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
有效分析蛋白质家族是生物信息学的一项重要挑战,聚类成为解决这一问题的主要途径之一.基于传统序列比对方法定义蛋白质序列间相似关系时,假设了同源片断问的邻接保守性,与遗传重组相冲突.为更好地识别蛋白质家族,提出了一种蛋白质序列家族挖掘算法ProFaM.ProFaM首先采用前缀投影策略挖掘表征蛋白质序列的模式,然后基于模式及其权重信息构造相似度度量函数,并采用共享最近邻方法,实现了蛋白质序列家族聚类.解决了以往方法在蛋白质模式挖掘及相似度设计中的不足.在蛋白质家族数据库Pfam上的实验结果证实了ProFaM算法在蛋白质家族分析上有良好的结果.  相似文献   

2.
It has become increasingly popular to study animal behaviors with the assistance of video recordings. An automated video processing and behavior analysis system is desired to replace the traditional manual annotation. We propose a framework for automatic video based behavior analysis systems, which consists of four major modules: behavior modeling, feature extraction from video sequences, basic behavior unit (BBU) discovery and complex behavior recognition. BBU discovery is performed based on features extracted from video sequences, hence the fusion of multiple dimensional features is very important. In this paper, we explore the application of feature fusion techniques to BBU discovery with one and multiple cameras. We applied the vector fusion (SBP) method, a multi-variate vector visualization technique, in fusing the features obtained from a single camera. This technique reduces the multiple dimensional data into two dimensional (SBP) space, and the spatial and temporal analysis in SBP space can help discover the underlying data groups. Then we present a simple feature fusion technique for BBU discovery from multiple cameras with the affinity graph method. Finally, we present encouraging results on a physical system and a synthetic mouse-in-a-cage scenario from one, two, and three cameras. The feature fusion methods in this paper are simple yet effective.  相似文献   

3.
Machine learning is being implemented in bioinformatics and computational biology to solve challenging problems emerged in the analysis and modeling of biological data such as DNA, RNA, and protein. The major problems in classifying protein sequences into existing families/superfamilies are the following: the selection of a suitable sequence encoding method, the extraction of an optimized subset of features that possesses significant discriminatory information, and the adaptation of an appropriate learning algorithm that classifies protein sequences with higher classification accuracy. The accurate classification of protein sequence would be helpful in determining the structure and function of novel protein sequences. In this article, we have proposed a distance‐based sequence encoding algorithm that captures the sequence's statistical characteristics along with amino acids sequence order information. A statistical metric‐based feature selection algorithm is then adopted to identify the reduced set of features to represent the original feature space. The performance of the proposed technique is validated using some of the best performing classifiers implemented previously for protein sequence classification. An average classification accuracy of 92% was achieved on the yeast protein sequence data set downloaded from the benchmark UniProtKB database.  相似文献   

4.
This paper deals with protein structure analysis, which is useful for understanding the function of proteins and therefore evolutionary relationships, since for proteins, function follows from form (shape). One of the basic approaches to structure analysis is protein fold recognition (protein fold is a 3D pattern), which is applied when there is no significant sequence similarity between structurally similar proteins. It does not rely on sequence similarity and can be achieved with relevant features extracted from protein sequences. Given (numerical) features, one of the existing machine learning techniques can be then applied to learn and classify proteins represented by these features. In this paper, we experiment with the K-local hyperplane distance nearest neighbor algorithm (HKNN) [12] applied to protein fold recognition. The goal is to compare it with other methods tested on a real-world dataset [3]. Two tasks are considered: (1) classification into four structural classes of proteins and (2) classification into 27 most populated protein folds composing these structural classes. Preliminary results demonstrate that HKNN can successfully compete with other methods (in both speed and accuracy) and thus encourage its further exploration in bioinformatics. The text was submitted by the author in English. Oleg G. Okun received his candidate of technical sciences (PhD) degree in 1996 from the Institute of Engineering Cybernetics, Belarussian Academy of Sciences. In 1998, he joined the Machine Vision Group of the University of Oulu, Finland, where he is currently a senior lecturer. His research interests include machine learning and data mining as well as their applications in bioinformatics and finance. He has about 50 scientific publications.  相似文献   

5.
如何有效提取蛋白质序列特征值,一直是生物信息学研究的重要任务.本文研究8种序列特征值提取方法,并考察它们在不同分类器中的表现,以用于预测氧化还原酶辅酶依赖类型.其中,氨基酸组成法效果最差,平均预测精度仅及64.96%;而将两性伪氨基酸组成和新氨基酸组成分布两种方法合并后,以支持向量机作为分类器时,其识别效果最佳,可达92.93%.此外,不同特征值的提取方法与分类器之间似乎有着一定的匹配关系,只有找到其间的最佳匹配,才能获得最佳的识别效果.  相似文献   

6.
Proteins control all biological functions in living species. Protein structure is comprised of four major classes including all-α class, all-β class, α+β, and α/β. Each class performs different function according to their nature. Owing to the large exploration of protein sequences in the databanks, the identification of protein structure classes is difficult through conventional methods with respect to cost and time. Looking at the importance of protein structure classes, it is thus highly desirable to develop a computational model for discriminating protein structure classes with high accuracy. For this purpose, we propose a silco method by incorporating Pseudo Average Chemical Shift and Support Vector Machine. Two feature extraction schemes namely Pseudo Amino Acid Composition and Pseudo Average Chemical Shift are used to explore valuable information from protein sequences. The performance of the proposed model is assessed using four benchmark datasets 25PDB, 1189, 640 and 399 employing jackknife test. The success rates of the proposed model are 84.2%, 85.0%, 86.4%, and 89.2%, respectively on the four datasets. The empirical results reveal that the performance of our proposed model compared to existing models is promising in the literature so far and might be useful for future research.  相似文献   

7.
Aircraft noise is one of the most uncomfortable kinds of sounds. That is why many organizations have addressed this problem through noise contours around airports, for which they use the aircraft type as the key element. This paper presents a new computational model to identify the aircraft class with a better performance, because it introduces the take-off noise signal segmentation in time. A method for signal segmentation into four segments was created. The aircraft noise patterns are extracted using an LPC (Linear Predictive Coding) based technique and the classification is made combining the output of four parallel MLP (Multilayer Perceptron) neural networks, one for each segment. The individual accuracy of each network was improved using a wrapper feature selection method, increasing the model effectiveness with a lower computational cost. The aircraft are grouped into classes depending on the installed engine type. The model works with 13 aircraft categories with an identification level above 85% in real environments.  相似文献   

8.
在类描述规则中,特征规则用于描述目标类中对象的特征,区分规则用于区分一个类及其对比类。研究基于对象立方体结构的类描述规则表示及其发现方法。通过实验验证该方法的可行性,得到用高层概念表示的类描述规则,该规则有助于用户对特定类进行识别。  相似文献   

9.
电机的故障特征信号一般为非平稳信号,而基于线性、平稳假定的传统故障特征提取方法不能准确提取非平稳信号的时频变化特征,针对这一问题,本文采用了更适于分析非线性非平稳信号的希尔伯特-黄变换(HHT),提出了结合集合经验模态分解(EEMD)与灰色关联度的方法进行电机故障特征提取,验证了EEMD抑制模态混叠问题的可行性以及灰色关联度方法识别虚假分量的有效性。并进一步对实际电机故障信号实验分析,利用BP人工神经网络对提取的特征向量进行故障识别,证明了该方法可以有效提高电机故障特征提取的准确性。  相似文献   

10.
基因组的结构与功能存在密切联系,其功能主要通过DNA子序列来表达,因此研究DNA序列结构对于生物信息学来说具有重要的意义。该文研究了k-长DNA子序列在DNA全序列中出现频数的计数问题,设计并实现了k-长DNA子序列内部计数算法和外部计数算法。该算法通过一个哈希函数把k-长DNA子序列映射为整数关键字从而把k-长DNA子序列出现频数的计数问题转化为整数关键字的重复计数问题,使得能够利用经典B树算法来解决k-长DNA子序列的出现频数计数问题。针对所要解决的问题提出3种改进措施以进一步提高算法的性能。  相似文献   

11.
电力无线通信网支撑用户量大面广、业务高并发、运行环境复杂,表现为异构多网混合共存。为了支持智能电力终端动态选择网络接入,必须首先执行网络发现与识别。针对TD-LTE无线通信专网、WiMAX无线通信专网、智能电网邻域网和230 MHz电力无线专网异构多网共存场景,提出一种融合物理层信号时频特征和MAC层协议特征的网络识别算法。该算法结合改进的窗口滑动能量检测和多周期特性加权循环平稳特征检测执行网络发现与识别。仿真结果表明,该算法能有效识别异构的多种电力无线通信网络。  相似文献   

12.
Sequences with low auto-correlation property have been applied in code-division multiple access communication systems, radar and cryptography. Using the inverse Gray mapping, a quaternary sequence of even length N can be obtained from two binary sequences of the same length, which are called component sequences. In this paper, using interleaving method, we present several classes of component sequences from twin-prime sequences pairs or GMW sequences pairs given by Tang and Ding in 2010; or two, three or four binary sequences defined by cyclotomic classes of order 4. Hence we can obtain new classes of quaternary sequences, which are different from known ones, since known component sequences are constructed from a pair of binary sequences with optimal auto-correlation or Sidel’nikov sequences.  相似文献   

13.
The analysis of small datasets in high dimensional spaces is inherently difficult. For two-class classification problems there are a few methods that are able to face the so-called curse of dimensionality. However, for multi-class sparsely sampled datasets there are hardly any specific methods. In this paper, we propose four multi-class classifier alternatives that effectively deal with this type of data. Moreover, these methods implicitly select a feature subset optimized for class separation. Accordingly, they are especially interesting for domains where an explanation of the problem in terms of the original features is desired.In the experiments, we applied the proposed methods to an MDMA powders dataset, where the problem was to recognize the production process. It turns out that the proposed multi-class classifiers perform well, while the few utilized features correspond to known MDMA synthesis ingredients. In addition, to show the general applicability of the methods, we applied them to several other sparse datasets, ranging from bioinformatics to chemometrics datasets having as few as tens of samples in tens to even thousands of dimensions and three to four classes. The proposed methods had the best average performance, while very few dimensions were effectively utilized.  相似文献   

14.
Classification of high-dimensional statistical data is usually not amenable to standard pattern recognition techniques because of an underlying small sample size problem. To address the problem of high-dimensional data classification in the face of a limited number of samples, a novel principal component analysis (PCA) based feature extraction/classification scheme is proposed. The proposed method yields a piecewise linear feature subspace and is particularly well-suited to difficult recognition problems where achievable classification rates are intrinsically low. Such problems are often encountered in cases where classes are highly overlapped, or in cases where a prominent curvature in data renders a projection onto a single linear subspace inadequate. The proposed feature extraction/classification method uses class-dependent PCA in conjunction with linear discriminant feature extraction and performs well on a variety of real-world datasets, ranging from digit recognition to classification of high-dimensional bioinformatics and brain imaging data.  相似文献   

15.
可扩展的旋转因子表及FFT算法   总被引:1,自引:0,他引:1  
该文提出了一个用于快速Fourier变换计算的反写码序的旋转因了表,这种旋转因子表具有可扩展性:本质上,这种旋转因子表的分量与变换的点数无关,当点数改变时,这种旋转因子表无须重新计算或者容易扩展;根据这种旋转因子表,该文设计了一个结构规整的基本基4计算2^n点FFT的算法及软件程序,该程序与FFTW软件包进行了对比实验,文中还以蛋白质序列相似性计算为例,对作者的算法与FFTW软件包中的相庆算法进行了对比实验,结果表明,采用该文的算法可节省计算时间约31.7%。  相似文献   

16.
We introduce a novel approach to signal classification based on evolving temporal pattern detectors (TPDs) that can find the occurrences of embedded temporal structures in discrete time signals and illustrate its application to characterizing the alcoholic brain using visually evoked response potentials. In contrast to conventional techniques used for most signal classification tasks, this approach unifies the feature extraction and classification steps. It makes no prior assumptions regarding the spectral characteristics of the data; it merely assumes that some temporal patterns exist that distinguish two classes of signals and therefore could be applied to new signal classification tasks where a body of prior work identifying important features does not exist. Evolutionary computation (EC) discovers a classifier by simply learning from the time series samples.The alcoholic classification (AC) problem consists of 2 sub-tasks, one spatial and one temporal: choosing a subset of electroencephalogram leads used to create a composite signal (the spatial task), and detecting temporal patterns in this signal that are more prevalent in the alcoholics than the controls (the temporal task). To accomplish this, a novel representation and crossover operator were devised that enable multiple feature subset tasks to be solved concurrently. Three TPD techniques are presented that differ in the mechanism by which partial credit is assigned to temporal patterns that deviate from the specified pattern. An EC approach is used for evolving a subset of sensors and the TPD specifications. We found evidence that partial credit does help evolutionary discovery. Regions on the skull of an alcoholic subject that produced abnormal electrical activity compared to the controls were located. These regions were consistent with prior findings in the literature. The classification accuracy was measured as the area under the receiver operator characteristic curve (ROC); the ROC area for the training set varied from 90.32% to 98.83% and for the testing set it varied from 87.17% to 95.9%.  相似文献   

17.
Knowledge discovery from image data is a multi-step iterative process. This paper describes the procedure we have used to develop a knowledge discovery system that classifies regions of the ocean floor based on textural features extracted from acoustic imagery. The image is subdivided into rectangular cells called texture elements (texels); a gray-level co-occurence matrix (GLCM) is computed for each texel in four directions. Secondary texture features are then computed from the GLCM resulting in a feature vector representation of each texel instance. Alternatively, a region-growing approach is used to identify irregularly shaped regions of varying size which have a homogenous texture and for which the texture features are computed. The Bayesian classifier Autoclass is used to cluster the instances. Feature extraction is one of the major tasks in knowledge discovery from images. The initial goal of this research was to identify regions of the image characterized by sand waves. Experiments were designed to use expert judgements to select the most effective set of features, to identify the best texel size, and to determine the number of meaningful classes in the data. The region-growing approach has proven to be more successful than the texel-based approach. This method provides a fast and accurate method for identifying provinces in the ocean floor of interest to geologists.  相似文献   

18.
The kernel method, especially the kernel-fusion method, is widely used in social networks, computer vision, bioinformatics, and other applications. It deals effectively with nonlinear classification problems, which can map linearly inseparable biological sequence data from low to high-dimensional space for more accurate differentiation, enabling the use of kernel methods to predict the structure and function of sequences. Therefore, the kernel method is significant in the solution of bioinformatics problems. Various kernels applied in bioinformatics are explained clearly, which can help readers to select proper kernels to distinguish tasks. Mass biological sequence data occur in practical applications. Research of the use of machine learning methods to obtain knowledge, and how to explore the structure and function of biological methods for theoretical prediction, have always been emphasized in bioinformatics. The kernel method has gradually become an important learning algorithm that is widely used in gene expression and biological sequence prediction. This review focuses on the requirements of classification tasks of biological sequence data. It studies kernel methods and optimization algorithms, including methods of constructing kernel matrices based on the characteristics of biological sequences and kernel fusion methods existing in a multiple kernel learning framework.  相似文献   

19.
准确识别出信号肽对蛋白质的研究和定位有着非常重要的意义。压缩感知技术能够在保留生物序列主要信息的同时降低冗余信息,将高维信息投影到低维空间上进行特征提取。因此本文基于压缩感知技术再结合动态时间规整算法提取出新的特征向量,提出一种高鉴别性的信号肽特征提取新方法。该算法所提取的特征不但体现了信号肽中的氨基酸组成、排列顺序、结构等重要信息,还能把信号肽的不同区域在时间维度中非线性地弯曲对整,为机器学习算法提供有效的信号肽特征表达。实验结果显示,新方法提取的特征向量在3个数据集Eukaryotes, Gram+ bacteria, Gram-bacteria上的识别率分别达到99.65%, 98.05%和98.56%,并且这种方法能简单地运用到其他生物序列的识别过程中。  相似文献   

20.
This paper deals with an approach to knowledge discovery in databases applied in order to identify a dynamic model of a real-existing machine. The problem considered within the paper is how to identify dynamic models suitable for model-based diagnosing of a physical object. A special attention is paid to identification on unsupervised way, while big databases collected by a SCADA system is handled.In the paper a method of identification of dynamic models of objects and processes is presented. The usefulness of the method in technical diagnostics are shown. The elaborated method of analysis of quantitative dynamic data is based on applications of accessible methods of knowledge discovery in databases. The essence of the method is to project values of considered set of attributes into the so-called multidimensional space of regressors. In order to select the subset of relevant features the genetic algorithm was used. Knowledge was induced using the support vector machines (SVM) method. The AIC measure as well as our own heuristic function were applied as evaluation criteria. The method was applied in a process of discovery of a model of changes of temperature of a pump. Within framework of the research, data gathered by means of an industrial system registering data on a peculiar object, which was deep-well pumping station, was analyzed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号