首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 265 毫秒
1.
The combinatorial explosion of motif patterns occurring in 1D and 2D arrays leads to the consideration of special classes of motifs growing linearly with the size of the input array. Such motifs, called irredundant motifs, are able to succinctly represent all of the other motifs occurring in the same array within reasonable time and space bounds. In previous work irredundant motifs were extracted from 2D arrays in O(N2log2nloglogn) and O(N3) time, where N is the size of the 2D input array and n is its largest dimension.In this paper, we present an algorithm to extract irredundant motifs from 2D arrays that is quadratic in the size of the input. The input is defined on a binary alphabet. It is shown that the algorithm is optimal and practically faster than the previous ones.  相似文献   

2.
The problems of finding maximal and minimal equivalent representations for gapped and non-gapped motifs as well as finding motifs that characterize a fixed set of occurrence locations for a given string are studied. We apply two equivalence relations on representations. The first one is the well-known occurrence-equivalence of motifs. The second equivalence is introduced for patterns of occurrence locations, to characterize such patterns by motifs. For both equivalences, quadratic-time algorithms are given for finding a maximal representative of an equivalence class. Finding a minimal representative is shown to be NP-complete in both cases. For non-gapped motifs suffix-tree-based linear-time algorithms are given for finding maximal and minimal representatives. Maximal (minimal) gapped motifs are composed of blocks that are maximal (minimal) non-gapped motifs, maximal and minimal non-gapped motifs thus making up a small basis for all motifs. The implied bound on the number of gapped motifs that have a fixed number of non-gapped blocks is also given.  相似文献   

3.
The forecasting process of real-world time series has to deal with especially unexpected values, commonly known as outliers. Outliers in time series can lead to unreliable modeling and poor forecasts. Therefore, the identification of future outlier occurrence is an essential task in time series analysis to reduce the average forecasting error. The main goal of this work is to predict the occurrence of outliers in time series, based on the discovery of motifs. In this sense, motifs will be those pattern sequences preceding certain data marked as anomalous by the proposed metaheuristic in a training set. Once the motifs are discovered, if data to be predicted are preceded by any of them, such data are identified as outliers, and treated separately from the rest of regular data. The forecasting of outlier occurrence has been added as an additional step in an existing time series forecasting algorithm (PSF), which was based on pattern sequence similarities. Robust statistical methods have been used to evaluate the accuracy of the proposed approach regarding the forecasting of both occurrence of outliers and their corresponding values. Finally, the methodology has been tested on six electricity-related time series, in which most of the outliers were properly found and forecasted.  相似文献   

4.
ObjectiveThis paper presents an algorithm for the solution of the motif discovery problem (MDP).Methods and materialsMotif discovery problem can be considered in two cases: motifs with insertions/deletions, and motifs without insertions/deletions. The first group motifs can be found by stochastic and approximated methods. The second group can be found by using stochastic and approximated methods, but also deterministic method. We proved that the second group motifs can be found with a deterministic algorithm, and so, it can be said that the second motifs finding is a P-type problem as proved in this paper.Results and conclusionsAn algorithm was proposed in this paper for motif discovery problem. The proposed algorithm finds all motifs which are occurred in the sequence at least two times, and it also finds motifs of various sizes. Due to this case, this algorithm is regarded as Automatic Exact Motif Discovery Algorithm. All motifs of different sizes can be found with this algorithm, and this case was proven in this paper. It shown that automatic exact motif discovery is a P-type problem in this paper. The application of the proposed algorithm has been shown that this algorithm is superior to MEME, MEME3, Motif Sampler, WEEDER, CONSENSUS, AlignACE.  相似文献   

5.
The problem of bit dimension in a control memory specification has been tackled by many authors. Based on a switching formulation, Das et al., have developed a simplified methodology which gives a minimum solution. An alternative procedure exploiting the reachability concept of Petri net (PN) is proposed here. In the proposed technique all maximal compatible classes of microoperations for a given ROM and all the irredundant solutions of CM cover table proposed by Das et al. are determined. The porposed technique is simple and requires less computational effort as only vector addition on a matrix is needed.  相似文献   

6.
紧致依赖与内涵亏值   总被引:1,自引:0,他引:1  
马垣  张学东  迟呈英 《软件学报》2011,22(5):962-971
提出了"内涵亏值"与"紧致依赖"的概念,证明了由"紧致依赖"组成的依赖基对于"左部加属性、右部减属性"这一规则的公理系统是无冗余而完整的.由此发现了除Guigues-Duquenne基以外还有其他无冗余完整依赖基,改变了只有唯一的一个无冗余完整依赖基的传统观念,揭开了寻找多种无冗余完整依赖基以满足多样化需求的序幕.  相似文献   

7.
F. Lapscher 《Calcolo》1967,4(1):21-40
The preliminary computation of the «majorants» enables the determination of the minimal or quasi-minimal disjunctive irredundant forms of an incompletely specified function [3]. In practical cases, the search algorithm for the disjunctive irredundant forms is applicable only if the number of «majorants» is not too large. In this paper formulas are derived to determine the average value of this number. Then the computation of the average number of prime implicants, already given byMileto andPutzolu [4], is rederived in a new form. The result is used to built functions having a great number of prime implicants and yields, for a function ofn variables, a lower bound of the maximal number of prime implicants.  相似文献   

8.
Finding motifs in biological sequences is one of the most intriguing problems for string algorithm designers due to, on the one hand, the numerous applications of this problem in molecular biology and, on the other hand, the challenging aspects of the computational problem. Indeed, when dealing with biological sequences it is necessary to work with approximations (that is, to identify fragments that are not necessarily identical, but just similar, according to a given similarity notion), and this complicates the problem. Existing algorithms run in time linear with respect to the input size. Nevertheless, the output size can be very large due to the approximation (namely exponential in the approximation degree). This often makes the output unreadable, as well as slowing down the inference itself. A high degree of redundancy has been detected in the set of motifs that satisfy traditional requirements, even for exact motifs. Moreover, it has been observed many times that only a subset of these motifs, namely the maximal motifs, could be enough to provide the information of all of them. In this paper, we aim at removing such redundancy. We extend some notions of maximality already defined for exact motifs to the case of approximate motifs with Hamming distance, and we give a characterization of maximal motifs on the suffix tree. Given that this data structure is used by a whole class of motif extraction tools, we show how these tools can be modified to include the maximality requirement without changing the asymptotical complexity.  相似文献   

9.
10.
The discovery of information encoded in biological sequences is assuming a prominent role in identifying genetic diseases and in deciphering biological mechanisms. This information is usually encoded in patterns frequently occurring in the sequences, also called motifs. In fact, motif discovery has received much attention in the literature, and several algorithms have already been proposed, which are specifically tailored to deal with motifs exhibiting some kinds of "regular structure". Motivated by biological observations, this paper focuses on the mining of loosely structured motifs, i.e., of more general kinds of motif where several "exceptions" may be tolerated in pattern repetitions. To this end, an algorithm exploiting data structures conceived to efficiently handle pattern variabilities is presented and analyzed. Furthermore, a randomized variant with linear time and space complexity is introduced, and a theoretical guarantee on its performances is proven. Both algorithms have been implemented and tested on real data sets. Despite the ability of mining very complex kinds of pattern, performance results evidence a genome-wide applicability of the proposed techniques.  相似文献   

11.
郭海涛  霍红卫  于强 《自动化学报》2016,42(11):1718-1731
顺式调控模块(Cis-regulatory module,CRM)在真核生物基因的转录调控中起着重要作用,识别顺式调控模块是当前计算生物学的一个重要课题.虽然当前有许多计算方法用于识别顺式调控模块,但识别准确率仍有待进一步提高.将顺式调控模块的多种特征信息结合在一起,有助于提高识别顺式调控模块的准确率.基于此,本文提出了一种识别顺式调控模块的算法SegHMC(Segmental HMM model for discovery of cis-regulatory module).该算法建立了一种关于顺式调控模块识别问题的Segmental HMM模型,进一步扩展了顺式调控模块调控结构(或调控语法)的表示,不仅将顺式调控模块表示为模体(Motif)的组合,还进一步将模体共同出现的频率、模体顺序偏好以及顺式调控模块中相邻模体间的距离分布等特征引入到顺式调控模块的调控语法中.在模拟数据集和真实生物数据集上的实验结果表明,本文方法识别顺式调控模块的准确率显著优于当前的主要方法.  相似文献   

12.
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power. Editor: Johannes Fürnkranz. An erratum to this article can be found at  相似文献   

13.
Time series motifs are approximately repeated subsequences found within a longer time series. They have been in the literature since 2002, but recently they have begun to receive significant attention in research and industrial communities. This is perhaps due to the growing realization that they implicitly offer solutions to a host of time series problems, including rule discovery, anomaly detection, density estimation, semantic segmentation, summarization, etc. Recent work has improved the scalability so exact motifs can be computed on datasets with up to a million data points in tenable time. However, in some domains, for example seismology or climatology, there is an immediate need to address even larger datasets. In this work, we demonstrate that a combination of a novel algorithm and a high-performance GPU allows us to significantly improve the scalability of motif discovery. We demonstrate the scalability of our ideas by finding the full set of exact motifs on a dataset with one hundred and forty-three million subsequences, which is by far the largest dataset ever mined for time series motifs/joins; it requires ten quadrillion pairwise comparisons. Furthermore, we demonstrate that our algorithm can produce actionable insights into seismology and ethology.  相似文献   

14.
Biochemical research often involves examining structural relationships in molecules since scientists strongly believe in the causal relationship between structure and function. Traditionally, researchers have identified these patterns, or motifs, manually using domain expertise. However, with the massive influx of new biochemical data and the ability to gather data for very large molecules, there is great need for techniques that automatically and efficiently identify commonly occurring structural patterns in molecules. Previous automated substructure discovery approaches have each introduced variations of similar underlying techniques and have embedded domain knowledge. While doing so improves performance for the particular domain, this complicates extensibility to other domains. Also, they do not address scalability or noise, which is critical for macromolecules such as proteins. In this paper, we present MotifMiner, a general framework for efficiently identifying common motifs in most scientific molecular datasets. The approach combines structure-based frequent-pattern discovery with search space reduction and coordinate noise handling. We describe both the framework and several algorithms as well as demonstrate the flexibility of our system by analyzing protein and drug biochemical datasets.  相似文献   

15.
李洪波  周莉 《计算机工程》2010,36(13):65-67
给出单模式、二模式和三模式3种序列模式发现的基本概念,给出二模式和三模式的表示方法。该表示方法不会产生实际不存在的候选序列,从而有效地缩小候选空间,提高序列模式的计算速度。结合Apriori方法,基于3种基本模式,应用无冗余的模式增长原则和三级动态优化方法,提出一种序列模式发现的结构化动态优化方法。  相似文献   

16.
Redundancy in logic I: CNF propositional formulae   总被引:1,自引:0,他引:1  
A knowledge base is redundant if it contains parts that can be inferred from the rest of it. We study some problems related to the redundancy of a CNF formula. In particular, any CNF formula can be made irredundant by deleting some of its clauses: what results is an irredundant equivalent subset. We study the complexity of problems related to irredundant equivalent subsets: verification, checking existence of an irredundant equivalent subset with a given size, checking necessary and possible presence of clauses in irredundant equivalent subsets, and uniqueness. We also consider the problem of redundancy with different definitions of equivalence.  相似文献   

17.
18.
The exploration of repeated patterns with different lengths, also called variable-length motifs, has received a great amount of attention in recent years. However, existing algorithms to detect variable-length motifs in large-scale time series are very time-consuming. In this paper, we introduce a time- and space-efficient approximate variable-length motif discovery algorithm, Distance-Propagation Sequitur (DP-Sequitur), for detecting variable-length motifs in large-scale time series data (e.g. over one hundred million in length). The discovered motifs can be ranked by different metrics such as frequency or similarity, and can benefit a wide variety of real-world applications. We demonstrate that our approach can discover motifs in time series with over one hundred million points in just minutes, which is significantly faster than the fastest existing algorithm to date. We demonstrate the superiority of our algorithm over the state-of-the-art using several real world time series datasets.  相似文献   

19.
在DNA序列中查找基序是生物信息学中一个重要的计算问题,人们针对这一计算问题提出了多种模型和算法.由于DNA序列数据的复杂性,在其中有许多是比强信号基序更难提取的弱信号基序.而目前植入(ι,d)基序问题(PMP)和扩展植入(ι,d)基序问题(EMP)是较适合模拟弱信号基序查找的问题模型.本文归纳分析了基序查找的基本方法、策略和基序模型,指出了各种策略和模型的优势与不足.在此基础上对现有的基于植入基序查找问题模型的主要弱信号基序查找算法进行了分析和实验评估,为选择计算方法查找弱基序信号提供了参考,并讨论了该方向上尚未解决的问题和发展趋势.  相似文献   

20.
In many real-world situations, the method for computing the desired output from a set of inputs is unknown. One strategy for solving these types of problems is to learn the input-output functionality from examples in a training set. However, in many situations it is difficult to know what information is relevant to the task at hand. Subsequently, researchers have investigated ways to deal with the so-called problem of consistency of attributes, i.e., attributes that can distinguish examples from different classes. In this paper, we first prove that the notion of relevance of attributes is directly related to the consistency of attributes, and show how relevant, irredundant attributes can be selected. We then compare different relevant attribute selection algorithms, and show the superiority of algorithms that select irredundant attributes over those that select relevant attributes. We also show that searching for an "optimal" subset of attributes, which is considered to be the main purpose of attribute selection, is not the best way to improve the accuracy of classifiers. Employing sets of relevant, irredundant attributes improves classification accuracy in many more cases. Finally, we propose a new method for selecting relevant examples, which is based on filtering the so-called pattern frequency domain. By identifying examples that are nontypical in the determination of relevant, irredundant attributes, irrelevant examples can be eliminated prior to the learning process. Empirical results using artificial and real databases show the effectiveness of the proposed method in selecting relevant examples leading to improved performance even on greatly reduced training sets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号