首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Sequence complexity for biological sequence analysis   总被引:2,自引:0,他引:2  
A new statistical model for DNA considers a sequence to be a mixture of regions with little structure and regions that are approximate repeats of other subsequences, i.e. instances of repeats do not need to match each other exactly. Both forward- and reverse-complementary repeats are allowed. The model has a small number of parameters which are fitted to the data. In general there are many explanations for a given sequence and how to compute the total probability of the data given the model is shown. Computer algorithms are described for these tasks. The model can be used to compute the information content of a sequence, either in total or base by base. This amounts to looking at sequences from a data-compression point of view and it is argued that this is a good way to tackle intelligent sequence analysis in general.  相似文献   

2.
A subsequence is obtained from a string by deleting any number of characters; thus in contrast to a substring, a subsequence is not necessarily a contiguous part of the string. Counting subsequences under various constraints has become relevant to biological sequence analysis, to machine learning, to coding theory, to the analysis of categorical time series in the social sciences, and to the theory of word complexity. We present theorems that lead to efficient dynamic programming algorithms to count (1) distinct subsequences in a string, (2) distinct common subsequences of two strings, (3) matching joint embeddings in two strings, (4) distinct subsequences with a given minimum span, and (5) sequences generated by a string allowing characters to come in runs of a length that is bounded from above.  相似文献   

3.
基于滑动窗口的多变量时间序列异常数据的挖掘   总被引:1,自引:0,他引:1       下载免费PDF全文
翁小清    沈钧毅 《计算机工程》2007,33(12):102-104
与其它多变量时间序列(MTS)子序列显著不同的子序列,称为异常子序列(含异常数据)。该文提出了一种基于滑动窗口的MTS异常子序列的挖掘算法,使用扩展的Frobenius 范数来计算两个MTS子序列之间相似性,使用两阶段顺序查询来进行K-近邻查找,将不可能成为候选异常子序列的MTS子序列剪去,对上海证券交易所股票交易情况MTS数据集进行了异常子序列(含异常数据)挖掘,结果表明了算法的有效性。  相似文献   

4.
Logistics faces great challenges in vehicle schedule problem. Intelligence Technologies need to be developed for solving the transportation problem. This paper proposes an improved Quantum-Inspired Evolutionary Algorithm (IQEA), which is a hybrid algorithm of Quantum-Inspired Evolutionary Algorithm (QEA) and greed heuristics. It extends the standard QEA by combining its principles with some heuristics methods. The proposed algorithm has also been applied to optimize a problem which may happen in real life. The problem can be categorized as a vehicle routing problem with time windows (VRPTW), which means the problem has many common characteristics that VRPTW has, but more constraints need to be considered. The basic idea of the proposed IQEA is to embed a greed heuristic method into the standard QEA for the optimal recombination of consignment subsequences. The consignment sequence is the order to arrange the vehicles for the transportation of the consignments. The consignment subsequences are generated by cutting the whole consignment sequence according to the values of quantum bits. The computational result of the simulation problem shows that IQEA is feasible in achieving a relatively optimal solution. The implementation of an optimized schedule can save much more cost than the initial schedule. It provides a promising, innovative approach for solving VRPTW and improves QEA for solving complexity problems with a number of constraints.  相似文献   

5.
Almost all RNA molecules--and consequently also almost all subsequences of a large RNA molecule-form secondary structures. The presence of secondary structure in itself therefore does not indicate any functional significance. In fact, we cannot expect a conserved secondary structure for all parts of a viral genome or a mRNA, even if there is a significant level of sequence conservation. We present a novel method for detecting conserved RNA secondary structures in a family of related RNA sequences. The method is based on combining the prediction of base pair probability matrices and comparative sequence analysis. It can be applied to small sets of long sequences and does not require a prior knowledge of conserved sequence or structure motifs. As such it can be used to scan large amounts of sequence data for regions that warrant further experimental investigation. Applications to complete genomic RNAs of some viruses show that in all cases the known secondary structure features are identified. In addition, we predict a substantial number of conserved structural elements which have not been described so far.  相似文献   

6.
7.
We present an algorithm for combining the elements of subsequences of a sequence with an associative operator. The subsequences are given by a sliding window of varying size. Our algorithm is greedy and computes the result with the minimal number of operator applications.  相似文献   

8.
An active research topic in data mining is the discovery of sequential patterns, which finds all frequent subsequences in a sequence database. The generalized sequential pattern (GSP) algorithm was proposed to solve the mining of sequential patterns with time constraints, such as time gaps and sliding time windows. Recent studies indicate that the pattern-growth methodology could speed up sequence mining. However, the capabilities to mine sequential patterns with time constraints were previously available only within the Apriori framework. Therefore, we propose the DELISP (delimited sequential pattern) approach to provide the capabilities within the pattern-growth methodology. DELISP features in reducing the size of projected databases by bounded and windowed projection techniques. Bounded projection keeps only time-gap valid subsequences and windowed projection saves nonredundant subsequences satisfying the sliding time-window constraint. Furthermore, the delimited growth technique directly generates constraint-satisfactory patterns and speeds up the pattern growing process. The comprehensive experiments conducted show that DELISP has good scalability and outperforms the well-known GSP algorithm in the discovery of sequential patterns with time constraints.  相似文献   

9.
Wang  Yuehua  Wu  Youxi  Li  Yan  Yao  Fang  Fournier-Viger  Philippe  Wu  Xindong 《Applied Intelligence》2022,52(6):6646-6661
Applied Intelligence - Repetitive sequential pattern mining (SPM) with gap constraints is a data analysis task that consists of identifying patterns (subsequences) appearing many times in a...  相似文献   

10.
Several combinatorial problems, such as car sequencing and rostering, feature sequence constraints, restricting the number of occurrences of certain values in every subsequence of a given length. We present three new filtering algorithms for the sequence constraint, including the first that establishes domain consistency in polynomial time. The filtering algorithms have complementary strengths: One borrows ideas from dynamic programming; another reformulates it as a regular constraint; the last is customized. The last two algorithms establish domain consistency, and the customized one does so in polynomial time. We provide experimental results that demonstrate the practical usefulness of each. We also show that the customized algorithm applies naturally to a generalized version of the sequence constraint that allows subsequences of varied lengths. The significant computational advantage of using a single generalized sequence constraint over a semantically equivalent collection of among or sequence constraints is demonstrated empirically.  相似文献   

11.
韩敏  姜涛  冯守渤 《控制与决策》2020,35(9):2175-2181
由于混沌系统的演化规律复杂,直接对混沌时间序列进行长期预测通常难以达到较好的效果.针对此问题,利用变分模态分解方法将混沌时间序列转化为一系列特征子序列,利用排列熵评估选取子序列个数的合理性,保证特征子序列包含了原序列长期演化趋势.此外,提出一种改进的确定性循环跳跃状态网络作为子序列的预测模型,该网络模型中的储备池采用单向环状连接和双向随机跳跃的拓扑结构,能够避免储备池确定连接结构造成的预测精度较低和随机连接造成网络的不稳定性问题.通过所提出模型对时间序列进行长期预测,采用多种评估手段对预测结果进行分析, 表明所提出模型对于长期预测具有较大的优势.  相似文献   

12.
We formally introduce a new data structure, called MiGaL for ‘Multiple Graph Layer’, composed of various graphs linked together by relations of abstraction/refinement. The new structure is useful for representing information that can be described at different levels of abstraction, each level corresponding to a graph. We then propose an algorithm for comparing two MiGaLs. The algorithm performs a step‐by‐step comparison starting with the most ‘abstract’ level. The result of the comparison at a given step is communicated to the next step using a special colouring scheme. MiGaLs represent a very natural model for comparing RNA secondary structures that may be seen at different levels of detail, going from the sequence of nucleotides, single or paired with another to participate in a helix, to the network of multiple loops that is believed to represent the most conserved part of RNAs having similar function. We therefore show how one can use MiGaLs to very efficiently compare two RNAs of any size at different levels of detail. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

13.
Sequential pattern mining (SPM) is an important data mining problem with broad applications. SPM is a hard problem due to the huge number of intermediate subsequences to be considered. State of the art approaches for SPM (e.g., PrefixSpan Pei et al. 2001) are largely based on the pattern-growth approach, where for each frequent prefix subsequence, only its related suffix subsequences need to be considered, and the database is recursively projected into smaller ones. Many authors have promoted the use of constraints to focus on the most promising patterns according to the interests of the end user. The top-k SPM problem is also used to cope with the difficulty of thresholding and to control the number of solutions. State of the art methods developed for SPM and top-k SPM, though efficient, are locked into a rather rigid search strategy, and suffer from the lack of declarativity and flexibility. Indeed, adding new constraints usually amounts to changing the data-structures used in the core of the algorithm, and combining these new constraints often require new developments. Recent works (e.g. Kemmar et al. 2014; Négrevergne and Guns 2015) have investigated the use of Constraint Programming (CP) for SPM. However, despite their nice declarative aspects, all these modelings have scaling problems, due to the huge size of their constraint networks. To address this issue, we propose the Prefix-Projection global constraint, which encapsulates both the subsequence relation as well as the frequency constraint. Its filtering algorithm relies on the principle of projected databases which allows to keep in the variables domain, only values leading to a frequent pattern in the database. Prefix-Projection filtering algorithm enforces domain consistency on the variable succeeding the current frequent prefix in polynomial time. This global constraint also allows for a straightforward implementation of additional constraints such as size, item membership, regular expressions and any combination of them. Experimental results show that our approach clearly outperforms existing CP approaches and competes well with the state-of-the-art methods on large datasets for mining frequent sequential patterns, sequential patterns under various constraints, and top-k sequential patterns. Unlike existing CP methods, our approach achieves a better scalability.  相似文献   

14.
Although short interfering RNA (siRNA) has been widely used for studying gene functions in mammalian cells, its gene silencing efficacy varies markedly and there are only a few consistencies among the recently reported design rules/guidelines for selecting siRNA sequences effective for mammalian genes. We propose a method for selecting effective siRNA target sequences by using a radial basis function (RBF) network and statistical significance analysis for a large number of known effective and ineffective siRNAs. The siRNA classification is first carried out by using the RBF network and then the preferred and unpreferred nucleotides for effective siRNAs at individual positions are chosen by significance testing. The gene degradation measure is defined as a score based on the preferred and unpreferred nucleotides. The effectiveness for the proposed method was confirmed by evaluating effective and ineffective siRNAs for the recently reported genes (15 genes, 196 sequences) and comparing the scores thus obtained with those obtained using other scoring methods. Since the score is closely correlated with the degree of gene degradation, it can easily be used for selecting high-potential siRNA candidates. The evaluation results indicate that the proposed method may be applicable for many other genes. It will therefore be useful for selecting siRNA sequences in mammalian genes.  相似文献   

15.
Although short interfering RNA (siRNA) has been widely used for studying gene functions in mammalian cells, its gene silencing efficacy varies markedly and there are only a few consistencies among the recently reported design rules/guidelines for selecting siRNA sequences effective for mammalian genes. We propose a method for selecting effective siRNA target sequences by using a radial basis function (RBF) network and statistical significance analysis for a large number of known effective and ineffective siRNAs. The siRNA classification is first carried out by using the RBF network and then the preferred and unpreferred nucleotides for effective siRNAs at individual positions are chosen by significance testing. The gene degradation measure is defined as a score based on the preferred and unpreferred nucleotides. The effectiveness for the proposed method was confirmed by evaluating effective and ineffective siRNAs for the recently reported genes (15 genes, 196 sequences) and comparing the scores thus obtained with those obtained using other scoring methods. Since the score is closely correlated with the degree of gene degradation, it can easily be used for selecting high-potential siRNA candidates. The evaluation results indicate that the proposed method may be applicable for many other genes. It will therefore be useful for selecting siRNA sequences in mammalian genes.  相似文献   

16.
An algorithmic method for assessing statistically the efficient market hypothesis (EMH) is developed based on two data mining tools, perceptually important points (PIPs) used to dynamically segment price series into subsequences, and dynamic time warping (DTW) used to find similar historical subsequences. Then predictions are made from the mappings of the most similar subsequences, and the prediction error statistic is used for the EMH assessment. The predictions are assessed on simulated price paths composed of stochastic trend and chaotic deterministic time series, and real financial data of 18 world equity markets and the GBP/USD exchange rate. The main results establish that the proposed algorithm can capture the deterministic structure in simulated series, confirm the validity of EMH on the examined equity indices, and indicate that prediction of the exchange rates using PIPs and DTW could beat at cases the prediction of last available price.  相似文献   

17.
文中所提m子序列是根据m序列的状态转换特征,通过交叉改变状态转换次序而形成新的序列。通过随机性测试软件(NIST)验证m子序列具有和m序列相似的随机性,使用BM算法可以得出这种伪随机序列具有非常高的线性复杂度,同时验证了其补序列也具有非常高的线性复杂度,并说明m子序列具有良好的线性复杂度谱,抗线性攻击能力强。m子序列的数量庞大,一个周期为 的m序列,改变反馈函数就可以至少产生 个m子序列。产生m子序列的反馈函数经证明具有良好的代数免疫度,抗代数攻击能力较强。m子序列具有良好的密码学性质,应用前景良好。  相似文献   

18.
基于异时间窗划分的时间序列聚类   总被引:3,自引:1,他引:2       下载免费PDF全文
针对相同时间窗对时间序列进行子序列划分的缺点,提出一种异时间窗的子序列划分方法。为解决划分得到的子序列长度不同,而使用动态时间弯曲算法进行子序列相似性度量的计算速度慢的问题,给出一种不规则时间序列距离度量算法。对异时间窗的子序列划分方法和不规则时间序列距离度量算法进行了实验,结果证明了二者的优越性。  相似文献   

19.
Discovering approximately recurrent motifs (ARMs) in timeseries is an active area of research in data mining. Exact motif discovery is defined as the problem of efficiently finding the most similar pairs of timeseries subsequences and can be used as a basis for discovering ARMs. The most efficient algorithm for solving this problem is the MK algorithm which was designed to find a single pair of timeseries subsequences with maximum similarity at a known length. This paper provides three extensions of the MK algorithm that allow it to find the top K similar subsequences at multiple lengths using both the Euclidean distance metric and scale invariant normalized version of it. The proposed algorithms are then applied to both synthetic data and real-world data with a focus on discovery of ARMs in human motion trajectories.  相似文献   

20.
Testing Web applications by modeling with FSMs   总被引:6,自引:0,他引:6  
Researchers and practitioners are still trying to find effective ways to model and test Web applications. This paper proposes a system-level testing technique that combines test generation based on finite state machines with constraints. We use a hierarchical approach to model potentially large Web applications. The approach builds hierarchies of Finite State Machines (FSMs) that model subsystems of the Web applications, and then generates test requirements as subsequences of states in the FSMs. These subsequences are then combined and refined to form complete executable tests. The constraints are used to select a reduced set of inputs with the goal of reducing the state space explosion otherwise inherent in using FSMs. The paper illustrates the technique with a running example of a Web-based course student information system and introduces a prototype implementation to support the technique.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号