首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The exploration of repeated patterns with different lengths, also called variable-length motifs, has received a great amount of attention in recent years. However, existing algorithms to detect variable-length motifs in large-scale time series are very time-consuming. In this paper, we introduce a time- and space-efficient approximate variable-length motif discovery algorithm, Distance-Propagation Sequitur (DP-Sequitur), for detecting variable-length motifs in large-scale time series data (e.g. over one hundred million in length). The discovered motifs can be ranked by different metrics such as frequency or similarity, and can benefit a wide variety of real-world applications. We demonstrate that our approach can discover motifs in time series with over one hundred million points in just minutes, which is significantly faster than the fastest existing algorithm to date. We demonstrate the superiority of our algorithm over the state-of-the-art using several real world time series datasets.  相似文献   

2.
3.
生物序列motif识别问题是当今生物信息学面临的一个复杂问题,要设计一个能识别所有motif的方法几乎是不可能的。针对该问题,在免疫遗传算法中引入了统计估计,提高了motif识别的精度,根据个体的浓度和适应值概率。设计了免疫替换算子,有效地解决了种群的多样性问题,利用Gibbs Sampler算法生成种子,提高了免疫遗传算法的搜索速度,最后得到了一个基于免疫GA与Gibbs Sampler的生物序列motif识别算法,该算法充分发挥了免疫遗传算法和Gibbs Sampler算法的优越性,较好地解决了计算速度和计算精度之间的矛盾。实验表明,该算法是有效的。  相似文献   

4.
已有的变长模体发现算法存在速度慢、可扩展性较差,且结果中包含过短、过长和平凡匹配等无意义模体的问题。本文提出一种基于Matrix Profile的时间序列变长模体挖掘算法。该算法使用STOMP算法作为子程序,使用结合了增量计算的下界距离来加速候选模体提取过程;采用长度相似性条件和模体分组等价类方法踢除过短、过长和平凡匹配等无意义的模体。在数据集UCR上的实验表明,提出的算法在发现变长模体时,能够有效地过滤无意义模体,且具有较高的效率和准确率。  相似文献   

5.
Time series motifs are approximately repeated subsequences found within a longer time series. They have been in the literature since 2002, but recently they have begun to receive significant attention in research and industrial communities. This is perhaps due to the growing realization that they implicitly offer solutions to a host of time series problems, including rule discovery, anomaly detection, density estimation, semantic segmentation, summarization, etc. Recent work has improved the scalability so exact motifs can be computed on datasets with up to a million data points in tenable time. However, in some domains, for example seismology or climatology, there is an immediate need to address even larger datasets. In this work, we demonstrate that a combination of a novel algorithm and a high-performance GPU allows us to significantly improve the scalability of motif discovery. We demonstrate the scalability of our ideas by finding the full set of exact motifs on a dataset with one hundred and forty-three million subsequences, which is by far the largest dataset ever mined for time series motifs/joins; it requires ten quadrillion pairwise comparisons. Furthermore, we demonstrate that our algorithm can produce actionable insights into seismology and ethology.  相似文献   

6.
针对时间序列模体发现算法计算复杂,并且无法发现多实例模体的问题,提出基于子序列全连接和最大团的时间序列模体发现(TSSJMC)算法。首先,使用快速时间序列子序列全连接算法求得所有子序列之间的距离,生成距离矩阵;然后,设置相似性阈值,将距离矩阵转化为邻接矩阵,构造子序列相似图;最后采用最大团搜索算法从相似图中搜索最大团,最大团的顶点对应的时间序列为包含最多实例的模体。在公开的时间序列数据集上进行实验,选用已有的能够发现多实例模体的Brute Force和Random Projection算法作为对比对象,分别从准确性、效率、可扩展性和鲁棒性对TSSJMC算法进行分析并获得了客观的评判结果。实验结果表明,与Random Projection算法相比,TSSJMC算法在效率、可扩展性和鲁棒性法方面均有明显优势;与Brute Force算法相比,TSSJMC算法发现的模体实例数量虽略低,但其效率和可扩展性都优于Brute Force算法。因此,TSSJMC是质量和效率相平衡的算法。  相似文献   

7.
随着生物信息学的发展,模体识别已经成为一种能够从生物序列中提取有用生物信息的方法。文中介绍了有关模体的一些概念,讨论了模体识别算法(MEME)的基础,即EM(expectation maximization)算法,由于MEME算法是建立在EM算法的基础上的,所以又由此引出了MEME算法,并对MEME算法的一些基本问题比如时间复杂度、算法性能等进行了详细讨论,对算法的局限性和有待改进的地方作了说明。实践证明,MEME是一个较好的模体识别算法,它能够识别出蛋白质或者DNA序列中单个或多个模体,具有很大的灵活性。  相似文献   

8.
A streaming time series is a continuous and unbounded group of chronological observations that are found in many scientific and business applications. Motifs that are frequent subsequences are highly representative for the time series and play an important role in time series mining. Discovering motifs in time series has received much attention during recent years, and several algorithms have been proposed to solve this problem. However, these algorithms can only find motifs with a predefined length, which greatly affects their performance and practicality. Recent algorithms can discover motifs with different lengths, but require multiple scanning of the time series and are thus not applicable to streaming time series. In addition, it is difficult to determine the optimal length of interesting motifs; a suboptimal choice results in missing the key motifs or having too many redundant motifs. To overcome this challenge, we introduce the notion of a \(closed\) motif; a motif is \(closed\) if there is no motif with a longer length having the same number of occurrences. We propose a novel algorithm \(closedMotif\) to discover closed motifs in a single scan for streaming time series. We also use the nearest neighbor classifier with the most distinctive closed motifs to validate their potential in time series classification. Extensive experiments show that our approach can efficiently discover motifs with different lengths. In addition, our closed-motif-based classifier is shown to be more accurate than \(Logical\text{- }Shapelet\) , a state-of-the-art time series classifier. Finally, we demonstrate the scalability of \(closedMotif\) on several large datasets in diverse domains like video surveillance, sensor networks, and biometrics.  相似文献   

9.
The goal of motif discovery algorithms is to efficiently find unknown recurring patterns. In this paper, we focus on motif discovery in time series. Most available algorithms cannot utilize domain knowledge in any way which results in quadratic or at least super-linear time and space complexity. In this paper we define the Constrained Motif Discovery problem which enables utilization of domain knowledge into the motif discovery process. The paper then provides two algorithms called MCFull and MCInc for efficiently solving the constrained motif discovery problem. We also show that most unconstrained motif discovery problems be converted into constrained ones using a change-point detection algorithm. A novel change-point detection algorithm called the Robust Singular Spectrum Transform (RSST) is then introduced and compared to traditional Singular Spectrum Transform using synthetic and real-world data sets. The results show that RSST achieves higher specificity and is more adequate for finding constraints to convert unconstrained motif discovery problems to constrained ones that can be solved using MCFull and MCInc. We then compare the combination of RSST and MCFull or MCInc with two state-of-the-art motif discovery algorithms on a large set of synthetic time series. The results show that the proposed algorithms provided four to ten folds increase in speed compared the unconstrained motif discovery algorithms studied without any loss of accuracy. RSST+MCFull is then used in a real world human-robot interaction experiment to enable the robot to learn free hand gestures, actions, and their associations by watching humans and other robots interacting.  相似文献   

10.
In the paper, a novel two‐level algorithm of time‐series change detection is presented. In the first level, to identify non‐stationary sequences in a processed signal, preliminary detection of events is performed with a short‐term prediction comparison. In the second stage, to confirm the changes detected in the first level, a similarity method aimed at identification of unique changes is employed. The detection of changes in a non‐stationary time series is discussed, implemented algorithms are described and the results produced on a sample four financial time series are shown. General conditions for implementing the proposed algorithm as an immune‐like event detector are discussed.  相似文献   

11.
This article presents SwiftMotif, a novel technique for on-line motif detection in time series. With this technique, frequently occurring temporal patterns or anomalies can be discovered, for instance. The motif detection is based on a fusion of methods from two worlds: probabilistic modeling and similarity measurement techniques are combined with extremely fast polynomial least-squares approximation techniques. A time series is segmented with a data stream segmentation method, the segments are modeled by means of normal distributions with time-dependent means and constant variances, and these models are compared using a divergence measure for probability densities. Then, using suitable clustering algorithms based on these similarity measures, motifs may be defined. The fast time series segmentation and modeling techniques then allow for an on-line detection of previously defined motifs in new time series with very low run-times. SwiftMotif is suitable for real-time applications, accounts for the uncertainty associated with the occurrence of certain motifs, e.g., due to noise, and considers local variability (i.e., uniform scaling) in the time domain. This article focuses on the mathematical foundations and the demonstration of properties of SwiftMotif—in particular accuracy and run-time—using some artificial and real benchmark time series.  相似文献   

12.
Time series motifs are sets of very similar subsequences of a long time series. They are of interest in their own right, and are also used as inputs in several higher-level data mining algorithms including classification, clustering, rule-discovery and summarization. In spite of extensive research in recent years, finding time series motifs exactly in massive databases is an open problem. Previous efforts either found approximate motifs or considered relatively small datasets residing in main memory. In this work, we leverage off previous work on pivot-based indexing to introduce a disk-aware algorithm to find time series motifs exactly in multi-gigabyte databases which contain on the order of tens of millions of time series. We have evaluated our algorithm on datasets from diverse areas including medicine, anthropology, computer networking and image processing and show that we can find interesting and meaningful motifs in datasets that are many orders of magnitude larger than anything considered before.  相似文献   

13.
挖掘时间序列motif间潜在的关联规则可以在预测未来趋势方面发挥重要作用,时间序列motif即时间序列中先前未知的重复出现的模式。针对符号化时间序列提取motif导致信息丢失的问题,提出基于剪枝技术的motif提取算法PM_Motif,实现了保留原始信息的motif的精准快速提取;针对分割motif来发现其内部关联规则导致的规则不一致的问题,从motif间的关联规则入手,给出了基于AR_TSM方法的时间序列motif关联规则挖掘算法,从根本上避免了因motif分割引起的不确定性,保证了规则的一致性;最后,引入了关联规则评价参数RM,在多数据集上证明了关联规则的预测性能。  相似文献   

14.
In this paper, we study the problem of motif discoveries in unaligned DNA and protein sequences. The problem of motif identification in DNA and protein sequences has been studied for many years in the literature. Major hurdles at this point include computational complexity and reliability of the search algorithms. We propose a self-organizing neural network structure for solving the problem of motif identification in DNA and protein sequences. Our network contains several layers, with each layer performing classifications at different levels. The top layer divides the input space into a small number of regions and the bottom layer classifies all input patterns into motifs and nonmotif patterns. Depending on the number of input patterns to be classified, several layers between the top layer and the bottom layer are needed to perform intermediate classifications. We maintain a low computational complexity through the use of the layered structure so that each pattern's classification is performed with respect to a small subspace of the whole input space. Our self-organizing neural network will grow as needed (e.g., when more motif patterns are classified). It will give the same amount of attention to each input pattern and will not omit any potential motif patterns. Finally, simulation results show that our algorithm outperforms existing algorithms in certain aspects. In particular, simulation results show that our algorithm can identify motifs with more mutations than existing algorithms. Our algorithm works well for long DNA sequences as well.  相似文献   

15.
ObjectiveThis paper presents an algorithm for the solution of the motif discovery problem (MDP).Methods and materialsMotif discovery problem can be considered in two cases: motifs with insertions/deletions, and motifs without insertions/deletions. The first group motifs can be found by stochastic and approximated methods. The second group can be found by using stochastic and approximated methods, but also deterministic method. We proved that the second group motifs can be found with a deterministic algorithm, and so, it can be said that the second motifs finding is a P-type problem as proved in this paper.Results and conclusionsAn algorithm was proposed in this paper for motif discovery problem. The proposed algorithm finds all motifs which are occurred in the sequence at least two times, and it also finds motifs of various sizes. Due to this case, this algorithm is regarded as Automatic Exact Motif Discovery Algorithm. All motifs of different sizes can be found with this algorithm, and this case was proven in this paper. It shown that automatic exact motif discovery is a P-type problem in this paper. The application of the proposed algorithm has been shown that this algorithm is superior to MEME, MEME3, Motif Sampler, WEEDER, CONSENSUS, AlignACE.  相似文献   

16.
针对动态有向网络中的时序链路预测问题,充分分析动态有向网络中微观结构三元组模体的演化规律,使用指数平滑法季节加法(Holter-Winter-Additive)时序分析方法预测三元组模体的转换概率,引入牛顿法寻求时序分析方法中的最优参数;同时考虑到节点的社区属性对链路预测产生的影响,定义模体内节点的社区结构一致性重要指标,对三元组模体的影响力进行评估。基于此,首先使用时间序列分析方法对模体的转换概率进行预测,进而结合模体社区结构一致性的指标提出一种新的链路预测方法。使用不同的方法在三个真实的有向网络中进行验证,实验结果显示该方法能够达到更好的链路预测效果。  相似文献   

17.
Clustering heteroskedastic time series by model-based procedures   总被引:1,自引:0,他引:1  
Financial time series are often characterized by similar volatility structures. The detection of clusters of series displaying similar behavior could be important in understanding the differences in the estimated processes, without having to study and compare the estimated parameters across all the series. This is particularly relevant when dealing with many series, as in financial applications. The volatility of a time series can be characterized in terms of the underlying GARCH process. Using Wald tests and the Autoregressive metrics to measure the distance between GARCH processes, it is shown that it is possible to develop a clustering algorithm, which can provide three classifications (with increasing degree of deepness) based on the heteroskedastic patterns of the time series. The number of clusters is detected automatically and it is not fixed a priori or a posteriori. The procedure is evaluated by simulations and applied to the sector indices of the Italian market.  相似文献   

18.
The detection of very similar patterns in a time series, commonly called motifs, has received continuous and increasing attention from diverse scientific communities. In particular, recent approaches for discovering similar motifs of different lengths have been proposed. In this work, we show that such variable-length similarity-based motifs cannot be directly compared, and hence ranked, by their normalized dissimilarities. Specifically, we find that length-normalized motif dissimilarities still have intrinsic dependencies on the motif length, and that lowest dissimilarities are particularly affected by this dependency. Moreover, we find that such dependencies are generally non-linear and change with the considered data set and dissimilarity measure. Based on these findings, we propose a solution to rank (previously obtained) motifs of different lengths and measure their significance. This solution relies on a compact but accurate model of the dissimilarity space, using a beta distribution with three parameters that depend on the motif length in a non-linear way. We believe the incomparability of variable-length dissimilarities could have an impact beyond the field of time series, and that similar modeling strategies as the one used here could be of help in a more broad context and in diverse application scenarios.  相似文献   

19.
借鉴Gibbs采样思想,将序列峰值所对应的候选模体作为遗传算法的初始种群,提出一种改进的模体识别算法。将模体在序列中的出现次数作为变量加入到适应度函数中,使其更符合生物数据的特性。在算法变异操作中加入IUPAC简并码保持种群的多样性。对DBTSS数据库中的真实数据进行测试,结果表明该算法具有较高的识别精度和较快的搜索速度。  相似文献   

20.
The discovery of useful patterns embodied in a time series is of fundamental relevance in many real applications. Repetitive structures and common type of segments can also provide very useful information of patterns in financial time series. In this paper, we introduce a time series segmentation and characterization methodology combining a hybrid genetic algorithm and a clustering technique to automatically group common patterns from this kind of financial time series and address the problem of identifying stock market prices trends. This hybrid genetic algorithm includes a local search method aimed to improve the quality of the final solution. The local search algorithm is based on maximizing a likelihood ratio, assuming normality for the series and the subseries in which the original one is segmented. To do so, we select two stock market index time series: IBEX35 Spanish index (closing prices) and a weighted average time series of the IBEX35 (Spanish), BEL20 (Belgian), CAC40 (French) and DAX (German) indexes. These are processed to obtain segments that are mapped into a five dimensional space composed of five statistical measures, with the purpose of grouping them according to their statistical properties. Experimental results show that it is possible to discover homogeneous patterns in both time series.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号