首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
针对时间序列模体发现算法计算复杂,并且无法发现多实例模体的问题,提出基于子序列全连接和最大团的时间序列模体发现(TSSJMC)算法。首先,使用快速时间序列子序列全连接算法求得所有子序列之间的距离,生成距离矩阵;然后,设置相似性阈值,将距离矩阵转化为邻接矩阵,构造子序列相似图;最后采用最大团搜索算法从相似图中搜索最大团,最大团的顶点对应的时间序列为包含最多实例的模体。在公开的时间序列数据集上进行实验,选用已有的能够发现多实例模体的Brute Force和Random Projection算法作为对比对象,分别从准确性、效率、可扩展性和鲁棒性对TSSJMC算法进行分析并获得了客观的评判结果。实验结果表明,与Random Projection算法相比,TSSJMC算法在效率、可扩展性和鲁棒性法方面均有明显优势;与Brute Force算法相比,TSSJMC算法发现的模体实例数量虽略低,但其效率和可扩展性都优于Brute Force算法。因此,TSSJMC是质量和效率相平衡的算法。  相似文献   

2.
The exploration of repeated patterns with different lengths, also called variable-length motifs, has received a great amount of attention in recent years. However, existing algorithms to detect variable-length motifs in large-scale time series are very time-consuming. In this paper, we introduce a time- and space-efficient approximate variable-length motif discovery algorithm, Distance-Propagation Sequitur (DP-Sequitur), for detecting variable-length motifs in large-scale time series data (e.g. over one hundred million in length). The discovered motifs can be ranked by different metrics such as frequency or similarity, and can benefit a wide variety of real-world applications. We demonstrate that our approach can discover motifs in time series with over one hundred million points in just minutes, which is significantly faster than the fastest existing algorithm to date. We demonstrate the superiority of our algorithm over the state-of-the-art using several real world time series datasets.  相似文献   

3.
Time series motifs are sets of very similar subsequences of a long time series. They are of interest in their own right, and are also used as inputs in several higher-level data mining algorithms including classification, clustering, rule-discovery and summarization. In spite of extensive research in recent years, finding time series motifs exactly in massive databases is an open problem. Previous efforts either found approximate motifs or considered relatively small datasets residing in main memory. In this work, we leverage off previous work on pivot-based indexing to introduce a disk-aware algorithm to find time series motifs exactly in multi-gigabyte databases which contain on the order of tens of millions of time series. We have evaluated our algorithm on datasets from diverse areas including medicine, anthropology, computer networking and image processing and show that we can find interesting and meaningful motifs in datasets that are many orders of magnitude larger than anything considered before.  相似文献   

4.
The last decade has seen a flurry of research on all-pairs-similarity-search (or similarity joins) for text, DNA and a handful of other datatypes, and these systems have been applied to many diverse data mining problems. However, there has been surprisingly little progress made on similarity joins for time series subsequences. The lack of progress probably stems from the daunting nature of the problem. For even modest sized datasets the obvious nested-loop algorithm can take months, and the typical speed-up techniques in this domain (i.e., indexing, lower-bounding, triangular-inequality pruning and early abandoning) at best produce only one or two orders of magnitude speedup. In this work we introduce a novel scalable algorithm for time series subsequence all-pairs-similarity-search. For exceptionally large datasets, the algorithm can be trivially cast as an anytime algorithm and produce high-quality approximate solutions in reasonable time and/or be accelerated by a trivial porting to a GPU framework. The exact similarity join algorithm computes the answer to the time series motif and time series discord problem as a side-effect, and our algorithm incidentally provides the fastest known algorithm for both these extensively-studied problems. We demonstrate the utility of our ideas for many time series data mining problems, including motif discovery, novelty discovery, shapelet discovery, semantic segmentation, density estimation, and contrast set mining. Moreover, we demonstrate the utility of our ideas on domains as diverse as seismology, music processing, bioinformatics, human activity monitoring, electrical power-demand monitoring and medicine.  相似文献   

5.
Given the recent explosion of interest in streaming data and online algorithms, clustering of time-series subsequences, extracted via a sliding window, has received much attention. In this work, we make a surprising claim. Clustering of time-series subsequences is meaningless. More concretely, clusters extracted from these time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature. We can justify calling our claim surprising because it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative examples, and a comprehensive set of experiments on reimplementations of previous work. Although the primary contribution of our work is to draw attention to the fact that an apparent solution to an important problem is incorrect and should no longer be used, we also introduce a novel method that, based on the concept of time-series motifs, is able to meaningfully cluster subsequences on some time-series datasets.  相似文献   

6.
The problem of anomaly detection in time series has received a lot of attention in the past two decades. However, existing techniques cannot locate where the anomalies are within anomalous time series, or they require users to provide the length of potential anomalies. To address these limitations, we propose a self-learning online anomaly detection algorithm that automatically identifies anomalous time series, as well as the exact locations where the anomalies occur in the detected time series. In addition, for multivariate time series, it is difficult to detect anomalies due to the following challenges. First, anomalies may occur in only a subset of dimensions (variables). Second, the locations and lengths of anomalous subsequences may be different in different dimensions. Third, some anomalies may look normal in each individual dimension but different with combinations of dimensions. To mitigate these problems, we introduce a multivariate anomaly detection algorithm which detects anomalies and identifies the dimensions and locations of the anomalous subsequences. We evaluate our approaches on several real-world datasets, including two CPU manufacturing data from Intel. We demonstrate that our approach can successfully detect the correct anomalies without requiring any prior knowledge about the data.  相似文献   

7.
iSAX: disk-aware mining and indexing of massive time series datasets   总被引:1,自引:0,他引:1  
Current research in indexing and mining time series data has produced many interesting algorithms and representations. However, the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets encountered in science, engineering, and business domains. In this work, we introduce a novel multi-resolution symbolic representation which can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature. To demonstrate the utility of this representation, we constructed a simple tree-based index structure which facilitates fast exact search and orders of magnitude faster, approximate search. For example, with a database of one-hundred million time series, the approximate search can retrieve high quality nearest neighbors in slightly over a second, whereas a sequential scan would take tens of minutes. Our experimental evaluation demonstrates that our representation allows index performance to scale well with increasing dataset sizes. Additionally, we provide analysis concerning parameter sensitivity, approximate search effectiveness, and lower bound comparisons between time series representations in a bit constrained environment. We further show how to exploit the combination of both exact and approximate search as sub-routines in data mining algorithms, allowing for the exact mining of truly massive real world datasets, containing tens of millions of time series.  相似文献   

8.
A streaming time series is a continuous and unbounded group of chronological observations that are found in many scientific and business applications. Motifs that are frequent subsequences are highly representative for the time series and play an important role in time series mining. Discovering motifs in time series has received much attention during recent years, and several algorithms have been proposed to solve this problem. However, these algorithms can only find motifs with a predefined length, which greatly affects their performance and practicality. Recent algorithms can discover motifs with different lengths, but require multiple scanning of the time series and are thus not applicable to streaming time series. In addition, it is difficult to determine the optimal length of interesting motifs; a suboptimal choice results in missing the key motifs or having too many redundant motifs. To overcome this challenge, we introduce the notion of a \(closed\) motif; a motif is \(closed\) if there is no motif with a longer length having the same number of occurrences. We propose a novel algorithm \(closedMotif\) to discover closed motifs in a single scan for streaming time series. We also use the nearest neighbor classifier with the most distinctive closed motifs to validate their potential in time series classification. Extensive experiments show that our approach can efficiently discover motifs with different lengths. In addition, our closed-motif-based classifier is shown to be more accurate than \(Logical\text{- }Shapelet\) , a state-of-the-art time series classifier. Finally, we demonstrate the scalability of \(closedMotif\) on several large datasets in diverse domains like video surveillance, sensor networks, and biometrics.  相似文献   

9.
已有的变长模体发现算法存在速度慢、可扩展性较差,且结果中包含过短、过长和平凡匹配等无意义模体的问题。本文提出一种基于Matrix Profile的时间序列变长模体挖掘算法。该算法使用STOMP算法作为子程序,使用结合了增量计算的下界距离来加速候选模体提取过程;采用长度相似性条件和模体分组等价类方法踢除过短、过长和平凡匹配等无意义的模体。在数据集UCR上的实验表明,提出的算法在发现变长模体时,能够有效地过滤无意义模体,且具有较高的效率和准确率。  相似文献   

10.
Biochemical research often involves examining structural relationships in molecules since scientists strongly believe in the causal relationship between structure and function. Traditionally, researchers have identified these patterns, or motifs, manually using domain expertise. However, with the massive influx of new biochemical data and the ability to gather data for very large molecules, there is great need for techniques that automatically and efficiently identify commonly occurring structural patterns in molecules. Previous automated substructure discovery approaches have each introduced variations of similar underlying techniques and have embedded domain knowledge. While doing so improves performance for the particular domain, this complicates extensibility to other domains. Also, they do not address scalability or noise, which is critical for macromolecules such as proteins. In this paper, we present MotifMiner, a general framework for efficiently identifying common motifs in most scientific molecular datasets. The approach combines structure-based frequent-pattern discovery with search space reduction and coordinate noise handling. We describe both the framework and several algorithms as well as demonstrate the flexibility of our system by analyzing protein and drug biochemical datasets.  相似文献   

11.
基于Shapelet剪枝和覆盖的时间序列分类算法   总被引:2,自引:0,他引:2  
原继东  王志海  韩萌 《软件学报》2015,26(9):2311-2325
时间序列shapelets是时间序列中能够最大限度地表示一个类别的子序列.解决时间序列分类问题的有效途径之一是通过shapelets转换技术,将shapelets的发现与分类器的构建相分离,其主要优点是优化了shapelets的选择过程,并能够灵活应用不同的分类策略.但该方法也存在不足:一是在shapelets转换时,用于产生最好分类结果的shapelets数量是很难确定的;二是被选择的shapelets之间往往存在着较大的相似性.针对这两个问题,首先提出了一种简单有效的shapelet剪枝技术,用于过滤掉相似的shapelets;其次,提出了一种基于shapelets覆盖的方法来确定用于数据转换的shapelets的数量.通过在多个数据集上的测试实验,表明了所提出的算法具有更高的分类准确率.  相似文献   

12.
Classification of time series has been attracting great interest over the past decade. While dozens of techniques have been introduced, recent empirical evidence has strongly suggested that the simple nearest neighbor algorithm is very difficult to beat for most time series problems, especially for large-scale datasets. While this may be considered good news, given the simplicity of implementing the nearest neighbor algorithm, there are some negative consequences of this. First, the nearest neighbor algorithm requires storing and searching the entire dataset, resulting in a high time and space complexity that limits its applicability, especially on resource-limited sensors. Second, beyond mere classification accuracy, we often wish to gain some insight into the data and to make the classification result more explainable, which global characteristics of the nearest neighbor cannot provide. In this work we introduce a new time series primitive, time series shapelets, which addresses these limitations. Informally, shapelets are time series subsequences which are in some sense maximally representative of a class. We can use the distance to the shapelet, rather than the distance to the nearest neighbor to classify objects. As we shall show with extensive empirical evaluations in diverse domains, classification algorithms based on the time series shapelet primitives can be interpretable, more accurate, and significantly faster than state-of-the-art classifiers.  相似文献   

13.
Time series are an important and interesting research field due to their many different applications. In our previous work, we proposed a time-series segmentation approach by combining a clustering technique, discrete wavelet transformation (DWT) and a genetic algorithm to automatically find segments and patterns from a time series. In this paper, we propose a perceptually important points (PIP)-based evolutionary approach, which uses PIP instead of DWT, to effectively adjust the length of subsequences and find appropriate segments and patterns, as well as avoid some problems that arose in the previous approach. To achieve this, an enhanced suitability factor in the fitness function is designed, modified from the previous approach. The experimental results on a real financial dataset show the effectiveness of the proposed approach.  相似文献   

14.
We investigate algorithms for efficiently detecting anomalies in real-valued one-dimensional time series. Past work has shown that a simple brute force algorithm that uses as an anomaly score the Euclidean distance between nearest neighbors of subsequences from a testing time series and a training time series is one of the most effective anomaly detectors. We investigate a very efficient implementation of this method and show that it is still too slow for most real world applications. Next, we present a new method based on summarizing the training time series with a small set of exemplars. The exemplars we use are feature vectors that capture both the high frequency and low frequency information in sets of similar subsequences of the time series. We show that this exemplar-based method is both much faster than the efficient brute force method as well as a prediction-based method and also handles a wider range of anomalies. We compare our algorithm across a large variety of publicly available time series and encourage others to do the same. Our exemplar-based algorithm is able to process time series in minutes that would take other methods days to process.  相似文献   

15.
16.
In this work we introduce the new problem of finding time seriesdiscords. Time series discords are subsequences of longer time series that are maximally different to all the rest of the time series subsequences. They thus capture the sense of the most unusual subsequence within a time series. While discords have many uses for data mining, they are particularly attractive as anomaly detectors because they only require one intuitive parameter (the length of the subsequence) unlike most anomaly detection algorithms that typically require many parameters. While the brute force algorithm to discover time series discords is quadratic in the length of the time series, we show a simple algorithm that is three to four orders of magnitude faster than brute force, while guaranteed to produce identical results. We evaluate our work with a comprehensive set of experiments on diverse data sources including electrocardiograms, space telemetry, respiration physiology, anthropological and video datasets. Eamonn Keogh is an Assistant Professor of computer science at the University of California, Riverside. His research interests include data mining, machine learning and information retrieval. Several of his papers have won best paper awards, including papers at SIGKDD and SIGMOD. Dr. Keogh is the recipient of a 5-year NSF Career Award for “Efficient discovery of previously unknown patterns and relationships in massive time series databases.” Jessica Lin is an Assistant Professor of information and software engineering at George Mason University. She received her Ph.D. from the University of California, Riverside. Her research interests include data mining and informational retrieval. Sang-Hee Lee is a paleoanthropologist at the University of California, Riverside. Her research interests include the evolution of human morphological variation and how different mechanisms (such as taxonomy, sex, age, and time) explain what is observed in fossil data. Dr. Lee obtained her Ph.D. in anthropology from the University of Michigan in 1999. Helga Van Herle is an Assistant Clinical Professor of medicine at the Division of Cardiology of the Geffen School of Medicine at UCLA. She received her M.D. from UCLA in 1993; completed her residency in internal medicine at the New York Hospital (Cornell University, 1993–1996) and her cardiology fellowship at UCLA (1997–2001). Dr. Van Herle holds a M.Sc. in bioengineering from Columbia University (1987) and a B.Sc. in Chemical Engineering from UCLA (1985)  相似文献   

17.
An active research topic in data mining is the discovery of sequential patterns, which finds all frequent subsequences in a sequence database. The generalized sequential pattern (GSP) algorithm was proposed to solve the mining of sequential patterns with time constraints, such as time gaps and sliding time windows. Recent studies indicate that the pattern-growth methodology could speed up sequence mining. However, the capabilities to mine sequential patterns with time constraints were previously available only within the Apriori framework. Therefore, we propose the DELISP (delimited sequential pattern) approach to provide the capabilities within the pattern-growth methodology. DELISP features in reducing the size of projected databases by bounded and windowed projection techniques. Bounded projection keeps only time-gap valid subsequences and windowed projection saves nonredundant subsequences satisfying the sliding time-window constraint. Furthermore, the delimited growth technique directly generates constraint-satisfactory patterns and speeds up the pattern growing process. The comprehensive experiments conducted show that DELISP has good scalability and outperforms the well-known GSP algorithm in the discovery of sequential patterns with time constraints.  相似文献   

18.
This paper presents a proposal for the extraction of association rules called G3PARM (Grammar-Guided Genetic Programming for Association Rule Mining) that makes the knowledge extracted more expressive and flexible. This algorithm allows a context-free grammar to be adapted and applied to each specific problem or domain and eliminates the problems raised by discretization. This proposal keeps the best individuals (those that exceed a certain threshold of support and confidence) obtained with the passing of generations in an auxiliary population of fixed size n. G3PARM obtains solutions within specified time limits and does not require the large amounts of memory that the exhaustive search algorithms in the field of association rules do. Our approach is compared to exhaustive search (Apriori and FP-Growth) and genetic (QuantMiner and ARMGA) algorithms for mining association rules and performs an analysis of the mined rules. Finally, a series of experiments serve to contrast the scalability of our algorithm. The proposal obtains a small set of rules with high support and confidence, over 90 and 99% respectively. Moreover, the resulting set of rules closely satisfies all the dataset instances. These results illustrate that our proposal is highly promising for the discovery of association rules in different types of datasets.  相似文献   

19.
Distance-based outliers: algorithms and applications   总被引:20,自引:0,他引:20  
This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. In this paper, we study the notion of DB (distance-based) outliers. Specifically, we show that (i) outlier detection can be done efficiently for large datasets, and for k-dimensional datasets with large values of k (e.g., ); and (ii), outlier detection is a meaningful and important knowledge discovery task. First, we present two simple algorithms, both having a complexity of , k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we present an optimized cell-based algorithm that has a complexity that is linear with respect to N, but exponential with respect to k. We provide experimental results indicating that this algorithm significantly outperforms the two simple algorithms for . Third, for datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most three passes over a dataset. Again, experimental results show that this algorithm is by far the best for . Finally, we discuss our work on three real-life applications, including one on spatio-temporal data (e.g., a video surveillance application), in order to confirm the relevance and broad applicability of DB outliers. Received February 15, 1999 / Accepted August 1, 1999  相似文献   

20.
基于滑动窗口的多变量时间序列异常数据的挖掘   总被引:1,自引:0,他引:1       下载免费PDF全文
翁小清    沈钧毅 《计算机工程》2007,33(12):102-104
与其它多变量时间序列(MTS)子序列显著不同的子序列,称为异常子序列(含异常数据)。该文提出了一种基于滑动窗口的MTS异常子序列的挖掘算法,使用扩展的Frobenius 范数来计算两个MTS子序列之间相似性,使用两阶段顺序查询来进行K-近邻查找,将不可能成为候选异常子序列的MTS子序列剪去,对上海证券交易所股票交易情况MTS数据集进行了异常子序列(含异常数据)挖掘,结果表明了算法的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号