首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
针对时间序列子序列聚类存在的平凡相似和水平伸缩等问题,提出了一种新的子序列聚类算法。它采用多孔平滑滤波器组对时间序列进行低通平滑处理,在所得到的多个尺度序列上生成平凡簇,然后将各个平凡簇的代表子序列作为数据样本进行聚类。新方法利用平凡簇克服了子序列聚类中的平凡相似问题,并且可以在时间序列上发现不等长的相似子序列,较好地解决了水平轴伸缩问题。实验结果证明新算法对于子序列聚类具有比较好的效果。  相似文献   

2.
An algorithmic method for assessing statistically the efficient market hypothesis (EMH) is developed based on two data mining tools, perceptually important points (PIPs) used to dynamically segment price series into subsequences, and dynamic time warping (DTW) used to find similar historical subsequences. Then predictions are made from the mappings of the most similar subsequences, and the prediction error statistic is used for the EMH assessment. The predictions are assessed on simulated price paths composed of stochastic trend and chaotic deterministic time series, and real financial data of 18 world equity markets and the GBP/USD exchange rate. The main results establish that the proposed algorithm can capture the deterministic structure in simulated series, confirm the validity of EMH on the examined equity indices, and indicate that prediction of the exchange rates using PIPs and DTW could beat at cases the prediction of last available price.  相似文献   

3.
Summary Dijkstra has given a derivation of an efficient algorithm for a problem concerning monotonic subsequences, and extracted a proof of a related theorem from the algorithm. Here it is shown that a careful separation of concerns can lead to a beautiful conventional proof, a very different derivation of Dijkstra's algorithm, a more elegant proof from the algorithm, and the discovery of a duality property.  相似文献   

4.
In real life, data often appear in the form of sequences and this form of data is called sequence data. In this paper, a new definition on sequence similarity and a novel algorithm, Projection Algorithm, for sequence data searching are proposed. This algorithm is not required to access every datum in a sequence database. However, it guarantees that no qualified subsequence is falsely rejected. Moreover, the projection algorithm can be extended to match subsequences with different scales. With careful selection of parameters, most of the similar subsequences with different scales can be retrieved. We also show by experiments that the proposed algorithm can outperform the traditional sequential searching algorithm up to 96 times in terms of speed up.  相似文献   

5.
In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.  相似文献   

6.
We investigate algorithms for efficiently detecting anomalies in real-valued one-dimensional time series. Past work has shown that a simple brute force algorithm that uses as an anomaly score the Euclidean distance between nearest neighbors of subsequences from a testing time series and a training time series is one of the most effective anomaly detectors. We investigate a very efficient implementation of this method and show that it is still too slow for most real world applications. Next, we present a new method based on summarizing the training time series with a small set of exemplars. The exemplars we use are feature vectors that capture both the high frequency and low frequency information in sets of similar subsequences of the time series. We show that this exemplar-based method is both much faster than the efficient brute force method as well as a prediction-based method and also handles a wider range of anomalies. We compare our algorithm across a large variety of publicly available time series and encourage others to do the same. Our exemplar-based algorithm is able to process time series in minutes that would take other methods days to process.  相似文献   

7.
邹蕾  高学东 《计算机应用》2016,36(9):2472-2474
时间序列子序列匹配作为时间序列检索、聚类、分类、异常监测等挖掘任务的基础被广泛研究。但传统的时间序列子序列匹配都是对精确相同或近似相同的模式进行匹配,为此定义了一种全新的具有相似发展趋势的序列模式——时间序列同构关系,经过数学推导给出了时间序列同构关系判定的法则,并基于此提出了同构关系时间序列片段发现的算法。该算法首先对原始时间序列进行预处理,然后分段拟合后对各时间序列分段进行同构关系判定。针对现实背景数据难以满足理论约束的问题,通过定义一个同构关系容忍度参数使实际时间序列数据的同构关系挖掘成为可能。实验结果表明,该算法能有效挖掘出满足同构关系的时间序列片段。  相似文献   

8.
孙焘  朱晓明 《计算机科学》2017,44(2):270-274
多条序列的最长公共子序列可以代表多条序列的公共信息,其在诸多领域里有着重要的应用,如信息检索、基因序列匹配等。求解多条序列的最长公共子序列是著名的NP难问题,本质为多解问题。一些近似算法虽然时间复杂度较低,但只能求出单解,对于有多解的序列集合,求得的结果信息量损失较大。因此提出一个新的近似算法来解决最长公共子序列问题。算法引入了代数结构“格”,通过动态规划求解出两条序列的公共格,并递归求解当前格与当前序列的公共格。公共格中的路径保存了多条公共子序列使得最终求解出的最长公共子序列为多个。对算法的相关定理给出了理论证明,并通过实验验证了算法的正确性。  相似文献   

9.
Time series motifs are sets of very similar subsequences of a long time series. They are of interest in their own right, and are also used as inputs in several higher-level data mining algorithms including classification, clustering, rule-discovery and summarization. In spite of extensive research in recent years, finding time series motifs exactly in massive databases is an open problem. Previous efforts either found approximate motifs or considered relatively small datasets residing in main memory. In this work, we leverage off previous work on pivot-based indexing to introduce a disk-aware algorithm to find time series motifs exactly in multi-gigabyte databases which contain on the order of tens of millions of time series. We have evaluated our algorithm on datasets from diverse areas including medicine, anthropology, computer networking and image processing and show that we can find interesting and meaningful motifs in datasets that are many orders of magnitude larger than anything considered before.  相似文献   

10.
针对时间序列模体发现算法计算复杂,并且无法发现多实例模体的问题,提出基于子序列全连接和最大团的时间序列模体发现(TSSJMC)算法。首先,使用快速时间序列子序列全连接算法求得所有子序列之间的距离,生成距离矩阵;然后,设置相似性阈值,将距离矩阵转化为邻接矩阵,构造子序列相似图;最后采用最大团搜索算法从相似图中搜索最大团,最大团的顶点对应的时间序列为包含最多实例的模体。在公开的时间序列数据集上进行实验,选用已有的能够发现多实例模体的Brute Force和Random Projection算法作为对比对象,分别从准确性、效率、可扩展性和鲁棒性对TSSJMC算法进行分析并获得了客观的评判结果。实验结果表明,与Random Projection算法相比,TSSJMC算法在效率、可扩展性和鲁棒性法方面均有明显优势;与Brute Force算法相比,TSSJMC算法发现的模体实例数量虽略低,但其效率和可扩展性都优于Brute Force算法。因此,TSSJMC是质量和效率相平衡的算法。  相似文献   

11.
Clustering of time series subsequence data commonly produces results that are unspecific to the data set. This paper introduces a clustering algorithm, that creates clusters exclusively from those subsequences that occur more frequently in a data set than would be expected by random chance. As such, it partially adopts a pattern mining perspective into clustering. When subsequences are being labeled based on such clusters, they may remain without label. In fact, if the clustering was done on an unrelated time series it is expected that the subsequences should not receive a label. We show that pattern-based clusters are indeed specific to the data set for 7 out of 10 real-world sets we tested, and for window-lengths up to 128 time points. While kernel-density-based clustering can be used to find clusters with similar properties for window sizes of 8–16 time points, its performance degrades fast for increasing window sizes.
Dietmar H. DorrEmail:
  相似文献   

12.
We present a new approach to motion rearrangement that preserves the syntactic structures of an input motion automatically by learning a context‐free grammar from the motion data. For grammatical analysis, we reduce an input motion into a string of terminal symbols by segmenting the motion into a series of subsequences, and then associating a group of similar subsequences with the same symbol. To obtain the most repetitive and precise set of terminals, we search for an optimial segmentation such that a large number of subsequences can be clustered into groups with little error. Once the input motion has been encoded as a string, a grammar induction algorithm is employed to build up a context‐free grammar so that the grammar can reconstruct the original string accurately as well as generate novel strings sharing their syntactic structures with the original string. Given any new strings from the learned grammar, it is straightforward to synthesize motion sequences by replacing each terminal symbol with its associated motion segment, and stitching every motion segment sequentially. We demonstrate the usefulness and flexibility of our approach by learning grammars from a large diversity of human motions, and reproducing their syntactic structures in new motion sequences.  相似文献   

13.
Time series motifs are approximately repeated subsequences found within a longer time series. They have been in the literature since 2002, but recently they have begun to receive significant attention in research and industrial communities. This is perhaps due to the growing realization that they implicitly offer solutions to a host of time series problems, including rule discovery, anomaly detection, density estimation, semantic segmentation, summarization, etc. Recent work has improved the scalability so exact motifs can be computed on datasets with up to a million data points in tenable time. However, in some domains, for example seismology or climatology, there is an immediate need to address even larger datasets. In this work, we demonstrate that a combination of a novel algorithm and a high-performance GPU allows us to significantly improve the scalability of motif discovery. We demonstrate the scalability of our ideas by finding the full set of exact motifs on a dataset with one hundred and forty-three million subsequences, which is by far the largest dataset ever mined for time series motifs/joins; it requires ten quadrillion pairwise comparisons. Furthermore, we demonstrate that our algorithm can produce actionable insights into seismology and ethology.  相似文献   

14.
An active research topic in data mining is the discovery of sequential patterns, which finds all frequent subsequences in a sequence database. The generalized sequential pattern (GSP) algorithm was proposed to solve the mining of sequential patterns with time constraints, such as time gaps and sliding time windows. Recent studies indicate that the pattern-growth methodology could speed up sequence mining. However, the capabilities to mine sequential patterns with time constraints were previously available only within the Apriori framework. Therefore, we propose the DELISP (delimited sequential pattern) approach to provide the capabilities within the pattern-growth methodology. DELISP features in reducing the size of projected databases by bounded and windowed projection techniques. Bounded projection keeps only time-gap valid subsequences and windowed projection saves nonredundant subsequences satisfying the sliding time-window constraint. Furthermore, the delimited growth technique directly generates constraint-satisfactory patterns and speeds up the pattern growing process. The comprehensive experiments conducted show that DELISP has good scalability and outperforms the well-known GSP algorithm in the discovery of sequential patterns with time constraints.  相似文献   

15.
数据流技术在金融分析、网络监控等诸多领域得到了广泛应用,而已有的子序列匹配算法主要针对静态序列,难于直接应用到海量、高速和连续的流数据。本文在动态时间规整技术的基础上,提出了一种新颖的TopKSM算法,能渐进、实时地获取Top-K相似子序列。算法完全符合数据流"单遍扫描"的性能要求。大量的实验表明,与现有的SPRING算法相比,该算法具有更高的性能。  相似文献   

16.
Kernel is a kind of data summary which is elaborately extracted from a large dataset. Given a problem, the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio. It is widely used in geometric optimization, clustering, and approximate query processing, etc., for scalingthem up to massive data. In this paper, we focus on the minimum ε-kernel (MK) computation that asks for a kernel of the smallest size for large-scale data processing. For the open problem presented by Wang et al. that whether the minimum ε-coreset (MC) problem and the MK problem can be reduced to each other, we first formalize the MK problem and analyze its complexity. Due to the NP-hardness of the MK problem in three or higher dimensions, an approximate algorithm, namely Set Cover-Based Minimum ε-Kernel algorithm (SCMK), is developed to solve it. We prove that the MC problem and the MK problem can be Turing-reduced to each other. Then, we discuss the update of MK under insertion and deletionoperations, respectively. Finally, a randomized algorithm, called the Randomized Algorithm of Set Cover-Based Minimumε-Kernel algorithm (RA-SCMK), is utilized to further reduce the complexity of SCMK. The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets. Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets, respectively. The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets. RA-SCMK runs up to three times faster than SCMK on synthetic datasets.  相似文献   

17.
时间序列中快速模式发现算法的研究   总被引:3,自引:0,他引:3  
针对长时间序列,该文提出了一种新的能快速发现序列中时序模式的检索方法。首先将时间序列分成若干等长的子序列;接着从每个子序列中提取特征序列,该特征序列能够反映子序列中数据的变化趋势;然后根据每个特征序列将相应的子序列分配到一系列盒子中,使得不同盒子中的子序列因数据变化趋势不同而不相似,而在同一盒子中的序列由于数据变化趋势相同而有可能相似;最后通过计算每个盒子中任意两个子序列间的欧几里德距离来发现所有的模式。有关实验证明该算法是行之有效的。  相似文献   

18.
A periodic datamining algorithm has been developed and used to extract distinct plasma fluctuations in multichannel oscillatory timeseries data. The technique uses the Expectation Maximisation algorithm to solve for the maximum likelihood estimates and cluster assignments of a mixture of multivariate independent von Mises distributions (EM-VMM). The performance of the algorithm shows significant benefits when compared to a periodic k-means algorithm and clustering using non-periodic techniques on several artificial datasets and real experimental data. Additionally, a new technique for identifying interesting features in multichannel oscillatory timeseries data is described (STFT-clustering). STFT-clustering identifies the coincidence of spectral features over most channels of a multi-channel array using the averaged short time Fourier transform of the signals. These features are filtered using clustering to remove noise. This method is particularly good at identifying weaker features and complements existing methods of feature extraction. Results from applying the STFT-clustering and EM-VMM algorithm to the extraction and clustering of plasma wave modes in the time series data from a helical magnetic probe array on the H-1NF heliac are presented.  相似文献   

19.
基于Shapelet剪枝和覆盖的时间序列分类算法   总被引:2,自引:0,他引:2  
原继东  王志海  韩萌 《软件学报》2015,26(9):2311-2325
时间序列shapelets是时间序列中能够最大限度地表示一个类别的子序列.解决时间序列分类问题的有效途径之一是通过shapelets转换技术,将shapelets的发现与分类器的构建相分离,其主要优点是优化了shapelets的选择过程,并能够灵活应用不同的分类策略.但该方法也存在不足:一是在shapelets转换时,用于产生最好分类结果的shapelets数量是很难确定的;二是被选择的shapelets之间往往存在着较大的相似性.针对这两个问题,首先提出了一种简单有效的shapelet剪枝技术,用于过滤掉相似的shapelets;其次,提出了一种基于shapelets覆盖的方法来确定用于数据转换的shapelets的数量.通过在多个数据集上的测试实验,表明了所提出的算法具有更高的分类准确率.  相似文献   

20.
基于Segmental-DTW的无监督行为序列分割   总被引:4,自引:0,他引:4  
吴晓婕  胡占义  吴毅红 《软件学报》2008,19(9):2285-2292
行为序列分割是行为分析与识别中最初始、最基础的一个步骤.提出了一种无监督的行为序列分割算法,主要步骤包括:(1)采用等长有重叠的时间窗口对视频序列进行粗分割;(2)将粗分割的视频段两两作比较,通过Segmental-DTW算法分割出两个视频段中最相似的行为片断;(3)将行为片断的相似性转化为邻接图表示,通过图聚类方法对分割出的行为片断进行聚类.该算法采用了从粗到细的分割思想,能够准确地分割出视频序列中大量出现的行为的片断,并将相同行为的片断聚为一类.分割结果可以直接用于行为建模和识别.实验结果也表明了分割出的行为片断具有较好的代表性和有效性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号