首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We address the handling of time series search based on two important distance definitions: Euclidean distance and time warping distance. The conventional method reduces the dimensionality by means of a discrete Fourier transform. We apply the Haar wavelet transform technique and propose the use of a proper normalization so that the method can guarantee no false dismissal for Euclidean distance. We found that this method has competitive performance from our experiments. Euclidean distance measurement cannot handle the time shifts of patterns. It fails to match the same rise and fall patterns of sequences with different scales. A distance measure that handles this problem is the time warping distance. However, the complexity of computing the time warping distance function is high. Also, as time warping distance is not a metric, most indexing techniques would not guarantee any false dismissal. We propose efficient strategies to mitigate the problems of time warping. We suggest a Haar wavelet-based approximation function for time warping distance, called Low Resolution Time Warping, which results in less computation by trading off a small amount of accuracy. We apply our approximation function to similarity search in time series databases, and show by experiment that it is highly effective in suppressing the number of false alarms in similarity search.  相似文献   

2.
动态时间弯曲(DTW)距离支持时间序列的多种形变,具有较高的匹配精度,是一种重要的相似性度量方法.然而,该方法计算复杂度较高,制约了其在相似性搜索中的应用.为了平衡匹配精度与计算效率之间的矛盾,提出一种过滤搜索方法.首先,构造一种计算代价较低的DTW下界距离,用其进行粗略过滤,得到候选集;然后,利用提前终止策略,优化计算候选集中序列的DTW距离,得到搜索结果;最后,对所提出方法进行实验验证,结果表明,该方法能够提高DTW距离的相似性搜索效率,且具有非漏报性.  相似文献   

3.
Recently, databases have been used to store multimedia data such as images, maps, video clips, and music clips. In order to search them, they should be represented by various features, which are composed of high-dimensional vectors. As a result, the dimensionality of data is increased considerably, which causes ‘the curse of dimensionality’. The increase of data dimensionality causes poor performance of index structures. To overcome the problem, the research on the dimensionality reduction has been conducted. However, some reduction methods do not guarantee no false dismissal, while others incur high computational cost. This paper proposes dimensionality reduction techniques that guarantee no false dismissal while providing efficiency considerable by approximating distances with a few values. To provide the no false dismissal property, approximated distances should always be smaller than original distances. The Cauchy–Schwarz inequality and two trigonometrical equations are used as well as the dimension partitioning technique is applied to approximate distances in such a way to reduce the difference between the approximated distance and the original distance. As a result, the proposed techniques reduce the candidate set of a query result for efficient query processing.  相似文献   

4.
Time sequences, which are ordered sets of observations, have been studied in various database applications. In this paper, we introduce a new class of time sequences where each observation is represented by an interval rather than a number. Such sequences may arise in many situations. For instance, we may not be able to determine the exact value at a time point due to uncertainty or aggregation. Such observation may be represented better by a range of possible values. Similarity search with interval time sequences as both query and data sequences poses a new challenge for research. We first address the issue of (dis)similarity measures for interval time sequences. We choose an L1 norm-based measure because it effectively quantifies the degree of overlapping and remoteness between two intervals, and is invariant irrespective of the position of an interval when it is enclosed within another interval. We next propose an efficient indexing technique for fast retrieval of similar interval time sequences from large databases. More specifically, we propose: (1) to extract a segment-based feature vector for each sequence, and (2) to map each feature vector to either a point or a hyper-rectangle in a multi-dimensional feature space. We then show how we can use existing multi-dimensional index structures such as the R-tree for efficient query processing. The proposed method guarantees no false dismissals. Experimental results show that, for synthetic and real stock data, it is superior to sequential scanning in performance and scales well with the data size.  相似文献   

5.
Indexing sequences containing multiple moving objects by all features of these objects captured at every clock tick results in huge index structures due to the large number of extracted features in all sampled instances. Thus, the main problems with current systems that index sequences containing multiple moving objects are: huge storage requirements for index structures, slow search time and low accuracy due to lack of representation of the time-varying features of objects. In this paper, a technique called cTraj to address these problems is proposed. For each object in a sequence, cTraj captures the features at sampled instances. Then, it maps the object??s features at each sampled instance from high-dimensional feature space into a point in low-dimensional distance space. The sequence of points of an object in low-dimensional space is considered the time-varying feature trajectory of the object. To reduce storage requirements of an index structure, the sequence of points in each trajectory is represented by a minimum bounding box (MBB). cTraj indexes a sequence by the MBBs of its objects using a spatial access method (SAM), such as an R?tree; thus, greatly reducing storage requirements of the index and speeding up the search time. The cTraj technique does not result in any false dismissal, but the result might contain a few false alarms, which are removed by a two-step refinement process. The experiments show that the proposed cTraj technique produces effective results comparable to those of a sequential method, however much more efficient.  相似文献   

6.
吴枫  仲妍  吴泉源  贾焰  杨树强 《软件学报》2009,20(10):2867-2884
相似性搜索在股票交易行情、网络安全、传感器网络等众多领域应用广泛.由于这些领域中产生的数据具有无限的、连续的、快速的、实时的特性,所以需要适合数据流上的在线相似性搜索算法.首先,在具有或不具有全局约束条件下,分别提出了没有索引结构的DTW(dynamic time warping)下限函数LB_seg_WFglobalLB_seg_WF,它们是一种分段DTW技术,能够处理数据流上的非等长序列间在线相似性匹配问题.然后,为了进一步提高LB_seg_WFglobalLB_seg_WF的近似程度,提出了一系列的改进方法.最后,针对流上使用LB_seg_WFglobalLB_seg_WF可能会出现连续失效的情况,分别提出了DTW的下限函数LB_WFglobal(具有全局约束条件)和上限函数UB_WF、下限函数LB_WF(不具有全局约束条件).通过增量方式快速估计DTW,极大地减少了估计DTW的冗余计算量.通过理论分析和统计实验,验证了该方法的有效性.  相似文献   

7.
Similarity search is a core module of many data analysis tasks, including search by example, classification, and clustering. For time series data, Dynamic Time Warping (DTW) has been proven a very effective similarity measure, since it minimizes the effects of shifting and distortion in time. However, the quadratic cost of DTW computation to the length of the matched sequences makes its direct application on databases of long time series very expensive. We propose a technique that decomposes the sequences into a number of segments and uses cheap approximations thereof to compute fast lower bounds for their warping distances. We present several, progressively tighter bounds, relying on the existence or not of warping constraints. Finally, we develop an index and a multi-step technique that uses the proposed bounds and performs two levels of filtering to efficiently process similarity queries. A thorough experimental study suggests that our method consistently outperforms state-of-the-art methods for DTW similarity search.  相似文献   

8.
Dynamic time warping (DTW) distance has been effectively used in mining time series data in a multitude of domains. However, in its original formulation DTW is extremely inefficient in comparing long sparse time series, containing mostly zeros and some unevenly spaced nonzero observations. Original DTW distance does not take advantage of this sparsity, leading to redundant calculations and a prohibitively large computational cost for long time series. We derive a new time warping similarity measure (AWarp) for sparse time series that works on the run-length encoded representation of sparse time series. The complexity of AWarp is quadratic on the number of observations as opposed to the range of time of the time series. Therefore, AWarp can be several orders of magnitude faster than DTW on sparse time series. AWarp is exact for binary-valued time series and a close approximation of the original DTW distance for any-valued series. We discuss useful variants of AWarp: bounded (both upper and lower), constrained, and multidimensional. We show applications of AWarp to three data mining tasks including clustering, classification, and outlier detection, which are otherwise not feasible using classic DTW, while producing equivalent results. Potential areas of application include bot detection, human activity classification, search trend analysis, seismic analysis, and unusual review pattern mining.  相似文献   

9.
多维时序数据中的相似子序列搜索研究   总被引:4,自引:0,他引:4  
由于动态时间弯曲距离较之欧氏距离有更好鲁棒性,因此被广泛用作时序数据相似子序列搜索研究领域中的相似性度量.在单一维度上的相似子序列搜索可能不能获得足够的匹配结果作为继续深入分析的依据,因此通过引入在多维数据分析中常用的数据立方体模型将相似子序列搜索问题扩展到了多维场景之下,从而在多个维度上得到搜索结果以获取更多有价值的知识.在此基础上利用数据立方体相邻层次单元间的相关性对基本的搜索算法进行了改进,在保证准确性的基础上提高了搜索效率.在真实网络安全数据集上的实验验证了所提方法的有效性.  相似文献   

10.
This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D2 statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D2 statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D2 have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D2 uncharacterised for most practical cases. The work presented here bridges these two worlds to give usable approximations of the distribution of D2 for ranges of parameters most frequently encountered in the study of biological sequences.  相似文献   

11.
Advances in geographical tracking, multimedia processing, information extraction, and sensor networks have created a deluge of probabilistic data. While similarity search is an important tool to support the manipulation of probabilistic data, it raises new challenges to traditional relational databases. The problem stems from the limited effectiveness of the distance metrics employed by existing database systems. On the other hand, several more complicated distance operators have proven their values for better distinguishing ability in specific probabilistic domains. In this paper, we discuss the similarity search problem with respect to Earth Mover’s Distance (EMD). EMD is the most successful distance metric for probability distribution comparison but is an expensive operator as it has cubic time complexity. We present a new database indexing approach to answer EMD-based similarity queries, including range queries and k-nearest neighbor queries on probabilistic data. Our solution utilizes primal-dual theory from linear programming and employs a group of B + trees for effective candidate pruning. We also apply our filtering technique to the processing of continuous similarity queries, especially with applications to frame copy detection in real-time videos. Extensive experiments show that our proposals dramatically improve the usefulness and scalability of probabilistic data management.  相似文献   

12.
Time series of remote sensing imagery or derived vegetation indices and biophysical products have been shown particularly useful to characterize land ecosystem dynamics. Various methods have been developed based on temporal trajectory analysis to characterize, classify and detect changes in ecosystem dynamics. Although time series similarity measures play an important role in these methods, a quantitative comparison of the similarity measures is lacking. The objective of this study was to provide an overview and quantitative comparison of the similarity measures in function of varying time series and ecosystem characteristics, such as amplitude, timing and noise effects. For this purpose, the performance was evaluated for the commonly used similarity measures (D), ranging from Manhattan (DMan), Euclidean (DE) and Mahalanobis (DMah) distance measures, to correlation (DCC), Principal Component Analysis (PCA; DPCA) and Fourier based (DFFT,Dξ,DFk) similarities. The quantitative comparison consists of a series of Monte-Carlo simulations based on subsets of global MODIS Normalized Difference Vegetation index (NDVI) and Enhanced Vegetation Index (EVI) and Leaf Area Index (LAI) data. Results of the simulations reveal four main groups of time series similarity measures with different sensitivities: (i) DMan, DE, DPCA, DFk quantify the difference in time series values, (ii) DMah accounts for temporal correlation and non-stationarity of variance, (iii) DCC measures the temporal correlation, and (iv) the Fourier based DFFT and Dξ show their specific sensitivity based on the selected Fourier components. The difference measures show relatively the highest sensitivity to amplitude effects, whereas the correlation based measures are highly sensitive to variations in timing and noise. The Fourier based measures, finally, depend highly on the signal to noise ratio and the balance between amplitude and phase dominance. The heterogeneity in sensitivity of each D stresses the importance of (i) understanding the time series characteristics before applying any classification of change detection approach and (ii) defining the variability one wants to identify/account for. This requires an understanding of the ecosystem dynamics and time series characteristics related to the baseline, amplitude, timing, noise and variability of the ecosystem time series. This is also illustrated in the quantitative comparison, where the different sensitivities of D for the NDVI, EVI, and LAI data relate specifically to the temporal characteristics of each data set. Additionally, the effect of noise and intra- and interclass variability is demonstrated in a case study based on land cover classification.  相似文献   

13.
We define similar video content as video sequences with almost identical content but possibly compressed at different qualities, reformatted to different sizes and frame-rates, undergone minor editing in either spatial or temporal domain, or summarized into keyframe sequences. Building a search engine to identify such similar content in the World-Wide Web requires: 1) robust video similarity measurements; 2) fast similarity search techniques on large databases; and 3) intuitive organization of search results. In a previous paper, we proposed a randomized technique called the video signature (ViSig) method for video similarity measurement. In this paper, we focus on the remaining two issues by proposing a feature extraction scheme for fast similarity search, and a clustering algorithm for identification of similar clusters. Similar to many other content-based methods, the ViSig method uses high-dimensional feature vectors to represent video. To warrant a fast response time for similarity searches on high dimensional vectors, we propose a novel nonlinear feature extraction scheme on arbitrary metric spaces that combines the triangle inequality with the classical Principal Component Analysis (PCA). We show experimentally that the proposed technique outperforms PCA, Fastmap, Triangle-Inequality Pruning, and Haar wavelet on signature data. To further improve retrieval performance, and provide better organization of similarity search results, we introduce a new graph-theoretical clustering algorithm on large databases of signatures. This algorithm treats all signatures as an abstract threshold graph, where the distance threshold is determined based on local data statistics. Similar clusters are then identified as highly connected regions in the graph. By measuring the retrieval performance against a ground-truth set, we show that our proposed algorithm outperforms simple thresholding, single-link and complete-link hierarchical clustering techniques.  相似文献   

14.
为了减少噪声数据对查询最优序列的影响,避免Euclidean距离对形态的敏感性,以及要求序列等长的缺点,提出了面向噪声数据的时间序列相似性搜索算法.运用SPC方法去除序列中的噪声数据;采用DTW距离作为度量函数,使用规范化方法使序列处于相同的分辨率下;采用LB_ Keogh下界函数对候选序列集合进行筛选.仿真实验结果表明,该算法在阈值较小时,对含有噪声数据序列的匹配能力较强.  相似文献   

15.
16.
Similarity search and detection is a central problem in time series data processing and management. Most approaches to this problem have been developed around the notion of dynamic time warping, whereas several dimensionality reduction techniques have been proposed to improve the efficiency of similarity searches. Due to the continuous increasing of sources of time series data and the cruciality of real-world applications that use such data, we believe there is a challenging demand for supporting similarity detection in time series in a both accurate and fast way. Our proposal is to define a concise yet feature-rich representation of time series, on which the dynamic time warping can be applied for effective and efficient similarity detection of time series. We present the Derivative time series Segment Approximation (DSA) representation model, which originally features derivative estimation, segmentation and segment approximation to provide both high sensitivity in capturing the main trends of time series and data compression. We extensively compare DSA with state-of-the-art similarity methods and dimensionality reduction techniques in clustering and classification frameworks. Experimental evidence from effectiveness and efficiency tests on various datasets shows that DSA is well-suited to support both accurate and fast similarity detection.  相似文献   

17.
针对时序数据相似性搜索面临的高维性问题,提出一种利用按沃尔什序数排列的离散沃尔什变换((DWHT)w)对时序数据进行维归约的方法.(DWHT)w是正交变换,变换矩阵简单,可以应用快速算法,对时序数据有更好的特征提取能力,用其索引时间序列数据在理论上具备非漏报性质.与基于离散傅里叶变换和基于离散沃尔什变换的对比实验表明,...  相似文献   

18.
随着大量三维人体运动数据库的建立,使得在数据库中实现基于内容的三维人体运动检索面临着诸多困难,文中提出一种分阶段的动态时间变形(DTW)优化算法的人体运动数据检索技术,可有效检索出逻辑上相似的运动。该算法首先对齐两个运动序列的坐标位置,基于窗口距离构造距离矩阵。其次采用基于全局和局部约束的DTW优化算法进行相似度匹配,得到两个运动间的对应关系。最后通过归一化相似度和DTW平均距离分阶段判断运动的相似性。实验结果表明,分阶段的DTW优化算法在提高效率的同时对长度不等的运动能取得较好的检索结果。  相似文献   

19.
In a digraph G, a vertex u is said to dominate itself and vertices v such that (u,v) is an arc of G. For a positive integer k, a k-tuple dominating set D of a digraph is a subset of vertices such that every vertex is dominated by at least k vertices in D. The k-tuple domination number of a given digraph is the minimum cardinality of a k-tuple dominating set of the digraph. In this letter, we give the exact values of the k-tuple domination number of de Bruijn and Kautz digraphs.  相似文献   

20.
Among many existing distance measures for time series data, Dynamic Time Warping (DTW) distance has been recognized as one of the most accurate and suitable distance measures due to its flexibility in sequence alignment. However, DTW distance calculation is computationally intensive. Especially in very large time series databases, sequential scan through the entire database is definitely impractical, even with random access that exploits some index structures since high dimensionality of time series data incurs extremely high I/O cost. More specifically, a sequential structure consumes high CPU but low I/O costs, while an index structure requires low CPU but high I/O costs. In this work, we therefore propose a novel indexed sequential structure called TWIST (Time Warping in Indexed Sequential sTructure) which benefits from both sequential access and index structure. When a query sequence is issued, TWIST calculates lower bounding distances between a group of candidate sequences and the query sequence, and then identifies the data access order in advance, hence reducing a great number of both sequential and random accesses. Impressively, our indexed sequential structure achieves significant speedup in a querying process. In addition, our method shows superiority over existing rival methods in terms of query processing time, number of page accesses, and storage requirement with no false dismissal guaranteed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号