首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
高效时序相似搜索技术   总被引:6,自引:0,他引:6  
时序相似搜索被认为是将来最有前途的技术之一.然而,时序数据是典型的高维海量数据,如何开发高效算法非常关键.文中概述了时序相似搜索技术的研究现状和进展以及研究的主要内容,讨论了该技术的几个重要应用范例,并对一些典型算法进行了定量分析;然后晕点论述了高效时序相似搜索的关键技术,包括边界过滤、三角不等式修剪、多辨析率检索方法、过滤精炼方案等.最后讨论并分析了时序的近似相似搜索技术.上述所有技术通过对比,其正面和反面都被深入分析.最后指出了存在的问题和未来的研究热点和方向.  相似文献   

2.
Similarity join (SJ) in time-series databases has a wide spectrum of applications such as data cleaning and mining. Specifically, an SJ query retrieves all pairs of (sub)sequences from two time-series databases that epsiv-match with each other, where epsiv is the matching threshold. Previous work on this problem usually considers static time-series databases, where queries are performed either on disk-based multidimensional indexes built on static data or by nested loop join (NLJ) without indexes. SJ over multiple stream time series, which continuously outputs pairs of similar subsequences from stream time series, strongly requires low memory consumption, low processing cost, and query procedures that are themselves adaptive to time-varying stream data. These requirements invalidate the existing approaches in static databases. In this paper, we propose an efficient and effective approach to perform SJ among multiple stream time series incrementally. In particular, we present a novel method, Adaptive Radius-based Search (ARES), which can answer the similarity search without false dismissals and is seamlessly integrated into SJ processing. Most importantly, we provide a formal cost model for ARES, based on which ARES can be adaptive to data characteristics, achieving the minimum number of refined candidate pairs, and thus, suitable for stream processing. Furthermore, in light of the cost model, we utilize space-efficient synopses that are constructed for stream time series to further reduce the candidate set. Extensive experiments demonstrate the efficiency and effectiveness of our proposed approach.  相似文献   

3.
The similarity search problem has received considerable attention in database research community. In sensor network applications, this problem is even more important due to the imprecision of the sensor hardware, and variation of environmental parameters. Traditional similarity search mechanisms are both improper and inefficient for these highly energy-constrained sensors. A difficulty is that it is hard to predict which sensor has the most similar (or closest) data item such that many or even all sensors need to send their data to the query node for further comparison. In this paper, we propose a similarity search algorithm (SSA), which is a novel framework based on the concept of Hilbert curve over a data-centric storage structure, for efficiently processing similarity search queries in sensor networks. SSA successfully avoids the need of collecting data from all sensors in the network in searching for the most similar data item. The performance study reveals that this mechanism is highly efficient and significantly outperforms previous approaches in processing similarity search queries.  相似文献   

4.
在分布式数据流环境中,系统的通信带宽是一种瓶颈资源。在保证查询精度的前提下,有效地减少网络中数据流的传输量是解决这一问题的重要途径。通过分析现有的分布式数据流处理算法,总结出一个通用处理框架,以减少数据流的传输量。通用处理框架包括三个方面:最小化信息传输、使用数据流摘要表示完整信息以及通过预测维持系统的稳定性。  相似文献   

5.
Multiscale Representations for Fast Pattern Matching in Stream Time Series   总被引:1,自引:0,他引:1  
Similarity-based time series retrieval has been a subject of long term study due to its wide usage in many applications, such as financial data analysis and weather data forecasting. Its original task was to find those time series similar to a pattern time series data, where both the pattern and data time series are static. Recently, with an increasing demand on stream data management, similarity-based stream time series retrieval has raised new research issues due to its unique requirements during the stream processing, such as one-pass search and fast response. In this paper, we address the problem of matching both static and dynamic patterns over stream time series data. We will develop a novel multi-scale representation, called multi-scale segment mean (MSM), for stream time series data, which can be incrementally computed and thus perfectly adapted to the stream characteristics. Most importantly, we propose a novel multi-step filtering mechanism, SS, over the multi-scale representation. Analysis indicates that the mechanism can greatly prune the search space and thus offer fast response. Furthermore, batching processing optimization, the dynamic case where patterns are also from stream time series, and pattern matching over future stream time series are also discussed. Extensive experiments show the proposed scheme can efficiently filter out false candidates and detect patterns.  相似文献   

6.
Metric databases are databases where a metric distance function is defined for pairs of database objects. In such databases, similarity queries in the form of range queries or k-nearest-neighbor queries are the most important query types. In traditional query processing, single queries are issued independently by different users. In many data mining applications, however, the database is typically explored by iteratively asking similarity queries for answers of previous similarity queries. We introduce a generic scheme for such data mining algorithms and we investigate two orthogonal approaches, reducing I/O cost as well as CPU cost, to speed-up the processing of multiple similarity queries. The proposed techniques apply to any type of similarity query and to an implementation based on an index or using a sequential scan. Parallelization yields an additional impressive speed-up. An extensive performance evaluation confirms the efficiency of our approach  相似文献   

7.
With the increasing demands for advanced use of streaming data, efficient execution of continuous queries is an important research issue. This paper focuses on event-driven continuous queries that are activated by foreign events such as data arrival and the progression of time. Existing approaches to multiple continuous query optimization decide the optimal query plan by extracting common subexpressions from the given queries. Event-driven queries containing the common subexpressions may produce many common intermediate results when they are activated within a small interval, but may produce only disjoint data when activated at completely different timings.This paper proposes an efficient data stream processing scheme for multiple event-driven continuous queries. In the proposed approach, we introduce query result caching to achieve a flexible way to share common operators among queries activated by unpredictable events. When a query is activated, an intermediate result generated for the query is stored into the cache area if it is expected to be reused by other queries. When other queries including the same operator are activated, they reuse the cached result if the cache includes reusable data. Efficiency of the proposed scheme is validated by intensive experimental evaluations.  相似文献   

8.
挖掘在线数据流的变化趋势并预测未来时间窗口上的可能值,可以为许多时间敏感的应用提供重要决策支持.通过将数量可能无限的流数据元素映射到离散的且数量有限的流数据状态空间,不断变化的流数据变化趋势可以模拟成连续的流数据状态变化的过程,进而在很小的时间与空间代价下,数据流状态变迁的趋势动态存储在状态变迁图中.通过分析状态变迁图中的流数据变迁的统计规律,数据流上未来时刻的可能值可以应用马尔可夫模型在线连续预测.  相似文献   

9.
Similarity search and detection is a central problem in time series data processing and management. Most approaches to this problem have been developed around the notion of dynamic time warping, whereas several dimensionality reduction techniques have been proposed to improve the efficiency of similarity searches. Due to the continuous increasing of sources of time series data and the cruciality of real-world applications that use such data, we believe there is a challenging demand for supporting similarity detection in time series in a both accurate and fast way. Our proposal is to define a concise yet feature-rich representation of time series, on which the dynamic time warping can be applied for effective and efficient similarity detection of time series. We present the Derivative time series Segment Approximation (DSA) representation model, which originally features derivative estimation, segmentation and segment approximation to provide both high sensitivity in capturing the main trends of time series and data compression. We extensively compare DSA with state-of-the-art similarity methods and dimensionality reduction techniques in clustering and classification frameworks. Experimental evidence from effectiveness and efficiency tests on various datasets shows that DSA is well-suited to support both accurate and fast similarity detection.  相似文献   

10.
Efficient similarity search for market basket data   总被引:2,自引:0,他引:2  
Several organizations have developed very large market basket databases for the maintenance of customer transactions. New applications, e.g., Web recommendation systems, present the requirement for processing similarity queries in market basket databases. In this paper, we propose a novel scheme for similarity search queries in basket data. We develop a new representation method, which, in contrast to existing approaches, is proven to provide correct results. New algorithms are proposed for the processing of similarity queries. Extensive experimental results, for a variety of factors, illustrate the superiority of the proposed scheme over the state-of-the-art method. Edited by R. Ng. Received: August 6, 2001 / Accepted: May 21, 2002 Published online: September 25, 2002  相似文献   

11.
Similarity query processing is becoming increasingly important in many applications such as data cleaning, record linkage, Web search, and document analytics. In this paper we study how to provide end-to-end similarity query support natively in a parallel database system. We discuss how to express a similarity predicate in its query language, how to build indexes, how to answer similarity queries (selections and joins) efficiently in the runtime engine, possibly using indexes, and how to optimize similarity queries. One particular challenge is how to incorporate existing similarity join algorithms, which often require a series of steps to achieve a high efficiency, including collecting token frequencies, finding matching record id pairs, and reassembling result records based on id pairs. We present a novel approach that uses existing runtime operators to implement such complex join algorithms without reinventing the wheel; doing so positions the system to automatically benefit from future improvements to those operators. The approach includes a technique to transform a similarity join plan into an efficient operator-based physical plan during query optimization by using a template expressed largely in the system’s user-level query language; this technique greatly simplifies the specification of such a transformation rule. We use Apache AsterixDB, a parallel Big Data management system, to illustrate and validate our techniques. We conduct an experimental study using several large, real datasets on a parallel computing cluster to assess the similarity query support. We also include experiments involving three other parallel systems and report the efficacy and performance results.  相似文献   

12.
Nearest-neighbor search of high-dimensionality spaces is critical for many applications, such as content-based retrieval from multimedia databases, similarity search of patterns in data mining, and nearest-neighbor classification. Unfortunately, even with the aid of the commonly used indexing schemes, the performance of nearest-neighbor (NN) queries deteriorates rapidly with the number of dimensions. We propose a method, called Clustering with Singular Value Decomposition (CSVD), which supports efficient approximate processing of NN queries, while maintaining good precision-recall characteristics. CSVD groups homogeneous points into clusters and separately reduces the dimensionality of each cluster using SVD. Cluster selection for NN queries relies on a branch-and-bound algorithm and within-cluster searches can be performed with traditional or in-memory indexing methods. Experiments with texture vectors extracted from satellite images show that CSVD achieves significantly higher dimensionality reduction than plain SVD for the same normalized mean squared error (NMSE), which translates into a higher efficiency in processing approximate NN queries.  相似文献   

13.
Approximation-Based Similarity Search for 3-D Surface Segments   总被引:1,自引:0,他引:1  
The issue of finding similar 3-D surface segments arises in many recent applications of spatial database systems, such as molecular biology, medical imaging, CAD, and geographic information systems. Surface segments being similar in shape to a given query segment are to be retrieved from the database. The two main questions are how to define shape similarity and how to efficiently execute similarity search queries. We propose a new similarity model based on shape approximation by multi-parametric surface functions that are adaptable to specific application domains. We then define shape similarity of two 3-D surface segments in terms of their mutual approximation errors. Applying the multi-step query processing paradigm, we propose algorithms to efficiently support complex similarity search queries in large spatial databases. A new query type, called the ellipsoid query, is utilized in the filter step. Ellipsoid queries, being specified by quadratic forms, represent a general concept for similarity search. Our major contribution is the introduction of efficient algorithms to perform ellipsoid queries on multidimensional index structures. Experimental results on a large 3-D protein database containing 94,000 surface segments demonstrate the successful application and the high performance of our method.  相似文献   

14.
Due to the resource limitation in the data stream environments, it has been reported that answering user queries according to the wavelet synopsis of a stream is an essential ability of a data stream management system (DSMS). In the literature, recent research has been elaborated upon minimizing the local error metric of an individual stream. However, many emergent applications such as stock marketing and sensor detection also call for the need of recording multiple streams in a commercial DSMS. As shown in our thorough analysis and experimental studies, minimizing global error in multiple-stream environments leads to good reliability for DSMS to answer the queries. In contrast, only minimizing local error may lead to a significant loss of query accuracy. As such, we first study in this paper the problem of maintaining the wavelet coefficients of multiple streams within collective memory so that the predetermined global error metric is minimized. Moreover, we also examine a promising application in the multistream environment, that is, the queries for top-k range sum. We resolve the problem of efficient top-k query processing with minimized global error by developing a general framework. For the purposes of maintaining the wavelet coefficients and processing top-k queries, several well- designed algorithms are utilized to optimize the performance of each primary component of this general framework. We also evaluate the proposed algorithms empirically on real and simulated data streams and show that our framework can process top-k queries accurately and efficiently.  相似文献   

15.
Querying time series data based on similarity   总被引:3,自引:0,他引:3  
We study similarity queries for time series data where similarity is defined, in a fairly general way, in terms of a distance function and a set of affine transformations on the Fourier series representation of a sequence. We identify a safe set of transformations supporting a wide variety of comparisons and show that this set is rich enough to formulate operations such as moving average and time scaling. We also show that queries expressed using safe transformations can efficiently be computed without prior knowledge of the transformations. We present a query processing algorithm that uses the underlying multidimensional index built over the data set to efficiently answer similarity queries. Our experiments show that the performance of this algorithm is competitive to that of processing ordinary (exact match) queries using the index, and much faster than sequential scanning. We propose a generalization of this algorithm for simultaneously handling multiple transformations at a time, and give experimental results on the performance of the generalized algorithm  相似文献   

16.
There is a growing interest in applications that utilize continuous sensing of individual activity or context, via sensors embedded or associated with personal mobile devices (e.g., smartphones). Reducing the energy overheads of sensor data acquisition and processing is essential to ensure the successful continuous operation of such applications, especially on battery-limited mobile devices. To achieve this goal, this paper presents a framework, called ACQUA, for ‘acquisition-cost’ aware continuous query processing. ACQUA replaces the current paradigm, where the data is typically streamed (pushed) from the sensors to the one or more smartphones, with a pull-based asynchronous model, where a smartphone retrieves appropriate blocks of relevant sensor data from individual sensors, as an integral part of the query evaluation process. We describe algorithms that dynamically optimize the sequence (for complex stream queries with conjunctive and disjunctive predicates) in which such sensor data streams are retrieved by the query evaluation component, based on a combination of (a) the communication cost & selectivity properties of individual sensor streams, and (b) the occurrence of the stream predicates in multiple concurrently executing queries. We also show how a transformation of a group of stream queries into a disjunctive normal form provides us with significantly greater degrees of freedom in choosing this sequence, in which individual sensor streams are retrieved and evaluated. While the algorithms can apply to a broad category of sensor-based applications, we specifically demonstrate their application to a scenario where multiple stream processing queries execute on a single smartphone, with the sensors transferring their data over an appropriate PAN technology, such as Bluetooth or IEEE 802.11. Extensive simulation experiments indicate that ACQUA’s intelligent batch-oriented data acquisition process can result in as much as 80 % reduction in the energy overhead of continuous query processing, without any loss in the fidelity of the processing logic.  相似文献   

17.
Configuration similarity is a special form of content-based image retrieval that considers relative object locations. It can be used as a standalone method, or to complement retrieval based on visual or semantic features. The corresponding queries ask for sets of objects that satisfy some spatio-temporal constraints, e.g., "find all triplets of objects (v/sub 1/, v/sub 2/, v/sub 3/), such that v/sub 1/ is northeast of v/sub 2/, which is inside v/sub 3/." Exhaustive processing (i.e., retrieval of the best solutions) of configuration similarity queries, in general, has exponential complexity and fast search for sub-optimal solutions is the only way to deal with the vast amounts of multimedia information in several real-time applications. In this paper we first discuss the utilization of nonsystematic search heuristics, based on genetic algorithms, simulated annealing and hill climbing approaches. An extensive experimentation with real and synthetic datasets reveals that hill climbing techniques are the best for the current problem; therefore, as a subsequent step we study the search space, and develop improved variations of hill climbing that take advantage of the special structure of the problem to enhance speed. The proposed heuristic methods significantly outperform systematic search when there is only limited time for query processing.  相似文献   

18.
数据流上的预测聚集查询处理算法   总被引:19,自引:3,他引:16  
实时数据流未来趋势的预测具有重要的实际应用意义.例如,在环境监测传感器网络中,通过对感知数据流进行预测聚集查询,观察者可以预测网络覆盖的区域在未来一段时间内的平均温度和湿度,以确定是否会发生异常事件.目前的研究工作多数集中在数据流上当前数据的查询,数据流上预测查询的研究工作还很少.采用多元线性回归方法,给出了数据流上的聚集值预测模型,提出了一种数据流预测聚集查询处理方法.当预测失败的次数大于预先给定的阈值时,给出了一种预测模型自动调整策略,以降低预测误差.还提出了滑动窗口的更新周期、数据流的流速对预测精度影响的数学模型.理论分析与实验结果表明,提出的预测聚集查询处理算法具有较高的性能,并且能够返回满足用户精度要求的预测查询结果.在实验中,采用TPC-H国际标准测试数据和TAO(tropical atmosphere ocean)测量的海洋表面空气温度数据来构造数据流.  相似文献   

19.
Object-oriented database management systems (OODBMSs) provide rich facilities for the modeling and processing of structural as well as behavioral properties of complex application objects. However, due to their inherent generality and continuously evolving functionalities, efficient implementations are important for these OODBMSs to support the present and future applications, particularly when the databases are very large. In this paper, we present several parallel, multi-wavefront algorithms based on two processing approaches, i.e., identification and elimination approaches, to verify association patterns specified in queries. Both approaches allow more processors to operate concurrently on a query than the traditional tree-structured query processing approach, thus introducing a higher degree of parallelism in query processing. A heuristic method is presented for partitioning an object-oriented database (OODB). The main consideration for partitioning the database is load balancing. This method also tries to reduce the communication time by reducing the length of the path that wavefronts need to be propagated. Multiple wavefront algorithms based on the two approaches for tree-structured queries have been implemented on an nCUBE 2 parallel computer. The implementation of the query processor allows multiple queries to be executed simultaneously. This implementation provides an environment for evaluating the algorithms and the heuristic method for partitioning the database. The evaluation results are presented in this paper.Recommended by: Patrick Valduriez  相似文献   

20.
Query processing for a data stream should also be continuous and rapid. This article proposes a novel approach for consistent collective evaluation of multiple continuous queries for filtering two different types of data streams: a relational stream and an XML stream. The proposed approach commonly provides region-based selection constructs: an attribute selection construct for relational queries and a path selection construct for XPath queries. Both collectively evaluate the selection predicates of the same attribute (path), based on the precomputed matching results of the queries in each of the disjoint regions divided by the selection predicates. The performance experiments show that the proposed approach is practically more efficient and stable than other approaches at run-time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号