首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 503 毫秒
1.
The similarity search problem has received considerable attention in database research community. In sensor network applications, this problem is even more important due to the imprecision of the sensor hardware, and variation of environmental parameters. Traditional similarity search mechanisms are both improper and inefficient for these highly energy-constrained sensors. A difficulty is that it is hard to predict which sensor has the most similar (or closest) data item such that many or even all sensors need to send their data to the query node for further comparison. In this paper, we propose a similarity search algorithm (SSA), which is a novel framework based on the concept of Hilbert curve over a data-centric storage structure, for efficiently processing similarity search queries in sensor networks. SSA successfully avoids the need of collecting data from all sensors in the network in searching for the most similar data item. The performance study reveals that this mechanism is highly efficient and significantly outperforms previous approaches in processing similarity search queries.  相似文献   

2.
Suffix trees and suffix arrays are fundamental full-text index data structures to solve problems occurring in string processing. Since suffix trees and suffix arrays have different capabilities, some problems are solved more efficiently using suffix trees and others are solved more efficiently using suffix arrays. We consider efficient index data structures with the capabilities of both suffix trees and suffix arrays without requiring much space. When the size of an alphabet is small, enhanced suffix arrays are such index data structures. However, when the size of an alphabet is large, enhanced suffix arrays lose the power of suffix trees. Pattern searching in an enhanced suffix array takes O(m|Σ|) time while pattern searching in a suffix tree takes O(mlog |Σ|) time where m is the length of a pattern and Σ is an alphabet. In this paper, we present linearized suffix trees which are efficient index data structures with the capabilities of both suffix trees and suffix arrays even when the size of an alphabet is large. A linearized suffix tree has all the functionalities of the enhanced suffix array and supports the pattern search in O(mlog |Σ|) time. In a different point of view, it can be considered a practical implementation of the suffix tree supporting O(mlog |Σ|)-time pattern search. In addition, we also present two efficient algorithms for computing suffix links on the enhanced suffix array and the linearized suffix tree. These are the first algorithms that run in O(n) time without using the range minima query. Our experimental results show that our algorithms are faster than the previous algorithms.  相似文献   

3.
Range and nearest neighbor queries are the most common types of spatial queries, which have been investigated extensively in the last decades due to its broad range of applications. In this paper, we study this problem in the context of fuzzy objects that have indeterministic boundaries. Fuzzy objects play an important role in many areas, such as biomedical image databases and GIS communities. Existing research on fuzzy objects mainly focuses on modeling basic fuzzy object types and operations, leaving the processing of more advanced queries largely untouched. In this paper, we propose two new kinds of spatial queries for fuzzy objects, namely single threshold query and continuous threshold query, to determine the query results which qualify at a certain probability threshold and within a probability interval, respectively. For efficient single threshold query processing, we optimize the classical R-tree-based search algorithm by deriving more accurate approximations for the distance function between fuzzy objects and the query object. To enhance the performance of continuous threshold queries, effective pruning rules are developed to reduce the search space and speed up the candidate refinement process. The efficiency of our proposed algorithms as well as the optimization techniques is verified with an extensive set of experiments using both synthetic and real datasets.  相似文献   

4.
黎玲利  王宏志  高宏  李建中 《软件学报》2012,23(6):1561-1577
利用关键字可以在模式未知的情况下对XML数据进行查询.在当前的XML数据流上的关键字查询处理中,打分函数往往不能都满足各种用户不同的需求.提出了一种基于skyline的XML数据流上的Top-K关键字查询.对于这种查询,不需要考虑影响结果与查询相关性的复杂因素,只需利用skyline挑选与查询最相关的结果.提出了两种XML数据流上的有效的基于skyline的Top-K关键查询处理算法,包括对单查询和多查询的处理算法.通过扩展实验对两种算法的有效性和可扩展性进行了验证.经过实验验证,所提出的查询处理算法的效率几乎不受关键字个数、查询结果数量、查询数量等参数的影响,运行时间和文档大小大致呈线性关系.  相似文献   

5.
In this paper, we study the following problem. Given a database and a set of queries, we want to find a set of views that can compute the answers to the queries, such that the amount of space, in bytes, required to store the viewset is minimum on the given database. (We also handle problem instances where the input has a set of database instances, as described by an oracle that returns the sizes of view relations for given view definitions.) This problem is important for applications such as distributed databases, data warehousing, and data integration. We explore the decidability and complexity of the problem for workloads of conjunctive queries. We show that results differ significantly depending on whether the workload queries have self-joins. Further, for queries without self-joins we describe a very compact search space of views, which contains all views in at least one optimal viewset. We present techniques for finding a minimum-size viewset for a single query without self-joins by using the shape of the query and its constraints, and validate the approach by extensive experiments. Part of this article was published elsewhere [Chirkova, R., Li, C.: Materializing views with minimal size to answer queries. PODS (2003)]. In addition to the prior materials, this article contains new theoretical results, as well as new results on how to efficiently implement the proposed techniques (Sects. 5 and 5.4)  相似文献   

6.
Users are rarely familiar with the content of a data source they are querying, and therefore cannot avoid using keywords that do not exist in the data source. Traditional systems may respond with an empty result, causing dissatisfaction, while the data source in effect holds semantically related content. In this paper we study this no-but-semantic-match problem on XML keyword search and propose a solution which enables us to present the top-k semantically related results to the user. Our solution involves two steps: (a) extracting semantically related candidate queries from the original query and (b) processing candidate queries and retrieving the top-k semantically related results. Candidate queries are generated by replacement of non-mapped keywords with candidate keywords obtained from an ontological knowledge base. Candidate results are scored using their cohesiveness and their similarity to the original query. Since the number of queries to process can be large, with each result having to be analyzed, we propose pruning techniques to retrieve the top-k results efficiently. We develop two query processing algorithms based on our pruning techniques. Further, we exploit a property of the candidate queries to propose a technique for processing multiple queries in batch, which improves the performance substantially. Extensive experiments on two real datasets verify the effectiveness and efficiency of the proposed approaches.  相似文献   

7.
Approximation-Based Similarity Search for 3-D Surface Segments   总被引:1,自引:0,他引:1  
The issue of finding similar 3-D surface segments arises in many recent applications of spatial database systems, such as molecular biology, medical imaging, CAD, and geographic information systems. Surface segments being similar in shape to a given query segment are to be retrieved from the database. The two main questions are how to define shape similarity and how to efficiently execute similarity search queries. We propose a new similarity model based on shape approximation by multi-parametric surface functions that are adaptable to specific application domains. We then define shape similarity of two 3-D surface segments in terms of their mutual approximation errors. Applying the multi-step query processing paradigm, we propose algorithms to efficiently support complex similarity search queries in large spatial databases. A new query type, called the ellipsoid query, is utilized in the filter step. Ellipsoid queries, being specified by quadratic forms, represent a general concept for similarity search. Our major contribution is the introduction of efficient algorithms to perform ellipsoid queries on multidimensional index structures. Experimental results on a large 3-D protein database containing 94,000 surface segments demonstrate the successful application and the high performance of our method.  相似文献   

8.
Database systems are becoming increasingly popular for answering queries. Partial-match search queries are an important class of queries in such a system. Several storage structures have been proposed to answer these queries efficiently. The BD tree is an example of such a storage structure. A previous study indicated that the k-d tree performance is better than that of the BD tree for partial-match search queries. A recent paper reported some improved algorithms. However, it is unclear whether the improved algorithms show the BD tree in a favourable light for partial-match search queries. This paper explores the performance of these algorithms and compares their performance to that of the k-d tree. Since the BD tree construction process uses some heuristics to make it a better balanced tree, this paper also evaluates the effect of these heuristics on the partial-match search algorithms. The major conclusions of this study are that the BD tree performance for partial-match search is better than that of the k-d tree when an improved algorithm is used for partial-match search, and only the DZ expression rearrangement heuristic has substantial effect on partial-match search performance.  相似文献   

9.
The multisearch problem is defined as follows. Given a data structure D modeled as a graph with n constant-degree nodes, perform O(n) searches on D. Let r be the length of the longest search path associated with a search process, and assume that the paths are determined "on-line." That is, the search paths may overlap arbitrarily. In this paper, we solve the multisearch problem for certain classes of graphs in O([formula] + r ([formula]/log n)) time on a [formula] × [formula]n mesh-connected computer. For many data structures, the search path traversed when answering one search query has length r = O(log n). For these cases, our algorithm processes O(n) such queries in asymptotically optimal Θ([formula]) time. The classes of graphs we consider contain many of the important data structures that arise in practice, ranging from simple trees to Kirkpatrick hierarchical search DAGs. Multisearch is a useful abstraction that can be used to implement parallel versions of standard sequential data structures on a mesh. As example applications, we consider a variety of parallel on-line tree traversals, as well as hierarchical representations of polyhedra and its myriad of applications (line-polyhedron intersection queries, multiple tangent plane determination, intersecting convex polyhedra, and three-dimensional convex hull).  相似文献   

10.
Distributed moving object database servers offer a feasible solution to the scalability problems of centralized database systems. In these potentially large-scale systems, querying about the time-varying location of specific moving objects can be particularly expensive in terms of running time. This work proposes a meta-index   based strategy that can significantly speed up the processing of these queries. The meta-index acts as an entry point for spatio-temporal queries and quickly drives the search process to the database servers that contain solutions. It also enables very fast approximated solutions to queries such as top-kk and spatio-temporal range queries.  相似文献   

11.
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level. We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance. Work supported by NSF CAREER Award CCR-0093400 and the New York State Center for Advanced Technology in Telecommunications (CATT) at Polytechnic University.  相似文献   

12.
Peer-to-peer (P2P) technology provides a popular way of distributing resources, sharing, and locating in a large-scale distributed environment. However, most of the current existing P2P systems only support queries over a single resource attribute, such as file name. The current multiple resource attribute search methods often encounter high maintenance cost and lack of resilience to the highly dynamic environment of P2P networks. In this paper, we propose a Flabellate overlAy Network (FAN), a scalable and structured underlying P2P overlay supporting resource queries over multi-dimensional attributes. In FAN, the resources are mapped into a multi-dimensional Cartesian space based on the consistent hash values of the resource attributes. The mapping space is divided into non-overlapping and continuous subspaces based on the peer’s distance. This paper presents strategies for managing the extended adjacent subspaces, which is crucial to network maintenance and resource search in FAN. The algorithms of a basic resource search and range query over FAN are also presented in this paper. To alleviate the load of the hot nodes, a virtual replica network (VRN) consisting of the nodes with the same replicates is proposed for replicating popular resources adaptively. The queries can be forwarded from the heavily loaded nodes to the lightly loaded ones through VRN. Theoretical analysis and experimental results show that FAN has a higher routing efficiency and lower network maintenance cost over the existing multi-attribute search methods. Also, VRN efficiently balances the network load and reduces the querying delay in FAN while invoking a relatively low overhead.  相似文献   

13.
A novel index structure based on the generalized suffix tree (PIGST) is proposed. Combined with post lists, PIGST can answer both structural and content queries. The distinct paths in an XML collection are mapped into strings. The construction algorithm of the PIGST for the path strings is presented based on the modification and improvement of a well-known suffix tree construction algorithm that only requires linear time and space complexity. The query process merely needs m character comparisons for direct containment queries, where m is the length of a query string. An efficient processing method for the indirect containment queries that avoids the inefficient tree traversal operation is also presented. Experiments show that PIGST outperforms earlier approaches.  相似文献   

14.
Moritz G. Maaß 《Software》2006,36(3):305-331
We present new algorithms for computing matching statistics with suffix arrays. We show how the Multiple Common Substring Problem can be solved efficiently in practice with a new approach using matching statistics. This problem consists of finding the common substrings of a set of strings. For the computation of matching statistics we compare seven different methods based on suffix trees and suffix arrays. Most of the suffix array algorithms have an inferior asymptotic worst case running time but a very low memory overhead and small constants in the running time complexity. Our experiments show a good performance in practice. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

15.
Querying time series data based on similarity   总被引:3,自引:0,他引:3  
We study similarity queries for time series data where similarity is defined, in a fairly general way, in terms of a distance function and a set of affine transformations on the Fourier series representation of a sequence. We identify a safe set of transformations supporting a wide variety of comparisons and show that this set is rich enough to formulate operations such as moving average and time scaling. We also show that queries expressed using safe transformations can efficiently be computed without prior knowledge of the transformations. We present a query processing algorithm that uses the underlying multidimensional index built over the data set to efficiently answer similarity queries. Our experiments show that the performance of this algorithm is competitive to that of processing ordinary (exact match) queries using the index, and much faster than sequential scanning. We propose a generalization of this algorithm for simultaneously handling multiple transformations at a time, and give experimental results on the performance of the generalized algorithm  相似文献   

16.
Graphs are widely used for modeling complicated data such as social networks, chemical compounds, protein interactions and semantic web. To effectively understand and utilize any collection of graphs, a graph database that efficiently supports elementary querying mechanisms is crucially required. For example, Subgraph and Supergraph queries are important types of graph queries which have many applications in practice. A primary challenge in computing the answers of graph queries is that pair-wise comparisons of graphs are usually hard problems. Relational database management systems (RDBMSs) have repeatedly been shown to be able to efficiently host different types of data such as complex objects and XML data. RDBMSs derive much of their performance from sophisticated optimizer components which make use of physical properties that are specific to the relational model such as sortedness, proper join ordering and powerful indexing mechanisms. In this article, we study the problem of indexing and querying graph databases using the relational infrastructure. We present a purely relational framework for processing graph queries. This framework relies on building a layer of graph features knowledge which capture metadata and summary features of the underlying graph database. We describe different querying mechanisms which make use of the layer of graph features knowledge to achieve scalable performance for processing graph queries. Finally, we conduct an extensive set of experiments on real and synthetic datasets to demonstrate the efficiency and the scalability of our techniques.  相似文献   

17.
An XML publish/subscribe system needs to filter a large number of queries over XML streams. Most existing systems only consider filtering the simple XPath statements. In this paper, we focus on filtering of the more complex Generalized-Tree-Pattern (GTP) queries. Our filtering mechanism is based on a novel Tree-of-Path (TOP) encoding scheme, which compactly represents the path matches for the entire document. First, we show that the TOP encodings can be efficiently produced via a shared bottom-up path matching. Second, with the aid of this TOP encoding, we can 1) achieve polynomial time and space complexity for post processing, 2) avoid redundant predicate evaluations, 3) allow an efficient duplicate-free and merge join-based algorithm for merging multiple encoded path matches and 4) simplify the processing of GTP queries. Overall our approach maximizes the sharing opportunity across queries by exploiting the suffix as well as prefix sharing. At the same time, our TOP encodings allow efficient post processing for GTP queries. Extensive performance studies show that our GFilter solution not only achieves significantly better filtering performance than state-of-the-art algorithms, but also is capable of efficiently filtering the more complex GTP queries.  相似文献   

18.
K.  Wen-Syan  M.   《Data & Knowledge Engineering》2000,35(3):259-298
Since media-based evaluation yields similarity values, results to a multimedia database query, Q(Y1,…,Yn), is defined as an ordered list SQ of n-tuples of the form X1,…,Xn. The query Q itself is composed of a set of fuzzy and crisp predicates, constants, variables, and conjunction, disjunction, and negation operators. Since many multimedia applications require partial matches, SQ includes results which do not satisfy all predicates. Due to the ranking and partial match requirements, traditional query processing techniques do not apply to multimedia databases. In this paper, we first focus on the problem of “given a multimedia query which consists of multiple fuzzy and crisp predicates, providing the user with a meaningful final ranking”. More specifically, we study the problem of merging similarity values in queries with multiple fuzzy predicates. We describe the essential multimedia retrieval semantics, compare these with the known approaches, and propose a semantics which captures the requirements of multimedia retrieval problem. We then build on these results in answering the related problem of “given a multimedia query which consists of multiple fuzzy and crisp predicates, finding an efficient way to process the query.” We develop an algorithm to efficiently process queries with unordered fuzzy predicates (sub-queries). Although this algorithm can work with different fuzzy semantics, it benefits from the statistical properties of the semantics proposed in this paper. We also present experimental results for evaluating the proposed algorithm in terms of quality of results and search space reduction.  相似文献   

19.
徐林昊  钱卫宁  周傲英 《软件学报》2007,18(6):1443-1455
对等计算数据管理中的一个重要问题是如何有效地支持多维数据空间上的相似性搜索.现有的非结构化对等计算数据共享系统仅支持简单的查询处理方法,即匹配查询处理.将近似技术和路由索引结合在一起,设计了一种简单、有效的索引结构EVARI(扩展近似向量路由索引).利用EVARI,每个节点不仅可以在本地共享的数据集上处理范围查询,而且还可以将查询转发给最有希望获得查询结果的邻居节点.为了建立EVARI,每个节点使用空间划分技术概括本地的共享内容,并与邻居节点交换概要信息.而且,每个节点都可以重新配置自己的邻居节点,使得相关节点位置相互邻近,优化了系统资源配置,提升了系统性能.仿真实验证明了该方法的良好性能.  相似文献   

20.
基于R树的方向关系查询处理   总被引:8,自引:1,他引:8  
肖予钦  张巨  景宁  李军 《软件学报》2004,15(1):103-111
方向关系描述了对象间的空间顺序关系.近年来,方向关系查询处理逐渐受到空间数据挖掘和地理信息系统等空间数据库应用领域研究者的关注.方向关系查询处理需要执行方向连接操作,目前有关空间连接的研究主要集中在拓扑关系和距离关系方面,而较少考虑方向关系.研究了基于R树的方向关系查询处理方法,定义了四元组模型表示对象MBR间的方向关系,提出了基于R树的处理方向关系查询过滤(filter)步骤的方法,并将提炼(refinement)步骤细化为3种不同的操作.所提出的方法能够高效处理任意对象间的方向关系查询.考虑到空间数据挖掘中方向关系查询通常是在满足一定距离约束条件的对象之间进行,还提出了一种同时利用方向和距离约束限制R树搜索空间的查询处理算法.实验证明,与不利用R树的方向关系查询处理方法相比,所提出的方法在I/O开销和CPU开销两方面都具有很高的性能.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号