首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
The ubiquity of location enabled devices has resulted in a wide proliferation of location based applications and services. To handle the growing scale, database management systems driving such location based services (LBS) must cope with high insert rates for location updates of millions of devices, while supporting efficient real-time analysis on latest location. Traditional DBMSs, equipped with multi-dimensional index structures, can efficiently handle spatio-temporal data. However, popular open-source relational database systems are overwhelmed by the high insertion rates, real-time querying requirements, and terabytes of data that these systems must handle. On the other hand, key-value stores can effectively support large scale operation, but do not natively provide multi-attribute accesses needed to support the rich querying functionality essential for the LBSs. We present the design and implementation of $\mathcal {MD}$ -HBase, a scalable data management infrastructure for LBSs that bridges this gap between scale and functionality. Our approach leverages a multi-dimensional index structure layered over a key-value store. The underlying key-value store allows the system to sustain high insert throughput and large data volumes, while ensuring fault-tolerance, and high availability. On the other hand, the index layer allows efficient multi-dimensional query processing. Our optimized query processing technique accesses only the index and storage level entries that intersect with the query region, thus ensuring efficient query processing. We present the design of $\mathcal {MD}$ -HBase that demonstrates how two standard index structures—the K-d tree and the Quad tree—can be layered over a range partitioned key-value store to provide scalable multi-dimensional data infrastructure. Our prototype implementation using HBase, a standard open-source key-value store, can handle hundreds of thousands of inserts per second using a modest 16 node cluster, while efficiently processing multi-dimensional range queries and nearest neighbor queries in real-time with response times as low as few hundreds of milliseconds.  相似文献   

2.
3.
预测性连续时空区域查询在用户指定的时间范围期间持续地返回给定未来查询时间范围期间将出现在查询区域的移动对象。论文提出了一种预测性连续时空区域查询处理方法,设计了支持连续查询处理的两种索引结构。移动对象索引用于记录移动对象不断更新的位置信息,它用于支持查询的首次处理。连续查询索引结构用于记录所有查询结果可能受到移动对象位置变化影响的连续查询,它用于支持连续查询处理。实验表明,论文提出的方法能够有效地提高处理大量连续查询的效率。  相似文献   

4.
Recently there has been a considerable increase in the number of different Key-Value stores, for supporting data storage and applications on the cloud environment. While all these solutions try to offer highly available and scalable services on the cloud, they are significantly different with each other in terms of the architecture and types of the applications, they try to support. Considering three widely-used such systems: Cassandra, HBase and Voldemort; in this paper we compare them in terms of their support for different types of query workloads. We are mainly focused on the range queries. Unlike HBase and Cassandra that have built-in support for range queries, Voldemort does not support this type of queries via its available API. For this matter, practical techniques are presented on top of Voldemort to support range queries. Our performance evaluation is based on mixed query workloads, in the sense that they contain a combination of short and long range queries, beside other types of typical queries on key-value stores such as lookup and update. We show that there are trade-offs in the performance of the selected system and scheme, and the types of the query workloads that can be processed efficiently.  相似文献   

5.
由于数据仓库中存储着不同粒度、容量巨大的数据记录,所以如何有效地执行联机分析处理(OLAP)查询操作,特别是连接和聚集操作,便成为数据仓库领域的核心问题之一.为此,提出了一种降低连接和聚集操作的新算法(join and aggregation based on the complex multi-dimensional hierarchies,JACMDH).算法充分考虑了复杂多维层次的特点,在原有的位图连接索引(bitmap join index)的基础上,采用层次联合代理(hierarchy combined surrogate)和预先分组排序的方法,使得复杂的多维层次上的连接和聚集操作转化成事实表上的区域查询,从而在处理多维层次聚集的同时,提高了连接和聚集的效率.算法性能分析和实验数据表明,JACMDH算法和目前流行的算法相比,其性能有显著的提高.  相似文献   

6.
Indexing is one of the most important techniques to facilitate query processing over a multi-dimensional dataset. A commonly used strategy for such indexing is to keep the tree-structured index balanced. This strategy reduces query processing cost in the worst case, and can handle all different queries equally well. In other words, this strategy implies that all queries are uniformly issued, which is partially because the query distribution is not possibly known and will change over time in practice. A key issue we study in this work is whether it is the best to fully rely on a balanced tree-structured index in particular when datasets become larger and larger in the big data era. This means that, when a dataset becomes very large, it becomes unreasonable to assume that all data in any subspace are equally important and are uniformly accessed by all queries at the index level. Given the existence of query skew and the possible changes of query skew, in this paper, we study how to handle such query skew and such query skew changes at the index level without sacrifice of supporting any possible queries in a wellbalanced tree index and without a high overhead. To tackle the issue, we propose index-view at the index level, where an index-view is a short-cut in a balanced tree-structured index to access objects in the subspaces that are more frequently accessed, and propose a new index-view-centric framework for query processing using index-views in a bottom-up manner. We study index-views selection problem in both static and dynamic setting, and we confirm the effectiveness of our approach using large real and synthetic datasets.  相似文献   

7.
徐林昊  钱卫宁  周傲英 《软件学报》2007,18(6):1443-1455
对等计算数据管理中的一个重要问题是如何有效地支持多维数据空间上的相似性搜索.现有的非结构化对等计算数据共享系统仅支持简单的查询处理方法,即匹配查询处理.将近似技术和路由索引结合在一起,设计了一种简单、有效的索引结构EVARI(扩展近似向量路由索引).利用EVARI,每个节点不仅可以在本地共享的数据集上处理范围查询,而且还可以将查询转发给最有希望获得查询结果的邻居节点.为了建立EVARI,每个节点使用空间划分技术概括本地的共享内容,并与邻居节点交换概要信息.而且,每个节点都可以重新配置自己的邻居节点,使得相关节点位置相互邻近,优化了系统资源配置,提升了系统性能.仿真实验证明了该方法的良好性能.  相似文献   

8.
Efficient Distributed Skyline Queries for Mobile Applications   总被引:3,自引:0,他引:3       下载免费PDF全文
In this paper, we consider skyline queries in a mobile and distributed environment, where data objects are distributed in some sites (database servers) which are interconnected through a high-speed wired network, and queries are issued by mobile units (laptop, cell phone, etc.) which access the data objects of database servers by wireless channels. The inherent properties of mobile computing environment such as mobility, limited wireless bandwidth, frequent disconnection, make skyline queries more complicated. We show how to efficiently perform distributed skyline queries in a mobile environment and propose a skyline query processing approach, called efficient distributed skyline based on mobile computing (EDS-MC). In EDS-MC, a distributed skyline query is decomposed into five processing phases and each phase is elaborately designed in order to reduce the network communication, network delay and query response time. We conduct extensive experiments in a simulated mobile database system, and the experimental results demonstrate the superiority of EDS-MC over other skyline query processing techniques on mobile computing.  相似文献   

9.
何龙  陈晋川  杜小勇 《软件学报》2017,28(3):502-513
SOH(SQL over HDFS)系统通常将数据存储于分布式文件系统HDFS中,采用Map/Reduce或分布式查询引擎来处理查询任务。得益于HDFS以及Map/Reduce的容错能力和可扩展性,SOH系统可以很好地应对数据规模的飞速增长,完成分析型查询处理。然而,在处理选择型查询或交互式查询时,这类系统暴露出性能上的缺陷。本文提出一个通用的索引技术,可以应用于SOH系统中,以提高其查询处理的效率。分析了SOH系统访问HDFS文件的过程,指出了其中影响数据加载时间的关键因素;提出了split层和split内部双层索引机制;设计并实现了聚集索引和非聚集索引。最后,在标准数据集上进行了大量实验,并与现有基于HDFS的索引技术进行了比较。实验结果表明,所提出的索引技术可以有效地提高查询处理的效率。  相似文献   

10.
易佳  薛晨  王树鹏 《计算机科学》2017,44(5):172-177
分布式流查询是一种基于数据流的实时查询计算方法,近年来得到了广泛的关注和快速发展。综述了分布式流处理框架在实时关系型查询上取得的研究成果;对涉及分布式数据加载、分布式流计算框架、分布式流查询的产品进行了分析和比较;提出了基于Spark Streaming和Apache Kafka构建的分布式流查询模型,以并发加载多个文件源的形式,设计内存文件系统实现数据的快速加载,相较于基于Apache Flume的加载技术提速1倍以上。在Spark Streaming的基础上,实现了基于Spark SQL的分布式流查询接口,并提出了自行编码解析SQL语句的方法,实现了分布式查询。测试结果表明,在查询语句复杂的情况下,自行编码解析SQL的查询效率具有明显的优势。  相似文献   

11.
分布式NoSQL系统旨在提供大规模数据的高可用性,但缺乏内在的支持复杂查询的应用程序。传统的基于单一词汇倒排表的解决方案未达到良好的效果。因此,文中就文档型数据库在处理动态文档集时不支持多键作为主索引的缺点展开研究,提出了一种改进的组合索引方法。通过存储组合条件的倒列表,查询驱动机制可以从最近的查询记录中自适应地存储比较受欢迎的条件组合。该方法可以降低整体的带宽消耗,只需占用较少的存储资源等额外开销,明显改善了NoSQL系统的容量和响应时间。  相似文献   

12.
Approximate query processing using wavelets   总被引:7,自引:0,他引:7  
Approximate query processing has emerged as a cost-effective approach for dealing with the huge data volumes and stringent response-time requirements of today's decision support systems (DSS). Most work in this area, however, has so far been limited in its query processing scope, typically focusing on specific forms of aggregate queries. Furthermore, conventional approaches based on sampling or histograms appear to be inherently limited when it comes to approximating the results of complex queries over high-dimensional DSS data sets. In this paper, we propose the use of multi-dimensional wavelets as an effective tool for general-purpose approximate query processing in modern, high-dimensional applications. Our approach is based on building wavelet-coefficient synopses of the data and using these synopses to provide approximate answers to queries. We develop novel query processing algorithms that operate directly on the wavelet-coefficient synopses of relational tables, allowing us to process arbitrarily complex queries entirely in the wavelet-coefficient domain. This guarantees extremely fast response times since our approximate query execution engine can do the bulk of its processing over compact sets of wavelet coefficients, essentially postponing the expansion into relational tuples until the end-result of the query. We also propose a novel wavelet decomposition algorithm that can build these synopses in an I/O-efficient manner. Finally, we conduct an extensive experimental study with synthetic as well as real-life data sets to determine the effectiveness of our wavelet-based approach compared to sampling and histograms. Our results demonstrate that our techniques: (1) provide approximate answers of better quality than either sampling or histograms; (2) offer query execution-time speedups of more than two orders of magnitude; and (3) guarantee extremely fast synopsis construction times that scale linearly with the size of the data. Received: 7 August 2000 / Accepted: 1 April 2001 Published online: 7 June 2001  相似文献   

13.
Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of microarray and analysis techniques, big volume of gene expression datasets and OPSM mining results are produced. OPSM query can efficiently retrieve relevant OPSMs from the huge amount of OPSM datasets. However, improving OPSM query relevancy remains a difficult task in real life exploratory data analysis processing. First, it is hard to capture subjective interestingness aspects, e.g., the analyst’s expectation given her/his domain knowledge. Second, when these expectations can be declaratively specified, it is still challenging to use them during the computational process of OPSM queries. With the best of our knowledge, existing methods mainly focus on batch OPSM mining, while few works involve OPSM query. To solve the above problems, the paper proposes two constrained OPSM query methods, which exploit userdefined constraints to search relevant results from two kinds of indices introduced. In this paper, extensive experiments are conducted on real datasets, and experiment results demonstrate that the multi-dimension index (cIndex) and enumerating sequence index (esIndex) based queries have better performance than brute force search.  相似文献   

14.
Due to the recent massive data generation, preference queries are becoming an increasingly important for users because such queries retrieve only a small number of preferable data objects from a huge multi-dimensional dataset. A top-k dominating query, which retrieves the k data objects dominating the highest number of data objects in a given dataset, is particularly important in supporting multi-criteria decision making because this query can find interesting data objects in an intuitive way exploiting the advantages of top-k and skyline queries. Although efficient algorithms for top-k dominating queries have been studied over centralized databases, there are no studies which deal with top-k dominating queries in distributed environments. The recent data management is becoming increasingly distributed, so it is necessary to support processing of top-k dominating queries in distributed environments. In this paper, we address, for the first time, the challenging problem of processing top-k dominating queries in distributed networks and propose a method for efficient top-k dominating data retrieval, which avoids redundant communication cost and latency. Furthermore, we also propose an approximate version of our proposed method, which further reduces communication cost. Extensive experiments on both synthetic and real data have demonstrated the efficiency and effectiveness of our proposed methods.  相似文献   

15.
An elastic and highly available data store is a key component of many cloud applications. Existing data stores with strong consistency guarantees are designed and optimized for small updates, key-value access, and (if supported) small range queries over a predefined key column. This raises performance and availability problems for applications which inherently require large updates, non-key access, and large range queries. This paper presents a solution to these problems: Crescando/RB; a distributed, scan-based, main memory, relational data store (single table) with robust performance and high availability. The system addresses a real, large-scale industry use case: the Amadeus travel management system. This paper focuses on the distribution layer of Crescando/RB, the problem and theory behind it, the rationale underlying key design decisions, and the novel multicast protocol and replication framework it is composed of. Highlighting the key features of the distribution layer, we present experimental results showing that even under permanent node failures and large-scale data repartitioning, Crescando/RB remains fully available and capable of sustaining a heavy query and update load.  相似文献   

16.
17.
As RDF data continue to gain popularity, we witness the fast growing trend of RDF datasets in both the number of RDF repositories and the size of RDF datasets. Many known RDF datasets contain billions of RDF triples (subject, predicate and object). One of the grant challenges for managing these huge RDF data is how to execute RDF queries efficiently. In this paper, we address the query processing problems against the billion triple challenges. We first identify some causes for the problems of existing query optimization schemes, such as large intermediate results, initial query cost estimation errors. Then, we present our block-oriented dynamic query plan generation approach powered with pipelining execution. Our approach consists of two phases. In the first phase, a near-optimal execution plan for queries is chosen by identifying the processing blocks of queries. We group the join patterns sharing a join variable into building blocks of the query plan since executing them first provides opportunities to reduce the size of intermediate results generated. In the second phase, we further optimize the initial pipelining for a given query plan. We employ optimization techniques, such as sideways information passing and semi-join, to further reduce the size of intermediate results, improve the query processing cost estimation and speed up the performance of query execution. Experimental results on several RDF datasets of over a billion triples demonstrate that our approach outperforms existing RDF query engines that rely on dynamic programming based static query processing strategies.  相似文献   

18.
近年来,图的可达性查询已经成为一个研究热点。传统的可达性查询算法——GRAIL在处理k步可达性查询时具有较高的查询效率,但不适合处理不同分支顶点之间的k步可达性查询。为了解决上述问题,提出了一种新的双向双区间标签索引,进而实现了RE-GRAIL算法,从而有效解决了k步可达性查询问题。最后,在5个不同特征的数据集上进行实验,并从索引构建时间、索引大小、查询时间、扩展性4个方面进行验证。 实验结果表明,与众多同类算法相比,RE-GRAIL算法具有更好的性能。  相似文献   

19.
针对MapReduce数据块处理机制、高维数据分布特征和KNN查询需求,本文设计一种基于B 树的高维索引结构(iPartition),创新性提出基于主成分区分度的优化数据划分策略和邻接数据域分散存储等原则,将数据均匀划分到不同的Slave节点,使尽可能多的数据域对计算共同贡献,提升MapReduce任务处理并行性;利用B 树构造分布式的双层索引实现查询时数据范围快速过滤,降低高维计算代价。实验表明,iPartition在高维数据近似查询环境下,具有良好的性能和扩展性。  相似文献   

20.
一种支持多维数据范围查询的对等计算索引框架   总被引:1,自引:0,他引:1  
如何有效地支持多维数据范围查询是传统数据管理领域的研究热点之一.但是,在大规模分布式系统中,这仍然是一个具有挑战性的研究工作.VBI-tree是一个对等计算环境下基于平衡树的索引架构,在该架构上可以实现集中式环境下的多种支持多维数据索引的层次化树结构,例如R-tree,X-tree和M-tree等.VBI-tree设计的查询算法保证查询可以从树的任意位置开始,而不是像集中式环境下层次化树结构那样采用从树的根节点开始查询的方法,从而成功地避免了根节点引起的系统性能瓶颈问题.对于有N个节点的网络,索引方法可以保证查询效率是O(log N).VBI-tree提出了基于AVL-tree旋转的网络重构负载均衡策略可以有效地均衡负栽.另外,在数据操作频繁的情况下,为了提高索引的性能,在VBI-tree上建立特殊的祖先-子孙链接形成VBI-tree的结构.通过使用祖先-子孙链接,可保证对于相关查询区域的探索尽量发生在同层节点之间,而不是一直往根节点方向发送,从而减轻上层节点的查询负担,并且显著地降低了更新代价.模拟实验验证了提出的方法的有效性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号