首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
With the advent of the era of cloud computing and big data, in order to cope with vast amounts of data, a number of key-value databases have emerged. These systems provide the ability of large scale data storage and effective data operations based on primary keys, but they do not efficiently support the range and k-Nearest Neighbor (kNN) queries on multi-dimensional datasets. In this paper, we introduce, SPIKE, a sliced Pyramid-based index system for key-value data stores. SPIKE bridges the gap between the data scale and querying functionality for highly available, scalable distributed key-value data stores. We first present SP-Index, the kernel indexing scheme. The SP-Index is designed as a two-level index mechanism consisting of a sliced pyramid space partition index and a distributed B-Tree index. On the basis of SP-Index, we have designed and implemented SPIKE on Cassandra, which provides efficient multi-dimensional complex query processing. We have conducted a set of comprehensive experiments with three types of datasets including synthetic datasets, TPC-H benchmark datasets and a real-world dataset. The experiment results show that SPIKE can efficiently handle multi-dimensional complex queries on large-scale key-value datasets. Evaluation results in comparison with existing systems demonstrates that SPIKE outperforms the comparing work including the original Pyramid, MySQL Cluster and CCIndex by dozens of times in complex query processing.  相似文献   

2.
Wide-column NoSQL databases are an important class of NoSQL (Not only SQL) databases which scale horizontally and feature high access performance on sparse tables. With current trends towards big Data Warehouses (DWs), it is attractive to run existing business intelligence/data warehousing applications on higher volumes of data in wide-column NoSQL databases for low latency by mapping multidimensional models to wide-column NoSQL models or using additional SQL add-ons. For examples, applications like retail management can run over integrated data sets stored in big DWs or in the cloud to capture current item-selling trends. Many of these systems also employ Snapshot Isolation (SI) as a concurrency control mechanism to achieve high throughput for read-heavy workloads. SI works well in a DW environment, as analytical queries can now work on (consistent) snapshots and are not impacted by concurrent update jobs performed by online incremental Extract-Transform-Load (ETL) flows that refresh fact/dimension tables. However, the snapshot made available in the DW is often stale, since at the moment when an analytical query is issued, the source updates (e.g. in a remote retail store) may not have been extracted and processed by the ETL process in time due to high input data volume or slow processing speed. This staleness may cause incorrect results for time-critical decision support queries. To address this problem, snapshots which are supposed to be accessed by analytical queries need to be first maintained by corresponding ETL flows to reflect source updates based on given freshness needs. Snapshot maintenance in this work means maintaining the distributed data partitions that are required by a query. Since most NoSQL databases are not ACID compliant and do not provide full-fledged distributed transaction support, snapshot may be inconsistently derived when its data partitions are updated by different ETL maintenance jobs.This paper describes an extended version of HBelt system [1] which tightly integrates the wide-column NoSQL database HBase with a clustered & pipelined ETL engine. Our objective is to efficiently refresh HBase tables with remote source updates while a consistent snapshot is guaranteed across distributed partitions for each scan request in analytical queries. A consistency model is defined and implemented to address so-called distributed snapshot maintenance. To achieve this, ETL jobs and analytical queries are scheduled in a distributed processing environment. In addition, a partitioned, incremental ETL pipeline is introduced to increase the performance of ETL (update) jobs. We validate the efficiency gain in terms of data pipelining and data partitioning using the TPC-DS benchmark, which simulates a modern decision support system for a retail product supplier. Experimental results show that high query throughput can be achieved in HBelt when distributed, refreshed snapshots are demanded.  相似文献   

3.
The typical workload in a database system consists of a mix of multiple queries of different types that run concurrently. Interactions among the different queries in a query mix can have a significant impact on database performance. Hence, optimizing database performance requires reasoning about query mixes rather than considering queries individually. Current database systems lack the ability to do such reasoning. We propose a new approach based on planning experiments and statistical modeling to capture the impact of query interactions. Our approach requires no prior assumptions about the internal workings of the database system or the nature and cause of query interactions, making it portable across systems. To demonstrate the potential of modeling and exploiting query interactions, we have developed a novel interaction-aware query scheduler for report-generation workloads. Our scheduler, called QShuffler, uses two query scheduling algorithms that leverage models of query interactions. The first algorithm is optimized for workloads where queries are submitted in large batches. The second algorithm targets workloads where queries arrive continuously, and scheduling decisions have to be made online. We report an experimental evaluation of QShuffler using TPC-H workloads running on IBM DB2. The evaluation shows that QShuffler, by modeling and exploiting query interactions, can consistently outperform (up to 4x) query schedulers in current database systems.  相似文献   

4.
预测性连续时空区域查询在用户指定的时间范围期间持续地返回给定未来查询时间范围期间将出现在查询区域的移动对象。论文提出了一种预测性连续时空区域查询处理方法,设计了支持连续查询处理的两种索引结构。移动对象索引用于记录移动对象不断更新的位置信息,它用于支持查询的首次处理。连续查询索引结构用于记录所有查询结果可能受到移动对象位置变化影响的连续查询,它用于支持连续查询处理。实验表明,论文提出的方法能够有效地提高处理大量连续查询的效率。  相似文献   

5.
Quotient Cube和QC-tree试图在浓缩一个数据立方尺寸的同时,保持该数据立方蕴涵的语义,但是,前者没有语义关系的存储,后者存储的语义关系是晦涩模糊的.为此提出了下钻立方结构,首次从语义角度考虑数据立方存储,存储的不是类的内容,而是类之间的直接下钻关系.下钻立方不仅能够极大地减小数据立方的存储尺寸,而且可以清晰地表达原数据立方蕴涵的下钻语义.此外,下钻立方具有较高的查询响应性能,这一点在范围查询中表现得尤其显著.实验和分析表明,下钻立方在存储尺寸和查询响应方面明显优于QC-tree,适于用来组织和存储数据立方.  相似文献   

6.
An elastic and highly available data store is a key component of many cloud applications. Existing data stores with strong consistency guarantees are designed and optimized for small updates, key-value access, and (if supported) small range queries over a predefined key column. This raises performance and availability problems for applications which inherently require large updates, non-key access, and large range queries. This paper presents a solution to these problems: Crescando/RB; a distributed, scan-based, main memory, relational data store (single table) with robust performance and high availability. The system addresses a real, large-scale industry use case: the Amadeus travel management system. This paper focuses on the distribution layer of Crescando/RB, the problem and theory behind it, the rationale underlying key design decisions, and the novel multicast protocol and replication framework it is composed of. Highlighting the key features of the distribution layer, we present experimental results showing that even under permanent node failures and large-scale data repartitioning, Crescando/RB remains fully available and capable of sustaining a heavy query and update load.  相似文献   

7.
分布式NoSQL系统旨在提供大规模数据的高可用性,但缺乏内在的支持复杂查询的应用程序。传统的基于单一词汇倒排表的解决方案未达到良好的效果。因此,文中就文档型数据库在处理动态文档集时不支持多键作为主索引的缺点展开研究,提出了一种改进的组合索引方法。通过存储组合条件的倒列表,查询驱动机制可以从最近的查询记录中自适应地存储比较受欢迎的条件组合。该方法可以降低整体的带宽消耗,只需占用较少的存储资源等额外开销,明显改善了NoSQL系统的容量和响应时间。  相似文献   

8.
随着云计算的发展,云存储技术通过集群应用、虚拟化技术、分布式文件系统等功能将网络中大量各种不同类型的存储设备集合起来协同工作,缓解了老式数据中心的存储压力.另外,重复数据删除技术是一种缩减存储空间减少网络传输量的技术,随着云的广泛应用也势必会发展应用于云存储中.这两种技术结合将会给IT存储业带来实际效益.本文通过研究重复数据删除技术、云存储技术,设计了基于云存储的重复数据删除架构,提出了一种用In-line方式在客户端进行数据块级与字节级相结合的重复数据删除操作后再将数据存入云中的方案.在本架构下,海量数据存储在HDFS中;而文件数据块的哈希值存储在HBase中.  相似文献   

9.
The blooming of different cloud data management infrastructures, specialized for different kinds of data and tasks, has led to a wide diversification of DBMS interfaces and the loss of a common programming paradigm. In this paper, we present the design of a cloud multidatastore query language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store’s native query interface. The query engine has a fully distributed architecture, which provides important opportunities for optimization. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized, e.g. by pushing down select predicates, using bind join, performing join ordering, or planning intermediate data shipping. Our experimental validation, with three data stores (graph, document and relational) and representative queries, shows that CloudMdsQL satisfies the five important requirements for a cloud multidatastore query language.  相似文献   

10.
There have been many studies on management of moving objects recently. Most of them try to optimize the performance of predictive window queries. However, not much attention is paid to two other important query types: the predictive range query and the predictive k nearest neighbor query. In this article, we focus on these two types of queries. The novelty of our work mainly lies in the introduction of the Transformed Minkowski Sum, which can be used to determine whether a moving bounding rectangle intersects a moving circular query region. This enables us to use the traditional tree traversal algorithms to perform range and kNN searches. We theoretically show that our algorithms based on the Transformed Minkowski Sum are optimal in terms of the number of tree node accesses. We also experimentally verify the effectiveness of our technique and show that our algorithms outperform alternative approaches.  相似文献   

11.
Cloud computing systems handle large volumes of data by using almost unlimited computational resources, while spatial data warehouses (SDWs) are multidimensional databases that store huge volumes of both spatial data and conventional data. Cloud computing environments have been considered adequate to host voluminous databases, process analytical workloads and deliver database as a service, while spatial online analytical processing (spatial OLAP) queries issued over SDWs are intrinsically analytical. However, hosting a SDW in the cloud and processing spatial OLAP queries over such database impose novel obstacles. In this article, we introduce novel concepts as cloud SDW and spatial OLAP as a service, and afterwards detail the design of novel schemas for cloud SDW and spatial OLAP query processing over cloud SDW. Furthermore, we evaluate the performance to process spatial OLAP queries in cloud SDWs using our own query processor aided by a cloud spatial index. Moreover, we describe the cloud spatial bitmap index to improve the performance to process spatial OLAP queries in cloud SDWs, and assess it through an experimental evaluation. Results derived from our experiments revealed that such index was capable to reduce the query response time from 58.20 up to 98.89 %.  相似文献   

12.
Database query verification schemes provide correctness guarantees for database queries. Typically such guarantees are required and advisable where queries are executed on untrusted servers. This need to verify query results, even though they may have been executed on one’s own database, is something new that has arisen with the advent of cloud services. The traditional model of hosting one’s own databases on one’s own servers did not require such verification because the hardware and software were both entirely within one’s control, and therefore fully trusted. However, with the economical and technological benefits of cloud services beckoning, many are now considering outsourcing both data and execution of database queries to the cloud, despite obvious risks. This survey paper provides an overview into the field of database query verification and explores the current state of the art in terms of query execution and correctness guarantees provided for query results. We also provide indications towards future work in the area.  相似文献   

13.
键值存储旨在从非常大的数据量中提取值,同时具有高可用性、容错性和可伸缩性,因此提供了非常需要的基础设施来支持基于位置的服务(LBS)。然而,多维数据上的复杂查询不能有效地处理,因为键值存储不提供访问多个属性的方法。针对键值存储HBase不能有效处理多维数据的问题,提出了一个统一的索引框架——New-grid,使键值存储HBase支持多维查询。在改进的P-grid覆盖网络中,组织了一组节点,提供了高效的数据分布、容错和多维数据的查询处理。为了进行索引,使用基于Hilbert空间填充曲线来保存数据的局部性,从而有效地管理键值存储中的多维数据。同时使用HBase底层存储管理数据,并提出了一种范围查询和K最近邻查询的算法,以消除维护单独索引表的开销。在Amazon EC2上使用4、8和16个普通节点的集群进行了广泛的实验。实验结果表明,New-grid的性能相比MD-Hbase以及MapReduce更优。  相似文献   

14.
在云资源共享服务模式中,针对云资源多属性范围查询的问题,提出一种改进的E-SkipNet网络。首先,E-SkipNet在传统分布式哈希表(DHT)网络SkipNet的基础上将数据属性引入到节点NameID的设置中,将物理节点加入到单个属性域中,以支持多属性范围查询;其次,在原E-SkipNet网络的基础上,将物理节点同时映射成多个逻辑节点;同时加入多个属性域,并将资源按照不同的属性发布到不同逻辑节点上;最后,采用均匀位置保留哈希函数对资源进行映射存储,从而在各个属性域中保留属性值的顺序关系,从而支持范围查询。仿真结果表明,改进后的E-SkipNet网络与改进前的E-SkipNet和多属性可寻址网络(MAAN)相比,在路由效率方面分别提高了18.09%和20.47%。结果表明,改进后的E-SkipNet网络能支持更加高效的云资源多属性范围查询,在异构环境中能较好地实现负载均衡。  相似文献   

15.
Efficient Execution of Multiple Queries on Deep Memory Hierarchy   总被引:1,自引:0,他引:1       下载免费PDF全文
This paper proposes a complementary novel idea, called MiniTasking to further reduce the number of cache misses by improving the data temporal locality for multiple concurrent queries. Our idea is based on the observation that, in many workloads such as decision support systems (DSS), there is usually significant amount of data sharing among different concurrent queries. MiniTasking exploits such data sharing to improve data temporal locality by scheduling query execution at three levels: query level batching, operator level grouping and mini-task level scheduling. The experimental results with various types of concurrent TPC-H query workloads show that, with the traditional N-ary Storage Model (NSM) layout, MiniTasking significantly reduces the L2 cache misses by up to 83%, and thereby achieves 24% reduction in execution time. With the Partition Attributes Across (PAX) layout, MiniTasking further reduces the cache misses by 65% and the execution time by 9%. For the TPC-H throughput test workload, MiniTasking improves the end performance up to 20%.  相似文献   

16.
In the recent years the problems of using generic storage (i.e., relational) techniques for very specific applications have been detected and outlined and, as a consequence, some alternatives to Relational DBMSs (e.g., HBase) have bloomed. Most of these alternatives sit on the cloud and benefit from cloud computing, which is nowadays a reality that helps us to save money by eliminating the hardware as well as software fixed costs and just pay per use. On top of this, specific querying frameworks to exploit the brute force in the cloud (e.g., MapReduce) have also been devised. The question arising next tries to clear out if this (rather naive) exploitation of the cloud is an alternative to tuning DBMSs or it still makes sense to consider other options when retrieving data from these settings.In this paper, we study the feasibility of solving OLAP queries with Hadoop (the Apache project implementing MapReduce) while benefiting from secondary indexes and partitioning in HBase. Our main contribution is the comparison of different access plans and the definition of criteria (i.e., cost estimation) to choose among them in terms of consumed resources (namely CPU, bandwidth and I/O).  相似文献   

17.
Data warehouse workloads are crucial for the support of on-line analytical processing (OLAP). The strategy to cope with OLAP queries on such huge amounts of data calls for the use of large parallel computers. The trend today is to use cluster architectures that show a reasonable balance between cost and performance. In such cases, it is necessary to tune the applications in order to minimize the amount of I/O and communication, such that the global execution time is reduced as much as possible.

In this paper, we model and analyze the most up-to-date strategies for ad hoc star join query processing in a cluster of computers. We show that, for ad hoc query processing and assuming a limited amount of resources available, these strategies still have room for improvement both in terms of I/O and inter-node data traffic communication. Our analysis concludes with the proposal of a hybrid solution that improves these two aspects compared to the previous techniques, and shows near optimal results in a broad spectrum of cases.  相似文献   


18.
Most real-world databases contain substantial amounts of time-referenced, or temporal, data. Recent advances in temporal query languages show that such database applications may benefit substantially from built-in temporal support in the DBMS. To achieve this, temporal query representation, optimization, and processing mechanisms must be provided. This paper presents a foundation for query optimization that integrates conventional and temporal query optimization and is suitable for both conventional DBMS architectures and ones where the temporal support is obtained via a layer on top of a conventional DBMS. This foundation captures duplicates and ordering for all queries, as well as coalescing for temporal queries, thus generalizing all existing approaches known to the authors. It includes a temporally extended relational algebra to which SQL and temporal SQL queries may be mapped, six types of algebraic equivalences, concrete query transformation rules that obey different equivalences, a procedure for determining which types of transformation rules are applicable for optimizing a query, and a query plan enumeration algorithm  相似文献   

19.
Unstructured peer-to-peer infrastructure has been widely employed to support large-scale distributed applications. Many of these applications, such as location-based services and multimedia content distribution, require the support of range selection queries. Under the widely-adopted query shipping protocols, the cost of query processing is affected by the number of result copies or replicas in the system. Since range queries can return results that include poorly-replicated data items, the cost of these queries is usually dominated by the retrieval cost of these data items. In this work, we propose a popularity-aware prefetch-based approach that can effectively facilitate the caching of poorly-replicated data items that are potentially requested in subsequent range queries, resulting in substantial cost savings. We prove that the performance of retrieving poorly-replicated data items is guaranteed to improve under an increasing query load. Extensive experiments show that the overall range query processing cost decreases significantly under various query load settings.  相似文献   

20.
基于多核处理器硬件技术和高并发查询负载需求,近年来的研究不仅关注于一次一查询模式的查询优化技术,而且也关注于一次一组模式的查询优化技术.通过将并发查询转换为共享负载,一些低访问延迟的操作,如磁盘I/O、cache访问,可以被多个并发的查询所共享.当前的研究通常基于共享查询操作符,如扫描、连接、谓词处理等,通过生成全局执行计划优化并发查询.对于复杂的分析型负载,如何创建优化的执行计划是一个具有挑战性的问题.在广泛使用的星形模型的基础上提出一种模板OLAP查询执行计划来简化查询执行计划,以达到最大化查询操作符利用率的目标.1)提出了基于代理键的连接索引技术,将传统的基于值探测的连接操作转化为内存数组索引引用(AIR),使连接操作的CPU效率更高并且支持聚集计算的后物化;2)并发查询的谓词处理简化为cache line敏感的谓词向量,在单次cache line访问中最大化并发查询谓词计算性能;3)通过多核并行实现技术在SSB基准上进行测试.实验结果表明:共享扫描和共享谓词处理能够将并发OLAP查询处理性能提升1倍.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号