首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
OLAP queries involve a lot of aggregations on a large amount of data in data warehouses. To process expensive OLAP queries efficiently, we propose a new method to rewrite a given OLAP query using various kinds of materialized views which already exist in data warehouses. We first define the normal forms of OLAP queries and materialized views based on the selection and aggregation granularities, which are derived from the lattice of dimension hierarchies. Conditions for usability of materialized views in rewriting a given query are specified by relationships between the components of their normal forms. We present a rewriting algorithm for OLAP queries that can effectively utilize materialized views having different selection granularities, selection regions, and aggregation granularities together. We also propose an algorithm to find a set of materialized views that results in a rewritten query which can be executed efficiently. We show the effectiveness and performance of the algorithm experimentally.  相似文献   

2.
Conventional data warehouses employ the query-at-a-time model, which maps each query to a distinct physical plan. When several queries execute concurrently, this model introduces contention and thrashing, because the physical plans??unaware of each other??compete for access to the underlying I/O and computation resources. As a result, while modern systems can efficiently optimize and evaluate a single complex data analysis query, their performance suffers significantly and can be highly erratic when multiple complex queries run at the same time. We present in this paper Cjoin, a new design that substantially improves throughput in large-scale data analytics systems processing many concurrent join queries. In contrast to the conventional query-at-a-time model our approach employs a single physical plan that shares I/O, computation, and tuple storage across all in-flight join queries. We use an ??always on?? pipeline of non-blocking operators, managed by a controller that continuously examines the current query mix and optimizes the pipeline on the fly. Our design enables data analytics engines to scale gracefully to large data sets, provide predictable execution times, and reduce contention. We implemented Cjoin as an extension to the PostgreSQL DBMS. This prototype outperforms conventional commercial systems by an order of magnitude for tens to hundreds of concurrent queries.  相似文献   

3.
Data warehouses are very large databases usually designed using the star schema. Queries defined on data warehouses are generally complex due to join operations involved. The performance of star schema queries in data warehouses is highly critical and its optimization is hard in general. Several query performance optimization methods exist, such as indexes and table partitioning. In this paper, we propose a new approach based on binary particle swarm optimization for solving the bitmap join index selection problem in data warehouses. This approach selects the optimal set of bitmap join indexes based on a mathematical cost model. Several experiments are performed to demonstrate the effectiveness of the proposed method on the bitmap join index selection problem. Further testing of the method is performed using a database environment specific cost function. The binary particle swarm optimization is found to be more effective than both the genetic algorithm and data mining based approaches.  相似文献   

4.
Histograms can be useful in estimating the selectivity of queries in areas such as database query optimization and data exploration. In this paper, we propose a new histogram method for multidimensional data, called the Q-Histogram, based on the use of the quad-tree, which is a popular index structure for multidimensional data sets. The use of the compact representation of the target data obtainable from the quad-tree allows a fast construction of a histogram with the minimum number of scanning, i.e., only one scanning, of the underlying data. In addition to the advantage of computation time, the proposed method also provides a better performance than other existing methods with respect to the quality of selectivity estimation. We present a new measure of data skew for a histogram bucket, called the weighted bucket skew. Then, we provide an effective technique for skew-tolerant organization of histograms. Finally, we compare the accuracy and efficiency of the proposed method with other existing methods using both real-life data sets and synthetic data sets. The results of experiments show that the proposed method generally provides a better performance than other existing methods in terms of accuracy as well as computational efficiency.  相似文献   

5.
Data warehouses are built to reply query searches efficiently from integrated data of various systems. To improve the performance of the system, the issue of materializing views within data warehouses must be explored. This involves to pre-compute a set of selected views which are fact and dimension tables, under given resource and quality constraints. The quality constraints include query processing time, data maintenance time and the freshness of data when queries are placed. Then there is the policy of updating, which treats the time issue of data reloading in data warehouses. A model is proposed to determine the view selection and update policy when the arrival of queries follows Poisson processes with the constraints of system response time, storage space and query dependent currency of data (on systems capable of periodic and query-triggered updates). To the best of the researchers’ knowledge, no other research has considered all these factors in their models. A two-phase greedy algorithm was developed to determine the optimal update policy for the view selection problem. Numerous experiments were performed to explore the sensitivity of the proposed model under various constraints and system parameter settings. The results show that the model has reasonable responses to the tunings and that the proposed algorithm can rapidly find acceptable solutions.  相似文献   

6.
《Information Systems》2005,30(2):133-149
Data warehouses collect masses of operational data, allowing analysts to extract information by issuing decision support queries on the otherwise discarded data. In many application areas (e.g. telecommunications), the warehoused data sets are multiple terabytes in size. Parts of these data sets are stored on very large disk arrays, while the remainder is stored on tape-based tertiary storage (which is one to two orders of magnitude less expensive than on-line storage). However, the inherently sequential nature of access to tape-based tertiary storage makes the efficient access to tape-resident data difficult to accomplish through conventional databases.In this paper, we present a way to make access to a massive tape-resident data warehouse easy and efficient. Ad hoc decision support queries usually involve large scale and complex aggregation over the detail data. These queries are difficult to express in SQL, and frequently require self-joins on the detail data (which are prohibitively expensive on the disk-resident data and infeasible to compute on tape-resident data), or unnecessary multiple passes through the detail data. An extension to SQL, the extended multi feature SQL (EMF SQL) expresses complex aggregation computations in a clear manner without using self-joins. The detail data in a data warehouse usually represents a record of past activities, and therefore is temporal. We show that complex queries involving sequences can be easily expressed in EMF SQL. An EMF SQL query can be optimized to minimize the number of passes through the detail data required to evaluate the query, in many cases to only one pass. We describe an efficient query evaluation algorithm along with a query optimization algorithm that minimizes the number of passes through the detail data, and which minimizes the amount of main memory required to evaluate the query. These algorithms are useful not only in the context of tape-resident data warehouses but also in data stream systems which require similar processing techniques.  相似文献   

7.
Multidimensional data modeling for location-based services   总被引:5,自引:0,他引:5  
With the recent and continuing advances in areas such as wireless communications and positioning technologies, mobile, location-based services are becoming possible.Such services deliver location-dependent content to their users. More specifically, these services may capture the movements and requests of their users in multidimensional databases, i.e., data warehouses, and content delivery may be based on the results of complex queries on these data warehouses. Such queries aggregate detailed data in order to find useful patterns, e.g., in the interaction of a particular user with the services.The application of multidimensional technology in this context poses a range of new challenges. The specific challenge addressed here concerns the provision of an appropriate multidimensional data model. In particular, the paper extends an existing multidimensional data model and algebraic query language to accommodate spatial values that exhibit partial containment relationships instead of the total containment relationships normally assumed in multidimensional data models. Partial containment introduces imprecision in aggregation paths. The paper proposes a method for evaluating the imprecision of such paths. The paper also offers transformations of dimension hierarchies with partial containment relationships to simple hierarchies, to which existing precomputation techniques are applicable.Received: 28 September 2002, Accepted: 5 April 2003, Published online: 12 August 2003Edited by: J. Veijalainen Correspondence to: I. Timko  相似文献   

8.
Yao Liu  Hui Xiong 《Information Sciences》2006,176(9):1215-1240
A data warehouse stores current and historical records consolidated from multiple transactional systems. Securing data warehouses is of ever-increasing interest, especially considering areas where data are sold in pieces to third parties for data mining practices. In this case, existing data warehouse security techniques, such as data access control, may not be easy to enforce and can be ineffective. Instead, this paper proposes a data perturbation based approach, called the cubic-wise balance method, to provide privacy preserving range queries on data cubes in a data warehouse. This approach is motivated by the following observation: analysts are usually interested in summary data rather than individual data values. Indeed, our approach can provide a closely estimated summary data for range queries without providing access to actual individual data values. As demonstrated by our experimental results on APB benchmark data set from the OLAP council, the cubic-wise balance method can achieve both better privacy preservation and better range query accuracy than random data perturbation alternatives.  相似文献   

9.
The GMAP: a versatile tool for physical data independence   总被引:1,自引:0,他引:1  
Physical data independence is touted as a central feature of modern database systems. It allows users to frame queries in terms of the logical structure of the data, letting a query processor automatically translate them into optimal plans that access physical storage structures. Both relational and object-oriented systems, however, force users to frame their queries in terms of a logical schema that is directly tied to physical structures. We present an approach that eliminates this dependence. All storage structures are defined in a declarative language based on relational algebra as functions of a logical schema. We present an algorithm, integrated with a conventional query optimizer, that translates queries over this logical schema into plans that access the storage structures. We also show how to compile update requests into plans that update all relevant storage structures consistently and optimally. Finally, we report on experiments with a prototype implementation of our approach that demonstrate how it allows storage structures to be tuned to the expected or observed workload to achieve significantly better performance than is possible with conventional techniques. Edited by Matthias Jarke, Jorge Bocca, Carlo Zaniolo. Received September 15, 1994 / Accepted September 1, 1995  相似文献   

10.
The analytic prediction of buffer hit probability, based on the characterization of database accesses from real reference traces, is extremely useful for workload management and system capacity planning. The knowledge can be helpful for proper allocation of buffer space to various database relations, as well as for the management of buffer space for a mixed transaction and query environment. Access characterization can also be used to predict the buffer invalidation effect in a multi-node environment which, in turn, can influence transaction routing strategies. However, it is a challenge to characterize the database access pattern of a real workload reference trace in a simple manner that can easily be used to compute buffer hit probability. In this article, we use a characterization method that distinguishes three types of access patterns from a trace: (1) locality within a transaction, (2) random accesses by transactions, and (3) sequential accesses by long queries. We then propose a concise way to characterize the access skew across randomly accessed pages by logically grouping the large number of data pages into a small number of partitions such that the frequency of accessing each page within a partition can be treated as equal. Based on this approach, we present a recursive binary partitioning algorithm that can infer the access skew characterization from the buffer hit probabilities for a subset of the buffer sizes. We validate the buffer hit predictions for single and multiple node systems using production database traces. We further show that the proposed approach can predict the buffer hit probability of a composite workload from those of its component files.  相似文献   

11.
Spatial data warehouses (SDWs) allow for spatial analysis together with analytical multidimensional queries over huge volumes of data. The challenge is to retrieve data related to ad hoc spatial query windows according to spatial predicates, avoiding the high cost of joining large tables. Therefore, mechanisms to provide efficient query processing over SDWs are essential. In this paper, we propose two efficient indices for SDW: the SB-index and the HSB-index. The proposed indices share the following characteristics. They enable multidimensional queries with spatial predicate for SDW and also support predefined spatial hierarchies. Furthermore, they compute the spatial predicate and transform it into a conventional one, which can be evaluated together with other conventional predicates by accessing a star-join Bitmap index. While the SB-index has a sequential data structure, the HSB-index uses a hierarchical data structure to enable spatial objects clustering and a specialized buffer-pool to decrease the number of disk accesses. The advantages of the SB-index and the HSB-index over the DBMS resources for SDW indexing (i.e. star-join computation and materialized views) were investigated through performance tests, which issued roll-up operations extended with containment and intersection range queries. The performance results showed that improvements ranged from 68% up to 99% over both the star-join computation and the materialized view. Furthermore, the proposed indices proved to be very compact, adding only less than 1% to the storage requirements. Therefore, both the SB-index and the HSB-index are excellent choices for SDW indexing. Choosing between the SB-index and the HSB-index mainly depends on the query selectivity of spatial predicates. While low query selectivity benefits the HSB-index, the SB-index provides better performance for higher query selectivity.  相似文献   

12.
Cloud computing systems handle large volumes of data by using almost unlimited computational resources, while spatial data warehouses (SDWs) are multidimensional databases that store huge volumes of both spatial data and conventional data. Cloud computing environments have been considered adequate to host voluminous databases, process analytical workloads and deliver database as a service, while spatial online analytical processing (spatial OLAP) queries issued over SDWs are intrinsically analytical. However, hosting a SDW in the cloud and processing spatial OLAP queries over such database impose novel obstacles. In this article, we introduce novel concepts as cloud SDW and spatial OLAP as a service, and afterwards detail the design of novel schemas for cloud SDW and spatial OLAP query processing over cloud SDW. Furthermore, we evaluate the performance to process spatial OLAP queries in cloud SDWs using our own query processor aided by a cloud spatial index. Moreover, we describe the cloud spatial bitmap index to improve the performance to process spatial OLAP queries in cloud SDWs, and assess it through an experimental evaluation. Results derived from our experiments revealed that such index was capable to reduce the query response time from 58.20 up to 98.89 %.  相似文献   

13.
Querying time series data based on similarity   总被引:3,自引:0,他引:3  
We study similarity queries for time series data where similarity is defined, in a fairly general way, in terms of a distance function and a set of affine transformations on the Fourier series representation of a sequence. We identify a safe set of transformations supporting a wide variety of comparisons and show that this set is rich enough to formulate operations such as moving average and time scaling. We also show that queries expressed using safe transformations can efficiently be computed without prior knowledge of the transformations. We present a query processing algorithm that uses the underlying multidimensional index built over the data set to efficiently answer similarity queries. Our experiments show that the performance of this algorithm is competitive to that of processing ordinary (exact match) queries using the index, and much faster than sequential scanning. We propose a generalization of this algorithm for simultaneously handling multiple transformations at a time, and give experimental results on the performance of the generalized algorithm  相似文献   

14.
Motivated by the globalization trend and Internet speed competition, enterprise nowadays often divides into many departments or organizations or even merges with other companies that located in different regions to bring up the competency and reaction ability. As a result, there are a number of data warehouse systems in a geographically-distributed enterprise. To meet the distributed decision-making requirements, the data in different data warehouses is addressed to enable data exchange and integration. Therefore, an open, vendor-independent, and efficient data exchange standard to transfer data between data warehouses over the Internet is an important issue. However, current solutions for cross-warehouse data exchange employ only approaches either based on records or transferring plain-text files, which are neither adequate nor efficient. In this research, issues on multidimensional data exchange are studied and an Intelligent XML-based multidimensional data exchange model is developed. In addition, a generic-construct-based approach is proposed to enable many-to-many systematic mapping between distributed data warehouses, introducing a consistent and unique standard exchange format. Based on the transformation model we develop between multidimensional data model and XML data model, and enhanced by the multidimensional metadata management mechanism proposed in this research, a general-purpose intelligent XML-based multidimensional data exchange process over web is facilitated efficiently and improved in quality. Moreover, we develop an intelligent XML-based prototype system to exchange multidimensional data, which shows that the proposed multidimensional data exchange model is feasible, and the multidimensional data exchange process is more systematic and efficient using metadata.  相似文献   

15.
The development of data warehouses begins with the definition of multidimensional models at the conceptual level in order to structure data, which will facilitate decision makers with an easier data analysis. Current proposals for conceptual multidimensional modelling focus on the design of static data warehouse structures, but few approaches model the queries which the data warehouse should support by means of OLAP (on-line analytical processing) tools. OLAP queries are, therefore, only defined once the rest of the data warehouse has been implemented, which prevents designers from verifying from the very beginning of the development whether the decision maker will be able to obtain the required information from the data warehouse. This article presents a solution to this drawback consisting of an extension to the object constraint language (OCL), which has been developed to include a set of predefined OLAP operators. These operators can be used to define platform-independent OLAP queries as a part of the specification of the data warehouse conceptual multidimensional model. Furthermore, OLAP tools require the implementation of queries to assure performance optimisations based on pre-aggregation. It is interesting to note that the OLAP queries defined by our approach can be automatically implemented in the rest of the data warehouse, in a coherent and integrated manner. This implementation is supported by a code-generation architecture aligned with model-driven technologies, in particular the MDA (model-driven architecture) proposal. Finally, our proposal has been validated by means of a set of sample data sets from a well-known case study.  相似文献   

16.
Analysis of historical data in data warehouses contributes significantly toward future decision-making. A number of design factors including, slowly changing dimensions (SCDs), affect the quality of such analysis. In SCDs, attribute values may change over time and must be tracked. They should maintain consistency and correctness of data, and show good query performance. We identify that SCDs can have three types of validity periods: disjoint, overlapping, and same validity periods. We then show that the third type cannot be handled through the temporal star schema for temporal data warehouses (TDWs). We further show that a hybrid/Type6 scheme and temporal star schema may be used to handle this shortcoming. We demonstrate that the use of a surrogate key in the hybrid scheme efficiently identifies data, avoids most time comparisons, and improves query performance. Finally, we compare the TDWs and a surrogate key-based temporal data warehouse (SKTDW) using query formulation, query performance, and data warehouse size as parameters. The results of our experiments for 23 queries of five different types show that SKTDW outperforms TDW for all type of queries, with average and maximum performance improvements of 165% and 1071%, respectively. The results of our experiments are statistically significant.  相似文献   

17.
Web数据仓库的异步迭代查询处理方法   总被引:2,自引:0,他引:2  
何震瀛  李建中  高宏 《软件学报》2002,13(2):214-218
数据仓库信息量的飞速膨胀对数据仓库提出了巨大挑战.如何提高Web环境下数据仓库的查询效率成为数据仓库研究领域重要的研究问题.对Web数据仓库的体系结构和查询方法进行了研究和探讨.在分析几种Web数据仓库实现方法的基础上,提出了一种Web数据仓库的层次体系结构,并在此基础上提出了Web数据仓库的异步迭代查询方法.该方法充分利用了流水线并行技术,在Web数据仓库的查询处理过程中不同层次的结点以流水线方式运行,并行完成查询的处理,提高了查询效率.理论分析表明,该方法可以有效地提高Web数据仓库的查询效率.  相似文献   

18.
A federation of data warehouses is understood as a set of data warehouses, which can be processed as a whole in the logic level. Physically, the federation does not gather data into one place. This paper presents a formal framework for data and knowledge processing in data warehouse federations. The management system for a data warehouse federation consists of an user interface enabling presentation of user queries, a program for query decomposition and a program for integrating knowledge coming from different data warehouses as the answers to a user query. We propose a model for query decomposition process and knowledge integration. It contains also the algorithm for knowledge inconstancy processing. This kind of inconsistency often occurs since very often the knowledge extracted from different data warehouses refers to the same subject, but is not consistent.  相似文献   

19.
Indexing is one of the most important techniques to facilitate query processing over a multi-dimensional dataset. A commonly used strategy for such indexing is to keep the tree-structured index balanced. This strategy reduces query processing cost in the worst case, and can handle all different queries equally well. In other words, this strategy implies that all queries are uniformly issued, which is partially because the query distribution is not possibly known and will change over time in practice. A key issue we study in this work is whether it is the best to fully rely on a balanced tree-structured index in particular when datasets become larger and larger in the big data era. This means that, when a dataset becomes very large, it becomes unreasonable to assume that all data in any subspace are equally important and are uniformly accessed by all queries at the index level. Given the existence of query skew and the possible changes of query skew, in this paper, we study how to handle such query skew and such query skew changes at the index level without sacrifice of supporting any possible queries in a wellbalanced tree index and without a high overhead. To tackle the issue, we propose index-view at the index level, where an index-view is a short-cut in a balanced tree-structured index to access objects in the subspaces that are more frequently accessed, and propose a new index-view-centric framework for query processing using index-views in a bottom-up manner. We study index-views selection problem in both static and dynamic setting, and we confirm the effectiveness of our approach using large real and synthetic datasets.  相似文献   

20.
In this paper we present the Brown Dwarf, a distributed data analytics system designed to efficiently store, query and update multidimensional data over commodity network nodes, without the use of any proprietary tool. Brown Dwarf distributes a centralized indexing structure among peers on-the-fly, reducing cube creation and querying times by enforcing parallelization. Analytical queries are naturally performed on-line through cooperating nodes that form an unstructured Peer-to-Peer overlay. Updates are also performed on-line, eliminating the usually costly over-night process. Moreover, the system employs an adaptive replication scheme that adjusts to the workload skew as well as the network churn by expanding or shrinking the units of the distributed data structure. Our system has been thoroughly evaluated on an actual testbed: it manages to accelerate cube creation up and querying up to several tens of times compared to the centralized solution by exploiting the capabilities of the available network nodes working in parallel. It also manages to quickly adapt even after sudden bursts in load and remains unaffected with a considerable fraction of frequent node failures. These advantages are even more apparent for dense and skewed data cubes and workloads.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号