首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The design of an OLAP system for supporting real-time queries is one of the major research issues. One approach is to use data cubes, which are materialized precomputed multidimensional views of data in a data warehouse. We can derive a set of data cubes to answer each frequently asked query directly. However, there are two practical problems: (1) the maintenance cost of the data cubes, and (2) the query cost to answer those queries. Maintaining a data cube requires disk storage and CPU computation, so the maintenance cost is related to the total size as well as the total number of data cubes materialized. In most cases, materializing all data cubes is impractical. The maintenance cost may be reduced by merging some data cubes. However, the resulting larger data cubes will increase the query cost of answering some queries. If the bounds on the maintenance cost and the query cost are too strict, we help the user decide which queries to be sacrificed and not taken into consideration. We have defined an optimization problem in data cube system design. Given a maintenance-cost bound, a query-cost bound and a set of frequently asked queries, it is necessary to determine a set of data cubes such that the system can answer a largest subset of the queries without violating the two bounds. This is an NP-hard problem. We propose approximate Greedy algorithms GR, 2GM and 2GMM, which are shown to be both effective and efficient by experiments done on a census data set and a forest-cover-type data set.  相似文献   

2.
数据方体系统设计中的优化问题   总被引:2,自引:0,他引:2  
支持实时查询的联机分析处理系统的设计是当前一个很重要的研究问题。其中常用的方法是使用数据方体来实现。对于出现频率较高的查询,可以给出对应的数据方体集,使得每个查询都可以直接得到回答。但是在设计基于方体的系统时,需要考虑以下两个问题:(1)数据方体的维护成本,(2)回答频繁查询的响应时间。在用户给出了维护成本上限和响应时间上限后,需要对数据方体集进行优化,使得系统能够满足用户的要求,并回答尽可能多的查询。文章给出了数据方体系统设计优化问题的定义,这是一个NP完全问题,并提出了贪心删除和贪心合并的近似算法。实验表明了算法的有效性。  相似文献   

3.
View materialization is an effective method to increase query efficiency in a data warehouse and improve OLAP query performance. However, one encounters the problem of space insufficiency if all possible views are materialized in advance. Reducing query time by means of selecting a proper set of materialized views with a lower cost is crucial for efficient data warehousing. In addition, the costs of data warehouse creation, query, and maintenance have to be taken into account while views are materialized. In this paper, we propose efficient algorithms to select a proper set of materialized views, constrained by storage and cost considerations, to help speed up the entire data warehousing process. We derive a cost model for data warehouse query and maintenance as well as efficient view selection algorithms that effectively exploit the gain and loss metrics. The main contribution of our paper is to speed up the selection process of materialized views. Concurrently, this will greatly reduce the overall cost of data warehouse query and maintenance.  相似文献   

4.
Selection of views to materialize in a data warehouse   总被引:4,自引:0,他引:4  
A data warehouse stores materialized views of data from one or more sources, with the purpose of efficiently implementing decision-support or OLAP queries. One of the most important decisions in designing a data warehouse is the selection of materialized views to be maintained at the warehouse. The goal is to select an appropriate set of views that minimizes total query response time and the cost of maintaining the selected views, given a limited amount of resource, e.g., materialization time, storage space, etc. In This work, we have developed a theoretical framework for the general problem of selection of views in a data warehouse. We present polynomial-time heuristics for a selection of views to optimize total query response time under a disk-space constraint, for some important special cases of the general data warehouse scenario, viz.: 1) an AND view graph, where each query/view has a unique evaluation, e.g., when a multiple-query optimizer can be used to general a global evaluation plan for the queries, and 2) an OR view graph, in which any view can be computed from any one of its related views, e.g., data cubes. We present proofs showing that the algorithms are guaranteed to provide a solution that is fairly close to (within a constant factor ratio of) the optimal solution. We extend our heuristic to the general AND-OR view graphs. Finally, we address in detail the view-selection problem under the maintenance cost constraint and present provably competitive heuristics.  相似文献   

5.
View materialization is one of the most important techniques applied in multidimensional databases. The problem of selecting a set of views for materialization that minimizes queries response time under storage space constraint received significant attention over last twenty years. Many researchers concentrate on designing better view selection methods with respect to the running time or the cost of the solution. This paper summarizes our research on the problem of how much space should be allocated for views materialization to ensure good queries performance. In order to comprehensively investigate the problem and minimize the influence of untypical cases, the experiments described in this paper were done on the large data set, including large data cubes, rarely considered in previous papers. In particular, the relation between the number of data cube views and the space limit expressed as a percentage of the fully materialized data cube size and a multiple of the base view size is analysed. According to our experimental results, the allocation of large space for views materialization is not cost effective.  相似文献   

6.
The data cube operator computes group-bys for all possible combinations of a set of dimension attributes. Since computing a data cube typically incurs a considerable cost, the data cube is often precomputed and stored as materialized views in data warehouses. A materialized data cube needs to be updated when the source relations are changed. The incremental maintenance of a data cube is to compute and propagate only its changes, rather than recompute the entire data cube from scratch. For n dimension attributes, the data cube consists of 2n group-bys, each of which is called a cuboid. To incrementally maintain a data cube with 2n cuboids, the conventional methods compute 2ndelta cuboids, each of which represents the change of a cuboid. In this paper, we propose an efficient incremental maintenance method that can maintain a data cube using only a subset of 2n delta cuboids. We formulate an optimization problem to find the optimal subset of 2n delta cuboids that minimizes the total maintenance cost, and propose a heuristic solution that allows us to maintain a data cube using only delta cuboids. As a result, the cost of maintaining a data cube is substantially reduced. Through various experiments, we show the performance advantages of the proposed method over the conventional methods. We also extend the proposed method to handle partially materialized cubes and dimension hierarchies.  相似文献   

7.
We present a new full cube computation technique and a cube storage representation approach, called the multidimensional cyclic graph (MCG) approach. The data cube relational operator has exponential complexity and therefore its materialization involves both a huge amount of memory and a substantial amount of time. Reducing the size of data cubes, without a loss of generality, thus becomes a fundamental problem. Previous approaches, such as Dwarf, Star and MDAG, have substantially reduced the cube size using graph representations. In general, they eliminate prefix redundancy and some suffix redundancy from a data cube. The MCG differs significantly from previous approaches as it completely eliminates prefix and suffix redundancies from a data cube. A data cube can be viewed as a set of sub-graphs. In general, redundant sub-graphs are quite common in a data cube, but eliminating them is a hard problem. Dwarf, Star and MDAG approaches only eliminate some specific common sub-graphs. The MCG approach efficiently eliminates all common sub-graphs from the entire cube, based on an exact sub-graph matching solution. We propose a matching function to guarantee one-to-one mapping between sub-graphs. The function is computed incrementally, in a top-down fashion, and its computation uses a minimal amount of information to generate unique results. In addition, it is computed for any measurement type: distributive, algebraic or holistic. MCG performance analysis demonstrates that MCG is 20-40% faster than Dwarf, Star and MDAG approaches when computing sparse data cubes. Dense data cubes have a small number of aggregations, so there is not enough room for runtime and memory consumption optimization, therefore the MCG approach is not useful in computing such dense cubes. The compact representation of sparse data cubes enables the MCG approach to reduce memory consumption by 70-90% when compared to the original Star approach, proposed in [33]. In the same scenarios, the improved Star approach, proposed in [34], reduces memory consumption by only 10-30%, Dwarf by 30-50% and MDAG by 40-60%, when compared to the original Star approach. The MCG is the first approach that uses an exact sub-graph matching function to reduce cube size, avoiding unnecessary aggregation, i.e. improving cube computation runtime.  相似文献   

8.
This article presents a method for adaptively representing multidimensional data cubes using wavelet view elements in order to more efficiently support data analysis and querying involving aggregations. The proposed method decomposes the data cubes into an indexed hierarchy of wavelet view elements. The view elements differ from traditional data cube cells in that they correspond to partial and residual aggregations of the data cube. The view elements provide highly granular building blocks for synthesizing the aggregated and range-aggregated views of the data cubes. We propose a strategy for selectively materializing alternative sets of view elements based on the patterns of access of views. We present a fast and optimal algorithm for selecting a non-expansive set of wavelet view elements that minimizes the average processing cost for supporting a population of queries of data cube views. We also present a greedy algorithm for allowing the selective materialization of a redundant set of view element sets which, for measured increases in storage capacity, further reduces processing costs. Experiments and analytic results show that the wavelet view element framework performs better in terms of lower processing and storage cost than previous methods that materialize and store redundant views for online analytical processing (OLAP).  相似文献   

9.
物化视图选择问题是数据仓库设计中最重要的问题之一,为了高效地解决这一问题.提出了一个如何选择物化视图集的增强遗传算法,以便在存储空间约束的条件下,取得较好的查询性能和较低的视图维护代价.这一算法的核心思想在于,首先,运用一个基于单位空间最大收益值的预处理算法来生成初始解,然后,该初始解经采用了多种优化策略的遗传算法进行提高,这些优化策略包括:基于改进的锦标赛和精英选择相结合的选择算子、基于半均匀交叉算子及自适应变异算子.并且,在进化过程中产生的无效解用损失函数加以修补.试验结果表明,该算法在寻优性能上优于启发式算法和经典遗传算法.  相似文献   

10.
封闭数据立方是一种有效的无损压缩技术,它去掉了数据立方中的冗余信息,从而有效降低了数据立方的存储空间、加快了计算速度,而且几乎不影响查询性能.Hadoop的MapReduce并行计算模型为数据立方的计算提供了技术支持,Hadoop的分布式文件系统HDFS为数据立方的存储提供了保障.为了节省存储空间、加快查询速度,在传统数据立方的基础上提出封闭直方图立方,它在封闭数据立方的基础上通过编码技术进一步节省了存储空间,通过建立索引加快了查询速度.Hadoop并行计算平台不论从扩展性还是均衡性都为封闭直方图立方提供了保证.实验证明:封闭直方图立方对数据立方进行了有效压缩,具有较高的查询性能,根据Hadoop的特点通过增加节点个数明显加快了计算速度.  相似文献   

11.
The lifecycle of a data cube involves efficient construction and storage, fast query answering, and incremental updating. Existing ROLAP methods that implement data cubes are weak with respect to one or more of the above, focusing mainly on construction and storage. In this paper, we present a comprehensive ROLAP solution that addresses efficiently all functionality in the lifecycle of a cube and can be implemented easily over existing relational servers. It is a family of algorithms centered around a purely ROLAP construction method that provides fast computation of a fully materialized cube in compressed form, is incrementally updateable, and exhibits quick query response times that can be improved by low-cost indexing and caching. This is demonstrated through comprehensive experiments on both synthetic and real-world datasets, whose results have shown great promise for the performance and scalability potential of the proposed techniques, with respect to both the size and dimensionality of the fact table. The project is co-financed within Op. Education by the ESF (European Social Fund) and National Resources.  相似文献   

12.
数据立方体计算方法研究综述   总被引:2,自引:0,他引:2  
随着多维数据分析在各领域的广泛应用,基于数据立方体的计算方法受到大量研究者的关注.分析了影响 数据立方体计算的各种因素,其中包括数据存储空间、查询处理效率和数据立方体的维护消耗,并且阐述了数据立方体的物化策略.分别从冰山立方体、紧凑数据立方体、高维数据立方体、近似计算、流式数据立方体等几个方面综述了国内外现有的计算方法,分析了各种方法的特点以及适用范围.  相似文献   

13.
This paper deals with the problem of physical clustering of multidimensional data that are organized in hierarchies on disk in a hierarchy-preserving manner. This is called hierarchical clustering. A typical case, where hierarchical clustering is necessary for reducing I/Os during query evaluation, is the most detailed data of an OLAP cube. The presence of hierarchies in the multidimensional space results in an enormous search space for this problem. We propose a representation of the data space that results in a chunk-tree representation of the cube. The model is adaptive to the cube’s extensive sparseness and provides efficient access to subsets of data based on hierarchy value combinations. Based on this representation of the search space we formulate the problem as a chunk-to-bucket allocation problem, which is a packing problem as opposed to the linear ordering approach followed in the literature. We propose a metric to evaluate the quality of hierarchical clustering achieved (i.e., evaluate the solutions to the problem) and formulate the problem as an optimization problem. We prove its NP-Hardness and provide an effective solution based on a linear time greedy algorithm. The solution of this problem leads to the construction of the CUBE File data structure. We analyze in depth all steps of the construction and provide solutions for interesting sub-problems arising, such as the formation of bucket-regions, the storage of large data chunks and the caching of the upper nodes (root directory) in main memory. Finally, we provide an extensive experimental evaluation of the CUBE File’s adaptability to the data space sparseness as well as to an increasing number of data points. The main result is that the CUBE File is highly adaptive to even the most sparse data spaces and for realistic cases of data point cardinalities provides hierarchical clustering of high quality and significant space savings.  相似文献   

14.
数据仓库中物化视图选择策略   总被引:2,自引:0,他引:2  
为了提高决策支持和OLAP查询的响应效率,数据仓库多采用物化视图的思想.因此,物化视图的选择策略是数据仓库研究的重要问题之一.其目标是选出一组存储、维护代价与查询代价的总和为最小的物化视图.提出一个以MVPP(multi-view processing plan)为视图选择的搜索空间的物化视图选择新算法--VSMF(views selection base on multi-factor)算法.该算法在存储空间约束下同时实现多查询最优化和视图维护最优化.  相似文献   

15.
In this paper, we study the quality-of-service (QoS)-aware replica placement problem in grid environments. Although there has been much work on the replica placement problem in parallel and distributed systems, most of them concern average system performance and have not addressed the important issue of quality of service requirement. In the very few existing work that takes QoS into consideration, a simplified replication model is assumed; therefore, their solution may not be applicable to real systems. In this paper, we propose a more realistic model for replica placement, which consider storage cost, update cost, and access cost of data replication, and also assumes that the capacity of each replica server is bounded. The QoS-aware replica placement is NP-complete even in the simple model. We propose two heuristic algorithms, called greedy remove and greedy add to approximate the optimal solution. Our extensive experiment results demonstrate that both greedy remove and greedy add find a near-optimal solution effectively and efficiently. Our algorithms can also adapt to various parallel and distributed environments.  相似文献   

16.
OLAP cubes provide exploratory query capabilities combining joins and aggregations at multiple granularity levels. However, cubes cannot intuitively or directly show the relationship between measures aggregated at different grouping levels. One prominent example is the percentage, which is widely used in most analytical applications. Considering this limitation, we introduce percentage cube as a generalized data cube that takes percentages as its basic measure. More precisely, a percentage cube shows the fractional relationship in every cuboid between each aggregated measure on several dimensions and its rolled-up measure aggregated by fewer dimensions. We propose the syntax and introduce query optimizations to materialize the percentage cube. We justify that percentage cubes are significantly harder to evaluate than standard data cubes because in addition to the exponential number of cuboids, there is an additional exponential number of grouping column pairs (grouping columns at the individual level and the total level) on which percentages are computed. We propose alternative methods to prune the cube to identify interesting percentages including a row count threshold, a percentage threshold, and selecting the top k percentages. We study percentage aggregations within the classification of distributive, algebraic, and holistic functions. Finally, we also consider the problem of incremental computation of percentage cube. Experiments compare our query optimizations with existing SQL functions, evaluate the impact and speed of lattice pruning methods and study the effectiveness of the incremental computation.  相似文献   

17.
On-line analytical processing (OLAP) typically involves complex aggregate queries over large datasets. The data cube has been proposed as a structure that materializes the results of such queries in order to accelerate OLAP. A significant fraction of the related work has been on Relational-OLAP (ROLAP) techniques, which are based on relational technology. Existing ROLAP cubing solutions mainly focus on “flat” datasets, which do not include hierarchies in their dimensions. Nevertheless, as shown in this paper, the nature of hierarchies introduces several complications into the entire lifecycle of a data cube including the operations of construction, storage, indexing, query processing, and incremental maintenance. This fact renders existing techniques essentially inapplicable in a significant number of real-world applications and mandates revisiting the entire cube lifecycle under the new perspective. In order to overcome this problem, the CURE algorithm has been recently proposed as an efficient mechanism to construct complete cubes over large datasets with arbitrary hierarchies and store them in a highly compressed format, compatible with the relational model. In this paper, we study the remaining phases in the cube lifecycle and introduce query-processing and incremental-maintenance algorithms for CURE cubes. These are significantly different from earlier approaches, which have been proposed for flat cubes constructed by other techniques and are inadequate for CURE due to its high compression rate and the presence of hierarchies. Our methods address issues such as cube indexing, query optimization, and lazy update policies. Especially regarding updates, such lazy approaches are applied for the first time on cubes. We demonstrate the effectiveness of CURE in all phases of the cube lifecycle through experiments on both real-world and synthetic datasets. Among the experimental results, we distinguish those that have made CURE the first ROLAP technique to complete the construction and usage of the cube of the highest-density dataset in the APB-1 benchmark (12 GB). CURE was in fact quite efficient on this, showing great promise with respect to the potential of the technique overall.  相似文献   

18.
路径规划查询是图数据上的一个基本问题,在众多的领域都有重要的应用价值。通常在实际问题中查询的路径是具有约束的,例如在外卖配送和共享出行问题中路径具有节点约束,其路径需要满足节点之间的先后关系约束。目前对于具有节点约束的路径查询问题,大多数的工作都在研究单起点的节点约束路径查询,但很难拓展到多起点节点约束问题中。因为具有节点约束的多起点路径查询问题是NP-hard的,所以该问题的大多数已有方法是使用贪心增量处理,但对于处理静态规则集拓展性不足。因此,提出了基于子路径的启发式算法和基于约束集拓展的精确算法,并在真实数据集上验证了算法的有效性。实验结果表明,启发式算法能够给出问题的精确解,而启发式算法能快速给出较好的近似解。  相似文献   

19.
Because it operates under a strict time constraint, query processing for data streams should be continuous and rapid. To guarantee this constraint, most previous researches optimize the evaluation order of multiple join operations in a set of continuous queries using a greedy optimization strategy so that the order is re-optimized dynamically in run-time due to the time-varying characteristics of data streams. However, this method often results in a sub-optimal plan because the greedy strategy traces only the first promising plan. This paper proposes a new multiple query optimization approach, Adaptive Sharing-based Extended Greedy Optimization Approach (A-SEGO), that traces multiple promising partial plans simultaneously. A-SEGO presents a novel method for sharing the results of common sub-expressions in a set of queries cost-effectively. The number of partial plans can be flexibly controlled according to the query processing workload. In addition, to avoid invoking the optimization process too frequently, optimization is performed only when the current execution plan is relatively no longer efficient. A series of experiments are comparatively analyzed to evaluate the performance of the proposed method in various stream environments.  相似文献   

20.
The View Selection Problem is an optimization problem designed to enhance query performance through the pre-computation and storage of select views given resource constraints. Assuring the materialized views can be updated within a reasonable time frame has become a chief concern for recent models. However, these methods are crafted simply to fit a solution within a feasible range and not to minimize the resource intensive maintenance process. In this paper, we submit two novel advances in terms of model formulation and solution generation to reduce maintenance costs. Our proposed model, the Minimum-Maintenance View Selection Problem, combines previous techniques to minimize and constrain update costs. Furthermore, we define a series of maintenance time reducing principles in solution generation embodied in a constructor heuristic. The model and constructor heuristic are evaluated using an existing clinical data warehouse and state-of-the-art heuristics. Our analysis shows our model produces the lowest-cost solution relative to extant models. Also, they indicate algorithms seeded with our constructor heuristic to be superior solutions to all other methods tested.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号