首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 457 毫秒
1.
Web数据仓库的异步迭代查询处理方法   总被引:2,自引:0,他引:2  
何震瀛  李建中  高宏 《软件学报》2002,13(2):214-218
数据仓库信息量的飞速膨胀对数据仓库提出了巨大挑战.如何提高Web环境下数据仓库的查询效率成为数据仓库研究领域重要的研究问题.对Web数据仓库的体系结构和查询方法进行了研究和探讨.在分析几种Web数据仓库实现方法的基础上,提出了一种Web数据仓库的层次体系结构,并在此基础上提出了Web数据仓库的异步迭代查询方法.该方法充分利用了流水线并行技术,在Web数据仓库的查询处理过程中不同层次的结点以流水线方式运行,并行完成查询的处理,提高了查询效率.理论分析表明,该方法可以有效地提高Web数据仓库的查询效率.  相似文献   

2.
The rapidly increasing scale of data warehouses is challenging today’s data analytical technologies. A conventional data analytical platform processes data warehouse queries using a star schema — it normalizes the data into a fact table and a number of dimension tables, and during query processing it selectively joins the tables according to users’ demands. This model is space economical. However, it faces two problems when applied to big data. First, join is an expensive operation, which prohibits a parallel database or a MapReduce-based system from achieving efficiency and scalability simultaneously. Second, join operations have to be executed repeatedly, while numerous join results can actually be reused by different queries. In this paper, we propose a new query processing framework for data warehouses. It pushes the join operations partially to the pre-processing phase and partially to the postprocessing phase, so that data warehouse queries can be transformed into massive parallelized filter-aggregation operations on the fact table. In contrast to the conventional query processing models, our approach is efficient, scalable and stable despite of the large number of tables involved in the join. It is especially suitable for a large-scale parallel data warehouse. Our empirical evaluation on Hadoop shows that our framework exhibits linear scalability and outperforms some existing approaches by an order of magnitude.  相似文献   

3.
On-line analytical processing (OLAP) refers to the technologies that allow users to efficiently retrieve data from the data warehouse for decision-support purposes. Data warehouses tend to be extremely large, it is quite possible for a data warehouse to be hundreds of gigabytes to terabytes in size (Chauduri and Dayal, 1997). Queries tend to be complex and ad hoc, often requiring computationally expensive operations such as joins and aggregation. Given this, we are interested in developing strategies for improving query processing in data warehouses by exploring the applicability of parallel processing techniques. In particular, we exploit the natural partitionability of a star schema and render it even more efficient by applying DataIndexes-a storage structure that serves both as an index as well as data and lends itself naturally to vertical partitioning of the data. DataIndexes are derived from the various special purpose access mechanisms currently supported in commercial OLAP products. Specifically, we propose a declustering strategy which incorporates both task and data partitioning and present the Parallel Star Join (PSJ) Algorithm, which provides a means to perform a star join in parallel using efficient operations involving only rowsets and projection columns. We compare the performance of the PSJ Algorithm with two parallel query processing strategies. The first is a parallel join strategy utilizing the Bitmap Join Index (BJI), arguably the state-of-the-art OLAP join structure in use today. For the second strategy we choose a well-known parallel join algorithm, namely the pipelined hash algorithm. To assist in the performance comparison, we first develop a cost model of the disk access and transmission costs for all three approaches.  相似文献   

4.
《Information Sciences》2007,177(11):2238-2254
Many data warehouse systems have been developed recently, yet data warehouse practice is not sufficiently sophisticated for practical usage. Most data warehouse systems have some limitations in terms of flexibility, efficiency, and scalability. In particular, the sizes of these data warehouses are forever growing and becoming overloaded with data, a scenario that leads to difficulties in data maintenance and data analysis. This research focuses on data-information integration between data cubes. This research might contribute to the resolution of two concerns: the problem of redundancy and the problem of data cubes’ independent information. This work presents a semantic cube model, which extends object-oriented technology to data warehouses and which enables users to design the generalization relationship between different cubes. In this regard, this work’s objectives are to improve the performance of query integrity and to reduce data duplication in data warehouse. To deal with the handling of increasing data volume in data warehouses, we discovered important inter-relationships that hold among data cubes, that facilitate information integration, and that prevent the loss of data semantics.  相似文献   

5.
Providing integrated access to multiple, distributed, heterogeneous databases and other information sources has become one of the leading issues in database research and the industry. One of the most effective approaches is to extract and integrate information of interest from each source in advance and store them in a centralized repository (known as a data warehouse). When a query is posed, it is evaluated directly at the warehouse without accessing the original information sources. One of the techniques that this approach uses to improve the efficiency of query processing is materialized view(s). Essentially, materialized views are used for data warehouses, and various methods for relational databases have been developed. In this paper, we first discuss an object deputy approach to realize materialized object views for data warehouses which can also incorporate object-oriented databases. A framework has been developed using Smalltalk to prepare data for data warehousing, in which an object deputy model and database connecting tools have been implemented. The object deputy model can provide an easy-to-use way to resolve inconsistency and conflicts while preparing data for data warehousing, as evidenced by our empirical study.  相似文献   

6.
数据仓库查询处理中的一种多表连接算法   总被引:22,自引:2,他引:20  
蒋旭东  周立柱 《软件学报》2001,12(2):190-195
在进行数据仓库的OLAP(onlineanalyticalprocessing,联机分析处理)查询处理时,经常会涉及到多表连接操作,因此,提高多表连接的性能就成了数据仓库领域的关键性问题.基于数据仓库的星型模式,给出了一种新的多表连接算法(M-Join).与传统关系数据库管理系统的多表连接查询处理相比,该算法充分考虑了数据仓库中的数据本身和多表连接的特点,采用对多个表进行一次性连接的方法,使得查询的性能有明显的改善.同时,还给出了算法的实验结果和分析.  相似文献   

7.
徐强 《计算机科学》2003,30(2):63-65
1 虚拟数据模型概述虚拟数据仓库技术因为其开放灵活的体系结构、以需求为驱动、无限的扩展性等优点而越来越引起人们的关注,相比传统数据仓库以供给为驱动的特点,虚拟数据仓库对有很多不同时期、不同构、复杂的数据源的大公司大企业来说有巨大的吸引力。本文在此技术的基础上,提出了一个基于查询优化的虚拟数据仓库模型,它使用多层次分布式的数据结构,在  相似文献   

8.
数据仓库索引启发式查询优化方法   总被引:1,自引:0,他引:1       下载免费PDF全文
在大型数据仓库查询过程中,经常涉及多事实表的连接操作。传统的查询优化方法是在计算多关系连接时尽可能地减少中间关系的大小,并没有考虑到数据仓库中数据的海量,以读为主且事实表一般建有索引的特点,往往无法取得最优的效果。针对数据仓库查询的特点,提出了一种利用索引加快查询的启发式优化方法。理论分析与实验表明,该方法在查询处理代价和执行时间上都明显减少,方法具有有效性。  相似文献   

9.
OLAP系统基于查询结构的用户浏览引导   总被引:4,自引:0,他引:4  
联机分析处理(OLAP)系统是数据仓库主要的前端支持工具,在OLAP系统中用户以浏览的方式进行数据访问。通常,OLAP系统用户一般会有相对稳定的信息需求,而OLAP系统中查询的结构一定程度上反映了用户所关心的信息内容,因此,用户执行查询的结构也会保持一定的稳定性。以查询结构为基础,对OLAP系统用户的查询行为进行了分析,提出了一种建立OLAP系统用户轮廓文件的方法,并对如何根据轮廓文件对用户的行为进行预测,并进一步对用户的浏览进行引导的方法进行了探讨。以此为基础,当OLAP系统用户进行信息浏览时,可以在OLAP系统前端,对用户可能感兴趣的地方做出一定的标识,引导用户将要进行的浏览动作,使用户能轻松的完成信息搜索的工作。  相似文献   

10.
Yao Liu  Hui Xiong 《Information Sciences》2006,176(9):1215-1240
A data warehouse stores current and historical records consolidated from multiple transactional systems. Securing data warehouses is of ever-increasing interest, especially considering areas where data are sold in pieces to third parties for data mining practices. In this case, existing data warehouse security techniques, such as data access control, may not be easy to enforce and can be ineffective. Instead, this paper proposes a data perturbation based approach, called the cubic-wise balance method, to provide privacy preserving range queries on data cubes in a data warehouse. This approach is motivated by the following observation: analysts are usually interested in summary data rather than individual data values. Indeed, our approach can provide a closely estimated summary data for range queries without providing access to actual individual data values. As demonstrated by our experimental results on APB benchmark data set from the OLAP council, the cubic-wise balance method can achieve both better privacy preservation and better range query accuracy than random data perturbation alternatives.  相似文献   

11.
大型数据仓库实现技术的研究   总被引:2,自引:0,他引:2  
大型数据仓库是实现海量数据存储的有效途径,但在大型数据仓库的实现中存在很多问题。在分析问题的基础上,对大型数据仓库的实现问题提出了一定的解决策略,对其中的几个关键技术即数据立方体的有效计算、增量式更新维护、索引优化、故障恢复、模式设计和查询优化的代价模型及元数据的定义和管理等作了研究。  相似文献   

12.
Developing a data warehouse is an ongoing task where new requirements are constantly being added. A widely accepted approach for developing data warehouses is the hybrid approach, where requirements and data sources must be accommodated to a reconciliated data warehouse model. During this process, relationships between conceptual elements specified by user requirements and those supplied by the data sources are lost, since no traceability mechanisms are included. As a result, the designer wastes additional time and effort to update the data warehouse whenever user requirements or data sources change. In this paper, we propose an approach to preserve traceability at conceptual level for data warehouses. Our approach includes a set of traces and their formalization, in order to relate the multidimensional elements specified by user requirements with the concepts extracted from data sources. Therefore, we can easily identify how changes should be incorporated into the data warehouse, and derive it according to the new configuration. In order to minimize the effort required, we define a set of general Query/View/Transformation rules to automate the derivation of traces along with data warehouse elements. Finally, we describe a CASE tool that supports our approach and provide a detailed case study to show the applicability of the proposal.  相似文献   

13.
实视图选择问题是数据仓库研究的重要问题之一。数据仓库存储实视图主要为OLAP查询,用户查询响应时间是首要考虑的问题,提出了查询代价视图选择问题,给出了其代价模型。提出了对查询代价视图选择问题利用遗传算法来解决的方法和策略。经实验证明,该算法达到了良好的效果,效率高。  相似文献   

14.
数据仓库的质量管理问题和方法   总被引:1,自引:1,他引:1  
A data warehouse is often a large-scale information system for an enterprise,so its quality management isimportant and difficult. Recently, some researchers have studied the problems of quality management in data ware-houses from different views ,and achieved some good results. This paper will broadly introduce the concept ,methods and techniques in quality management in data warehouses ,and discuss the important quality factors in data warehous-es in detail.  相似文献   

15.
《Information Systems》2005,30(2):133-149
Data warehouses collect masses of operational data, allowing analysts to extract information by issuing decision support queries on the otherwise discarded data. In many application areas (e.g. telecommunications), the warehoused data sets are multiple terabytes in size. Parts of these data sets are stored on very large disk arrays, while the remainder is stored on tape-based tertiary storage (which is one to two orders of magnitude less expensive than on-line storage). However, the inherently sequential nature of access to tape-based tertiary storage makes the efficient access to tape-resident data difficult to accomplish through conventional databases.In this paper, we present a way to make access to a massive tape-resident data warehouse easy and efficient. Ad hoc decision support queries usually involve large scale and complex aggregation over the detail data. These queries are difficult to express in SQL, and frequently require self-joins on the detail data (which are prohibitively expensive on the disk-resident data and infeasible to compute on tape-resident data), or unnecessary multiple passes through the detail data. An extension to SQL, the extended multi feature SQL (EMF SQL) expresses complex aggregation computations in a clear manner without using self-joins. The detail data in a data warehouse usually represents a record of past activities, and therefore is temporal. We show that complex queries involving sequences can be easily expressed in EMF SQL. An EMF SQL query can be optimized to minimize the number of passes through the detail data required to evaluate the query, in many cases to only one pass. We describe an efficient query evaluation algorithm along with a query optimization algorithm that minimizes the number of passes through the detail data, and which minimizes the amount of main memory required to evaluate the query. These algorithms are useful not only in the context of tape-resident data warehouses but also in data stream systems which require similar processing techniques.  相似文献   

16.
物化视图能够有效地提高空间数据仓库的查询效率,但由于空间操作的复杂性,传统数据仓库中物化视图的选择算法不能很好地应用于空间数据仓库。为了在存储空间约束下选择查询进行物化,并动态调整物化视图集,以适应用户查询的时变性和即席查询,提出了空间物化视图选择算法SMVS。实验结果表明该算法是有效可行的,不仅能够提高查询性能,而且解决了查询响应性能随用户查询分布变化而下降的问题。  相似文献   

17.
Data warehouses are built to reply query searches efficiently from integrated data of various systems. To improve the performance of the system, the issue of materializing views within data warehouses must be explored. This involves to pre-compute a set of selected views which are fact and dimension tables, under given resource and quality constraints. The quality constraints include query processing time, data maintenance time and the freshness of data when queries are placed. Then there is the policy of updating, which treats the time issue of data reloading in data warehouses. A model is proposed to determine the view selection and update policy when the arrival of queries follows Poisson processes with the constraints of system response time, storage space and query dependent currency of data (on systems capable of periodic and query-triggered updates). To the best of the researchers’ knowledge, no other research has considered all these factors in their models. A two-phase greedy algorithm was developed to determine the optimal update policy for the view selection problem. Numerous experiments were performed to explore the sensitivity of the proposed model under various constraints and system parameter settings. The results show that the model has reasonable responses to the tunings and that the proposed algorithm can rapidly find acceptable solutions.  相似文献   

18.
Analysis of historical data in data warehouses contributes significantly toward future decision-making. A number of design factors including, slowly changing dimensions (SCDs), affect the quality of such analysis. In SCDs, attribute values may change over time and must be tracked. They should maintain consistency and correctness of data, and show good query performance. We identify that SCDs can have three types of validity periods: disjoint, overlapping, and same validity periods. We then show that the third type cannot be handled through the temporal star schema for temporal data warehouses (TDWs). We further show that a hybrid/Type6 scheme and temporal star schema may be used to handle this shortcoming. We demonstrate that the use of a surrogate key in the hybrid scheme efficiently identifies data, avoids most time comparisons, and improves query performance. Finally, we compare the TDWs and a surrogate key-based temporal data warehouse (SKTDW) using query formulation, query performance, and data warehouse size as parameters. The results of our experiments for 23 queries of five different types show that SKTDW outperforms TDW for all type of queries, with average and maximum performance improvements of 165% and 1071%, respectively. The results of our experiments are statistically significant.  相似文献   

19.
This paper presents and evaluates a simple but very effective method to implement large data warehouses on an arbitrary number of computers, achieving very high query execution performance and scalability. The data is distributed and processed in a potentially large number of autonomous computers using our technique called data warehouse striping (DWS). The major problem of DWS technique is that it would require a very expensive cluster of computers with fault tolerant capabilities to prevent a fault in a single computer to stop the whole system. In this paper, we propose a radically different approach to deal with the problem of the unavailability of one or more computers in the cluster, allowing the use of DWS with a very large number of inexpensive computers. The proposed approach is based on approximate query answering techniques that make it possible to deliver an approximate answer to the user even when one or more computers in the cluster are not available. The evaluation presented in the paper shows both analytically and experimentally that the approximate results obtained this way have a very small error that can be negligible in most of the cases.  相似文献   

20.
为使数据仓库更好地为决策支持服务,本文提出了一种面向主题的智能查询方法,增强了数据仓库查询的智能化程度,使用户能够基于领域专家知识进行动态查询,并且为用户的查询提供更多的相关信息。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号