首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到16条相似文献,搜索用时 187 毫秒
1.
ETL执行过程的优化研究   总被引:2,自引:0,他引:2  
提出了一个ETL(Extraction-Transformation-Loading)优化框架并对ETL过程的逻辑优化进行了研究,把优化问题建模成状态空间搜索问题。每个ETL工作流看作一种状态,通过一系列正确的状态变换来构造状态空间,并且提出算法来获得最小执行时间的ETL工作流。理论分析和实践表明其具有良好效果。  相似文献   

2.
ETL是数据仓库获得高质量数据的关键环节,在数据仓库的构建和实施中占有重要地位。针对传统ETL串行执行方式的不足,提出一种基于Agent和活动优先度相结合的ETL并行执行方法。该方法计算ETL执行过程中各个活动的优先度,利用Agent理论和多线程并行计算技术实现并行执行具有相同优先度且相互间没有依赖关系的ETL活动。实验结果表明,该方法在数据量较大时具有较好的加速比,提高了ETL过程的执行效率。  相似文献   

3.
姚全珠  白敏  黄蔚 《计算机工程》2009,35(19):91-93,9
针对AP模型的特点,给出元模型中对象的形式化定义,优化模型映射算法,提出一种基于模型驱动、从概念模型到逻辑模型的映射方法。改进后的算法能够映射基于数据抽取-加载-转换(ETL)工作流的单源数据或多源数据,并发执行各状态节点,提高了执行的效率。实验结果表明,该方法为模型驱动式ETL设计及数据集成中快速实现ETL奠定了良好基础。  相似文献   

4.
基于多Agent与工作流的分布式ETL引擎的研究   总被引:1,自引:0,他引:1  
丁进  郭朝珍 《计算机应用》2009,29(1):319-322
针对传统ETL工具集中式执行方式的不足,提出了一种基于多Agent与工作流相结合的分布式ETL引擎的体系结构。该体系结构由一个主控引擎和多个执行引擎组成,执行引擎可自主地向主控引擎注册执行服务,并利用分布式计算和多线程并行计算技术,实现由多个执行引擎协同执行ETL工作流,从而提高整个系统的灵活性和吞吐率。实验结果表明,该引擎具有较好的可扩展性和负载平衡性能,并提高了执行效率。  相似文献   

5.
为了识别出分布式环境下工作流的执行流程,对分布式工作流管理系统进行了研究,通过对分布式工作流执行站点中XML格式的系统运行日志进行分析,提出了一种增量式工作流挖掘算法。该算法通过对大量工作流执行站点中的活动执行时间序列进行分析与合并,从而重构出分布式环境下的工作流模型。该算法主要由两个重要部分组成:一个是时间序列挖掘算法,用于从工作流执行日志中挖掘出活动间的执行时间序列;另一个是工作流程识别算法,在时间序列挖掘算法得出的活动执行时间序列基础上,识别出结构化的工作流模型。通过实例结果表明了该算法的有效性。  相似文献   

6.
针对事务密集型工作流系统的特性,提出一种工作流动态自适应调度算法――预演算调度算法。在该算法中每个工作流应用都会进行初始化计算以产生一个可执行节点的优先级序列,从而保证不同运行环境下每个流程实例的执行代价与传输代价最小。实验结果表明,在事务密集型环境中该算法具有较好的运行效率。  相似文献   

7.
孙月  于炯  朱建波 《计算机科学》2014,41(3):145-148,168
为解决多用户工作流调度过程中的公平性问题,提高资源利用率,满足不同用户DAG工作流的不同QoS需求,提出了抢占式多DAG工作流动态调度模型。该算法将DAG工作流按照QoS需求进行优先级划分,采用高优先级作业优先占有资源的原则调度作业。相同优先级DAG工作流的任务依据带有启发性信息的slowdown进行资源抢占,进一步提高了作业调度的公平性;对于不同优先级的作业调度,提出了基于阈值的回填算法,该算法在保证作业调度公平的同时提高了资源利用率。  相似文献   

8.
任丰玲  于炯  杨兴耀 《计算机工程》2012,38(23):287-290
针对云计算环境下多个有向无环图(DAG)工作流的调度问题,提出一种基于最小化数据传输时间和任务完成时间(LTCT)的算法,用于处理具有相同优先级的多个DAG工作流之间的调度问题。在多个DAG优先级各不相同时的情况下,给出多优先级多DAG的混合调度算法。实验结果表明,LTCT算法较E-Fairness算法在保证多DAG调度公平性的基础上,能避免额外的数据传输开销,有利于缩短整个工作流的执行Makespan,提高资源的利用率。  相似文献   

9.
吴远红  徐宏炳 《计算机工程与设计》2007,28(10):2262-2264,2269
数据抽取-转换-重载(ETL)是构建和维护数据仓库的基本构件,由于它处理的是海量数据,如何加快响应时间成为值得研究的问题.对ETL过程的逻辑优化进行了研究,把优化问题建模成状态空间搜索问题.把每个ETL工作流看作一种状态,通过一系列正确的状态变换来构造状态空间,并且提出算法来获得最小执行时间的ETL工作流.  相似文献   

10.
根据网格工作流中任务的依赖关系和截止时间,以及资源的有效度和MIPS(每秒百万条指令),提出基于网格资源预测的任务优先级调度算法。把网格任务工作流抽象为有向无环图,找到该工作流的关键路径,计算每个任务的最迟开始执行时间,作为任务的优先级。在算法中考虑用户的要求和资源的类型,以及任务调度失败后重新分配的问题。实验验证了该算法的有效性。  相似文献   

11.
Extract, Transform and Load (ETL) processes organized as workflows play an important role in data warehousing. As ETL workflows are usually complex, various ETL facilities have been developed to address their control-flow process modeling and execution control. To evaluate the quality of ETL facilities, Synthetic ETL workflow test cases, consisting of control-flow and data-flow aspects are needed to check ETL facility functionalities at construction time and to validate the correctness and performance of ETL facilities at run time. Although there are some synthetic workflow and data set test case generation approaches existed in literatures, little work is done to consider both aspects at the same time specifically for ETL workflow generators. To address this issue, this paper proposes a schema aware ETL workflow generator with which users can characterize their ETL workflows by various parameters and get ETL workflow test cases with control-flow of ETL activities, complied schemas and associated recordsets. Our generator consists of three steps. First, with type and ratio of individual activities and their connection characteristic parameter specification, the generator will produce ETL activities and form ETL skeleton which determine how generated activities are cooperated with each other. Second, with schema transformation characteristic parameter specification, e.g. ranges of numbers of attributes, the generator will resolve attribute dependencies and refine input/output schemas with complied attributes and their data types. In the last step, recordsets are generated following cardinality specifications. ETL workflows in specific patterns are produced in the experiment in order to show the ability of our generator. Also experiments to generate thousands of ETL workflow test cases in seconds have been done to verify the usability of the generator.  相似文献   

12.
State-space optimization of ETL workflows   总被引:3,自引:0,他引:3  
Extraction-transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization, and insertion into a data warehouse. In this paper, we derive into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as a state and fabricate the state space through a set of correct state transitions. Moreover, we provide an exhaustive and two heuristic algorithms toward the minimization of the execution cost of an ETL workflow. The heuristic algorithm with greedy characteristics significantly outperforms the other two algorithms for a large set of experimental cases.  相似文献   

13.
增量ETL过程的并行化是提高ODS数据实时性的有效途径。结合通信顺序进程理论研究了增量ETL过程模型,形式化分析了增量ETL过程事件在并行环境下执行状态的变换过程,提出了增量ETL过程并行调度算法,解决了增量ETL过程在并行环境下调度策略的问题。应用及实践表明,模型及算法具有源系统负载小、数据的实时性高等特点。  相似文献   

14.
Data sources (DSs) being integrated in a data warehouse frequently change their structures/schemas. As a consequence, in many cases, an already deployed ETL workflow stops its execution, yielding errors. Since in big companies the number of ETL workflows may reach dozens of thousands and since structural changes of DSs are frequent, an automatic repair of an ETL workflow after such changes is of high practical importance. In our approach, we developed a framework, called E-ETL, for handling the evolution of an ETL layer. In the framework, an ETL workflow is semi-automatically or automatically (depending on a case) repaired as the result of structural changes in DSs, so that it works with the changed DSs. E-ETL supports two different repair methods, namely: (1) user defined rules, (2) and Case-Based Reasoning. In this paper, we present how Case-Based Reasoning may be applied to repairing ETL workflows. In particular, we contribute an algorithm for selecting the most suitable case for a given ETL evolution problem. The algorithm applies a technique for reducing cases in order to make them more universal and capable of solving more problems. The algorithm has been implemented in prototype E-ETL and evaluated experimentally. The obtained results are also discussed in this paper.  相似文献   

15.
In the last years, scientific workflows have emerged as a fundamental abstraction for structuring and executing scientific experiments in computational environments. Scientific workflows are becoming increasingly complex and more demanding in terms of computational resources, thus requiring the usage of parallel techniques and high performance computing (HPC) environments. Meanwhile, clouds have emerged as a new paradigm where resources are virtualized and provided on demand. By using clouds, scientists have expanded beyond single parallel computers to hundreds or even thousands of virtual machines. Although the initial focus of clouds was to provide high throughput computing, clouds are already being used to provide an HPC environment where elastic resources can be instantiated on demand during the course of a scientific workflow. However, this model also raises many open, yet important, challenges such as scheduling workflow activities. Scheduling parallel scientific workflows in the cloud is a very complex task since we have to take into account many different criteria and to explore the elasticity characteristic for optimizing workflow execution. In this paper, we introduce an adaptive scheduling heuristic for parallel execution of scientific workflows in the cloud that is based on three criteria: total execution time (makespan), reliability and financial cost. Besides scheduling workflow activities based on a 3-objective cost model, this approach also scales resources up and down according to the restrictions imposed by scientists before workflow execution. This tuning is based on provenance data captured and queried at runtime. We conducted a thorough validation of our approach using a real bioinformatics workflow. The experiments were performed in SciCumulus, a cloud workflow engine for managing scientific workflow execution.  相似文献   

16.
The use of rules in a distributed environment creates new challenges for the development of active rule execution models. In particular, since a single event can trigger multiple rules that execute over distributed sources of data, it is important to make use of concurrent rule execution whenever possible. This paper presents the details of the integration rule scheduling (IRS) algorithm. Integration rules are active database rules that are used for component integration in a distributed environment. The IRS algorithm identifies rule conflicts for multiple rules triggered by the same event through static, compile-time analysis of the read and write sets of each rule. A unique aspect of the algorithm is that the conflict analysis includes the effects of nested rule execution that occurs as a result of using an execution model with an immediate coupling mode. The algorithm therefore identifies conflicts that may occur as a result of the concurrent execution of different rule triggering sequences. The rules are then formed into a priority graph before execution, defining the order in which rules triggered by the same event should be processed. Rules with the same priority can be executed concurrently. The IRS algorithm guarantees confluence in the final state of the rule execution. The IRS algorithm is applicable for rule scheduling in both distributed and centralized rule execution environments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号