首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
吴远红  徐宏炳 《计算机工程与设计》2007,28(10):2262-2264,2269
数据抽取-转换-重载(ETL)是构建和维护数据仓库的基本构件,由于它处理的是海量数据,如何加快响应时间成为值得研究的问题.对ETL过程的逻辑优化进行了研究,把优化问题建模成状态空间搜索问题.把每个ETL工作流看作一种状态,通过一系列正确的状态变换来构造状态空间,并且提出算法来获得最小执行时间的ETL工作流.  相似文献   

2.
ETL执行过程的优化研究   总被引:2,自引:0,他引:2  
提出了一个ETL(Extraction-Transformation-Loading)优化框架并对ETL过程的逻辑优化进行了研究,把优化问题建模成状态空间搜索问题。每个ETL工作流看作一种状态,通过一系列正确的状态变换来构造状态空间,并且提出算法来获得最小执行时间的ETL工作流。理论分析和实践表明其具有良好效果。  相似文献   

3.
Extract, Transform and Load (ETL) processes organized as workflows play an important role in data warehousing. As ETL workflows are usually complex, various ETL facilities have been developed to address their control-flow process modeling and execution control. To evaluate the quality of ETL facilities, Synthetic ETL workflow test cases, consisting of control-flow and data-flow aspects are needed to check ETL facility functionalities at construction time and to validate the correctness and performance of ETL facilities at run time. Although there are some synthetic workflow and data set test case generation approaches existed in literatures, little work is done to consider both aspects at the same time specifically for ETL workflow generators. To address this issue, this paper proposes a schema aware ETL workflow generator with which users can characterize their ETL workflows by various parameters and get ETL workflow test cases with control-flow of ETL activities, complied schemas and associated recordsets. Our generator consists of three steps. First, with type and ratio of individual activities and their connection characteristic parameter specification, the generator will produce ETL activities and form ETL skeleton which determine how generated activities are cooperated with each other. Second, with schema transformation characteristic parameter specification, e.g. ranges of numbers of attributes, the generator will resolve attribute dependencies and refine input/output schemas with complied attributes and their data types. In the last step, recordsets are generated following cardinality specifications. ETL workflows in specific patterns are produced in the experiment in order to show the ability of our generator. Also experiments to generate thousands of ETL workflow test cases in seconds have been done to verify the usability of the generator.  相似文献   

4.
Data sources (DSs) being integrated in a data warehouse frequently change their structures/schemas. As a consequence, in many cases, an already deployed ETL workflow stops its execution, yielding errors. Since in big companies the number of ETL workflows may reach dozens of thousands and since structural changes of DSs are frequent, an automatic repair of an ETL workflow after such changes is of high practical importance. In our approach, we developed a framework, called E-ETL, for handling the evolution of an ETL layer. In the framework, an ETL workflow is semi-automatically or automatically (depending on a case) repaired as the result of structural changes in DSs, so that it works with the changed DSs. E-ETL supports two different repair methods, namely: (1) user defined rules, (2) and Case-Based Reasoning. In this paper, we present how Case-Based Reasoning may be applied to repairing ETL workflows. In particular, we contribute an algorithm for selecting the most suitable case for a given ETL evolution problem. The algorithm applies a technique for reducing cases in order to make them more universal and capable of solving more problems. The algorithm has been implemented in prototype E-ETL and evaluated experimentally. The obtained results are also discussed in this paper.  相似文献   

5.
ETL工作流活动优先级的确定及并行实现*   总被引:1,自引:0,他引:1  
ETL流程是一个以数据为中心的工作流,对ETL工作流的执行过程进行论述,提出了一个算法,计算ETL工作流中各个活动的执行优先级,在工作流执行中为优先级相同且相互之间没有依赖关系的活动集创建多个线程,通过并行执行这些活动,提高了ETL工作流的执行效率。实验结果表明,所提出的并行算法与串行算法比较,在数据量足够大的情况下,加速比可接近理想值,加速比随着数据量增大而提高。  相似文献   

6.
Feature selection (attribute reduction) from large-scale incomplete data is a challenging problem in areas such as pattern recognition, machine learning and data mining. In rough set theory, feature selection from incomplete data aims to retain the discriminatory power of original features. To address this issue, many feature selection algorithms have been proposed, however, these algorithms are often computationally time-consuming. To overcome this shortcoming, we introduce in this paper a theoretic framework based on rough set theory, which is called positive approximation and can be used to accelerate a heuristic process for feature selection from incomplete data. As an application of the proposed accelerator, a general feature selection algorithm is designed. By integrating the accelerator into a heuristic algorithm, we obtain several modified representative heuristic feature selection algorithms in rough set theory. Experiments show that these modified algorithms outperform their original counterparts. It is worth noting that the performance of the modified algorithms becomes more visible when dealing with larger data sets.  相似文献   

7.
Data-intensive Grid applications need access to large data sets that may each be replicated on different resources. Minimizing the overhead of transferring these data sets to the resources where the applications are executed requires that appropriate computational and data resources be selected. In this paper, we consider the problem of scheduling an application composed of a set of independent tasks, each of which requires multiple data sets that are each replicated on multiple resources. We break this problem into two parts: one, to match each task (or job) to one compute resource for executing the job and one storage resource each for accessing each data set required by the job and two, to assign the set of tasks to the selected resources. We model the first part as an instance of the well-known Set Covering Problem (SCP) and apply a known heuristic for SCP to match jobs to resources. The second part is tackled by extending existing MinMin and Sufferage algorithms to schedule the set of distributed data-intensive tasks. Through simulation, we experimentally compare the SCP-based matching heuristic to others in conjunction with the task scheduling algorithms and present the results.  相似文献   

8.
姚全珠  白敏  黄蔚 《计算机工程》2009,35(19):91-93,9
针对AP模型的特点,给出元模型中对象的形式化定义,优化模型映射算法,提出一种基于模型驱动、从概念模型到逻辑模型的映射方法。改进后的算法能够映射基于数据抽取-加载-转换(ETL)工作流的单源数据或多源数据,并发执行各状态节点,提高了执行的效率。实验结果表明,该方法为模型驱动式ETL设计及数据集成中快速实现ETL奠定了良好基础。  相似文献   

9.
针对Simitsis[1,2]等人提出的ETL过程优化算法中存在不足之处,提出了改进的启发式搜索算法,实验证明改进后的算法较好地降低了实际执行的代价,解决了原算法的短视性。  相似文献   

10.
We consider a special case of heuristics, namely numeric heuristic evaluation functions, and their use in artificial intelligence search algorithms. The problems they are applied to fall into three general classes: single-agent path-finding problems, two-player games, and constraint-satisfaction problems. In a single-agent path-finding problem, such as the Fifteen Puzzle or the travelling salesman problem, a single agent searches for a shortest path from an initial state to a goal state. Two-player games, such as chess and checkers, involve an adversarial relationship between two players, each trying to win the game. In a constraint-satisfaction, problem, such as the 8-Queens problem, the task is to find a state that satisfies a set of constraints. All of these problems are computationally intensive, and heuristic evaluation functions are used to reduce the amount of computation required to solve them. In each case we explain the nature of the evaluation functions used, how they are used in search algorithms, and how they can be automatically learned or acquired.  相似文献   

11.
资源调度问题一直是云计算环境下的热点研究问题,然而当前的大部分研究都集中在满足用户的时间或成本需求上,很少考虑用户在调度过程中对安全的需求。针对这一问题,在对常见的云环境下工作流任务的资源调度问题进行建模的基础上,提出了一个安全约束模型,并使用变近邻粒子群算法对该问题进行了求解。最后在CloudSim仿真平台上,用最大 最小蚁群算法和遗传算法与该算法进行了对比,实验结果表明,该算法具有很好的可用性和寻优能力。关键词:  相似文献   

12.
本文介绍了安全工作流以及状态机的基本概念,通过对工作流中安全属性的研究,提出了一种基于多层状态机的安全工作流模型。该模型架构分为工作流层、控制层和数据层三个层次,我们分别从任务、事件和数据角度来分析安全工作流的执行。最后,我们介绍了多层状态机中的授权函数,并详细阐述了安全工作流模型中各个层次的授权过程。  相似文献   

13.
Ye  Xin  Li  Jia  Liu  Sihao  Liang  Jiwei  Jin  Yaochu 《Natural computing》2019,18(4):735-746

Aiming to solve the problem of instance-intensive workflow scheduling in private cloud environment, this paper first formulates a scheduling optimization model considering the communication time between tasks. The objective of this model is to minimize the execution time of all workflow instances. Then, a hybrid scheduling method based on the batch strategy and an improved genetic algorithm termed fragmentation based genetic algorithm is proposed according to the characters of instance-intensive cloud workflow, where task priority dispatching rules are also taken into account. Simulations are conducted to compare the proposed method with the canonical genetic algorithm and two heuristic algorithms. Our simulation results demonstrate that the proposed method can considerably enhance the search efficiency of the genetic algorithm and is able to considerably outperform the compared algorithms, in particular when the number of workflow instances is high and the computational resource available for optimization is limited.

  相似文献   

14.
In this paper, we consider an identical parallel machine scheduling problem with sequence-dependent setup times and job release dates. An improved iterated greedy heuristic with a sinking temperature is presented to minimize the maximum lateness. To verify the developed heuristic, computational experiments are conducted on a well-known benchmark problem data set. The experimental results show that the proposed heuristic outperforms the basic iterated greedy heuristic and the state-of-art algorithms on the same benchmark problem data set. It is believed that this improved approach will also be helpful for other applications.  相似文献   

15.
Extraction-Transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Literature and personal experience have guided us to conclude that the problems concerning the ETL tools are primarily problems of complexity, usability and price. To deal with these problems we provide a uniform metamodel for ETL processes, covering the aspects of data warehouse architecture, activity modeling, contingency treatment and quality management. The ETL tool we have developed, namely , is capable of modeling and executing practical ETL scenarios by providing explicit primitives for the capturing of common tasks. provides three ways to describe an ETL scenario: a graphical point-and-click front end and two declarative languages: XADL (an XML variant), which is more verbose and easy to read and SADL (an SQL-like language) which has a quite compact syntax and is, thus, easier for authoring.  相似文献   

16.
Extraction-Transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Literature and personal experience have guided us to conclude that the problems concerning the ETL tools are primarily problems of complexity, usability and price. To deal with these problems we provide a uniform metamodel for ETL processes, covering the aspects of data warehouse architecture, activity modeling, contingency treatment and quality management. The ETL tool we have developed, namely , is capable of modeling and executing practical ETL scenarios by providing explicit primitives for the capturing of common tasks. provides three ways to describe an ETL scenario: a graphical point-and-click front end and two declarative languages: XADL (an XML variant), which is more verbose and easy to read and SADL (an SQL-like language) which has a quite compact syntax and is, thus, easier for authoring.  相似文献   

17.
In this paper, we propose a method about task scheduling and data assignment on heterogeneous hybrid memory multiprocessor systems for real‐time applications. In a heterogeneous hybrid memory multiprocessor system, an important problem is how to schedule real‐time application tasks to processors and assign data to hybrid memories. The hybrid memory consists of dynamic random access memory and solid state drives when considering the performance of solid state drives into the scheduling policy. To solve this problem, we propose two heuristic algorithms called improvement greedy algorithm and the data assignment according to the task scheduling algorithm, which generate a near‐optimal solution for real‐time applications in polynomial time. We evaluate the performance of our algorithms by comparing them with a greedy algorithm, which is commonly used to solve heterogeneous task scheduling problem. Based on our extensive simulation study, we observe that our algorithms exhibit excellent performance and demonstrate that considering data allocation in task scheduling is significant for saving energy. We conduct experiments on two heterogeneous multiprocessor systems. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

18.
Cloud computing is a relatively new concept in the distributed systems and is widely accepted as a new solution for high performance and distributed computing. Its dynamisms in providing virtual resources for organisations and laboratories and its pay-per-use policy make it very popular. A workflow models a process consisting of a series of steps that shape an application. Workflow scheduling is the method for assigning each workflow task to a processing resource in a way that specific workflow rules are satisfied. Some scheduling algorithms for workflows may assume some quality of service parameter such as cost and deadline. Some efforts have been done on workflow scheduling on cloud computing environments with different service level agreements. But most of them suffer from low speed. Here, we introduce a new hybrid heuristic algorithm based on particle swarm optimisation (PSO) and gravitation search algorithms. The proposed algorithm, in addition to processing cost and transfer cost, takes deadline limitations into account. The proposed workflow scheduling approach can be used by both end-users and utility providers. The CloudSim toolkit is used as a cloud environment simulator and the Amazon EC2 pricing is the reference pricing used. Our experimental result shows about 70% cost reduction, in comparison to non-heuristic implementations, 30% cost reduction in comparison to PSO, 30% cost reduction in comparison to gravitational search algorithm and 50% cost reduction in comparison to hybrid genetic-gravitational algorithm.  相似文献   

19.
为了实现任务执行效率与执行代价的同步优化,提出了一种云计算环境中的DAG任务多目标调度优化算法。算法将多目标最优化问题以满足Pareto最优的均衡最优解集合的形式进行建模,以启发式方式对模型进行求解;同时,为了衡量多目标均衡解的质量,设计了基于hypervolume方法的评估机制,从而可以得到相互冲突目标间的均衡调度解。通过配置云环境与三种人工合成工作流和两种现实科学工作流的仿真实验测试,结果表明,比较同类单目标算法和多目标启发式算法,算法不仅求解质量更高,而且解的均衡度更好,更加符合现实云的资源使用特征与工作流调度模式。  相似文献   

20.
Current conceptual workflow models use either informally defined conceptual models or several formally defined conceptual models that capture different aspects of the workflow, e.g., the data, process, and organizational aspects of the workflow. To the best of our knowledge, there are no algorithms that can amalgamate these models to yield a single view of reality. A fragmented conceptual view is useful for systems analysis and documentation. However, it fails to realize the potential of conceptual models to provide a convenient interface to automate the design and management of workflows. First, as a step toward accomplishing this objective, we propose SEAM (State-Entity-Activity-Model), a conceptual workflow model defined in terms of set theory. Second, no attempt has been made, to the best of our knowledge, to incorporate time into a conceptual workflow model. SEAM incorporates the temporal aspect of workflows. Third, we apply SEAM to a real-life organizational unit's workflows. In this work, we show a subset of the workflows modeled for this organization using SEAM. We also demonstrate, via a prototype application, how the SEAM schema can be implemented on a relational database management system. We present the lessons we learned about the advantages obtained for the organization and, for developers who choose to use SEAM, we also present potential pitfalls in using the SEAM methodology to build workflow systems on relational platforms. The information contained in this work is sufficient enough to allow application developers to utilize SEAM as a methodology to analyze, design, and construct workflow applications on current relational database management systems. The definition of SEAM as a context-free grammar, definition of its semantics, and its mapping to relational platforms should be sufficient also, to allow the construction of an automated workflow design and construction tool with SEAM as the user interface  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号