首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Efficient data-aware methods in job scheduling, distributed storage management and data management platforms are necessary for successful execution of data-intensive applications. However, research about methods for data-intensive scientific applications are insufficient in large-scale distributed cloud and cluster computing environments and data-aware methods are becoming more complex. In this paper, we propose a Data-Locality Aware Workflow Scheduling (D-LAWS) technique and a locality-aware resource management method for data-intensive scientific workflows in HPC cloud environments. D-LAWS applies data-locality and data transfer time based on network bandwidth to scientific workflow task scheduling and balances resource utilization and parallelism of tasks at the node-level. Our method consolidates VMs and consider task parallelism by data flow during the planning of task executions of a data-intensive scientific workflow. We additionally consider more complex workflow models and data locality pertaining to the placement and transfer of data prior to task executions. We implement and validate the methods based on fairness in cloud environments. Experimental results show that, the proposed methods can improve performance and data-locality of data-intensive workflows in cloud environments.  相似文献   

2.
In many-task computing (MTC), applications such as scientific workflows or parameter sweeps communicate via intermediate files; application performance strongly depends on the file system in use. The state of the art uses runtime systems providing in-memory file storage that is designed for data locality: files are placed on those nodes that write or read them. With data locality, however, task distribution conflicts with data distribution, leading to application slowdown, and worse, to prohibitive storage imbalance. To overcome these limitations, we present MemFS, a fully symmetrical, in-memory runtime file system that stripes files across all compute nodes, based on a distributed hash function. Our cluster experiments with Montage and BLAST workflows, using up to 512 cores, show that MemFS has both better performance and better scalability than the state-of-the-art, locality-based file system, AMFS. Furthermore, our evaluation on a public commercial cloud validates our cluster results. On this platform MemFS shows excellent scalability up to 1024 cores and is able to saturate the 10G Ethernet bandwidth when running BLAST and Montage.  相似文献   

3.
A growing number of data- and compute-intensive experiments have been modeled as scientific workflows in the last decade. Meanwhile, clouds have emerged as a prominent environment to execute this type of workflows. In this scenario, the investigation of workflow scheduling strategies, aiming at reducing its execution times, became a top priority and a very popular research field. However, few work consider the problem of data file assignment when solving the task scheduling problem. Usually, a workflow is represented by a graph where nodes represent tasks and the scheduling problem consists in allocating tasks to machines to be executed at a predefined time aiming at reducing the makespan of the whole workflow. In this article, we show that the scheduling of scientific workflows can be improved when both task scheduling and the data file assignment problems are treated together. Thus, we propose a new workflow representation, where nodes of the workflow graph represent either tasks or data files, and define the Task Scheduling and Data Assignment Problem (TaSDAP), considering this new model. We formulated this problem as an integer programming problem. Moreover, a hybrid evolutionary algorithm for solving it, named HEA-TaSDAP, is also introduced. To evaluate our approach we conducted two types of experiments: theoretical and practical ones. At first, we compared HEA-TaSDAP with the solutions produced by the mathematical formulation and by other works from related literature. Then, we considered real executions in Amazon EC2 cloud using a real scientific workflow use case (SciPhy for phylogenetic analyses). In all experiments, HEA-TaSDAP outperformed the other classical approaches from the related literature, such as Min–Min and HEFT.  相似文献   

4.
5.
Cloud computing has established itself as an interesting computational model that provides a wide range of resources such as storage, databases and computing power for several types of users. Recently, the concept of cloud computing was extended with the concept of federated clouds where several resources from different cloud providers are inter-connected to perform a common action (e.g. execute a scientific workflow). Users can benefit from both single-provider and federated cloud environment to execute their scientific workflows since they can get the necessary amount of resources on demand. In several of these workflows, there is a demand for high performance and parallelism techniques since many activities are data and computing intensive and can execute for hours, days or even weeks. There are some Scientific Workflow Management Systems (SWfMS) that already provide parallelism capabilities for scientific workflows in single-provider cloud. Most of them rely on creating a virtual cluster to execute the workflow in parallel. However, they also rely on the user to estimate the amount of virtual machines to be allocated to create this virtual cluster. Most SWfMS use this initial virtual cluster configuration made by the user for the entire workflow execution. Dimensioning the virtual cluster to execute the workflow in parallel is then a top priority task since if the virtual cluster is under or over dimensioned it can impact on the workflow performance or increase (unnecessarily) financial costs. This dimensioning is far from trivial in a single-provider cloud and specially in federated clouds due to the huge number of virtual machine types to choose in each location and provider. In this article, we propose an approach named GraspCC-fed to produce the optimal (or near-optimal) estimation of the amount of virtual machines to allocate for each workflow. GraspCC-fed extends a previously proposed heuristic based on GRASP for executing standalone applications to consider scientific workflows executed in both single-provider and federated clouds. For the experiments, GraspCC-fed was coupled to an adapted version of SciCumulus workflow engine for federated clouds. This way, we believe that GraspCC-fed can be an important decision support tool for users and it can help determining an optimal configuration for the virtual cluster for parallel cloud-based scientific workflows.  相似文献   

6.
Security is increasingly critical for various scientific workflows that are big data applications and typically take quite amount of time being executed on large-scale distributed infrastructures. Cloud computing platform is such an infrastructure that can enable dynamic resource scaling on demand. Nevertheless, based on pay-per-use and hourly-based pricing model, users should pay attention to the cost incurred by renting virtual machines (VMs) from cloud data centers. Meanwhile, workflow tasks are generally heterogeneous and require different instance series (i.e., computing optimized, memory optimized, storage optimized, etc.). In this paper, we propose a security and cost aware scheduling (SCAS) algorithm for heterogeneous tasks of scientific workflow in clouds. Our proposed algorithm is based on the meta-heuristic optimization technique, particle swarm optimization (PSO), the coding strategy of which is devised to minimize the total workflow execution cost while meeting the deadline and risk rate constraints. Extensive experiments using three real-world scientific workflow applications, as well as CloudSim simulation framework, demonstrate the effectiveness and practicality of our algorithm.  相似文献   

7.
An important challenge for the adoption of cloud computing in the scientific community remains the efficient allocation and execution of data-intensive scientific workflows to reduce execution time and the size of transferred data. The transferred data overhead is becoming significant with emerging scientific workflows that have input/output files and intermediate data products ranging in the hundreds of gigabytes. The allocation of scientific workflows on public clouds can be described through a variety of perspectives and parameters, and has been proved to be NP-complete. This paper proposes an evolutionary approach for task allocation on public clouds considering data transfer and execution time. In our framework, a solution is represented using an allocation chromosome that encodes the allocation of tasks to nodes, and an ordering chromosome that defines the execution order according to the scientific workflow representation. We propose a multi-objective optimization that relies on a cloud cost model and employs tailored evolution operators. Starting from a population of possible solutions, we employ crossover and mutation operators on both chromosomes aiming at optimizing the data transferred between nodes as well as the total workflow runtime. The crossover operators combine parts of solutions to reduce data overhead, whereas the mutation operators swamp between parts of the same chromosome according to pre-defined rules. Our experimental study compares between the proposed approach and current state-of-the art approaches using synthetic and real-life workflows. Our algorithm performs similarly to existing heuristics for small workflows and shows up to 80 % improvements for larger synthetic workflows. To further validate our approach we compare between the allocation and scheduling obtained by our approach with that obtained by popular scientific workflow managers, when real workflows with hundreds of tasks are executed on a public cloud. The results show a 10 % improvement in runtime over existing schedulers, caused by a 80 % reduction in transferred data and optimized allocation and ordering of tasks. This improved data locality has greater impact as it can be employed to improve and study data provenance and facilitate data persistence for scientific workflows.  相似文献   

8.
In the last years, scientific workflows have emerged as a fundamental abstraction for structuring and executing scientific experiments in computational environments. Scientific workflows are becoming increasingly complex and more demanding in terms of computational resources, thus requiring the usage of parallel techniques and high performance computing (HPC) environments. Meanwhile, clouds have emerged as a new paradigm where resources are virtualized and provided on demand. By using clouds, scientists have expanded beyond single parallel computers to hundreds or even thousands of virtual machines. Although the initial focus of clouds was to provide high throughput computing, clouds are already being used to provide an HPC environment where elastic resources can be instantiated on demand during the course of a scientific workflow. However, this model also raises many open, yet important, challenges such as scheduling workflow activities. Scheduling parallel scientific workflows in the cloud is a very complex task since we have to take into account many different criteria and to explore the elasticity characteristic for optimizing workflow execution. In this paper, we introduce an adaptive scheduling heuristic for parallel execution of scientific workflows in the cloud that is based on three criteria: total execution time (makespan), reliability and financial cost. Besides scheduling workflow activities based on a 3-objective cost model, this approach also scales resources up and down according to the restrictions imposed by scientists before workflow execution. This tuning is based on provenance data captured and queried at runtime. We conducted a thorough validation of our approach using a real bioinformatics workflow. The experiments were performed in SciCumulus, a cloud workflow engine for managing scientific workflow execution.  相似文献   

9.
Many scientific workflows are data intensive: large volumes of intermediate datasets are generated during their execution. Some valuable intermediate datasets need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science on clouds has become popular nowadays, more intermediate datasets in scientific cloud workflows can be stored by different storage strategies based on a pay-as-you-go model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenances in scientific workflows. With the IDG, deleted intermediate datasets can be regenerated, and as such we develop a novel algorithm that can find a minimum cost storage strategy for the intermediate datasets in scientific cloud workflow systems. The strategy achieves the best trade-off of computation cost and storage cost by automatically storing the most appropriate intermediate datasets in the cloud storage. This strategy can be utilised on demand as a minimum cost benchmark for all other intermediate dataset storage strategies in the cloud. We utilise Amazon clouds’ cost model and apply the algorithm to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that benchmarking effectively demonstrates the cost effectiveness over other representative storage strategies.  相似文献   

10.
11.
This paper focuses on data-intensive workflows and addresses the problem of scheduling workflow ensembles under cost and deadline constraints in Infrastructure as a Service (IaaS) clouds. Previous research in this area ignores file transfers between workflow tasks, which, as we show, often have a large impact on workflow ensemble execution. In this paper we propose and implement a simulation model for handling file transfers between tasks, featuring the ability to dynamically calculate bandwidth and supporting a configurable number of replicas, thus allowing us to simulate various levels of congestion. The resulting model is capable of representing a wide range of storage systems available on clouds: from in-memory caches (such as memcached), to distributed file systems (such as NFS servers) and cloud storage (such as Amazon S3 or Google Cloud Storage). We observe that file transfers may have a significant impact on ensemble execution; for some applications up to 90 % of the execution time is spent on file transfers. Next, we propose and evaluate a novel scheduling algorithm that minimizes the number of transfers by taking advantage of data caching and file locality. We find that for data-intensive applications it performs better than other scheduling algorithms. Additionally, we modify the original scheduling algorithms to effectively operate in environments where file transfers take non-zero time.  相似文献   

12.
Large-scale applications can be expressed as a set of tasks with data dependencies between them, also known as application workflows. Due to the scale and data processing requirements of these applications, they require Grid computing and storage resources. So far, the focus has been on developing easy to use interfaces for composing these workflows and finding an optimal mapping of tasks in the workflow to the Grid resources in order to minimize the completion time of the application. After this mapping is done, a workflow execution engine is required to run the workflow over the mapped resources. In this paper, we show that the performance of the workflow execution engine in executing the workflow can also be a critical factor in determining the workflow completion time. Using Condor as the workflow execution engine, we examine the various factors that affect the completion time of a fine granularity astronomy workflow. We show that changing the system parameters that influence these factors and restructuring the workflow can drastically reduce the completion time of this class of workflows. We also examine the effect on the optimizations developed for the astronomy application on a coarser granularity biology application. We were able to reduce the completion time of the Montage and the Tomography application workflows by 90% and 50%, respectively.  相似文献   

13.
Cloud computing is a relatively new concept in the distributed systems and is widely accepted as a new solution for high performance and distributed computing. Its dynamisms in providing virtual resources for organisations and laboratories and its pay-per-use policy make it very popular. A workflow models a process consisting of a series of steps that shape an application. Workflow scheduling is the method for assigning each workflow task to a processing resource in a way that specific workflow rules are satisfied. Some scheduling algorithms for workflows may assume some quality of service parameter such as cost and deadline. Some efforts have been done on workflow scheduling on cloud computing environments with different service level agreements. But most of them suffer from low speed. Here, we introduce a new hybrid heuristic algorithm based on particle swarm optimisation (PSO) and gravitation search algorithms. The proposed algorithm, in addition to processing cost and transfer cost, takes deadline limitations into account. The proposed workflow scheduling approach can be used by both end-users and utility providers. The CloudSim toolkit is used as a cloud environment simulator and the Amazon EC2 pricing is the reference pricing used. Our experimental result shows about 70% cost reduction, in comparison to non-heuristic implementations, 30% cost reduction in comparison to PSO, 30% cost reduction in comparison to gravitational search algorithm and 50% cost reduction in comparison to hybrid genetic-gravitational algorithm.  相似文献   

14.
科学工作流应用是一种复杂且数据密集型的应用,常应用于结构生物学、高能物理学和神经学等涉及分布式数据源的学科。数据分散存储在基于互联网的云计算平台上,致使科学工作流在执行时伴随着大量的数据传输。云计算是一种按使用量付费的模式,数据传输产生传输费用,尤其在多个工作流相互协同的情况下,将产生更高的传输成本。该文从全局的角度建立基于多工作流数据依赖图的传输成本模型,研究基于二进制粒子群算法(BPSO)的数据布局优化策略,从而减少对云计算传输资源的租赁费用。  相似文献   

15.
Neuroimaging is a field that benefits from distributed computing infrastructures (DCIs) to perform data processing and analysis, which is often achieved using Grid workflow systems. Collaborative research in neuroimaging requires ways to facilitate exchange between different groups, in particular to enable sharing, re-use and interoperability of applications implemented as workflows. The SHIWA project provides solutions to facilitate sharing and exchange of workflows between workflow systems and DCI resources. In this paper we present and analyse how the SHIWA Platform was used to implement various cases in which workflow exchange supports collaboration in neuroscience. The SHIWA Platform and the implemented solutions are described and analysed from a “user” perspective, in this case workflow developers and neuroscientists. We conclude that the platform in its current form is valuable for these cases, and we identify remaining challenges.  相似文献   

16.
随着分布式存储技术的不断发展,越来越多的企业、政府机构用户将数据保存在云端,实现大数据的分布式存储和数据资源共享。区块链技术的去中心化、可追溯、不可篡改、数据一致性等特点,为解决云存储存在的隐私和安全挑战问题带来了新的契机。文章提出了一种基于区块链的分布式离链存储框架。在区块链中部署区块节点和存储节点,其中区块节点用于执行底层区块链运行机制,存储节点用于存储数据和文件,通过将区块和存储功能区分开实现了离链存储,保证了区块链的运行效率。此外,文章还提出了一个使用基于经典数据占有机制的全局交互验证方法,以确保数据文件的分布、可靠和可证明的存储。用户向区块链中添加区块(存储文件)时会触发公平挑战机制的审核机制,从而隐式验证离链存储的所有文件是否完整。  相似文献   

17.
Many Grid workflow middleware services require knowledge about the performance behavior of Grid applications/services in order to effectively select, compose, and execute workflows in dynamic and complex Grid systems. To provide performance information for building such knowledge, Grid workflow performance tools have to select, measure, and analyze various performance metrics of workflows. However, there is a lack of a comprehensive study of performance metrics which can be used to evaluate the performance of a workflow executed in the Grid. Moreover, given the complexity of both Grid systems and workflows, semantics of essential performance-related concepts and relationships, and associated performance data in Grid workflows should be well described. In this paper, we analyze performance metrics that performance monitoring and analysis tools should provide during the evaluation of the performance of Grid workflows. Performance metrics are associated with multiple levels of abstraction. We introduce an ontology for describing performance data of Grid workflows and illustrate how the ontology can be utilized for monitoring and analyzing the performance of Grid workflows.  相似文献   

18.
游小容  曹晟 《计算机科学》2015,42(10):76-80
Hadoop作为成熟的分布式云平台,能提供可靠高效的存储服务,常用来解决大文件的存储问题,但在处理海量小文件时效率显著降低。提出了基于Hadoop的海量教育资源中小文件的存储优化方案,即利用教育资源小文件间的关联关系,将小文件合并成大文件以减少文件数量,并用索引机制访问小文件及元数据缓存和关联小文件预取机制来提高文件的读取效率。实验证明,以上方法提高了Hadoop文件系统对小文件的存取效率。  相似文献   

19.
The paper proposes a model, design, and implementation of an efficient multithreaded engine for execution of distributed service-based workflows with data streaming defined on a per task basis. The implementation takes into account capacity constraints of the servers on which services are installed and the workflow data footprint if needed. Furthermore, it also considers storage space of the workflow execution engine and its cost. Caching service output data is implemented to speed up the execution of the workflow. Input data is partitioned into data packets, which are passed and processed by services previously selected for workflow tasks so that the aforementioned constraints are met. Performance impact of the proposed mechanisms is investigated for workflow structures common in acyclic directed graph workflow applications. It is shown for a real workflow with distributed processing of digital media content that the initial budget needs to be properly distributed between both the cost of services, but also the cost of intermediate storage to obtain good workflow execution times.  相似文献   

20.
As large-scale distributed systems gain momentum, the scheduling of workflow applications with multiple requirements in such computing platforms has become a crucial area of research. In this paper, we investigate the workflow scheduling problem in large-scale distributed systems, from the Quality of Service (QoS) and data locality perspectives. We present a scheduling approach, considering two models of synchronization for the tasks in a workflow application: (a) communication through the network and (b) communication through temporary files. Specifically, we investigate via simulation the performance of a heterogeneous distributed system, where multiple soft real-time workflow applications arrive dynamically. The applications are scheduled under various tardiness bounds, taking into account the communication cost in the first case study and the I/O cost and data locality in the second. The simulation results provide useful insights into the impact of tardiness bound and data locality on the system performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号