首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
    
Task-based systems have become popular due to their ability to utilize the computational power of complex heterogeneous systems. A typical programming model used is the Sequential Task Flow (STF) model, which unfortunately only supports static task graphs. This can result in submission overhead and a static task graph that is not well-suited for execution on heterogeneous systems. A common approach is to find a balance between the granularity needed for accelerator devices and the granularity required by CPU cores to achieve optimal performance. To address these issues, we have extended the STF model in the StarPU runtime system by introducing the concept of hierarchical tasks. This allows for a more dynamic task graph and, when combined with an automatic data manager, it is possible to adjust granularity at runtime to best match the targeted computing resource. That data manager makes it possible to switch between various data layout without programmer input and allows us to enforce the correctness of the DAG as hierarchical tasks alter it during runtime. Additionally, submission overhead is reduced by using large-grain hierarchical tasks, as the submission process can now be done in parallel. We have shown that the hierarchical task model is correct and have conducted an early evaluation on shared memory heterogeneous systems using the Chameleon dense linear algebra library.  相似文献   

2.
    
David M. Rogers 《Software》2023,53(1):99-114
Runtime scheduling and workflow systems are an increasingly popular algorithmic component in HPC because they allow full system utilization with relaxed synchronization requirements. There are so many special-purpose tools for task scheduling, one might wonder why more are needed. Use cases seen on the Summit supercomputer needed better integration with MPI and greater flexibility in job launch configurations. Preparation, execution, and analysis of computational chemistry simulations at the scale of tens of thousands of processors revealed three distinct workflow patterns. A separate job scheduler was implemented for each one using extremely simple and robust designs: file-based, task-list based, and bulk-synchronous. Comparing to existing methods shows unique benefits of this work, including simplicity of design, suitability for HPC centers, short startup time, and well-understood per-task overhead. All three new tools have been shown to scale to full utilization of Summit, and have been made publicly available with tests and documentation. This work presents a complete characterization of the minimum effective task granularity for efficient scheduler usage scenarios. These schedulers have the same bottlenecks, and hence similar task granularities as those reported for existing tools following comparable paradigms.  相似文献   

3.
    
This work describes some first efforts to design a new Linpack-like benchmark useful to evaluate the performance of Heterogeneous Computing Resources. The benchmark is based on the Schur Complement reformulation of the solution of a linear equation system. Details about its implementation and evaluation, mainly in terms of performance scalability, are presented for a computing environment based on multi NVIDIA GP-GPUs nodes connected by an Infiniband network.  相似文献   

4.
    
The objective of this paper is to analyze the dynamic scheduling of dense linear algebra algorithms on shared‐memory, multicore architectures. Current numerical libraries (e.g., linear algebra package) show clear limitations on such emerging systems mainly because of their coarse granularity tasks. Thus, many numerical algorithms need to be redesigned to better fit the architectural design of the multicore platform. The parallel linear algebra for scalable multicore architectures library developed at the University of Tennessee tackles this challenge by using tile algorithms to achieve a finer task granularity. These tile algorithms can then be represented by directed acyclic graphs, where nodes are the tasks and edges are the dependencies between the tasks. The paramount key to achieve high performance is to implement a runtime environment to efficiently schedule the execution of the directed acyclic graph across the multicore platform. This paper studies the impact on the overall performance of some parameters, both at the level of the scheduler (e.g., window size and locality) and the algorithms (e.g., left‐looking and right‐looking variants). We conclude that some commonly accepted rules for dense linear algebra algorithms may need to be revisited. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

5.
    
We present several algorithms to compute the solution of a linear system of equations on a graphics processor (GPU), as well as general techniques to improve their performance, such as padding and hybrid GPU‐CPU computation. We compare single and double precision performance of a modern GPU with unified architecture, and show how iterative refinement with mixed precision can be used to regain full accuracy in the solution of linear systems, exploiting the potential of the processor for single precision arithmetic. Experimental results on a GTX280 using CUBLAS 2.0, the implementation of BLAS for NVIDIA® GPUs with unified architecture, illustrate the performance of the different algorithms and techniques proposed. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

6.
该文提出一个针对大型实对称正定稠密方程组或复对称非Hermitian稠密方程组线性求解器的并行分布式算法。它使用了不同于ScaLAPACK的J-变量块Cholesky分解算法和一维块循环列数据分配。该算法以MPI作为消息传递库,在最多可达16个处理器的集群上针对实对称正定稠密方程组可提供与ScaLAPACK近似的浮点操作性能,并可解决一些涉及复对称非Hermitian稠密方程组的电磁场散射问题。该算法的优点是执行Cholesky分解所需的存储量只是标准并行库ScaLAPACK的一半。仿真的数值结果表明该算法是正确、有效的。  相似文献   

7.
    
Any factorization/back substitution scheme for the solution of linear systems consists of two phases which are different in nature, and hence may be inefficient for parallel implementation on a single computational network. The Gauss-Jordan elimination scheme unifies the nature of the two phases of the solution process and thus seems to be more suitable for parallel architectures, especially if reconfiguration of the communication pattern is not permitted. In this communication, a computational network for the Gauss-Jordan algorithm is presented. This network compares favorably with optimal implementations of the Gauss elimination/back substitution algorithm.  相似文献   

8.
建立了一个异构分布式系统实时调度模型,对异构分布式系统中的任务及不同处理机资源进行了形式化描述.结合基版本/副版本技术,给出了用于异构分布式系统的实时任务轮转式容错调度算法.实例分析表明,该算法有效提高了异构处理机环境下的资源利用率以及整体计算性能.  相似文献   

9.
刘弢  范彬  吴承勇  张兆庆 《软件学报》2008,19(9):2181-2190
提出了一种具有数据流特征的Java并行程序设计模型,并针对该模型提出了一种基于运行时信息反馈的自适应优化算法,使得运行时系统可以利用数据流程序所暴露出的数据并行性,加速程序的运行.此外,在该模型中加入了数据流多态的概念,扩展了该模型的面向对象特性.在一个实际的开放源码Java虚拟机中实现了上述程序设计模型及优化方法.在实际多核多线程机器上的实验结果表明,所提出的程序设计模型及优化能够充分利用硬件的并行处理能力,显著地提高了程序的性能.  相似文献   

10.
    
Distributed computing systems are a viable and less expensive alternative to parallel computers. However, a serious difficulty in concurrent programming of a distributed system is how to deal with scheduling and load balancing of such a system which may consist of heterogeneous computers. Some distributed scheduling schemes suitable for parallel loops with independent iterations on heterogeneous computer clusters have been designed in the past. In this work we study self‐scheduling schemes for parallel loops with independent iterations which have been applied to multiprocessor systems in the past. We extend one important scheme of this type to a distributed version suitable for heterogeneous distributed systems. We implement our new scheme on a network of computers and make performance comparisons with other existing schemes. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

11.
    
While heterogeneous computing has emerged as a dominant trend in current and future High-Performance Computing (HPC) systems, it is also widely recognized that this shift has led to increased software complexity due to a proliferation of programming systems for different heterogeneous processors. One such example is the Heterogeneous-Compute Interface for Portability from AMD (HIP  ), which is composed of a C Runtime API and C++ Kernel Language. Many HPC applications will likely use HIP  on future exascale systems (e.g., Frontier and El Capitan), but HIP  currently only targets AMD and NVIDIA processors. This limitation creates challenges for users who would also like to run their applications on exascale systems based on other architectures (e.g., Aurora, which is based on Intel hardware) that are currently not targeted by HIP  . In this paper, we introduce the design and implementation of HIPLZ  , a compiler and runtime system that uses the Intel Level Zero API to support HIP  on Intel GPU architectures. We discuss the design of HIPLZ  , derived from HIPCL  (an implementation of HIP  on top of OpenCL  ), and portability issues that occur from using the Level Zero runtime as a backend. We evaluate our implementation by running several performance benchmarks and mini-apps written in HIP  on Intel architectures using HIPLZ  . Our results show that this approach provides competitive performance relative to Intel's OpenCL implementations on Intel Gen9 and UHD Graphics 770 GPUs, while providing good coverage of features needed by HPC applications. Overall, this approach is a promising demonstration of enabling performance portability for exascale systems.  相似文献   

12.
13.
In this paper, we consider the problem of multimedia synchronization based on scheduling the transmission of multimedia documents in a networked environment. Assuming channels with different bandwidth and delay characteristics are established between the multimedia server and the client, we formulate the scheduling problem to ensure interstream and intrastream synchronization as a parallel processor scheduling problem. Since the heterogeneous parallel processor scheduling problem is NP-hard, we propose two heuristic algorithms with time complexity ofO(n log n+nm), wherenis the number of data units to be scheduled andmthe number of channels available. We also develop an enumerative algorithm to obtain the exact solutions. Extensive computational simulations reveal that the heuristics consistently obtain near-optimal solutions. From the simulation results, we also identify special structures of multimedia documents along with characteristics of the available channels which affect the relative performance of the algorithms.  相似文献   

14.
实时多处理器系统的动态调度算法一直是实时系统中的重要研究课题.根据异构实时多处理器的特点,提出了一种新的异构实时动态调度算法P_IEFT.该算法采用了一个新的处理器分配策略——将任务分配到能最早完成任务的处理器上.该策略能够缩短调度长度,提高后继任务被接受的可能性,从而能够提高成功调度率.模拟结果表明,该调度算法的成功调度率高于近视算法和节约算法的成功调度率.  相似文献   

15.
基于异构分布式系统的实时容错调度算法   总被引:26,自引:1,他引:26  
目前文献中研究的实时容错调度算法都是基于同构分布式系统,系统中的所有处理机完全相同。该文首先建立了一个基于异构分布式系统实时容错调度模型,异构分布式系统中的各个处理机均不相同。基于该异构分布式系统模型,该文引入了可靠性代价(reliability cost)概念,并提出两种静态实时容错调度算法(RTFTNO和RTFTRC)用于调度周期性实时容错任务。算法RTFTRC在调度任务时,尽量使系统的可靠性代价最小;而算法RTFTNO在调度实时任务时,没有考虑系统的可靠性代价。该文详细讨论了两种调度算法的性能。性能模拟实验分别比较了两个算法的可靠性代价,超时比率和可调度性;并研究了任务的计算时间与可靠性代价的关系以及调度长度阈值与最小处理机个数的关系。实验结果表明,算法RTFTRC的性能优于算法RTFTNO。  相似文献   

16.
    
Transactional memory (TM) is an approach to concurrency control that aims to make writing parallel programs both effective and simple. The approach has been initially proposed for nondistributed multiprocessor systems, but it is gaining popularity in distributed systems to synchronize tasks at large scales. Efficiency and scalability are often the key issues in TM research; thus, performance benchmarks are an important part of it. However, while standard TM benchmarks like the Stanford Transactional Applications for Multi‐Processing suite and STMBench7 are available and widely accepted, they do not translate well into distributed systems. Hence, the set of benchmarks usable with distributed TM systems is very limited, and must be padded with microbenchmarks, whose simplicity and artificial nature often makes them uninformative or misleading. Therefore, this paper introduces Helenos, a realistic, complex, and comprehensive distributed TM benchmark based on the problem of the Facebook inbox, an application of the Cassandra distributed store.  相似文献   

17.
    
Given the complexity of current supercomputers and applications, being able to trace application executions to understand their behavior is not a luxury. As constraints, tracing systems have to be as little intrusive as possible in the application code and performances, and be precise enough in the collected data. In this article, we present how works the tracing system used by the task-based runtime system StarPU. We study the different sources of performance overhead coming from the tracing system and how to reduce these overheads. Then, we evaluate the accuracy of distributed traces with different clock synchronization techniques. Finally, we summarize our experiments and conclusions with the lessons we learned to efficiently trace applications, and the list of characteristics each tracing system should feature to be competitive. The reported experiments and implementation details comprise a feedback of integrating into a task-based runtime system state-of-the-art techniques to efficiently and precisely trace application executions. We highlight the points every application developer or end-user should be aware of to seamlessly integrate a tracing system or just trace application executions.  相似文献   

18.
    
Load balancing increases the efficient use of existing resources for parallel and distributed applications. At a coarse level of granularity, advances in runtime systems for parallel programs have been proposed in order to control available resources as efficiently as possible by utilizing idle resources and using task migration. Simultaneously, at a finer granularity level, advances in algorithmic strategies for dynamically balancing computational loads by data redistribution have been proposed in order to respond to variations in processor performance during the execution of a given parallel application. Combining strategies from each level of granularity can result in a system which delivers advantages of both. The resulting integration is systemic in nature and transfers the responsibility of efficient resource utilization from the application programmer to the runtime system. This paper presents the design and implementation of a system that combines an algorithmic fine-grained data parallel load balancing strategy with a systemic coarse-grained task-parallel load balancing strategy, and reports on recent experimental results of running a computationally intensive scientific application under this integrated system. The experimental results indicate that a distributed runtime environment which combines both task and data migration can provide performance advantages with little overhead. It also presents proposals for performance enhancements of the implementation, as well as future explorations for effective resource management. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

19.
    
High performance fast multipole method is crucial for the numerical simulation of many physical problems. In a previous study, we have shown that task‐based fast multipole method provides the flexibility required to process a wide spectrum of particle distributions efficiently on multicore architectures. In this paper, we now show how such an approach can be extended to fully exploit heterogeneous platforms. For that, we design highly tuned graphics processing unit (GPU) versions of the two dominant operators P2P and M2L) as well as a scheduling strategy that dynamically decides which proportion of subsequent tasks is processed on regular CPU cores and on GPU accelerators. We assess our method with the StarPU runtime system for executing the resulting task flow on an Intel X5650 Nehalem multicore processor possibly enhanced with one, two, or three Nvidia Fermi M2070 or M2090 GPUs (Santa Clara, CA, USA). A detailed experimental study on two 30 million particle distributions (a cube and an ellipsoid) shows that the resulting software consistently achieves high performance across architectures. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

20.
    
This paper presents a novel list‐based scheduling algorithm called Improved Predict Earliest Finish Time for static task scheduling in a heterogeneous computing environment. The algorithm calculates the task priority with a pessimistic cost table, implements the feature prediction with a critical node cost table, and assigns the best processor for the node that has at least 1 immediate successor as the critical node, thereby effectively reducing the schedule makespan without increasing the algorithm time complexity. Experiments regarding aspects of randomly generated graphs and real‐world application graphs are performed, and comparisons are made based on the scheduling length ratio, robustness, and frequency of the best result. The results demonstrate that the Improved Predict Earliest Finish Time algorithm outperforms the Predict Earliest Finish Time and Heterogeneous Earliest Finish Time algorithms in terms of the schedule length ratio, frequency of the best result, and robustness while maintaining the same time complexity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号