期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

弹性数据相关与软件流水 总被引：1，自引：0，他引：1

容红波汤志忠《软件学报》2001,12(6):894-906

最差路径是有分支循环软件流水的一大障碍.对于有分支循环,某些数据相关(称为弹性相关)在循环的动态执行中可能产生、也可能不产生实例.据此,可将严重限制并行性的弹性相关用限制较松的虚构相关代替,再进行软件流水.若调度没有遵守原来的弹性相关,则使用下推变换修正.从而缓解或者完全解除了最差路径的限制.该方法与经典的控制猜测互补,特点是允许调度含错,然后纠错. 相似文献

2.

cache profiling信息指导的软件流水

周谦冯晓兵张兆庆《计算机研究与发展》2008,45(5):834-840

软件流水是一种重要的指令调度技术,它通过同时执行来自不同循环迭代的指令来加快循环的执行时间.随着处理器速度和访存速度差距越拉越大,访存指令尤其是cache miss的访存指令日益成为系统性能提高的瓶颈.由于这些指令的延迟不是固定的,如何在软件流水中预测并掩盖这些访存指令的延迟是非常重要的.与前人预测访存延迟的方法不同,引入cache profiling技术,通过动态收集到profile信息来预测访存延迟,并进行适当的调度.当增加模调度循环中的访存指令的延迟时,启动间隔也会随之增大,导致性能不会随之上升.CSMS算法和FLMS算法在尽量不增大启动间隔的情况下,改变访存指令的延迟.改进了CSMS算法和FLMS算法,根据cache profiling的信息来改变访存延迟,所以比前人的方法更为准确.实验表明,新方法可以有效地提高程序性能,对SPEC2000测试程序平均性能提高1%左右,个别例子的性能改进高达11%. 相似文献

3.

Towards the optimal synchronization granularity for dynamic scheduling of pipelined computations on heterogeneous computing systems

I. Riakiotakis F. M. Ciorba T. Andronikos G. Papakonstantinou A. T. Chronopoulos 《Concurrency and Computation》2012,24(18):2302-2327

Loops are the richest source of parallelism in scientific applications. A large number of loop scheduling schemes have therefore been devised for loops with and without data dependencies (modeled as dependence distance vectors) on heterogeneous clusters. The loops with data dependencies require synchronization via cross‐node communication. Synchronization requires fine‐tuning to overcome the communication overhead and to yield the best possible overall performance. In this paper, a theoretical model is presented to determine the granularity of synchronization that minimizes the parallel execution time of loops with data dependencies when these are parallelized on heterogeneous systems using dynamic self‐scheduling algorithms. New formulas are proposed for estimating the total number of scheduling steps when a threshold for the minimum work assigned to a processor is assumed. The proposed model uses these formulas to determine the synchronization granularity that minimizes the estimated parallel execution time. The accuracy of the proposed model is verified and validated via extensive experiments on a heterogeneous computing system. The results show that the theoretically optimal synchronization granularity, as determined by the proposed model, is very close to the experimentally observed optimal synchronization granularity, with no deviation in the best case, and within 38.4% in the worst case. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

4.

Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU accelerators

《Parallel Computing》2014,40(8):425-447

EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed for simulating thermo-fluid flows across a wide range of scales and physical scenarios. The dynamic core of EULAG includes the multidimensional positive definite advection transport algorithm (MPDATA) and elliptic solver. In this work we investigate aspects of an optimal parallel version of the 2D MPDATA algorithm on modern hybrid architectures with GPU accelerators, where computations are distributed across both GPU and CPU components.Using the hybrid OpenMP–OpenCL model of parallel programming opens the way to harness the power of CPU–GPU platforms in a portable way. In order to better utilize features of such computing platforms, comprehensive adaptations of MPDATA computations to hybrid architectures are proposed. These adaptations are based on efficient strategies for memory and computing resource management, which allow us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of CPU–GPU platforms. The main contributions of the paper are:

•method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations;
•method for the adaptation of 2D MPDATA computations to multicore CPU platforms, based on space and temporal blocking techniques;
•method for the adaptation of the 2D MPDATA algorithm to GPU architectures, based on a hierarchical decomposition strategy across data and computation domains, with support provided by the developed GPU task scheduler allowing for the flexible management of available resources;
•approach to the parametric optimization of 2D MPDATA computations on GPUs using the autotuning technique, which allows us to provide a portable implementation methodology across a variety of GPUs.

Hybrid platforms tested in this study contain different numbers of CPUs and GPUs – from solutions consisting of a single CPU and a single GPU to the most elaborate configuration containing two CPUs and two GPUs. Processors of different vendors are employed in these systems – both Intel and AMD CPUs, as well as GPUs from NVIDIA and AMD. For all the grid sizes and for all the tested platforms, the hybrid version with computations spread across CPU and GPU components allows us to achieve the highest performance. In particular, for the largest MPDATA grids used in our experiments, the speedups of the hybrid versions over GPU and CPU versions vary from 1.30 to 1.69, and from 1.95 to 2.25, respectively. 相似文献

5.

基于四阶段人工优化的软件流水技术 总被引：1，自引：1，他引：0

下载免费PDF全文

周国建吴少刚李祖松史岗《计算机工程》2009,35(5):40-43

代码体积是优化存储资源有限的嵌入式系统的重要因素之一。针对该特点,使用oprofile性能分析工具,以EEMBC基准程序集作为工作负载,提出四阶段人工优化软件流水方法(FPMO)。电信类的自相关程序实验结果表明,FPMO以2.04%的代码增量为代价换来40.678%的性能提升,而单纯的编译器自动优化则以33.35%的体积膨胀换来38.33%的性能提升。相似文献

6.

Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures

Azzam Haidar Hatem Ltaief Asim YarKhan Jack Dongarra 《Concurrency and Computation》2012,24(3):305-321

The objective of this paper is to analyze the dynamic scheduling of dense linear algebra algorithms on shared‐memory, multicore architectures. Current numerical libraries (e.g., linear algebra package) show clear limitations on such emerging systems mainly because of their coarse granularity tasks. Thus, many numerical algorithms need to be redesigned to better fit the architectural design of the multicore platform. The parallel linear algebra for scalable multicore architectures library developed at the University of Tennessee tackles this challenge by using tile algorithms to achieve a finer task granularity. These tile algorithms can then be represented by directed acyclic graphs, where nodes are the tasks and edges are the dependencies between the tasks. The paramount key to achieve high performance is to implement a runtime environment to efficiently schedule the execution of the directed acyclic graph across the multicore platform. This paper studies the impact on the overall performance of some parameters, both at the level of the scheduler (e.g., window size and locality) and the algorithms (e.g., left‐looking and right‐looking variants). We conclude that some commonly accepted rules for dense linear algebra algorithms may need to be revisited. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

7.

X86平台上Open64软件流水的设计与实现

刘家兵徐云《计算机工程》2013,(9)

由于缺乏相关硬件功能,Open64编译器的软件流水技术没有面向X86处理器的版本。为此,提出一种适用于X86平台的Open64软件流水实现框架。利用软件实现处理器的部分硬件行为,通过循环过滤方法剔除不适用的循环。针对缺乏循环寄存器文件的问题,设计寄存器分配算法达到使用通用寄存器的目的,并添加模变量扩展模块以保证执行的正确性。实验结果表明,与循环展开方案相比,该框架可使系统平均获得9%的性能提升。相似文献

8.

Performance Bounds of Algorithms for Scheduling Advertisements on a Web Page

Milind Dawande Subodha Kumar Chelliah Sriskandarajah 《Journal of Scheduling》2003,6(4):373-394

Consider a set of n advertisements (hereafter called ads) A ={A₁,...,A_n} competing to be placed in a planning horizon which is divided into N time intervals called slots. An ad A _i is specified by its size s _i and frequency w _i. The size s _i represents the amount of space the ad occupies in a slot. Ad A _i is said to be scheduled if exactly w _i copies of A _i are placed in the slots subject to the restriction that a slot contains at most one copy of an ad. In this paper, we consider two problems. The MINSPACE problem minimizes the maximum fullness among all slots in a feasible schedule where the fullness of a slot is the sum of the sizes of ads assigned to the slot. For the MAXSPACE problem, in addition, we are given a common maximum fullness S for all slots. The total size of the ads placed in a slot cannot exceed S. The objective is to find a feasible schedule of ads such that the total occupied slot space is maximized. We examine the complexity status of both problems and provide heuristics with performance guarantees. 相似文献

9.

基于多任务深度学习的HXDSP多簇软流水研究

刘纯纲周鹏郑启龙《计算机系统应用》2022,31(12):112-119

针对目前编译优化领域的深度学习模型普遍采用单任务学习而难以利用多个任务间的相关性提升模型整体编译加速效果的问题,提出了一种基于多任务深度学习的编译优化方法.该方法使用图神经网络(GNN)从C程序的抽象语法树(ASTs)和数据控制流图(CDFGs)中学习得到程序特征,然后对程序特征同步预测HXDSP软件流水启动间隔和循环展开因子.在DSPStone数据集上的实验结果表明,该多任务方法取得了相对于单任务方法12%的性能提升. 相似文献

10.

固定优先级抢占调度算法下非周期任务实时性能研究

王沁袁玲玲张燕《小型微型计算机系统》2011,32(6)

嵌入式实时系统不仅要在功能上满足需求,而且要在性能上满足实时性需求.给定调度算法,实时性取决于各个任务的到达特征和执行时间.任务的到达特征由应用环境决定.为此,本文研究任务执行时间对实时性能的影响,为嵌入式系统设计提供参考.针对固定优先级抢占调度算法,应用排队论,提出一种非周期实时任务的理论模型.该理论模型包含两个优先级不同的非周期实时任务,给出了任务的执行时间长短对时限错过率、任务响应时间、任务队列长度等实时性能的影响.给出一个应用实例,仿真结果验证了理论模型的正确性. 相似文献

11.

DSP实时图像处理软件优化方法研究 总被引：2，自引：0，他引：2

下载免费PDF全文

雷涛周进吴钦章《计算机工程》2012,38(14):177-180

为提高高速图像处理系统中数字信号处理器(DSP)软件的实时性,分别提出面向算法与代码2个层次的优化方法。算法级优化通过重新设计算法的实现流程,充分利用处理器资源,完成算法到处理器上的高效映射;代码级优化使用汇编语言,对固定算法的代码进行优化,使循环核形成高效的软件流水,达到实时性能要求。实验结果表明,2种优化方法均能提高DSP软件中关键模块的处理速度。相似文献

12.

A scalable HPF implementation of a finite‐volume computational electromagnetics application on a CRAY T3E parallel system

Yi Pan Joseph J. S. Shang Minyi Guo 《Concurrency and Computation》2003,15(6):607-621

The time‐dependent Maxwell equations are one of the most important approaches to describing dynamic or wide‐band frequency electromagnetic phenomena. A sequential finite‐volume, characteristic‐based procedure for solving the time‐dependent, three‐dimensional Maxwell equations has been successfully implemented in Fortran before. Due to its need for a large memory space and high demand on CPU time, it is impossible to test the code for a large array. Hence, it is essential to implement the code on a parallel computing system. In this paper, we discuss an efficient and scalable parallelization of the sequential Fortran time‐dependent Maxwell equations solver using High Performance Fortran (HPF). The background to the project, the theory behind the efficiency being achieved, the parallelization methodologies employed and the experimental results obtained on the Cray T3E massively parallel computing system will be described in detail. Experimental runs show that the execution time is reduced drastically through parallel computing. The code is scalable up to 98 processors on the Cray T3E and has a performance similar to that of an MPI implementation. Based on the experimentation carried out in this research, we believe that a high‐level parallel programming language such as HPF is a fast, viable and economical approach to parallelizing many existing sequential codes which exhibit a lot of parallelism. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献