首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 189 毫秒
1.
李勇  胡慧俐  杨焕荣 《计算机应用》2014,34(4):1005-1009
数字信号处理软件中循环程序在执行时间上占有很大比例,用指令缓冲器暂存循环代码可以减少程序存储器的访问次数,提高处理器性能。在VLIW处理器指令流水线中增加一个支持循环指令的缓冲器,该缓冲器能够缓存循环程序指令,并以软件流水的形式向功能部件派发循环程序指令。这样循环程序代码只需访存一次而执行多次,大大减少了访存次数。在循环指令运行期间,缓冲器发出信号使程序存储器进入睡眠状态可以降低处理器功耗。典型的应用程序测试表明,使用了循环缓冲后,取指流水线空闲率可达90%以上,处理器整体性能提高10%左右,而循环缓冲的硬件面积开销大约占取指流水线的9%。  相似文献   

2.
软件流水中隐藏存储延迟的方法   总被引:5,自引:2,他引:3  
刘利  李文龙  陈彧  李胜梅  汤志忠 《软件学报》2005,16(10):1833-1841
软件流水是一种重要的指令调度技术,它通过同时执行来自不同循环体的指令来加快循环的执行速度.随着处理机运行速度的逐渐提高,存储访问延迟成为性能提高的瓶颈.为了减轻存储系统影响,软件流水结合了一些存储优化技术,通过隐藏存储延迟来提高性能.提出了一种延迟可预测的模调度算法(foresighted latencymodulo scheduling,简称FLMS),它根据循环的特点来确定load指令延迟.实验结果表明,FLMS算法减少了阻塞时间,提高了程序性能.  相似文献   

3.
设计了一种软件流水循环缓冲,用于存储和派发循环体指令,减少执行循环程序时的访存次数,从而减少访存延迟对性能的影响。在详细研究软件流水和循环展开的基础上,完成了软件流水循环缓冲的设计。所设计的循环缓冲可以存储112条32位指令,用循环专用指令来控制循环程序的执行。对设计进行了模拟验证,并用Design Complier对设计进行了综合。  相似文献   

4.
一、子程序(过程)设计 上一讲讲述了循环程序的设计方法。从循环程序中看到,反复利用一段程序进行计算或处理可以提高工作效率。但有些问题虽然有重复计算的工作,确很难用循环程序来实现。如某个程序中在不同的点上都需要执行一段特定的程序,不能用循环指令来实现,然而可以把这段程序抽出来,用于程序的方法来编程,当程序需要用该段程序时就调用该子程序。执行完该子程序后再返回到主程序。  相似文献   

5.
动态二进制翻译系统是根据程序的动态执行信息来将源机器上的可执行代码翻译成目标机器上的可执行代码.在翻译成中间表示的过程中会产生一些冗余的LOAD指令,为提高代码的执行效率,提出对这些LOAD指令进行冗余删除优化.该优化技术可以使优化效果超过其自身的开销,达到优化的目的.  相似文献   

6.
唐遇星  邓鹍  窦勇  周兴铭 《计算机学报》2007,30(11):1972-1981
分支指令与分支预测失败限制了处理器发掘指令级并行(ILP)的潜力.通过If-conversion或Predicated执行将程序中的控制相关转化为数据相关,能较好地降低分支预测开销.提出一种基于简化Trace结构的动态隐式断言执行机制(Dynamic Implicit Predication,DIP),而早期的相关研究主要集中于由编译器显式为宽发射处理器产生静态Predicated指令.无需编译器或者其他二进制工具的帮助,DIP可以在程序运行过程中识别可以进行断言变换的指令片断,完成指令转换与优化,并在以后的执行中使用优化后的指令Trace.基于SPEC2000模拟测试表明DIP可以有效避免错误的分支预测,提高并行度,单个程序的IPC平均提高10.3%,基准程序的平均加速比可达7.59%.  相似文献   

7.
循环倾斜是程序优化中一种循环变换的手段,它改变空间迭代形式,将循环存在的跨迭代的并行用传统的并行标识出来,使得循环可以并行执行。但是循环倾斜后,并行执行的数据在内存中是离散的,而且每次迭代执行的次数是不一致的。为了更有效地利用SIMD,本文提出一种基于全局数据重组的循环倾斜优化方法。首先分析循环倾斜优化,针对数据离散的问题实现全局数据重组,改善数据局部性,循环易于向量化操作;针对迭代执行次数不一致问题,实现非满载向量操作,使尾循环得以向量执行。最后选择wavefront程序进行测试,优化后,程序计算可以获得平均10.73倍的加速效果。  相似文献   

8.
谓词执行及其关键技术浅析   总被引:1,自引:0,他引:1  
在超标量和VLIW微处理器的设计中,指令间的相关,尤其是控制相关和数据相关,严重限制了指令级并行(ILP)的开发,从而限制了微处理器性能的进一步提高。条件执行技术(Guarded Execution)能够将控制相关转化为数据相关来处理。具体来说,它能够将控制相关于一条分支指令的其他指令转换为数据相关于该分支条件的条件指令。条件指令与常规指令的不同之处在于它含有显式的条件指示符,其语义为:首先计算指令执行条件,如果条件为真,则执行该指令中的操作,否则将其作为空操作处理。条件执行技术实质上是一种程序变换技术,变换后的程序无论对编译优化还是对硬件调度都有很大好处,但需要专门硬件机制的支持。目前只有ARM指令集与IA-64指令集支持条件执行。  相似文献   

9.
面向BWDSP的体系结构分析了字符串与内存处理函数汇编优化方法,基于向量化与软件流水的优化技术,通过利用高效访存指令、能够提升循环执行效率的零开销循环机制、指令重排技术,结合具体功能函数的循环特性,展开针对字符串与内存处理函数的指令级并行性挖掘.实验结果表明,这些库函数的优化效率能够达到硬件平台提供函数性能理论运行时间的1.5倍以下,对BWDSP平台整体性能提升具有重要意义.  相似文献   

10.
软件流水是一种重要的指令调度技术,它通过同时执行来自不同循环迭代的指令来加快循环的执行时间.随着处理器速度和访存速度差距越拉越大,访存指令尤其是cache miss的访存指令日益成为系统性能提高的瓶颈.由于这些指令的延迟不是固定的,如何在软件流水中预测并掩盖这些访存指令的延迟是非常重要的.与前人预测访存延迟的方法不同,引入cache profiling技术,通过动态收集到profile信息来预测访存延迟,并进行适当的调度.当增加模调度循环中的访存指令的延迟时,启动间隔也会随之增大,导致性能不会随之上升.CSMS算法和FLMS算法在尽量不增大启动间隔的情况下,改变访存指令的延迟.改进了CSMS算法和FLMS算法,根据cache profiling的信息来改变访存延迟,所以比前人的方法更为准确.实验表明,新方法可以有效地提高程序性能,对SPEC2000测试程序平均性能提高1%左右,个别例子的性能改进高达11%.  相似文献   

11.
Software pipelining is an efficient method of loop optimization that allows for parallelism of operations related to different loop iterations. Currently, most commercial compilers use loop pipelining methods based on modulo scheduling algorithms. This paper reviews these algorithms and considers algorithmic solutions designed for overcoming the negative effects of software pipelining of loops (a significant growth in code size and increase in the register pressure) as well as methods making it possible to use some hardware features of a target architecture. The paper considers global-scheduling mechanisms allowing one to pipeline loops containing a few basic blocks and loops in which the number of iterations is not known before the execution.  相似文献   

12.
Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multilevel hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape.  相似文献   

13.
Compiler-directed locality optimization techniques are effective in reducing the number of cycles spent in off-chip memory accesses. Recently, methods have been developed that transform memory layouts of data structures at compile-time to improve spatial locality of nested loops beyond current control-centric (loop nest-based) optimizations. Most of these data-centric transformations use a single static (program-wide) memory layout for each array. A disadvantage of these static layout-based locality enhancement strategies is that they might fail to optimize codes that manipulate arrays, which demand different layouts in different parts of the code. We introduce a new approach, which extends current static layout optimization techniques by associating different memory layouts with the same array in different parts of the code. We call this strategy "quasidynamic layout optimization." In this strategy, the compiler determines memory layouts (for different parts of the code) at compile time, but layout conversions occur at runtime. We show that the possibility of dynamically changing memory layouts during the course of execution adds a new dimension to the data locality optimization problem. Our strategy employs a static layout optimizer module as a building block and, by repeatedly invoking it for different parts of the code, it checks whether runtime layout modifications bring additional benefits beyond static optimization. Our experiments indicate significant improvements in execution time over static layout-based locality enhancing techniques.  相似文献   

14.
刘晓娴  赵荣彩  赵捷  徐金龙 《软件学报》2014,25(6):1154-1168
发掘DOACROSS 循环中蕴含的并行性,选择合适的策略将其并行执行,对提升程序的并行性能非常重要.流水并行方式是规则DOACROSS 循环并行的重要方式.自动生成性能良好的流水并行代码是一项困难的工作,并行编译器对程序自动并行时常常对DOACROSS 循环作保守处理,损失了DOACROSS 循环包含的并行性,限制了程序的并行性能.针对上述问题,设计了一种选择计算划分循环层和循环分块层的启发式算法,给出了一个基于流水并行代价模型的循环分块大小计算公式,并使用计数信号量进行并行线程之间的同步,实现了基于OpenMP 的规则DOACROSS 循环流水并行代码的自动生成.通过对有限差分松弛法(finite difference relaxation,简称FDR)的波前(wavefront)循环和时域有限差分法(finite difference time domain,简称FDTD)中典型循环以及程序Poisson,LU 和Jacobi 的测试,算法自动生成的流水并行代码能够在多核处理器上获得明显的性能提升,使用的流水分块大小计算公式能够较为精确地计算出循环流水并行时的最佳分块大小.自动生成的流水并行代码与基于手工选择的最优分块大小的流水并行代码相比,加速比达到手工选择加速比的89%.  相似文献   

15.
Optimizing a computer program is defined as improving the execution time without disturbing the correctness. We show how to use invariants from a proof of correctness in order to change the statement in and around the program's loops. This approach is shown to systematize existing optimization methods, and to sometimes allow stronger optimizations than are possible under the standard transformation approach.  相似文献   

16.
In this work we consider a special optimization problem involved with compiling compound loops (combining nested and consecutive sub-loops) to Verilog. Each sub-loop of the compound loop may require a different optimized hardware configuration (OHC) for optimized execution times. For example, one loop requires at least two memory ports and one multiplier for an optimized execution time, while another loop may require only one memory port but two multipliers, yet one OHC should be selected for both loops. The goal is to compute a minimal OHC which, based on the different heat levels (expected number of iterations) of the sub-loops, is a good compromise between all the conflicting requirements of each sub-loop. Though synthesis of nested loops has been implemented in quite a few systems this aspect has not been considered so far. We avoid the use of time consuming integer linear programming (ILP) techniques and instead use a fast space exploration technique combined with an efficient variant of list scheduling.Another novel aspect of the proposed system is the observation that the real latencies of the hardware units should be considered as variables of the OHC rather than fixed real values as is usually done in high-level synthesis systems. Experimental results show a significant improvement in the OHC without a significant increase in the execution time due to the use of this search procedure.1  相似文献   

17.
Most scientific and digital signal processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms  相似文献   

18.
Software pipelining increases the loop execution throughput by overlapping the execution of successive iterations in a pipelined fashion. For loops with control flows, however, software pipelining is not straightforward because we need to consider the overlap of more than one execution path. Modulo scheduling simply transforms them into straightline loops through if-conversion which, in effect, achieves a fixed, worst-case initiation interval (/spl par/) among all paths. On the other hand, all-path pipelining (APP) and enhanced pipeline scheduling (EPS) can achieve a variable /spl par/ depending on the path that is taken at execution time. Unfortunately, APP concentrates only on the overlap within the same path, entirely losing the overlap between different paths, whereas EPS attempts to overlap all paths together, failing to produce a tight schedule for each individual path, especially when resource constraints are tight. In this paper, we propose a new approach to EPS called split-path EPS (SP-EPS), which first splits each individual path via tail duplication and then performs EPS in a way to guarantee a tight schedule for each path, while producing a competitive cross-path schedule. We also extend SP-EPS to outer loops such that frequent paths that bypass the inner loop are split and then scheduled by SP-EPS. Our experimental results on nontrivial integer benchmarks show that SP-EPS can achieve as much as a geometric mean of 10 percent speedup over EPS when innermost loops are scheduled by SP-EPS, while it can achieve a geometric mean of 11.9 percent speedup when outer loops are also scheduled by SP-EPS.  相似文献   

19.
分析了目前主流采样模拟技术中定长样本的不足,提出了一种基于编译器元数据的采样模拟技术(BigLoopSP).首先利用编译器收集各种可能的周期行为的边界信息作为元数据.然后为了处理程序中大量存在的动态行为,基于编译器产生的元数据结合程序的动态行为进行周期行为的划分和采样点的选取.以此方案划分的变长候选样本能够在保证样本质量的前提下有效地减少所需特征样本的总数.因此比较于定长采样技术SimPoint,BigLoopSP在提高精确性的同时,进一步降低了模拟所需的时间(相对于SimPoint的平均加速比为2.63).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号