期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

王翠霞韩林刘浩浩《计算机工程与科学》2022,44(12):2111-2119

循环展开是一种常用的编译优化技术,能够有效减少循环开销,提升指令级并行程度和数据局部性,提升循环的执行效能。然而,过度的循环展开会造成指令Cache溢出,增大寄存器压力,循环展开次数太少又会浪费潜在的性能提升机会,因此寻找恰当的展开因子是研究循环展开问题的核心。基于GCC开源编译器,面向循环展开问题开展深入的分析与研究,针对指令Cache和寄存器资源对循环展开的影响,提出了一种基于指令Cache和寄存器压力的循环展开因子计算方法,并在GCC编译器中实现了该计算方法。申威和海光平台上的实验结果显示,相较于目前GCC中存在的其它展开因子计算方法,所提出的方法可以获得更为有效的循环展开因子,提升了程序性能。在SPEC CPU 2006测试集上的平均性能分别提升了2.7%和3.1%,在NPB-3.3.1测试集上的分别为5.4%和6.1%。相似文献

2.

循环展开技术在向量程序中的应用

高伟赵荣彩于海宁张庆花《计算机科学》2016,43(1):226-231, 245

循环展开是一项常用的循环优化技术。当前针对串行程序的循环展开技术已经比较成熟,但是在实际应用中没有针对向量程序进行有效的循环展开。为了解决这个问题,提出了一种面向向量程序的循环展开技术。首先,针对向量寄存器压力和代码膨胀等限制因素,提出了一种自动计算展开因子的CUFVL算法;其次,根据向量循环展开的特点,制定了完全展开策略;最后结合CUFVL算法和完全展开策略,设计了向量循环展开的总体流程。实验结果表明,该方案能够计算出合适的展开因子,进而对向量程序进行适当的循环展开或完全展开,从而有效提升应用程序的性能。相似文献

3.

基于随机决策森林的循环展开方法

王冬赵荣彩高伟李雁冰《计算机工程与设计》2018,(1):199-204

为提高编译器循环展开因子计算的准确性,提出一种基于改进的随机森林模型预测循环展开因子的方法。对传统随机森林模型进行加权的改进,为解决非平衡数据集问题提出基于SMOTE算法的BSC算法。从SPEC2006等测试集中提取近1000个循环并提取特征构成训练集,训练循环展开因子预测模型。生成的模型对于展开因子的预测准确度达81%,与编译器默认的循环展开方法相比,利用预测模型对选定的测试程序循环展开后性能平均提升12%。相似文献

4.

软件流水循环缓冲的设计与实现

陈纪孝李勇《计算机科学》2013,40(4):35-37

设计了一种软件流水循环缓冲,用于存储和派发循环体指令,减少执行循环程序时的访存次数,从而减少访存延迟对性能的影响。在详细研究软件流水和循环展开的基础上,完成了软件流水循环缓冲的设计。所设计的循环缓冲可以存储112条32位指令,用循环专用指令来控制循环程序的执行。对设计进行了模拟验证,并用Design Complier对设计进行了综合。相似文献

5.

利用m4定语言进行Fortran 77循环展开

张林波《数值计算与计算机应用》1998,(1)

51．引言循环展开（do－loopunrolling）做为Fortran程序的一种简单、有效的优化手段早已为人们所知，被广泛应用于Fortran编译系统中．对于向量机而言，通过对简单的循环适当进行展开可增加每次循环中的计算量，有助于充分发挥向量机上多流水线的并发处理能力，往往可以成倍地提高计算程序的浮点性能．而在当前占主导地位的基于PSC微处理芯片的计算机系统（包括微机、工作站、共享内存并行机以及MPP型计算机等），处理器内部普遍采用超标量多流水线结构，可并发执行多条浮点指令，能够达到每秒数亿次的浮点运算峰值性能．对内存的访问… 相似文献

6.

基于扩展逻辑变换系统_μTS证明循环优化正确性

王昌晶《计算机研究与发展》2012,49(9):1863-1873

循环优化对于提高Cache性能、发掘程序的并行性以及减少执行循环的开销都有着重要的作用,证明带循环优化功能的现代编译器的正确性已成为可信编译的一个挑战性的问题.形式化证明一个羽翼丰满的优化编译器本质上是不可行的,可以使用替代的方法,即不是证明优化编译器本身,而是形式化证明每一次循环变换前后编译对象的正确性.提出一种新颖的基于扩展逻辑变换系统μTS来证明循环优化正确性的方法.系统μTS在逻辑变换系统TS的基础上扩展了若干条派生规则,经谓词抽象将源程序与目标程序转换为形式化Radl语言后,使用μTS的派生规则能证明常见循环变换的正确性,如循环融合、循环分配、循环交换、循环反转、循环分裂、循环脱皮、循环调整、循环展开、循环铺盖、循环判断外提、循环不变代码外提等.循环优化可以看作一系列循环变换的组合,从而系统μTS能证明循环优化的正确性.为了支持自动化证明循环优化的正确性并出示证据,进一步提出了一个辅助证明算法.最后通过一个典型实例对这一方法进行了详细的阐述,实际效果表明了该方法的有效性.该方法对设计高可信优化编译器具有重要的指导意义. 相似文献

7.

基于高层次综合的AES算法研究与设计

张望贾佳孟渊白旭《计算机应用》2017,37(5):1341-1346

由于对广泛使用的AES算法的性能要求越来越高,基于软件的密码算法已经越来越难以满足高吞吐量密码破解的需求,因此越来越多的算法利用现场可编程逻辑门阵列（FPGA）平台进行加速。针对AES算法在FPGA硬件上存在的开发复杂度高且开发周期长等问题,采用高层次综合（HLS）设计方法,使用高级程序语言描述并设计AES硬件加速算法。首先利用循环展开等提高运算并行度;其次使用资源平衡技术进行优化,充分利用片上存储和电路资源;最后添加全流水结构,提高整体设计的时钟频率和吞吐量,同时也详细对比分析基准设计、利用结构展开、资源均衡以及流水线优化方法的设计。经过实验表明,在Xilinx xc7z020clg484 FPGA芯片上,最终AES算法的时钟频率最高达到127.06 MHz,而吞吐量达到了16.26 Gb/s,较之基准的AES设计,性能提升了三个数量级。相似文献

8.

向量并行度指导的循环SIMD向量化方法

高伟韩林赵荣彩徐金龙陈超然《软件学报》2017,28(4):925-939

SIMD扩展部件是集成到通用处理器中的加速部件,旨在发掘多媒体和科学计算等领域程序的数据级并行.当前两种基本的向量发掘方法分别是发掘迭代间并行的Loop-based方法和发掘迭代内并行的SLP方法.Loop-aware方法是对SLP方法的改进,其思想是首先通过循环展开将迭代间并行转换为迭代内并行,使循环体内的同构语句条数足够多,再利用SLP方法进行向量发掘.但当循环展开不合法或者并行度低于向量化因子时,Loop-aware方法无法实现程序向量并行性的发掘.因此提出了向量并行度指导的循环向量化方法,依据迭代间并行度、迭代内并行度和向量化因子,构建循环向量化方法选择方案,同时提出不充分向量化方法发掘并行度低于向量化因子的循环向量并行性,最后依据向量并行度对生成的向量循环进行展开.经过标准测试集测试,向量并行度指导的循环SIMD向量化方法比Loop-aware方法识别率提升107.5%,性能提升12.1%. 相似文献

9.

分簇结构谓词以及编译优化框架研究

王向前洪一《计算机应用研究》2018,35(1)

谓词执行是在控制流存在的条件下可以有效挖掘指令级并行性的硬件机制。而在分簇结构上实现谓词机制,可以提高分簇结构上条件的执行效率。本文针对分簇结构展开谓词体系体系结构的研究,提出了分簇结构部分谓词的高效实现方法,以及基于循环展开的分簇结构部分谓词支持框架。实验表明,本文提出的分簇结构部分谓词及编译框架可以很好地提高条件执行程序的执行效率。相似文献

10.

滑动窗口应用循环展开及其数据通路生成

董亚卓刘明政夏飞窦勇《计算机学报》2008,31(6):989-997

滑动窗口广泛应用于图像处理、模式识别和数字信号处理中,它具有数据量大、计算密集等特点.可重构硬件为滑动窗口应用提供了一个灵活高效的实现平台.文中基于一种存储、数据调度模型及其相应的数据通路生成技术,研究循环展开对滑动窗口应用的面积、时钟频率和吞吐率的影响.实验结果表明内层循环展开相对于外层循环展开将带来更大的控制复杂度,增加了对芯片面积的需求,然而外层循环展开需要更多的存储资源保存重用数据;当片内存储模块个数增加到一定规模时,时钟频率将随着循环展开不断降低;不同维度的应用,吞吐率随循环展开提升程度不同. 相似文献

11.

Optimized Unrolling of Nested Loops

Vivek Sarkar 《International journal of parallel programming》2001,29(5):545-581

Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604). 相似文献

12.

Map Reduce inspired loop mapping for coarse-grained reconfigurable architecture

YIN ShouYi SHAO ShengJia LIU LeiBo WEI ShaoJun 《中国科学:信息科学(英文版)》2014,(12):180-193

Our work investigates how to map loops efficiently onto Coarse-Grained Reconfigurable Architecture （CGRA）. This paper examines the properties of CGRA and builds MapReduce inspired models for the loop parallelization problem. The proposed model has a more detailed performance metric and a more flexible unrolling scheme that can unroll different loop levels with different factors. A Geometric Programming based approach is proposed to resolve the optimization problem of loop parallelization problem. The proposed approach can find the optimal unrolling factor for each level loop, resulting in better parallelization of loops. Experimental results show that the proposed approach achieved up to 44% performance gain compared to the state-of-the-art loop mapping scheme. 相似文献

13.

符号执行中的循环依赖分析方法

下载免费PDF全文

刘杰曹琰魏强彭建山《计算机工程》2012,38(22):24-27

符号执行方法处理循环时存在路径爆炸的问题。为此,提出一种基于归纳变量的循环依赖分析方法。通过识别循环归纳变量及符号表达式,结合边界约束条件生成可达归纳变量分支的路径约束,并采用符号化映射方法分析嵌套循环归纳变量依赖问题,从而在不展开循环的情况下生成覆盖归纳变量分支的测试用例。对开源工具Libxml2进行实验,该方法能发现其中2个while循环所引发的数组访问越界错误。相似文献

14.

Combining Loop Transformations Considering Caches and Scheduling

Michael E. Wolf Dror E. Maydan Ding-Kai Chen 《International journal of parallel programming》1998,26(4):479-503

The performance of modern microprocessors is greatly affected by cache behavior, instruction scheduling, register allocation and loop overhead. High-level loop transformations such as fission, fusion, tiling, interchanging and outer loop unrolling (e.g., unroll and jam) are well known to be capable of improving all these aspects of performance. Difficulties arise because these machine characteristics and these optimizations are highly interdependent. Interchanging two loops might, for example, improve cache behavior but make it impossible to allocate registers in the inner loop. Similarly, unrolling or interchanging a loop might individually hurt performance but doing both simultaneously might help performance. Little work has been published on how to combine these transformations into an efficient and effective compiler algorithm. In this paper, we present a model that estimates total machine cycle time taking into account cache misses, software pipelining, register pressure and loop overhead. We then develop an algorithm to intelligently search through the various, possible transformations, using our machine model to select the set of transformations leading to the best overall performance. We have implemented this algorithm as part of the MIPSPro commercial compiler system. We give experimental results showing that our approach is both effective and efficient in optimizing numerical programs. 相似文献

15.

利用循环分割和循环展开避免Cache代价 总被引：1，自引：0，他引：1

刘利陈彧乔林汤志忠《软件学报》2008,19(9):2228-2242

存储系统与处理器之间的速度差距逐渐变大,为此,cache使用了分级机制,但这也带来了额外的存储延迟(cache代价).提出一种利用循环分割和循环展开相结合避免cache代价的PCPLPU(prevent cache penalty by loop partition-unrolling)算法.实验结果表明,PCPLPU算法能够有效避免循环代价,提高程序性能. 相似文献

16.

Compiler transformations for effectively exploiting a zero overhead loop buffer

Gang‐Ryung Uh Yuhong Wang David Whalley Sanjay Jinturkar Yunheung Paek Vincent Cao Chris Burns 《Software》2005,35(4):393-412

A Zero Overhead Loop Buffer (ZOLB) is an architectural feature that is commonly found in DSP processors. This buffer can be viewed as a compiler managed cache that contains a sequence of instructions that will be executed a specified number of times without incurring any loop overhead. Unlike loop unrolling, a loop buffer can be used to minimize loop overhead without the penalty of increasing code size. In addition, a ZOLB requires relatively little space and power, which are both important considerations for most DSP applications. This paper describes strategies for generating code to effectively use a ZOLB. We have found that many common code improving transformations used by optimizing compilers on conventional architectures can be easily used to (1) allow more loops to be placed in a ZOLB, (2) further reduce loop overhead of the loops placed in a ZOLB, and (3) avoid redundant loading of ZOLB loops. The results given in this paper demonstrate that this architectural feature can often be exploited with substantial improvements in execution time and slight reductions in code size for various signal processing applications. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献