期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

苏伯珙丁士元《计算机学报》1989,12(9):663-673

软件流水技术是对程序及微程序中的循环进行优化的一种有效方法,可对基本块构成的循环体进行软件流水的LURPR算法已取得令人满意的效果。本文将在LURPR法的基础上,把软件流水技术扩展到任意结构的循环体,并给出相应的GURPR算法,GURPR算法可对任意的含非正常入口、条件出口、支路、循环嵌套及子程序调用的循环体进行软件流水。相似文献

2.

一个支持多分支循环最优执行的VLIW体系结构

汤志忠张赤红《计算机研究与发展》1995,32(8):1-9

本首先提出一个能够支持多分支循环程序最优执行的ＶＬＩＷ体系结构模型，然后在这个模型的基础上设计了一个新的主要用于数字信号处理及图象处理应用领域的单片体系结构－ＵＲＰＲ－２。在这个体系结构中，属于不同路径和不同循环体的多个分支操作可以在一个节拍内同时被执行，因此可以在更大范围内开发指讼级并行性，同时还提出了一个种叫作流水控制黑板的机制来支持条件分支操作。ＵＲＰＲ－２不仅能够以很高的速度执行只含有基相似文献

3.

URPR——一种实现软件流水技术的方法 总被引：2，自引：0，他引：2

苏伯珙丁士元《计算机学报》1988,(5)

软件流水技术是对AP数组处理机循环程序进行优化的一种有效方法.本文介绍一种在微代码循环压缩URCR算法基础上研究的URPR算法.首先对循环体进行展开,展开的个数取决于循环体之间的数据相关程度,然后将展开后的循环体逐个进行安放,最后进行收拢得到一个优化后的新循环体.初步实验验证了URPR比目前现有一些方法具有优越性. 相似文献

4.

软件流水的开销模型和决策框架 总被引：1，自引：0，他引：1

下载免费PDF全文

李文龙林海波汤志忠《软件学报》2004,15(7):1005-1011

软件流水是一种重要的指令调度技术,它通过重叠地执行不同的循环体来提高指令级并行性(instruction level parallelism,简称ILP).模调度是一类被广泛采用的软件流水调度算法.软件流水并非一种无损的优化方法,它具有一定的开销,比如延长了编译时间、增加了寄存器压力等.而且,受到体系结构、调度算法以及程序特性的限制,进行软件流水并不一定能达到理想的加速比,有时反而会引起性能下降.提出了一种面向程序特性的软件流水开销模型,对此模型下的软件流水开销进行了量化分析,并提出了一种基于相关性分析的相似文献

5.

SMS软件流水调度算法的设计与实现

叶丞朱怡安王云岚《计算机工程与科学》2008,30(9):62-65

循环是程序中的热代码,对循环进行有效的优化可以显著缩短程序的执行时间。软件流水是一种开发循环体指令级并行的细粒度循环优化技术,它通过调度循环中连续迭代之间的指令使其并行执行,从而提高了循环的执行效率。实验数据表明,用Cerngoop程序包进行测试,循环优化效果明显。相似文献

6.

基于依赖环问题的改进软流水框架

张仁高郑启龙王向前韩东科《计算机工程与应用》2017,53(17):65-69

软件流水是编译后端优化中针对循环的调度技术,在软件流水优化过程中,依赖环是影响软件流水优化的重要因素。针对循环体中依赖环导致软件流水失败的问题,通过对循环中的依赖环进行分析处理,基于传统的模调度框架,提出了改进的软件流水优化算法,对于造成依赖环的寄存器引入多个分量,实现了对含有归约变量循环的流水。通过典型的算法测试,实验结果表明,该框架能够使得更多类型的循环流水成功,对于循环核心性能提升至少58%。相似文献

7.

基于循环分块的流水粒度优化算法

刘晓娴赵荣彩丁锐李雁冰《计算机应用》2013,33(8):2171-2176

当计算划分层迭代数目较大,或是循环体单次迭代工作量较大,但可用的并行线程数目较小时,传统的基于循环分块的流水粒度优化方法无法进行处理。为此,提出一种基于循环分块减小流水粒度的方法,并根据流水并行循环的代价模型实现最优流水粒度的求解,设计实现了一个流水计算粒度的优化算法。对有限差分松弛法(FDR)的波前循环和时域有限差分法(FDTD)中典型循环的测试表明,与传统的流水粒度选择方法相比,所提算法能够得到更优的循环分块大小。相似文献

8.

软件流水中隐藏存储延迟的方法 总被引：3，自引：2，他引：3

刘利李文龙陈彧李胜梅汤志忠《软件学报》2005,16(10):1833-1841

软件流水是一种重要的指令调度技术,它通过同时执行来自不同循环体的指令来加快循环的执行速度.随着处理机运行速度的逐渐提高,存储访问延迟成为性能提高的瓶颈.为了减轻存储系统影响,软件流水结合了一些存储优化技术,通过隐藏存储延迟来提高性能.提出了一种延迟可预测的模调度算法(foresighted latencymodulo scheduling,简称FLMS),它根据循环的特点来确定load指令延迟.实验结果表明,FLMS算法减少了阻塞时间,提高了程序性能. 相似文献

9.

MAP——一种新的软件流水技术

刘晓林汤志忠《计算机工程与应用》1992,(8):35-38

本文提出了一种全新的软件流水算法MAP,它不仅采用一种直观的流水安放规则以适应各种类型的硬件流水操作,而且使用多种启发策略减小循环体间相关距离。此算法以很低的算法复杂度在任意结构的VLIW机器上都可获得较高的加速比和资源利用率。相似文献

10.

IA-64中软件流水的寄存器需求研究 总被引：1，自引：0，他引：1

林海波李文龙汤志忠《计算机研究与发展》2004,41(1):22-27

软件流水是开发循环程序指令级并行性的重要方法之一，IA-64是支持软件流水的EPIC体系结构，通过对NAS Benchmarks中可软件流水循环所需的寄存器进行量化分析，提出了一种限制循环展开因子的启发式算法，有效地解决了因可用寄存器不足而导致软件流水失败的问题，并提高了应用程序的执行速度。相似文献

11.

Power-aware register assignment for large register file design

Wann-Yun Shieh Bo-Syun Wang 《The Journal of supercomputing》2012,61(3):719-742

The design trend of high-speed microprocessors is toward wider and wider issue architecture to increase instruction-level parallelism. Such architecture needs a large register file to reduce register pressure. A large register file, however, consumes much more power during program execution. In this paper, we first analyze the register requirements in general programs, especially among those parts of the program which take most of execution time. Next, we drive a power-aware register assignment algorithm to distribute different access-frequencies temporary values over different register groups. Finally, we design a dynamic voltage scaling circuit to save the power consumption for those infrequently accessed registers. Experimental results show that partitioning the storage locations of temporary values in a register file will indeed impact the utilization of each register, and within a DVS approach a large register file can thus save a significant ratio of power consumption. 相似文献

12.

适用于多核处理器的扩展寄存器文件设计

下载免费PDF全文

肖瑞瑾权衡张家杰尤凯迪英彦虞志益《计算机工程》2012,38(15):283-285,289

针对处理器中可用寄存器数量有限的问题,提出一种适用于多核处理器的扩展寄存器文件设计方案。采用多组结构进行硬件设计,将通信端口映射在扩展寄存器地址空间上,以实现寄存器寻址核间通信机制,引入兼具底层指令与高层封装的混合软件配置方案,改进软件编译流程。评估结果表明,该方案将可用寄存器文件的数量增加一倍,核间通信指令数目减少50%,系统吞吐率得到优化。相似文献

13.

基于EPIC同时多线程处理器的寄存器堆设计

下载免费PDF全文

黄彩霞《计算机工程与科学》2009,31(10)

在体现EPIC设计思想的Itanium微处理器中,寄存器堆的管理是通过寄存器堆栈引擎(RSE)技术实现的。EPIC硬件简单,动态同时多线程(DSMT)易于开发线程级并行,针对结合二者优点的EDSMT微体系结构,我们提出一种基于映射表的寄存器堆管理方法—MTRSE。该方法兼容Itanium体系结构,支持同时多线程,并提高了寄存器资源使用效率。实验表明,当线程数为3或4时,该方法对于寄存器资源有40%使用效率的提升。相似文献

14.

Register spilling via transformed interference equations for PAC DSP architecture

Chung‐Ju Wu Chia‐Han Lu Jenq Kuen Lee 《Concurrency and Computation》2014,26(3):779-799

Digital signal processors (DSPs) with very long instruction word (VLIW) data‐path architectures are increasingly being deployed on embedded devices for multimedia processing applications. To reduce the power consumption and design cost of VLIW DSP processors, distributed register files and multibank register architectures are being adopted to reduce the number of read and write ports associated with register files, which presents new challenges for devising compiler optimization schemes. This paper addresses the issues of reducing the spill code for a VLIW DSP with distributed register files. Spill code produced by register allocation is traditionally handled by memory spills, but the multibank register‐file architecture provides the opportunity to spill‐out register values onto different register banks. We present a conceptual framework based on the universal and the proxy interference graphs to model the live ranges of registers for spilling codes to different register banks. Heuristic algorithms are then developed on the basis of this concept. By heuristically estimating the register pressure for each register file, we treat different register banks as optional spilling locations in addition to traditional spilling to memory. Experiments were performed on the parallel architecture core VLIW DSP with distributed register files by incorporating our proposed optimization schemes into an Open64‐based compiler. The experimental results show that our approach can improve the performances on average for DSPStone and MiBench benchmarks with spilling cases by 7.1% and 21.6%, respectively, compared with the one always handling spill code in memory. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

15.

面向流应用的流寄存器文件

下载免费PDF全文

马驰远陈海燕齐树波陈书明肖嵘《计算机工程》2008,34(18):263-265

存储系统是通用处理器在处理流应用时的瓶颈。该文基于FT64流处理器体系结构,提出一种面向流应用的流寄存器文件结构设计方法和数据传输机制,分析它在FT64中的作用。通过采用大容量、高带宽、虚拟多端口的存储器,将大部分流数据存取操作限制在寄存器文件这一层次,减少了主存压力。实验结果表明,该结构能很好地适应流应用需求。相似文献

16.

基于65nm工艺的高性能低功耗处理器设计

下载免费PDF全文

权衡肖瑞瑾欧鹏尤凯迪黄贝虞志益《计算机工程》2012,38(19):250-253

研究并设计一款RISC处理器,从架构设计、电路设计、芯片后端设计多个层次保证其高性能、低功耗的特点.在架构设计层面,通过扩展寄存器堆来提升数据交互的局部性并降低对存储器的访问次数.在电路设计层面,利用动态门控时钟技术对乘除法模块和寄存器堆进行高效的时钟控制.在芯片后端设计层面,分析并比较TSMC 65 nm中GP和LP 2种工艺库,采用多阈值设计流程进一步提高处理器的速度并降低功耗.测试结果表明,与其他平台下的性能结果相比,该处理器可以将RS前向纠错解码算法的吞吐率提高4倍～70倍. 相似文献

17.

一个VLIW体系结构的单片多处理机

汤志忠张赤红《计算机研究与发展》1993,30(10):1-8

本介绍一个采用ＶＬＩＷ超长指令字体系结构的高性能单片多处理机，在这个体系结构中采用流水寄存器堆来消除循环程序内的数据相关，从而使程序能够在指令级以极高的并行度并行运行。模拟实验结果表明这个体系结构具有很高的运算速度和很好的性能价格比。相似文献

18.

Software and Hardware Techniques to Optimize Register File Utilization in VLIW Architectures 总被引：1，自引：0，他引：1

Javier Zalamea Josep Llosa Eduard Ayguadé Mateo Valero 《International journal of parallel programming》2004,32(6):447-474

High-performance microprocessors are currently designed with the purpose of exploiting instruction level parallelism (ILP). The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requirements of the loops. This paper reviews hardware and software techniques that alleviate the high register demands of aggressive scheduling heuristics on VLIW cores. From the software point of view, instruction scheduling can stretch lifetimes and reduce the register pressure. If more registers than those available in the architecture are required, some actions (such as the injection of spill code) have to be applied to reduce this pressure, at the expense of some performance degradation. From the hardware point of view, this degradation could be reduced if a high-capacity register file were included without causing a negative impact on the design of the processor (cycle time, area and power dissipation). Novel organizations for the register file based on clustering and hierarchical organization are necessary to meet the technology constraints. This paper proposes the used of a clustered organization and proposes an aggressive instruction scheduling technique that minimizes the negative effect of the limitations imposed by the register file organization. 相似文献

19.

On the Boosting of Instruction Scheduling by Renaming

Wang L. Yang Ted C. 《The Journal of supercomputing》2001,19(2):173-197

Speculative execution is the execution of instructions before it is known whether these instructions should be executed. In the speculative execution for instruction level parallelism (ILP) processors, the concept of shadow register provides a hardware solution to maintain semantics of a program from the pollution of boosted instructions that are incorrectly predicted. In a recent study, Chang and Lai proposed a special register file based on shadow register, named conjugate register file (CRF), to support multilevel boosting in speculative execution. They also proposed a scheduling heuristic named frequency-driven scheduling to incorporate with CRF for execution. However, the ability of boosting is still constrained since the concept of register pair will force the results produced speculatively be stored in dedicated locations. Moreover, when the parallelism potential increases to tens through the advancement of hardware techniques, the heavy demand on register usage and the complexity of register file may well become a serious bottleneck for the exploitation of ILP.In this paper, the algorithm of frequency-driven scheduling is modified by replacing the function of hardware CRF with the technique of variable renaming during compilation. The new scheduling technique, named LESS, can exploit the parallelism efficiently with limited number of registers. Moreover, since the technique can benefit ILP without any special hardware support, it can be incorporated with any other ILP architecture without changing its instruction set architecture (ISA).Simulation results show that the performance achievable by LESS is better than other existing methods. For example, under the ILP model with an issue rate of 8, the speculative execution can achieve an increase of 34% in parallelism, as compared to 18% in CRF scheme. 相似文献

20.

Region-based dual bank register allocation for reduced instruction encoding Architectures

《Microprocessors and Microsystems》2017

In embedded systems, small code size is important due to memory constraints. One technique to achieve a small code size is reducing the instruction encoding from 32-bit to 16-bit, such as the ARM THUMB or MIPS-16 architectures. This half-size encoding leads to shorter register operands, making fewer registers available for register allocation and causing more spills, although invisible registers can be used as spill locations via copies. We propose reconstructing the original register file into dual-banks, added with the bank toggle instruction for bank changes and the inter-bank copies between the banks. We also propose an efficient dual-bank register allocation technique based on regions in the code to reduce spills. As a case study, we applied our banked register allocation model for the THUMB architecture. We found that the code size decreases by as much as 8% (5.8% on average) while the performance improves by as much as 11.1% (3.3% on average). Our results indicate that we would better organize the register file of an embedded CPU that can provide reduced encoding into dual banks for better quality of register allocation, rather than using the invisible registers for spills. 相似文献