期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Compiler-assisted power optimization for clustered VLIW architectures

Rahul Nagpal Y.N. Srikant 《Parallel Computing》2011,37(1):42-59

Clustered VLIW architectures solve the scalability problem associated with flat VLIW architectures by partitioning the register file and connecting only a subset of the functional units to a register file. However, inter-cluster communication in clustered architectures leads to increased leakage in functional components and a high number of register accesses. In this paper, we propose compiler scheduling algorithms targeting two previously ignored power-hungry components in clustered VLIW architectures, viz., instruction decoder and register file.We consider a split decoder design and propose a new energy-aware instruction scheduling algorithm that provides 14.5% and 17.3% benefit in the decoder power consumption on an average over a purely hardware based scheme in the context of 2-clustered and 4-clustered VLIW machines. In the case of register files, we propose two new scheduling algorithms that exploit limited register snooping capability to reduce extra register file accesses. The proposed algorithms reduce register file power consumption on an average by 6.85% and 11.90% (10.39% and 17.78%), respectively, along with performance improvement of 4.81% and 5.34% (9.39% and 11.16%) over a traditional greedy algorithm for 2-clustered (4-clustered) VLIW machine. 相似文献

2.

Compiler-assisted energy optimization for clustered VLIW processors

Rahul Nagpal Y.N. Srikant 《Journal of Parallel and Distributed Computing》2012

Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving the clock speed, reducing the energy consumption of the logic, and making the design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Inter-cluster communication also introduces many short idle cycles, thereby significantly increasing the overall leakage energy consumption in the functional units. The trend towards miniaturization of devices (and associated reduction in threshold voltage) makes energy consumption in interconnects and functional units even worse, and limits the usability of clustered architectures in smaller technologies. However, technological advancements now permit the design of interconnects and functional units with varying performance and power modes. In this paper, we propose scheduling algorithms that aggregate the scheduling slack of instructions and communication slack of data values to exploit the low-power modes of functional units and interconnects. Finally, we present a synergistic combination of these algorithms that simultaneously saves energy in functional units and interconnects to improves the usability of clustered architectures by achieving better overall energy–performance trade-offs. Even with conservative estimates of the contribution of the functional units and interconnects to the overall processor energy consumption, the proposed combined scheme obtains on average 8% and 10% improvement in overall energy–delay product with 3.5% and 2% performance degradation for a 2-clustered and a 4-clustered machine, respectively. We present a detailed experimental evaluation of the proposed schemes. Our test bed uses the Trimaran compiler infrastructure. 相似文献

3.

A hypergraph-based model for port allocation on multiple-register-file VLIW architectures

Andrea Capitanio Alexandru Nicolau Nikil Dutt 《International journal of parallel programming》1995,23(6):499-513

Multiple-functional-unit architectures allow one to boost performance by simultaneously executing many operations, but technological constraints limit the achievable register-file I/O bandwith and prevent one from fully exploiting the benefits of a large number of units. Dividing the register set into multiple banks can improve the overall I/O bandwidth but determines a nonhomogeneous register space onto which variables must be allocated subject to register-file-port constraining. We propose a hypergraph-based paradigm for modeling competition among variables for port-allocation on multiple-register-file VLIW architectures; by coloring such a hypergraph, we can identify legal allocations of variables to register banks and produce executable code. 相似文献

4.

分簇VLIW结构下利用数据依赖图优化调度的研究

杨旭何虎孙义和《计算机学报》2011,34(1):182-192

应用的需求促使如今的处理器必须尽可能高地利用程序中所存在的指令级并行度,然而,高指令级并行的硬件和指令调度技术会给寄存器资源带来极大的压力.要在单一寄存器堆的情况下,既维持高的指令级并行度,又保持高的运行时钟频率是一件非常困难的事情,这是因为,当指令级并行度足够高时,在单一寄存器堆情况下,寄存器堆访问端口数目的限制会使... 相似文献

5.

面向VLIW结构的寄存器压力敏感表调度算法*

王红梅王敏张铁军单睿侯朝焕《计算机应用研究》2009,26(11):4039-4041

为了改善寄存器压力问题,提出一种寄存器压力敏感的指令调度算法。该算法在传统表调度算法的基础上采用关键路径为优先级函数,并考虑在寄存器压力区域内调整非关键节点的调度时机,在应用程序性能不损失的情况下达到了减小寄存器压力的目的。相似文献

6.

Compiler-assisted leakage-aware loop scheduling for embedded VLIW DSP processors

Meng Wang Author Vitae Author Vitae Duo Liu Author Vitae Author Vitae Zili Shao^{Author Vitae} 《Journal of Systems and Software》2010,83(5):772-785

As feature size shrinks, leakage energy consumption has become an important concern. In this paper, we develop a compiler-assisted instruction-level scheduling technique to reduce leakage energy consumption for applications with loops on VLIW architecture. In the proposed technique, we obtain the schedule with minimum leakage energy from the ones that are generated by repeatedly regrouping a loop based on rotation scheduling and bipartite-matching. We conduct experiments on a set of benchmarks from DSPstone, Mediabench, Netbench, and MiBench based on the power model of the VLIW processors. The results show that our algorithm can achieve significant leakage energy saving compared with the previous work. 相似文献

7.

面向VLIW处理器的分支调度优化算法

时磊吴潇涂登彪程工刘峰余翠玲任彦《计算机工程与应用》2012,48(21):41-44

分支调度是一种有效消除分支指令延迟的指令调度技术,对于提升VLIW类处理器的性能非常重要。提出了一个针对分支延迟槽的指令调度优化算法。该算法面向VLIW体系结构,根据程序依赖图选择合适的候选指令序列;通过建立代价收益模型为分支延迟槽产生一个收益较大的指令调度序列。实验数据表明,分支调度算法可以平均提升12.9%的应用程序性能。相似文献

8.

A Novel instruction stream buffer for VLIW architectures

Jih-Ching Chiu 《Computers & Electrical Engineering》2010,36(1):190-198

The instruction compression mechanism used to solve the drawbacks of traditional very long instruction word (VLIW) architectures often leads to poor code density in the instruction cache, which causes the irregular lengths of long instructions to cross the different cache line. These split long instructions cannot be fetched simultaneously, which creates a bottleneck for VLIW architectures. This paper proposes a buffing mechanism which can slide the split long instruction as a continuous form to offer better efficiency in instruction fetching. This approach helps maintain the behaviors of the software pipeline technology, which schedules iterative instructions to enhance the performance of streaming processing for VLIW architectures. In the proposed mechanism, the instruction stream buffer stores the repeat block completely and suspends as far as possible the cache access to reduce access time. The advantages of repeatedly issuing instructions in the instruction buffer and preventing split long instructions, can substantially improve the performance in fetching instructions. Simulation results show that the mechanism is efficient at the instruction level for the basic DSP/IMG library by improving performance by 35% on average. 相似文献

9.

魂芯分簇VLIW DSP上指令调度的优化

《微型机与应用》2017,(11)

魂芯DSP处理器是一款32 bit静态超标量、分簇结构的、支持SIMD的VLIW处理器。魂芯DSP芯片有4个执行簇和3个内存块,但簇间数据传输和寻址会占用总线带宽。魂芯DSP上每个簇中有大量的计算部件,但是现有的编译器框架中指令调度算法是针对非分簇结构的,无法充分利用魂芯DSP的分簇结构特点,产生出高效的指令级并行代码。根据魂芯处理器架构分簇的特点,提出了在魂芯DSP上进行指令分簇和指令调度的启发式算法,并且在开源Open64编译器框架上进行了实现。实验结果表明,该算法在魂芯DSP编译器上的实现可以显著提高一些在DSP上有着计算密集型程序的性能。相似文献

10.

寄存器堆互连的VLIW结构及其指令调度算法

周志雄何虎杨旭张延军孙义和《计算机学报》2008,31(1):127-132

超长指令字(Very Long Instruction Word,VLIW)处理器一般采用总线互连的多簇结构,每个簇中的功能单元共享一个本地寄存器堆,簇间采用总线传输数据,以避免功能单元增多时,全连通结构的延时、面积和功耗的快速增长;但簇间数据共享时的拷贝和延时,使得处理器在性能上有所下降.文中提出了一种寄存器堆互连的多簇VLIW结构,采用寄存器堆来连接各个簇,从而可以避免簇间数据传输的延时和额外的数据拷贝操作.同时也提出了针对这种结构的指令调度算法,以提高指令调度的性能.实验结果表明,与全连通的VLIW结构相比,寄存器堆互连结构在性能上仅有13%左右的性能下降,代码长度则基本不变;这都优于总线互连的多簇结构. 相似文献

11.

VLIW微处理器特征与编译技术支持

郑飞陆鑫达《微处理机》1996,(3):1-4

VLIW是一种早已出现但一直未能广泛使用而现今又被重新重点研究的微处理器设计思想与技术，它跟超标量技术一样支持每周期执行多条指令，但并行度更高。本文将详细介绍VLIW的概念及其发展历程，讨论VLIW微处理器的特征与所需的编译技术支持，并与超标量微处理器进行比较分析。相似文献

12.

Dynamically Scheduling VLIW Instructions

A. F. de Souza P. Rounce 《Journal of Parallel and Distributed Computing》2000,60(12):184

Very long instruction word (VLIW) machines potentially provide the most direct way to exploit instruction-level parallelism; however, they cannot be used to emulate current general-purpose instruction set architectures. In addition, programs scheduled for a particular implementation of a VLIW model cannot be guaranteed to be binary compatible with other implementations of the same machine model with a different number of functional units or functional units with different latencies. This paper describes an architecture, named dynamically trace scheduled VLIW (DTSVLIW), that can be used to implement machines that execute code of current RISC or CISC instruction set architectures in a VLIW fashion, with backward code compatibility. Preliminary measurements of the DTSVLIW performance, obtained with an execution-driven simulator running the SPECint95 benchmark suite, are also presented. 相似文献

13.

基于整数线性规划的VLIW DSP指令分簇调度

周鹏《计算机应用研究》2022,39(10)

在分簇VLIW DSP上,指令分簇是一项对程序性能有重要影响的编译优化,但现有的指令分簇算法只能处理顺序的程序区域,且难以获得最佳的分簇方案。针对这些问题,提出一种基于整数线性规划的统一指令分簇与指令调度的方法。该方法使用零一决策变量表示函数中指令的分簇、指令的局部调度以及簇间传输指令的全局调度,并将指令之间的依赖关系和对处理器资源的竞争关系构造为线性约束,最终得到一个以最小化函数的估计执行时间为目标的整数线性规划模型。实验结果表明,求解该模型得到的分簇调度方案对程序性能的优化显著强于现有算法,并且求解模型所耗费的时间是可接受的。相似文献

14.

一种动态VLIW调度机制的研究和实现 总被引：2，自引：0，他引：2

下载免费PDF全文

李云照王志英沈立《计算机工程与科学》2008,30(7):90-93

VLIW结构是开发ILP的一种重要手段,其优点是结构规整简单、硬件复杂度低。但是,完全依靠编译器进行指令调度的机制限制了VLIW结构性能的提高。本文提出了一种基于确定指令延迟的动态VLIW调度机制,该机制利用大部分指令执行时间确定的特点,根据运行时信息重新调度指令的执行顺序,以进一步开发ILP。在FPGA上的实验结果表明,该机制具有线性的硬件复杂度。相似文献

15.

消除VLIW结构上的循环体间冗余流相关 总被引：1，自引：1，他引：1

容红波汤志忠《软件学报》2000,11(1):126-132

数据相关是并行处理的基本依据.该文指出,VLIW(very long instruction word)特有的锁步性质使其数据相关性分析具有与众不同的特点.同一体差上的流相关形成一个线序集合,多体差上的特征流相关之间也存在包含关系.据此,提出一种用于VLIW的消除循环体间冗余流相关的方法.该方法是完备的,可以去除所有冗余的体间流相关,从而减轻循环调度的负担.文章给出判定单体差和多体差存在冗余的充分必要条件,以及消除冗余的线性复杂度的算法.这种方法具有普遍意义,可作为VLIW上软件流水和多指令流调度的基础. 相似文献

16.

面向VLIW处理器的GAS汇编器实现

王红梅张铁军王东辉《微计算机应用》2010,31(5)

DSP处理器采用VLIW结构提高了指令级并行度,同时也增加了为其开发汇编器的难度.本文在汇编器GAS(GNV Assemblor)的基础上,讨论了为VLIW结构DSP开发汇编器的关键技术.该技术通过分析汇编指令的串并行信息为DSP产生指令包;通过相关性检查改善了代码膨胀问题,在保证汇编器功能正确的同时,提高了性能. 相似文献

17.

面向嵌入式VLIW处理器的代码压缩技术

杨磊张铁军王东辉《微计算机应用》2010,31(5)

随着嵌入式系统的发展,在性能不断提高的同时,软件代码规模也不断扩大.而超长指令字结构更加引起了代码的膨胀,因此代码压缩技术变得很重要.本文基于自主研发的SDSP处理器核,应用3种压缩编码技术,比较它们压缩的效果,并讨论了包括压缩后地址的重映射以及解压缩结构的整体硬件方案. 相似文献

18.

VLIW处理器的设计与实现

唐骞杨小雪《微型机与应用》2010,29(11)

介绍了基于FPGA实现VLIW微处理器的基本方法,对VLIW微处理器具体划分为5个主要功能模块.依据FPGA的设计思想,采用自顶向下和文本与原理图相结合的流水线方式的设计方法,进行VLIW微处理器的5个模块功能设计,从而最终实现VLIW微处理器的功能,并进行了板级功能验证. 相似文献

19.

基于甚长指令字处理器的启发式手动编码方法

楼东武任俊李志能《计算机工程与应用》2005,41(26):58-60,93

虽然有针对VLIW处理器的复杂编译器,但是通过手动汇编能够更有效地实现这些算法。手动编码是一项易出错,耗时的工作。为了解决这个问题,文章提出了一种手动编码的启发式实现方法,相对于单纯的手动编码,它能够在更短的时间内更有效地实现算法。在德州仪器的VLIW处理器TMS320C6x上,使用这种方法实现了IIR滤波器算法,证实了其有效性。相似文献

20.

Static scheduling for barrier MIMD architectures

Henry G. Dietz Abderrazek Zaafrani Matthew T. O'Keefe 《The Journal of supercomputing》1992,5(4):263-289

In a SIMD or VLIW machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.In this paper we present a timing analysis that allows a compiler for a MIMD machine to eliminate a large fraction of the run-time synchronization by making efficient use of static code scheduling. Although these techniques can be adapted to be applied to most MIMD machines, this paper centers on the analysis and scheduling for barrier MIMD machines. Barrier MIMDs are asynchronous multiple instruction stream/multiple data stream architectures capable of parallel execution of variable execution-time instructions and arbitrary control flow (e.g., while loops and calls). However, they also incorporate a special hardware barrier synchronization mechanism that facilitates static scheduling by providing a mechanism which the compiler can use to enforce precise timing constraints. In other words, the compiler tracks relative timing between processors and uses static code scheduling until the timing imprecision becomes too large, at which point the compiler simply inserts a barrier to reduce that timing imprecision to zero (or a small constant).This paper describes new scheduling and barrier placement algorithms for barrier MIMDs that are based loosely on the list scheduling approach employed for VLIWs [Ellis 1985]. In addition, the experimental results from scheduling thousands of synthetic benchmark programs for a parameterized barrier MIMD machine are presented. 相似文献