共查询到20条相似文献,搜索用时 15 毫秒
1.
Rahul Nagpal Y.N. Srikant 《Parallel Computing》2011,37(1):42-59
Clustered VLIW architectures solve the scalability problem associated with flat VLIW architectures by partitioning the register file and connecting only a subset of the functional units to a register file. However, inter-cluster communication in clustered architectures leads to increased leakage in functional components and a high number of register accesses. In this paper, we propose compiler scheduling algorithms targeting two previously ignored power-hungry components in clustered VLIW architectures, viz., instruction decoder and register file.We consider a split decoder design and propose a new energy-aware instruction scheduling algorithm that provides 14.5% and 17.3% benefit in the decoder power consumption on an average over a purely hardware based scheme in the context of 2-clustered and 4-clustered VLIW machines. In the case of register files, we propose two new scheduling algorithms that exploit limited register snooping capability to reduce extra register file accesses. The proposed algorithms reduce register file power consumption on an average by 6.85% and 11.90% (10.39% and 17.78%), respectively, along with performance improvement of 4.81% and 5.34% (9.39% and 11.16%) over a traditional greedy algorithm for 2-clustered (4-clustered) VLIW machine. 相似文献
2.
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving the clock speed, reducing the energy consumption of the logic, and making the design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Inter-cluster communication also introduces many short idle cycles, thereby significantly increasing the overall leakage energy consumption in the functional units. The trend towards miniaturization of devices (and associated reduction in threshold voltage) makes energy consumption in interconnects and functional units even worse, and limits the usability of clustered architectures in smaller technologies. However, technological advancements now permit the design of interconnects and functional units with varying performance and power modes. In this paper, we propose scheduling algorithms that aggregate the scheduling slack of instructions and communication slack of data values to exploit the low-power modes of functional units and interconnects. Finally, we present a synergistic combination of these algorithms that simultaneously saves energy in functional units and interconnects to improves the usability of clustered architectures by achieving better overall energy–performance trade-offs. Even with conservative estimates of the contribution of the functional units and interconnects to the overall processor energy consumption, the proposed combined scheme obtains on average 8% and 10% improvement in overall energy–delay product with 3.5% and 2% performance degradation for a 2-clustered and a 4-clustered machine, respectively. We present a detailed experimental evaluation of the proposed schemes. Our test bed uses the Trimaran compiler infrastructure. 相似文献
3.
Andrea Capitanio Alexandru Nicolau Nikil Dutt 《International journal of parallel programming》1995,23(6):499-513
Multiple-functional-unit architectures allow one to boost performance by simultaneously executing many operations, but technological
constraints limit the achievable register-file I/O bandwith and prevent one from fully exploiting the benefits of a large
number of units. Dividing the register set into multiple banks can improve the overall I/O bandwidth but determines a nonhomogeneous
register space onto which variables must be allocated subject to register-file-port constraining. We propose a hypergraph-based
paradigm for modeling competition among variables for port-allocation on multiple-register-file VLIW architectures; by coloring
such a hypergraph, we can identify legal allocations of variables to register banks and produce executable code. 相似文献
4.
5.
6.
Meng Wang Author Vitae Author Vitae Duo Liu Author Vitae Author Vitae Zili Shao Author Vitae 《Journal of Systems and Software》2010,83(5):772-785
As feature size shrinks, leakage energy consumption has become an important concern. In this paper, we develop a compiler-assisted instruction-level scheduling technique to reduce leakage energy consumption for applications with loops on VLIW architecture. In the proposed technique, we obtain the schedule with minimum leakage energy from the ones that are generated by repeatedly regrouping a loop based on rotation scheduling and bipartite-matching. We conduct experiments on a set of benchmarks from DSPstone, Mediabench, Netbench, and MiBench based on the power model of the VLIW processors. The results show that our algorithm can achieve significant leakage energy saving compared with the previous work. 相似文献
7.
分支调度是一种有效消除分支指令延迟的指令调度技术,对于提升VLIW类处理器的性能非常重要。提出了一个针对分支延迟槽的指令调度优化算法。该算法面向VLIW体系结构,根据程序依赖图选择合适的候选指令序列;通过建立代价收益模型为分支延迟槽产生一个收益较大的指令调度序列。实验数据表明,分支调度算法可以平均提升12.9%的应用程序性能。 相似文献
8.
Jih-Ching Chiu 《Computers & Electrical Engineering》2010,36(1):190-198
The instruction compression mechanism used to solve the drawbacks of traditional very long instruction word (VLIW) architectures often leads to poor code density in the instruction cache, which causes the irregular lengths of long instructions to cross the different cache line. These split long instructions cannot be fetched simultaneously, which creates a bottleneck for VLIW architectures. This paper proposes a buffing mechanism which can slide the split long instruction as a continuous form to offer better efficiency in instruction fetching. This approach helps maintain the behaviors of the software pipeline technology, which schedules iterative instructions to enhance the performance of streaming processing for VLIW architectures. In the proposed mechanism, the instruction stream buffer stores the repeat block completely and suspends as far as possible the cache access to reduce access time. The advantages of repeatedly issuing instructions in the instruction buffer and preventing split long instructions, can substantially improve the performance in fetching instructions. Simulation results show that the mechanism is efficient at the instruction level for the basic DSP/IMG library by improving performance by 35% on average. 相似文献
9.
10.
超长指令字(Very Long Instruction Word,VLIW)处理器一般采用总线互连的多簇结构,每个簇中的功能单元共享一个本地寄存器堆,簇间采用总线传输数据,以避免功能单元增多时,全连通结构的延时、面积和功耗的快速增长;但簇间数据共享时的拷贝和延时,使得处理器在性能上有所下降.文中提出了一种寄存器堆互连的多簇VLIW结构,采用寄存器堆来连接各个簇,从而可以避免簇间数据传输的延时和额外的数据拷贝操作.同时也提出了针对这种结构的指令调度算法,以提高指令调度的性能.实验结果表明,与全连通的VLIW结构相比,寄存器堆互连结构在性能上仅有13%左右的性能下降,代码长度则基本不变;这都优于总线互连的多簇结构. 相似文献
11.
VLIW是一种早已出现但一直未能广泛使用而现今又被重新重点研究的微处理器设计思想与技术,它跟超标量技术一样支持每周期执行多条指令,但并行度更高。本文将详细介绍VLIW的概念及其发展历程,讨论VLIW微处理器的特征与所需的编译技术支持,并与超标量微处理器进行比较分析。 相似文献
12.
Very long instruction word (VLIW) machines potentially provide the most direct way to exploit instruction-level parallelism; however, they cannot be used to emulate current general-purpose instruction set architectures. In addition, programs scheduled for a particular implementation of a VLIW model cannot be guaranteed to be binary compatible with other implementations of the same machine model with a different number of functional units or functional units with different latencies. This paper describes an architecture, named dynamically trace scheduled VLIW (DTSVLIW), that can be used to implement machines that execute code of current RISC or CISC instruction set architectures in a VLIW fashion, with backward code compatibility. Preliminary measurements of the DTSVLIW performance, obtained with an execution-driven simulator running the SPECint95 benchmark suite, are also presented. 相似文献
13.
在分簇VLIW DSP上,指令分簇是一项对程序性能有重要影响的编译优化,但现有的指令分簇算法只能处理顺序的程序区域,且难以获得最佳的分簇方案。针对这些问题,提出一种基于整数线性规划的统一指令分簇与指令调度的方法。该方法使用零一决策变量表示函数中指令的分簇、指令的局部调度以及簇间传输指令的全局调度,并将指令之间的依赖关系和对处理器资源的竞争关系构造为线性约束,最终得到一个以最小化函数的估计执行时间为目标的整数线性规划模型。实验结果表明,求解该模型得到的分簇调度方案对程序性能的优化显著强于现有算法,并且求解模型所耗费的时间是可接受的。 相似文献
14.
VLIW结构是开发ILP的一种重要手段,其优点是结构规整简单、硬件复杂度低。但是,完全依靠编译器进行指令调度的机制限制了VLIW结构性能的提高。本文提出了一种基于确定指令延迟的动态VLIW调度机制,该机制利用大部分指令执行时间确定的特点,根据运行时信息重新调度指令的执行顺序,以进一步开发ILP。在FPGA上的实验结果表明,该机制具有线性的硬件复杂度。 相似文献
15.
消除VLIW结构上的循环体间冗余流相关 总被引:1,自引:1,他引:1
数据相关是并行处理的基本依据.该文指出,VLIW(very long instruction word)特有的锁步性质使其数据相关性分析具有与众不同的特点.同一体差上的流相关形成一个线序集合,多体差上的特征流相关之间也存在包含关系.据此,提出一种用于VLIW的消除循环体间冗余流相关的方法.该方法是完备的,可以去除所有冗余的体间流相关,从而减轻循环调度的负担.文章给出判定单体差和多体差存在冗余的充分必要条件,以及消除冗余的线性复杂度的算法.这种方法具有普遍意义,可作为VLIW上软件流水和多指令流调度的基础. 相似文献
16.
17.
18.
介绍了基于FPGA实现VLIW微处理器的基本方法,对VLIW微处理器具体划分为5个主要功能模块.依据FPGA的设计思想,采用自顶向下和文本与原理图相结合的流水线方式的设计方法,进行VLIW微处理器的5个模块功能设计,从而最终实现VLIW微处理器的功能,并进行了板级功能验证. 相似文献
19.
虽然有针对VLIW处理器的复杂编译器,但是通过手动汇编能够更有效地实现这些算法。手动编码是一项易出错,耗时的工作。为了解决这个问题,文章提出了一种手动编码的启发式实现方法,相对于单纯的手动编码,它能够在更短的时间内更有效地实现算法。在德州仪器的VLIW处理器TMS320C6x上,使用这种方法实现了IIR滤波器算法,证实了其有效性。 相似文献
20.
Henry G. Dietz Abderrazek Zaafrani Matthew T. O'Keefe 《The Journal of supercomputing》1992,5(4):263-289
In a SIMD or VLIW machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.In this paper we present a timing analysis that allows a compiler for a MIMD machine to eliminate a large fraction of the run-time synchronization by making efficient use of static code scheduling. Although these techniques can be adapted to be applied to most MIMD machines, this paper centers on the analysis and scheduling for barrier MIMD machines. Barrier MIMDs are asynchronous multiple instruction stream/multiple data stream architectures capable of parallel execution of variable execution-time instructions and arbitrary control flow (e.g., while loops and calls). However, they also incorporate a special hardware barrier synchronization mechanism that facilitates static scheduling by providing a mechanism which the compiler can use to enforce precise timing constraints. In other words, the compiler tracks relative timing between processors and uses static code scheduling until the timing imprecision becomes too large, at which point the compiler simply inserts a barrier to reduce that timing imprecision to zero (or a small constant).This paper describes new scheduling and barrier placement algorithms for barrier MIMDs that are based loosely on the list scheduling approach employed for VLIWs [Ellis 1985]. In addition, the experimental results from scheduling thousands of synthetic benchmark programs for a parameterized barrier MIMD machine are presented. 相似文献