共查询到20条相似文献,搜索用时 187 毫秒
1.
本文针对控制流网络处理器固定拓扑结构的限制及指令集并行性开发的不足,将粗粒度数据流设计思想引入到网络处理器体系结构设计中,提出了一种新型粗粒度数据流网络处理器体系结构-DynaNP。DynaNP利用处理引擎(PE)内控制流执行方式获得较高的可编程性,还利用PE间数据流执行方式开发了报文处理中的任务级并行性。为了进一步提高DynaNP的系统流量,面向DynaNP的多核及数据流特性,设计了混合定制硬件加速机制,并详细介绍了实现混合定制硬件加速的关键技术,通过提供统一的混合定制硬件加速接口,可以支持定制指令和协处理器两种典型硬件加速器。 相似文献
2.
网络处理器是一种支持高速报文处理和转发的可编程通信集成电路.作为路由器中的重要组件,网络处理器设计不但强调高性能,还要求足够的灵活性以支持未来的网络协议.针对控制流网络处理器固定拓扑结构及指令级并行性开发方面的不足,采用粗粒度数据流设计思想,提出了一种粗粒度数据流网络处理器体系结构及原型--DynaNP.DynaNP不但可利用处理引擎内控制流执行方式获得高可编程性,还利用处理引擎间数据流执行方式有效开发报文处理中的任务级并行性.此外,DynaNP提供了处理路径动态配置机制,可有效提高系统流量.DynaNP的原型系统基于SoPC技术设计实现.多个PE和功能模块通过片上高速通信网络连接,其中,核心处理引擎采用嵌入式RISC处理器核LEON3实现,并采用指令集扩展技术优化网络协议处理.该原型系统可有效验证粗粒度数据流网络处理器的功能和关键技术. 相似文献
3.
谓词执行是在控制流存在的条件下可以有效挖掘指令级并行性的硬件机制。而在分簇结构上实现谓词机制,可以提高分簇结构上条件的执行效率。本文针对分簇结构展开谓词体系体系结构的研究,提出了分簇结构部分谓词的高效实现方法,以及基于循环展开的分簇结构部分谓词支持框架。实验表明,本文提出的分簇结构部分谓词及编译框架可以很好地提高条件执行程序的执行效率。 相似文献
4.
超长指令字处理器为了提高指令集并行(ILP)往往采用多个功能单元,从而需要多端口的寄存器文件提供支持.但是寄存器文件会随着端口的增多变得更复杂,频率难以提升,成为系统的瓶颈.分簇是解决这一问题的有效手段.分簇在不影响处理器ILP的前提下减少了每簇寄存器文件的端口数目,但对编译器提出了挑战,编译器必须将指令和操作数在簇间进行合理分配才能得到较好的指令级并行.针对分簇超长指令字结构提出了一种基于超块的统一分簇与模调度编译方法.使用超块技术可以增大调度范围以获得更好的ILP,并且可以处理含有控制流的循环体,增加了模调度的适用范围.超块中指令的分簇与模调度则是统一进行的,这将比分阶段进行有更好的优化效果,因为统一进行是从全局的角度寻求优化而非寻求各个阶段局部优化.在YHFT-DSP/700编译器中的实验结果表明,与ITSS算法相比,该算法可以达到较好的优化效果. 相似文献
5.
利用数据预取机制降低块执行模型的访存延迟 总被引:1,自引:0,他引:1
块执行模型通过将串行程序划分成一系列可并行执行的指令块来挖掘应用中潜在的指令级并行性.访存延迟是阻碍块执行模型提高指令级并行性的主要因素之一,而数据预取技术在传统执行模型中可有效降低访存延迟,对块执行模型也同样具有较强的适应性.本文分析了在块执行模型中引入数据预取机制的可行性,并从cache命中率、访存指令的延迟等方面验证了数据预取在块执行模型中的作用,仿真结果表明数据预取可有效降低块执行模型中的访存延迟. 相似文献
6.
谓词执行能使分片式处理器充分利用众多的执行单元,开发指令级并行性.但因此形成的超块也使得分支误预测代价增大,所以提高分支预测器的性能至关重要.本文提出一种基于剖析信息决策的谓词执行技术,该技术利用剖析信息对谓词执行前后的执行周期进行估算,从而对分支的谓词执行进行决策.该技术使分支预测器的命中率提高了0.68%~3.50%,使系统性能提高了1.67%~8.33%.同时,利用select指令表示谓词化指令也消除了重命名阶段寄存器多定义问题. 相似文献
7.
8.
9.
10.
11.
Discovering and exploiting instruction level parallelism in code will be key to future increases in microprocessor performance. What technical challenges must compiler writers meet to better use ILP? Instruction level parallelism allows a sequence of instructions derived from a sequential program to be parallelized for execution on multiple pipelined functional units. If industry acceptance is a measure of importance, ILP has blossomed. It now profoundly influences the design of almost all leading edge microprocessors and their compilers. Yet the development of ILP is far from complete, as research continues to find better ways to use more hardware parallelism over a broader class of applications 相似文献
12.
《Parallel Computing》1999,25(13-14):1741-1783
Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism (ILP) are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared-memory multiprocessors (SMPs) are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution. Communication optimization and code generation issues that are unique to such compilers are also briefly discussed. 相似文献
13.
程序中大量分支指令的存在,严重制约了体系结构和编译器开发并行性的能力。有效发掘指令级并行性的一个主要挑战是要克服分支指令带来的限制。利用谓词执行可有效地删除分支,将分支指令转换为谓词代码,从而扩大了指令调度的范围并且删除了分支误测带来的性能损失。阐述了基于谓词代码的指令调度、软件流水、寄存器分配、指令归并等编译优化技术。设计并实现了一个基于谓词代码的指令调度算法。实验表明,对谓词代码进行编译优化,能有效提高指令并行度,缩短代码执行时间,提高程序性能。 相似文献
14.
《Parallel and Distributed Systems, IEEE Transactions on》2008,19(7):914-925
This paper presents the Mitosis framework, which is a combined hardware-software approach to speculative multithreading, even in the presence of frequent dependences among threads. Speculative multithreading increases single-threaded application performance by exploiting thread-level parallelism speculatively - that is, executing code in parallel even when the compiler or runtime system cannot guarantee the parallelism exists. The proposed approach is based on predicting/computing thread input values via software, through a piece of code that is added at the beginning of each thread (the pre-computation slice). A pre-computation slice is expected to compute the correct thread input values most of the time, but not necessarily always. This allows aggressive optimization techniques to be applied to the slice to make it very short. This paper focuses on the microarchitecture that supports this execution model. The primary novelty of the microarchitecture is the hardware support for the execution and validation of pre-computation slices. Additionally, this paper presents new architectures for the register file and the cache memory in order to support multiple versions of each variable and allow for efficient roll-back in case of misspeculation. We show that the proposed microarchitecture, together with the compiler support, achieves an average speedup of 2.2 for applications that conventional non-speculative approaches are not able to parallelize at all. 相似文献
15.
新型体系结构概念—虚拟寄存器与并行的指令处理部件 总被引:4,自引:1,他引:3
随着程序对地址空间的需求日益提高,研究者提出了虚拟存储器概念,使程序访问的地址空间免受物理存储器的限制。随着面向寄存器的RISC技术发展以及多发射结构中指令调度的日益重要,我们提出了虚拟寄存器的新概念,使寄存器空间不受物理寄存器堆大小的束缚,有利于指令调度和寄存器重新命名技术,提高指令级并行性ILP。此外,现代新型RISC处理机都着重于加强数据处理部件中的执行并行度,忽略了放在存储器中指令的处理。 相似文献
16.
Wen -Mei W. Hwu Scott A. Mahlke William Y. Chen Pohua P. Chang Nancy J. Warter Roger A. Bringmann Roland G. Ouellette Richard E. Hank Tokuzo Kiyohara Grant E. Haab John G. Holm Daniel M. Lavery 《The Journal of supercomputing》1993,7(1-2):229-248
A compiler for VLIW and superscalar processors must expose sufficient instruction-level parallelism (ILP) to effectively utilize the parallel hardware. However, ILP within basic blocks is extremely limited for control-intensive programs. We have developed a set of techniques for exploiting ILP across basic block boundaries. These techniques are based on a novel structure called thesuperblock. The superblock enables the optimizer and scheduler to extract more ILP along the important execution paths by systematically removing constraints due to the unimportant paths. Superblock optimization and scheduling have been implemented in the IMPACT-I compiler. This implementation gives us a unique opportunity to fully understand the issues involved in incorporating these techniques into a real compiler. Superblock optimizations and scheduling are shown to be useful while taking into account a variety of architectural features. 相似文献
17.
Lori Carter Beth Simon Brad Calder Larry Carter Jeanne Ferrante 《International journal of parallel programming》2000,28(6):563-588
Increases in instruction level parallelism are needed to exploit the potential parallelism available in future wide issue architectures. Predicated execution is an architectural mechanism that increases instruction level parallelism by removing branches and allowing simultaneous execution of multiple paths of control, only committing instructions from the correct path. In order for the compiler to expose and use such parallelism, traditional compiler data-flow and path analysis needs to be extended to predicated code. In this paper, we motivate the need for renaming and for predicates that reflect path information. We present Predicated Static Single Assignment (PSSA) which uses renaming and introduces Full -Path Predicates to remove false dependences and enable aggressive predicated optimization and instruction scheduling. We demonstrate the usefulness of PSSA for Predicated Speculation and Control Height Reduction. These two predicated code optimizations used during instruction scheduling reduce the dependence length of the critical paths through a predicated region. Our results show that using PSSA to enable speculation and control height reduction reduces execution time from 12 to 68%. 相似文献
18.
开发利用ILP(Inst ruction-level Parallelism)是现代高性能处理器取得高性能的关键要素之一。宽发射的超标量处理器、超长指令字处理器和数据流处理器只有在并行执行多条相邻的指令时才能获得较高的性能。数据流处理器的一个关键问题是如何把指令的计算结果高效地播送给目标指令而不用读写集中式寄存器文件。对于每条目标数大于指令所能编码的目标数的指令,编译程序都要插入一棵由MOV指令构成的软件扇出树来把计算结果播送给多条目标指令。为了暴露更多的ILP给硬件执行基底,提出了一种改进的软件扇出树生成算法,本算法根据目标指令的执行概率大小以及目标指令到该指令所在块的出口的关键路径长度来计算目标指令的权值,然后对各个叶子的优先权值进行排序,再根据优先权值的顺序来构造一棵软件扇出树,以便把指令的计算结果播送给多条目标指令。实验结果发现,本算法相对于传统的软件扇出树生成算法其性能有较大的提高。 相似文献
19.
20.
Abstract machines bridge the gap between the high-level of programming languages and the low-level mechanisms of a real machine. The paper proposed a general abstract-machine-based framework (AMBF) to build instruction level parallelism processors using the instruction tagging technique. The constructed processor may accept code written in any (abstract or real) machine instruction set, and produce tagged machine code after data conflicts are resolved. This requires the construction of a tagging unit which emulates the sequential execution of the program using tags rather than actual values. The paper presents a Java ILP processor by using the proposed framework. The Java processor takes advantage of the tagging unit to dynamically translate Java bytecode instructions to RISC-like tag-based instructions to facilitate the use of a general-purpose RISC core and enable the exploitation of instruction level parallelism. We detailed the Java ILP processor architecture and the design issues. Benchmarking of the Java processor using SpecJVM98 and Linpack has shown the overall ILP speedup improvement between 78% and 173%. 相似文献