期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

乔林刘志忠张赤红苏伯珙《软件学报》1999,10(10):02

采用软硬件结合的运行时消除指针别名歧义方法SHRTD(software/hardware run-time disambiguation)适用于不可逆代码,同时,它的代码空间受到限制,不存在严重的代码可重入性问题.文章详细分析了SHRTD方法的指令级并行加速比,给出了发生地址冲突后的并行加速比与平均并行加速比以及发生地址冲突的依概率并行加速比.文章引入的三类理论加速比对指令级并行编译技术的研究和评测有重要的实际意义. 相似文献

2.

一种运行时消除指针别名歧义的新方法 总被引：1，自引：1，他引：0

汤志忠乔林张赤红苏伯珙《软件学报》1999,10(7):685-689

提出一种采用软硬件结合的运行时消除指针别名歧义的新方法SHRTD（software/hardware run-time disambiguation）.为延迟运行时不正确的内存访问及其后继操作,SHRTD的功能单元执行NOP操作.为保证所有延迟操作执行顺序的一致性,编译时就确定执行NOP操作的所有功能单元的顺序和NOP操作的数目.SHRTD方法适用于不可逆代码,同时它的代码空间受限,也不存在严重的代码可重入性问题.新方法有效地解决了指针别名问题,为获得潜在的指令级并行加速提供了可能. 相似文献

3.

基于线程集成的系统设计方法

蒋书波刘仲辉程明霄《计算机工程与设计》2008,29(6):1380-1383

实现嵌入式系统任务的并行性是改善系统性能的基本手段.通过分析影响嵌入式系统性能的主要因素,采用了基于线程概念的嵌入式系统并行设计方法,利用指令级并行来改善系统性能.主要论述了线程集成的实现方法,通过编译技术在指令级代码中融合多个线程,从而实现任务的并行性,并将该方法应用于仪器仪表显示模块的设计. 相似文献

4.

基于最小延时启发式搜索的TTA代码优化

下载免费PDF全文

王正华郭炜魏继增《计算机工程》2010,36(10):282-284

针对传输触发架构下代码生成中指令调度的流水线冲突、调度死锁、资源冲突等问题,给出一种基于最小延时的遗传搜索算法模型,将软件旁路优化和资源动态分配优化整合到该模型中。实验结果表明,该算法能产生较高质量的并行代码,90%以上测试用例的指令级并行度高于表调度算法获得的结果。相似文献

5.

分簇结构超长指令字DSP编译器的设计与实现 总被引：5，自引：0，他引：5

胡定磊陈书明刘春林《小型微型计算机系统》2006,27(2):348-353

超长指令字（VLIW）是高端DSP普遍采用的体系结构。VLIW DSP在硬件上没有调度和冲突判决的机制，其性能的发挥完全依靠编译嚣的优化效果．基于可重定向编译基础设施IMPACT，为分簇VLIW DSP YHFT—D4设计与实现了优化编译器．其中着重讨论了可重定向信息的定义、代码注释、SIMD指令的支持、分簇寄存器分配以度指令级并行开发和资源冲突解决等内容．实验结果表明该编译器可以达到较好的优化效果．相似文献

6.

指令调度中的寄存器重命名技术

张军超张兆庆《计算机工程》2005,31(23):8-10

指令间的依赖关系是阻碍指令调度发挥作用，进而影响指令级并行的主要障碍。寄存器重命名是解决控制依赖和数据依赖的一种重要技术。研究并实现了一种指令调度中的寄存器重命名技术。它在164．gzip和186．crafty上分别取得了约5％和3％的加速比。相似文献

7.

指令级并行中谓词分析技术的研究 总被引：2，自引：0，他引：2

芦运照张兆庆连瑞琦《计算机学报》2003,26(10):1337-1342

谓词支持是IA 6 4体系结构的新特征 ,它为发掘指令级并行提供了更多的机会 ,同时给编译器的设计者增加了难度 .谓词是条件执行的依据 ,是提高指令级并行的新途径 .该文介绍在ORC(IA 6 4OpenResearchCompiler)中首次设计实现的基于谓词划分图的谓词分析技术及其在指令调度中的应用 .利用谓词分析技术建立了谓词关系数据库、指令调度查询谓词关系数据库提高指令级并行 .文章着重论述了谓词关系数据库的核心———谓词划分图的建立 ,在谓词划分图的基础上实现了谓词关系的计算和查询 ,实际结果表明谓词分析技术有显著优化效果 . 相似文献

8.

指令Cache优化中代码重排技术研究

张定飞赵克佳黄春《计算机工程与应用》2006,42(7):28-30,68

代码重排技术是提高指令Cache命中率、提升程序性能的一种重要优化方法。文章介绍了代码重排的几种主要技术,并从排序粒度、实现时机、冲突考虑、算法代价等方面对代码重排技术进行了深入的分析与比较。相似文献

9.

协作式全局指令调度与寄存器分配 总被引：1，自引：1，他引：0

吴承勇连瑞琦张兆庆乔如良《计算机学报》2000,23(5):493-499

指令级并行是现代高性能代理器的重要特征,对于发挥这类处理器所具有的并行处理能力来说,编译器有至关重要的影响。文中讨论指令级并行编译中的核心问题－全局指令调度与器分配,并以作者为一种新型的显式并行体系结构微处理器的编译系统为背景,介绍了此类编译器后端设计中面临的指令调度与寄存器分配的时序问题,以及为解决这一问题而提出了的一种协作式全局指令调度与寄存器分配方法。相似文献

10.

基于关键路径与无死锁的DSP代码并行设计方法

初耀军《电脑与信息技术》2009,17(6):56-59,71

文章主要介绍了TMS320C64XDSP的常规代码开发流程。利用数据相关性及关键路径,解决哪些指令可并行执行,进而合理安排指令的执行顺序;采用PV操作的生产者消费者算法解决共享资源的互斥问题,使寄存器和存储器避免指令间写一写、读一写冲突;采用银行家算法进一步检测资源的利用状况,避免死锁的发生。将三者有机的结合,可以得到无死锁的并行代码,为流水线结构的汇编语言程序设计提供了一种有效的方法。可以证明,这是一种有效并且可取的方法,为并行代码的快速且充分执行提供了依据。相似文献

11.

Trace Software Pipelining

下载免费PDF全文

Wang Jian Andreas Krall M.Anton Ertl 《计算机科学技术学报》1995,10(6):481-490

Global software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches.This paper presents a novel global software pipelining technique,called Trace Software Pipelining,targeted to the instruction-level parallel processors such as Very Long Instruction Word (VLIW) and superscalar machines.Trace software pipelining applies a global code scheduling technique to compact the original loop body.The resulting loop is called a trace software pipelined (TSP) code.The trace softwrae pipelined code can be directly executed with special architectural support or can be transformed into a globally software pipelined loop for the current VLIW and superscalar processors.Thus,exploiting parallelism across all iterations of a loop can be completed through compacting the original loop body with any global code scheduling technique.This makes our new technique very promising in practical compilers.Finally,we also present the preliminary experimental results to support our new approach. 相似文献

12.

Branch effect reduction techniques

Uht A.K. Sindagi V. Somanathan S. 《Computer》1997,30(5):71-81

Branch effects are the biggest obstacle to gaining significant speedups when running general purpose code on instruction level parallel machines. The article presents a survey which compares current branch effect reduction techniques, offering hope for greater gains. We believe this survey is timely because research is bearing much fruit: speedups of 10 or more are being demonstrated in research simulations and may be realized in hardware within a few years. The hardware required for large scale exploitation is great, but the density of transistors per chip is increasing exponentially, with estimates of 50 to 100 million transistors per chip by the year 2000 相似文献

13.

Simultaneous multithreading: a platform for next-generationprocessors

Eggers S.J. Emer J.S. Leby H.M. Lo J.L. Stamm R.L. Tullsen D.M. 《Micro, IEEE》1997,17(5):12-19

Simultaneous multithreading is a processor design which consumes both thread-level and instruction-level parallelism. In SMT processors, thread-level parallelism can come from either multithreaded, parallel programs or individual, independent programs in a multiprogramming workload. Instruction-level parallelism comes from each single program or thread. Because it successfully (and simultaneously) exploits both types of parallelism, SMT processors use resources more efficiently, and both instruction throughput and speedups are greater 相似文献

14.

Parallelized direct execution simulation of message-passingparallel programs

Dickens P.M. Heidelberger P. Nicol D.M. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(10):1090-1105

As massively parallel computers proliferate, there is growing interest in finding ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm development. In this paper, we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine, such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization, specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, LAPSE (Large Application Parallel Simulation Environment), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well, typically within 10% relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors 相似文献

15.

Profile-assisted instruction scheduling

William Y. Chen Scott A. Mahlke Nancy J. Warter Sadun Anik Wen-Mei W. Hwu 《International journal of parallel programming》1994,22(2):151-181

Instruction schedulers for superscalar and VLIW processors must expose sufficient instruction-level parallelism to the hardware in order to achieve high performance. Traditional compiler instruction scheduling techniques typically take into account the constraints imposed by all execution scenarios in the program. However, there are additional opportunities to increase instruction-level parallelism for the frequent execution scenarios at the expense of the less freuent ones. Profile information identifies these important execution scenarios in a program. In this paper, two major categories of profile information are studied: control-flow and memory-dependence. Profile-assisted code scheduling techniques have been incorporated into the IMPACT-I compiler. These techniques are acyclic global scheduling and software pipelining. This paper describes the scheduling algorithms, highlights the modifications required to use profile information, and explains the hardware and compiler support for dealing with hazards that arise from aggressive use of profile information. The effectiveness of these profile-based scheduling techniques is evaluated for a range of superscalar and VLIW processors. 相似文献

16.

Superspeculative microarchitecture for beyond AD 2000

Lipasti M.H. Shen J.P. 《Computer》1997,30(9):59-66

Based on their research at Carnegie Mellon University, the authors argue for billion-transistor uniprocessors. They divide the important implementation problems into three components: instruction flow, register dataflow, and memory dataflow. They also argue for trace caches and advanced branch prediction. Their article, however, focuses on using massive speculation at all levels to improve performance. They claim that without this much speculation, future processors will be limited by true data dependences, and will be unable to harvest enough instruction-level parallelism (ILP) to improve performance satisfactorily. Their investigations discovered large speedups on code that have traditionally not been amenable to finding ILP 相似文献

17.

SoMMA: A software-managed memory architecture for multi-issue processors

《Microprocessors and Microsystems》2020

Embedded processors rely on the efficient use of instruction-level parallelism to answer the performance and energy needs of modern applications. Though improving performance is the primary goal for processors in general, it might lead to a negative impact on energy consumption, a particularly critical constraint for current systems. In this paper, we present SoMMA, a software-managed memory architecture for embedded multi-issue processors that can reduce energy consumption and energy-delay product (EDP), while still providing an increase in memory bandwidth. We combine the use of software-managed memories (SMM) with the data cache, and leverage the lower energy access cost of SMMs to provide a processor with reduced energy consumption and EDP. SoMMA also provides a better overall performance, as memory accesses can be performed in parallel, with no cost in extra memory ports. Compiler-automated code transformations minimize the programmer's effort to benefit from the proposed architecture. The approach shows average speedups of 1.118x and 1.121x, while consuming up to 11% and 12.8% less energy when comparing two modified ρVEX processors and their baselines, at full-system level comparisons. SoMMA also shows reduction of up to 41.5% on full-system EDP, maintaining the same processor area as baseline processors. 相似文献

18.

基于剖析信息和关键路径长度的软件扇出树生成算法

曾斌安虹王莉《计算机科学》2010,37(3):248-252

开发利用ILP(Inst ruction-level Parallelism)是现代高性能处理器取得高性能的关键要素之一。宽发射的超标量处理器、超长指令字处理器和数据流处理器只有在并行执行多条相邻的指令时才能获得较高的性能。数据流处理器的一个关键问题是如何把指令的计算结果高效地播送给目标指令而不用读写集中式寄存器文件。对于每条目标数大于指令所能编码的目标数的指令,编译程序都要插入一棵由MOV指令构成的软件扇出树来把计算结果播送给多条目标指令。为了暴露更多的ILP给硬件执行基底,提出了一种改进的软件扇出树生成算法,本算法根据目标指令的执行概率大小以及目标指令到该指令所在块的出口的关键路径长度来计算目标指令的权值,然后对各个叶子的优先权值进行排序,再根据优先权值的顺序来构造一棵软件扇出树,以便把指令的计算结果播送给多条目标指令。实验结果发现,本算法相对于传统的软件扇出树生成算法其性能有较大的提高。相似文献