首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
3.
高性能通用微处理器体系结构关键技术研究   总被引:1,自引:0,他引:1  
X处理器是我国自主设计的基于EPIC思想的高性能通用微处理器.介绍了8级流水线和OLSM执行模型,以很少的硬件代价克服了基本EPIC模型的局限性.设计了一种多分支预测结构,支持多条分支指令的并行执行,并通过判定执行减少分支指令的数目;设计了两级cache存储器,提出DTD低功耗设计方法,并通过前瞻执行隐藏访存的延迟.最后,展望了高性能通用微处理器的发展趋势.  相似文献   

4.
田祖伟  孙光 《计算机科学》2010,37(5):130-133
程序中大量分支指令的存在,严重制约了体系结构和编译器开发并行性的能力。有效发掘指令级并行性的一个主要挑战是要克服分支指令带来的限制。利用谓词执行可有效地删除分支,将分支指令转换为谓词代码,从而扩大了指令调度的范围并且删除了分支误测带来的性能损失。阐述了基于谓词代码的指令调度、软件流水、寄存器分配、指令归并等编译优化技术。设计并实现了一个基于谓词代码的指令调度算法。实验表明,对谓词代码进行编译优化,能有效提高指令并行度,缩短代码执行时间,提高程序性能。  相似文献   

5.
As microprocessor designs move towards deeper pipelines and support for multiple instruction issue, steps must be taken to alleviate the negative impact of branch operations on processor performance. One approach is to use branch prediction hardware and perform speculative execution of the instructions following an unresolved branch. Another technique is to eliminate certain branch instructions altogether by translating the instructions following a forward branch into predicate form. Both these techniques are employed in many current processor designs. This paper investigates the relationship between branch prediction techniques and branch predication. In particular, we are interested in how using predication to remove a certain class of poorly predicted branches affects the prediction accuracy of the remaining branches. A variety of existing predication models for eliminating branch operations are presented, and the effect that eliminating branches has on branch prediction schemes ranging from simple prediction mechanisms to the newer more sophisticated branch predictors is studied. We also examine the impact of predication on basic block size, and how the two techniques used together affect overall processor performance.  相似文献   

6.
This paper describes a general code‐improving transformation that can coalesce conditional branches into an indirect jump from a table. Applying this transformation allows an optimizer to exploit indirect jumps for many other coalescing opportunities besides the translation of multiway branch statements. First, dataflow analysis is performed to detect a set of coalescent conditional branches, which are often separated by blocks of intervening instructions. Secondly, several techniques are applied to reduce the cost of performing an indirect jump operation, often requiring the execution of only two instructions on a SPARC. Finally, the control flow is restructured using code duplication to replace the set of branches with an indirect jump. Thus, the transformation essentially provides early resolution of conditional branches that may originally have been some distance from the point where the indirect jump is inserted. The transformation can be frequently applied with often significant reductions in the number of instructions executed, total cache work, and execution time. In addition, we show that with branch target buffer support, indirect jumps improve branch prediction since they cause fewer mispredictions than the set of branches they replaced. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

7.
Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead.  相似文献   

8.
The Gmicro/200, a microprocessor that has been developed as part of Japan's TRON (The Real-Time Operating Nucleus) project, is described. This microprogram-based processor with six-state pipeline, 730000 transistors and on-chip caches will serve in an engineering workstation or a high-speed graphics accelerator system. The authors discuss features of the instruction set; memory management; handling of exceptions, interrupts and traps; and the implementation of the Gmicro/200  相似文献   

9.
一种精确的分支预测微处理器模型   总被引:3,自引:0,他引:3  
在当今深流水宽发射的微处理器中,为实现高性能,精确的分支预测是不可缺少的关键技术.分支预测失效将浪费大量的时钟周期,无法发挥乱序执行的效能.宽发射微处理器的有效性能同时还依赖指令窗口的大小和指令预取宽度.提出了一种新的更精确的支持分支预测和分支误预测周期损失的微处理器模型.根据指令的执行带宽为指令窗口中可用指令数的平方根统计规律,给出了一个更为精确的描述微处理器取指带宽、分支预测精度、分支误预测周期损失、指令窗口大小和IPC之间关系的算法,并讨论了这些参数的综合权衡以及这些参数对程序IPC的影响.由此可以确定依赖多个微处理器参数的取指带宽阈值和微处理器中几个关键参数的选取.  相似文献   

10.
The speculated execution of threads in a multithreaded architecture, plus the branch prediction used in each thread execution unit, allows many instructions to be executed speculatively, that is, before it is known whether they actually needed by the program. In this study, we examine how the load instructions executed on what turn out to be incorrectly executed program paths impact the memory system performance. We find that incorrect speculation (wrong execution) on the instruction and thread-level provides an indirect prefetching effect for the later correct execution paths and threads. By continuing to execute the mispredicted load instructions even after the instruction or thread-level control speculation is known to be incorrect, the cache misses observed on the correctly executed paths can be reduced by 16 to 73 percent, with an average reduction of 45 percent. However, we also find that these extra loads can increase the amount of memory traffic and can pollute the cache. We introduce the small, fully associative wrong execution cache (WEC) to eliminate the potential pollution that can be caused by the execution of the mispredicted load instructions. Our simulation results show that the WEC can improve the performance of a concurrent multithreaded architecture up to 18.5 percent on the benchmark programs tested, with an average improvement of 9.7 percent, due to the reductions in the number of cache misses.  相似文献   

11.
唐遇星  邓鹍  窦勇  周兴铭 《计算机学报》2007,30(11):1972-1981
分支指令与分支预测失败限制了处理器发掘指令级并行(ILP)的潜力.通过If-conversion或Predicated执行将程序中的控制相关转化为数据相关,能较好地降低分支预测开销.提出一种基于简化Trace结构的动态隐式断言执行机制(Dynamic Implicit Predication,DIP),而早期的相关研究主要集中于由编译器显式为宽发射处理器产生静态Predicated指令.无需编译器或者其他二进制工具的帮助,DIP可以在程序运行过程中识别可以进行断言变换的指令片断,完成指令转换与优化,并在以后的执行中使用优化后的指令Trace.基于SPEC2000模拟测试表明DIP可以有效避免错误的分支预测,提高并行度,单个程序的IPC平均提高10.3%,基准程序的平均加速比可达7.59%.  相似文献   

12.
张仕健  胡伟武 《计算机学报》2007,30(10):1674-1680
随着深亚微米工艺的广泛应用,瞬态故障已成为芯片失效的主要原因.文中提出了一种向分支指令后插入冗余指令的容错微结构,利用分支误预测浪费的处理带宽,降低了冗余执行导致的性能损失.实验结果表明,该技术的性能损失在6%~31%之间,平均为21%,明显低于MBI技术而和DIE技术的性能损失相当.该技术能够检测流水线上各阶段发生的瞬态故障并能恢复处理器状态,故障检测延时短,需要的硬件开销也较小,非常适合提高带有简单预测机制的嵌入式微处理器的容错能力.  相似文献   

13.
传统的谓词优化技术是在冯·诺伊曼体系结构计算机上实施的,仅对数据流进行优化,并没有考虑哈佛体系结构下指令和数据分开的情况.BWDSP10x是指令和数据分开的哈佛体系结构,它支持超长指令字,不仅提供了对数据谓词执行的支持也提供了对地址谓词执行的支持.特此提出了一种在区域上对两种谓词模式优化支持的方法,在进行两种比较之前,通过判断比较操作的两个操作数类型来分别实施两种模式的谓词优化,使得对地址的比较不用传输到通用寄存器中.实验结果表明该优化方法能显著地节省CPU的时间和带宽,大大减少了分支指令,使程序性能提高了28.4%.  相似文献   

14.
By executing two or more threads concurrently, Simultaneous MultiThreading (SMT) architectures are able to exploit both Instruction-Level Parallelism (ILP) and Thread-Level Parallelism (TLP) from the increased number of in-flight instructions that are fetched from multiple threads. However, due to incorrect control speculations, a significant number of these in-flight instructions are discarded from the pipelines of SMT processors (which is a direct consequence of these pipelines getting wider and deeper). Although increasing the accuracy of branch predictors may reduce the number of instructions so discarded from the pipelines, the prediction accuracy cannot be easily scaled up since aggressive branch prediction schemes strongly depend on the particular predictability inherently to the application programs. In this paper, we present an efficient thread scheduling mechanism for SMT processors, called SAFE-T (Speculation-Aware Front-End Throttling): it is easy to implement and allows an SMT processor to selectively perform speculative execution of threads according to the confidence level on branch predictions, hence preventing wrong-path instructions from being fetched. SAFE-T provides an average reduction of 57.9% in the number of discarded instructions and improves the instructions per cycle (IPC) performance by 14.7% on average over the ICOUNT policy across the multi-programmed workloads we simulate. This paper is an extended version of the paper, “Speculation Control for Simultaneous Multithreading,” which appeared in the Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, April 2004.  相似文献   

15.
A method for an introductory optimization of multithreaded Java programs for execution on clusters of Java Virtual Machines (JVMs) inside desktop grids is presented. It is composed of two stages. In the first stage, a clustering algorithm is applied to extended macro data flow graphs generated on the basis of the byte-code compiled for multithreaded Java programs. These graphs account for data and control dependencies in programs including conditional branch instructions annotated by branch statistics driven from execution traces for representative sets of data. In the second stage, a list scheduling is performed based on the Earliest Task First (ETF) heuristics in which node mapping on JVMs accounts for mutually exclusive paths outgoing from conditional branch nodes. The presented object placement optimization algorithm is a part of the DG-ADAJ environment.  相似文献   

16.
动态翻译系统每执行一次间接转移指令均需进行一次地址转换,该过程是翻译系统性能开销的主要来源之一.无特殊硬件支持的翻译系统常采用软件预测法来降低地址转换开销,而软件预测法的预测准确率较低,制约其对翻译系统整体性能的提升.低开销关联软件预测算法(low-overhead correlated software prediction, LOCSP)可利用代码副本区分待预测指令的不同转移场景,将到达该指令的多条动态执行路径分离为多个互不重合的代码缓存副本,并为各个副本提供独立的预测链.从而在不增加动态指令数的前提下实现关联预测,显著提升软件预测的预测准确率.同时,LOCSP算法基于动态剖析的结果,仅对部分难预测的热点间接转移指令进行关联软件预测,进一步降低预测开销.实验表明,相比软件预测法,LOCSP算法可将平均预测准确率从58.9%提升至82.2%,将翻译系统的整体性能开销平均降低19.3%,最高降低41.9%,而平均静态代码数量仅增加2.4%.  相似文献   

17.
袁平海  曾庆凯 《软件学报》2017,28(10):2583-2598
返回导向编程(return-oriented programming,简称ROP)被广泛用于软件漏洞利用攻击中,用来构造攻击代码.通过更新ROP构造技术,证实了图灵完备的纯ROP攻击代码在软件模块中是普遍可实现的.ROP构造功能代码的难点是实现条件转移逻辑.通过深入分析条件转移机器指令的执行上下文发现,对这些指令的传统认知存在一定的局限性.事实上,在已有代码中存在少量的条件转移指令,它们的两个分支的开始部分都是可复用的代码片段(称为gadgets),而且这两个gadgets会从不同的内存单元中取得下一个gadget的地址,因此,以这些条件转移指令开始的代码片段可以帮助ROP实现条件转移逻辑.把这种代码片段称为if-gadget.在Linux和Windows系统上的实验结果表明,if-gadget普遍存在,即使在代码量很小的日常可执行程序中也存在.在Binutils程序集上的实验结果表明,引入if-gadget后,构造图灵完备的ROP代码要比用传统方法容易得多.在Ubuntu这样的主流操作系统上,由于可执行程序上默认没有实施防御ROP攻击的保护机制,因此,攻击者可以在这些软件模块中构造纯ROP攻击代码来发动攻击.由此可见,ROP对系统安全的威胁比原来认为的严重得多.  相似文献   

18.
Kitahara  T. Satoh  T. 《Micro, IEEE》1990,10(3):68-75
A high-end microprocessor, the Gmicro/300, based on the TRON architecture specification is described. In contrast to other RISC (reduced-instruction-set-computer) or CISC (complex-instruction-set-computer) chips, it executes an instruction with a memory operand and a register operand in one clock cycle. Separate cache memories improve performance more than 13.8%. The Gmicro/300's pipeline structure, its other one-cycle structures, and the effects of using internal caches are discussed  相似文献   

19.
二进制翻译是实现软件移植的主要方法.动态二进制翻译受动态执行限制而不能深度优化导致效率较低而传统的静态二进制翻译难以处理间接分支,且现有的优化方法大部分集中在中间代码层,对目标码中存在的大量冗余指令较少关注.针对这一现状,提出一种静态二进制翻译框架SQEMU,基于该框架提出了一种对目标码冗余指令进行删除的优化算法.该算法通过分析目标码生成指令特定数据依赖图(instruction-specific data dependence graph, IDDG),再利用该图将活性分析和窥孔优化的2种理论相结合,有效删除目标码中的冗余指令.实验结果表明,利用该算法对目标码优化后,其执行效率得到显著提升,最大提升可达42%,整体性能测试表明,优化后nbench测试集翻译效率提高约20%,SPEC CINT2006测试集翻译效率提高约17%.  相似文献   

20.
谓词执行能使分片式处理器充分利用众多的执行单元,开发指令级并行性.但因此形成的超块也使得分支误预测代价增大,所以提高分支预测器的性能至关重要.本文提出一种基于剖析信息决策的谓词执行技术,该技术利用剖析信息对谓词执行前后的执行周期进行估算,从而对分支的谓词执行进行决策.该技术使分支预测器的命中率提高了0.68%~3.50%,使系统性能提高了1.67%~8.33%.同时,利用select指令表示谓词化指令也消除了重命名阶段寄存器多定义问题.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号