首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 218 毫秒
1.
陈海民  李峥  王瑞蛟 《计算机应用》2011,31(7):2004-2007
针对五级流水线嵌入式微处理器的特定应用环境,对分支预测技术进行了深入研究,提出了一种新的分支预测方案。该方案兼容带缓存设计,通过扩展指令总线,在取指段提前对分支指令跳转方向和目标地址进行预测,保存可能执行而未执行的指令和地址指针以备分支预测失效时得以恢复,减少了预测失效的代价,同时保证了指令流的正确执行。研究表明,该方案硬件开销小,预测效率高,预测失效代价低。  相似文献   

2.
一种RISC型微处理器指令流水线结构   总被引:2,自引:0,他引:2  
求文介绍一种RISC微处理器的指令流水线结构、其工作原理及相关支持技术,包括HELP指令的产生和插入、延迟控制转移以及硬件互锁。  相似文献   

3.
谓词执行及其关键技术浅析   总被引:1,自引:0,他引:1  
在超标量和VLIW微处理器的设计中,指令间的相关,尤其是控制相关和数据相关,严重限制了指令级并行(ILP)的开发,从而限制了微处理器性能的进一步提高。条件执行技术(Guarded Execution)能够将控制相关转化为数据相关来处理。具体来说,它能够将控制相关于一条分支指令的其他指令转换为数据相关于该分支条件的条件指令。条件指令与常规指令的不同之处在于它含有显式的条件指示符,其语义为:首先计算指令执行条件,如果条件为真,则执行该指令中的操作,否则将其作为空操作处理。条件执行技术实质上是一种程序变换技术,变换后的程序无论对编译优化还是对硬件调度都有很大好处,但需要专门硬件机制的支持。目前只有ARM指令集与IA-64指令集支持条件执行。  相似文献   

4.
设计了一款具有4级流水线结构的16位RISC嵌入式微处理器.针对转移指令,未采用惯用的延迟转移技术,而是通过在取指阶段增加相应的硬件结构实现了无延迟转移.采用内部前推技术解决了指令执行过程中的数据相关.同时通过设置相应的硬件堆栈实现了对中断嵌套和调用嵌套的支持.整体系统结构采用Verilog HDL语言设计,指令系统较完善.在软件平台上的仿真验证初步表明了本设计的正确性.  相似文献   

5.
以基本块为单位的非顺序指令预取   总被引:1,自引:0,他引:1  
取指令能力的高低对微处理器的性能有很大影响。指令预取技术能够有效地降低指令Cache的访问失效率,提高微处理器的取指令能力,进而提高微处理器的性能。本文提出了一种由分支指令指导的、以基本块为单位的非顺序指令预取技术,每次预取将一个完整的基本块读入指令Cache。这种方法使用静态策略分析程序行为,实现所需的硬件复杂度低。模拟结果显示,该方法能够有效地提高指令Cache访问的命中率。  相似文献   

6.
高性能通用微处理器体系结构关键技术研究   总被引:1,自引:0,他引:1  
X处理器是我国自主设计的基于EPIC思想的高性能通用微处理器.介绍了8级流水线和OLSM执行模型,以很少的硬件代价克服了基本EPIC模型的局限性.设计了一种多分支预测结构,支持多条分支指令的并行执行,并通过判定执行减少分支指令的数目;设计了两级cache存储器,提出DTD低功耗设计方法,并通过前瞻执行隐藏访存的延迟.最后,展望了高性能通用微处理器的发展趋势.  相似文献   

7.
32位RISC微处理器中分支预测器的硬件实现*   总被引:1,自引:0,他引:1  
提出了一种基于Bi-mode和分支路径历史的动态分支预测器,并在西北工业大学自主设计的“龙腾R2”微处理器中得以FPGA硬件实现,提出的分支预测器对条件分支可以进行准确地预测,具有延迟小、功耗低的特点。  相似文献   

8.
张仕健  胡伟武 《计算机学报》2007,30(10):1674-1680
随着深亚微米工艺的广泛应用,瞬态故障已成为芯片失效的主要原因.文中提出了一种向分支指令后插入冗余指令的容错微结构,利用分支误预测浪费的处理带宽,降低了冗余执行导致的性能损失.实验结果表明,该技术的性能损失在6%~31%之间,平均为21%,明显低于MBI技术而和DIE技术的性能损失相当.该技术能够检测流水线上各阶段发生的瞬态故障并能恢复处理器状态,故障检测延时短,需要的硬件开销也较小,非常适合提高带有简单预测机制的嵌入式微处理器的容错能力.  相似文献   

9.
嵌入式处理器动态分支预测机制研究与设计   总被引:2,自引:1,他引:1  
黄伟  王玉艳  章建雄 《计算机工程》2008,34(21):163-165
针对嵌入式处理器的特定应用环境,通过对传统神经网络算法的改进,结合定制的分支目标缓冲,提出一种复合式动态分支预测机制。该机制基于全局索引方式,对BTB结构进行定制设计,实现对循环逻辑中最后一条分支指令的精确预测。实验结果表明,该动态分支预测机制能降低硬件复杂度,提高预测精度。  相似文献   

10.
一种精确的分支预测微处理器模型   总被引:3,自引:0,他引:3  
在当今深流水宽发射的微处理器中,为实现高性能,精确的分支预测是不可缺少的关键技术.分支预测失效将浪费大量的时钟周期,无法发挥乱序执行的效能.宽发射微处理器的有效性能同时还依赖指令窗口的大小和指令预取宽度.提出了一种新的更精确的支持分支预测和分支误预测周期损失的微处理器模型.根据指令的执行带宽为指令窗口中可用指令数的平方根统计规律,给出了一个更为精确的描述微处理器取指带宽、分支预测精度、分支误预测周期损失、指令窗口大小和IPC之间关系的算法,并讨论了这些参数的综合权衡以及这些参数对程序IPC的影响.由此可以确定依赖多个微处理器参数的取指带宽阈值和微处理器中几个关键参数的选取.  相似文献   

11.
一种动态VLIW调度机制的研究和实现   总被引:2,自引:0,他引:2       下载免费PDF全文
VLIW结构是开发ILP的一种重要手段,其优点是结构规整简单、硬件复杂度低。但是,完全依靠编译器进行指令调度的机制限制了VLIW结构性能的提高。本文提出了一种基于确定指令延迟的动态VLIW调度机制,该机制利用大部分指令执行时间确定的特点,根据运行时信息重新调度指令的执行顺序,以进一步开发ILP。在FPGA上的实验结果表明,该机制具有线性的硬件复杂度。  相似文献   

12.
存储相关性预测对于减少存储相关性冲突、提高微处理器性能具有十分重要的作用。针对传统相关性预测器硬件开销大、可实现性较差的缺点,通过对存储相关性的局部性分析,提出了一种基于指令距离的存储相关性预测方法。该方法充分利用了发生存储相关性冲突的指令在指令距离上的局部性,预测冲突指令的指令距离,进而控制部分访存指令的发射时机,大大减少了存储相关性冲突的次数。实验结果表明,在硬件开销约为1KB的情况下,使用基于指令距离的相关性预测器后,每个时钟周期平均执行的指令数可以提高1.70%,最高可以提高5.11%。在硬件开销较小的情况下,较大程度提高了微处理器的性能。  相似文献   

13.
Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered superscalar microarchitectures. This new class of interconnects has demands and characteristics different from traditional multiprocessor networks. In particular, in a clustered microarchitecture, a low intercluster communication latency is essential for high performance. We propose some point-to-point cluster interconnects and new improved instruction steering schemes. The results show that these point-to-point interconnects achieve much better performance than bus-based ones, and that the connectivity of the network together with effective steering schemes are key for high performance. We also show that these interconnects can be built with simple hardware and achieve a performance close to that of an idealized contention-free model.  相似文献   

14.
提出一种基于命题演算的二进制代码高级控制结构恢复方法。该方法针对低级指令之间的控制依赖关系进行形式化,将其抽象为命题逻辑变元,并且沿着程序执行路径进行传播和演算,通过计算结果中的特定命题常元对隐藏于低级代码中的高级控制结构进行判定。测试结果表明,该方法能够较好地检测并恢复出循环结构和分支结构,且具备针对谓词指令的分析和恢复能力。  相似文献   

15.
Current superscalar architectures strongly depend on an instruction issue queue to achieve multiple instruction issue and out-of-order execution. However, the issue queue requires a centralized structure and mainly causes globally broadcasting operations to wakeup and select the instructions. Therefore, a large issue queue ultimately results in a low clock rate along with a high circuit complexity. In other words, the increasing demands for a larger issue queue correspondingly impose a significant burden on achieving a higher clock speed.This paper discusses our Speculative Pre-Execution Assisted by compileR (SPEAR), a low-complexity issue queue design. SPEAR is designed to manage the small window superscalar architecture more efficiently without increasing the window size. To this end, we have first recognized that the long memory latency is one of the factors that demand a large window, and we aim at achieving early execution of the miss-causing load instructions using another hierarchy of an issue queue. We pre-execute those miss-causing instructions speculatively as an additional prefetching thread. Simulation results show that the SPEAR design achieves performance comparable to or even better than what would be obtained in superscalar architectures with a large issue queue. However, SPEAR is designed with smaller issue queues which consequently can be implemented with low hardware complexity and high clock speed.  相似文献   

16.
Modern microprocessors achieve high application performance at an acceptable level of power dissipation. Reorder buffer is used for out-of-order instructions to be committed in-order. The reorder buffer plays a key role in modern microprocessors because performance improvement techniques highly rely on aggressive speculation to feed wider issue, out-of-order, and deep pipelines. In terms of power to performance trade-off, reorder buffer is particularly important. This is because enlarging the reorder buffer size achieves high performance but naive scaling of the conventional reorder buffer architecture can severely increase the complexity and power consumption. In this paper, we propose low-power reorder buffer techniques for contemporary microprocessors. First, the separated reorder buffer reduces power dissipation by deferred allocation and early release. The deferred allocation delays the SROB allocation of instructions until all their data dependencies are resolved. Then, the instructions are executed in program order and they are released faster from the SROB. The result of the instruction is written into rename buffers immediately after the execution completes. Then, the result values in the rename buffer are written into the architectural register file at the commit state. The proposed approaches in this paper provide higher resource utilization and low power consumption.  相似文献   

17.
Current high-end microprocessors achieve high performance as a result of adding more features and therefore increasing complexity. This paper makes the case for a Chip-Multiprocessor based on the Data-Driven Multithreading (DDM-CMP) execution model in order to overcome the limitations of current design trends. Data-Driven Multithreading (DDM) is a multithreading model that effectively hides the communication delay and synchronization overheads. DDM-CMP avoids the complexity of other designs by combining simple commodity microprocessors with a small hardware overhead for thread scheduling and an interconnection network. Preliminary experimental results show that a DDM-CMP chip of the same hardware budget as a high-end commercial microprocessor, clocked at the same frequency, achieves a speedup of up to 18.5 with a 78–81% power consumption of the commercial chip. Overall, the estimated results for the proposed DDM-CMP architecture show a significant benefit in terms of both speedup and power consumption making it an attractive architecture for future processors.  相似文献   

18.
The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality. To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors. Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号