首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 471 毫秒
1.
针对超标量处理器中指令长时间占用重排序缓存引起指令退休缓慢的问题,提出了一种基于投机执行的两级退休机制.该方案根据指令有无异常和预测错误风险将指令分为有风险指令和无风险指令,对重排序缓存进行轻量化改进,只有存在异常和预测风险的指令才允许进重排序缓存,在确认风险消除后将指令快速退休.重命名寄存器从重排序缓存分离,负责寄存器重命名和结果乱序回写.实验结果表明,在硬件资源相同的情况下,基于该方案的处理器比传统的按序退休处理器的性能平均提高28.8%以上.  相似文献   

2.
发射队列是超标量处理器的乱序控制部件,也是处理器中的关键部件,对整个处理器的性能起着决定性的作用.提出了一种能够有效提高乱序超标量处理器性能的双端口发射队列结构.该队列能够根据指令之间的相关性,估算指令的发射时机,将指令分配到不同的队列中.对比了2种不同的发射策略对性能的影响,输入端标记执行流水线的策略能够获得较高的IPC性能,最大能提高10.68%.同时对比了采用相同发射策略时,发射队列项数对性能的影响,相比于24项发射队列,32项发射队列能够平均提升2% 的IPC性能,最大可以提升8.59%.  相似文献   

3.
为解决嵌入式领域对处理器不同性能面积的需求,以及对重排序缓冲区阻塞,保留站派遣长短周期指令时导致的吞吐率不平衡及堵塞问题,设计并优化了一种简便配置的参数化流水线超标量处理器.通过定制化流水线中的分支预测,缓存与运算单元,将RISC-V指令划分5大类处理,对不同周期的执行单元采用级联与并行的混合分布方式,将充当排序缓存中...  相似文献   

4.
动态可重构缓存由于能够在运行时进行缓存容量、结构、映射规则等方面的重新配置,因而在资源利用率和能耗方面有很大优势。针对超长指令字处理器发射宽度动态变化的特点,提出了在运行时利用其动态特征来驱动缓存的重构,从而达到动态分离或合并处理器核的目的。这不同于传统的以缓存缺失率来驱动缓存重构的方法。为了平滑频繁重构场景下缓存的性能,进一步提出了一种重构时的过渡机制,使缓存平滑地从一种配置过渡到另一种配置。设计了实验并对重构策略进行了性能评估,仿真结果表明,该方法可以实现在重构后2 000周期内,缓存缺失率平均下降16%,并且提高了系统性能。  相似文献   

5.
1 引言在微处理器设计中,开发指令级并行(ILP)以提高微处理器系统的性能受到了很大的限制。研究更大发射的超标量微处理器已经是一件极其复杂而没有意义的事。但是如果在开发指令级并行的同时,开发数据级并行,理论分析表明可以显著提高微处理器的性能,微处理器的等效IPC(每个时钟周期发射的指令条数)和超标量微处理器相比可以提高20~40多倍。因此,在微处理器系统设计中开发数据级并行具有重要的理论意义和实用价值。  相似文献   

6.
按照可重配置处理器的体系结构建立并实现功耗模型;模型对处理器的电路级特性进行抽象,基于体系结构级属性和工艺参数进行静态峰值功耗估算,基于性能模拟器进行动态功耗统计,并实现三种条件时钟下的门控技术;可重配置处理器与超标量通用微处理器相比,在性能方面获得的平均加速比为3.59,而在功耗方面的平均增长率仅为1.48;通过实验还说明采用简单的CC1门控技术能有效地降低可重配置系统的功耗和硬件复杂度;该模型为可重配置处理器低功耗设计和编译器级低功耗优化研究奠定了基础。  相似文献   

7.
在一款同时支持超标量与超长指令字执行方式混合结构数字信号处理器上,为超标量结构添加分支预测功能。为控制硬件设计的复杂度,同时保证分支预测的命中率,分支预测方案使用gshare预测器。在设计完成的硬件上,运行由Open64编译器编译的Dhrystone、Coremark基准测试程序。实验结果表明,在添加分支预测功能后,处理器的性能提高30%~35%。  相似文献   

8.
针对不同分簇超标量处理器结构下SPEC2000程序中指令关键可能性(LoC)的特性,提出一种静态LoC关键性预测器的设计方法。对指令LoC进行研究,根据其结构无关性和动态不变性,设计预测器。仿真结果表明,在对1×8分簇超标量处理器使用该设计时,程序的每周期指令数平均提升5.3%,性能优于动态LoC预测器。  相似文献   

9.
前瞻性执行技术是一种提高超标量处理器性能的有效技术,为了分析前瞻性执行的超标量处理器的性能潜力。  相似文献   

10.
通过对嵌入式处理器进行多媒体处理能力的扩展可增强其对多媒体数据的处理能力。以 32 bit龙腾嵌入式处理器为基础 ,研究 AltiVec技术以及超标量技术 ,设计了该处理器中支持 AltiVec技术的多媒体协处理单元。该单元采用五级流水线 ,将指令动态调度技术分配到不同的流水线中 ,在提高处理性能的同时保证了设计频率。通过多媒体基准程序测试 ,该单元的指令 IPC为 1. 2, SMIC0. 18μm工艺库下 ,频率为 350 MHz,该协处理单元提高了龙腾处理器的性能。  相似文献   

11.
Trace-driven simulation of out-of-order superscalar processors is far from straightforward. The dynamic nature of out-of-order superscalar processors combined with the static nature of traces can lead to large inaccuracies in the results when the traces contain only a subset of executed instructions for trace reduction. In this paper, we describe and comprehensively evaluate the pairwise dependent cache miss model (PDCM), a framework for fast and accurate trace-driven simulation of out-of-order superscalar processors. The model determines how to treat a cache miss with respect to other cache misses recorded in the trace by dynamically reconstructing the reorder buffer state during simulation and honoring the dependencies between the trace items. Our experimental results demonstrate that a PDCM-based simulator produces highly accurate simulation results (less than 3% error) with fast simulation speeds (62.5× on average) compared with an execution-driven simulator. Moreover, we observed that the proposed simulation method is capable of preserving a processor’s dynamic off-core memory access behavior and accurately predicting the relative performance change when a processor’s low-level memory hierarchy parameters are changed.  相似文献   

12.
Software developers can gain insight into software-hardware interactions by decomposing processor performance into individual cycles-per-instruction components that differentiate cycles consumed in active computation from those spent handling various miss events. Constructing accurate CPI components for out-of-order superscalar processors is complicated, however, because computation and miss event handling overlap. The authors' counter architecture, using an analytical superscalar performance model, handles overlap effects more accurately than existing methods  相似文献   

13.
为满足嵌入式设备小面积高性能的需求,设计一种基于开源RISC-V指令集的32位可综合乱序处理器。处理器包括分支预测、相关性处理等关键技术,支持RISC-V基本整数运算、乘除法以及压缩指令集。采用具有顺序单发射、乱序执行、乱序写回等特性的三级流水线结构,运用哈佛体系结构及AHB总线协议,可满足并行访问指令与数据的需求。在Artix-7(XC7A35T-L1CSG324I)FPGA开发板上以50 MHz时钟频率完成功能验证,测试功耗为7.9 mW。实验结果表明,在SMIC 110 nm的ASIC技术节点上进行综合分析,并在同等条件下与ARM Cortex-M3等处理器进行对比,该系统面积减少64%,功耗降低0.57 mW,可用于小面积低功耗的嵌入式领域。  相似文献   

14.
Many devices with modern microprocessor have generated an increased attention for transient soft errors. Previous strategies for instruction level temporal redundancy in super-scalar out-of-order processors have up to 45% performance degradation in certain applications compared to normal execution. The reason is that the redundant workload slows down the normal execution. Solutions are proposed to avoid certain redundant execution by reusing the result of the previously executed instructions, but there are still limitations on the instruction level parallelism and the pipeline throughput. In this paper, we propose a novel technique to recover the performance gap between instruction level temporal redundancy and normal execution. We present a set of micro-architectural extensions to implement the reliability prediction and integrate it with the issue logic of a dual instruction stream superscalar core, and conduct extensive evaluations to demonstrate how it can solve the performance problem. Experiments show that in average it can gain back nearly 71.13% of the overall IPC loss caused by redundant execution. Generally, it exhibits much performance and power efficiency within a high transient error rate.  相似文献   

15.
Billion-transistor processors will be much as they are today, just bigger, faster and wider (issuing more instructions at once). The authors describe the key problems (instruction supply, data memory supply and an implementable execution core) that prevent current superscalar computers from scaling up to 16- or 32-instructions per issue. They propose using out-of-order fetching, multi-hybrid branch predictors and trace caches to improve the instruction supply. They predict that replicated first-level caches, huge on-chip caches and data value speculation will enhance the data supply. To provide a high-speed, implementable execution core that is capable of sustaining the necessary instruction throughput, they advocate a large, out-of-order-issue instruction window (2,000 instructions), clustered (separated) banks of functional units and hierarchical scheduling of ready instructions. They contend that the current uniprocessor model can provide sufficient performance and use a billion transistors effectively without changing the programming model or discarding software compatibility  相似文献   

16.
乱序超标量处理器所能获得的指令级并行能力越来越有限,为了获得更高的指令并行性,必须增加更多的乱序执行和控制资源.随着处理器架构的变化,值预测技术能够在现有主流处理器微架构的基础上以更少的硬件开销,获得更高的数据并行性,进一步提升处理器的乱序执行性能.提出了一种基于真实历史反馈的上下文值预测器(RH-VTAGE),通过设置失效列表和预测精度表来控制反馈RH-VTAGE的预测精度,减少预测失效时的流水线恢复开销.同时,在值预测器的最后阶段增加了真实历史反馈的控制计数器,并设计了自适应置信度控制逻辑,针对不同类型的指令按概率对置信度进行动态调整.实际测试结果表明,相对于其他预测器,RH-VTAGE的整数程序预测性能没有明显提升,但是对于浮点程序性能最大提升31.2%.  相似文献   

17.
乱序执行是现代微处理器设计中普遍采用的提高流水线性能的方法,但乱序执行并乱序退出的全乱序结构在超标量处理器中应用并不普遍,这种全乱序的结构对基于参考模型的处理器正确性验证提出了巨大的挑战。主要介绍了从处理器的程序行为是否正确的最终标准——程序员可见的结构变量按程序行为进行顺序变化的角度对全乱序结构的处理器验证提出了一种全新的解决方法。  相似文献   

18.
Tremblay  M. O'Connor  J.M. 《Micro, IEEE》1996,16(2):42-50
UItraSpare I is a second-generation superscalar processor. It is a high performance, highly integrated, four issue superscalar processor based on the Spare Version 9 64-bit RISC architecture. We have extended the core instruction set to include graphics instructions that provide the most common operations related to two dimensional image processing; two- and three-dimensional graphics and image compression algorithms; and parallel operations on pixel data with 8-, 16-, and 32-bit components. Additional, new memory access instructions support the very high bandwidth requirements typical of graphics and multimedia applications  相似文献   

19.
Rock, Sun's third-generation chip-multithreading processor, contains 16 high-performance cores, each of which can support two software threads. Rock uses a novel checkpoint-based architecture to support automatic hardware scouting under a load miss, speculative out-of-order retirement of instructions, and aggressive dynamic hardware parallelization of a sequential instruction stream. It is also the first processor to support transactional memory in hardware.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号