共查询到18条相似文献,搜索用时 218 毫秒
1.
2.
一种RISC型微处理器指令流水线结构 总被引:2,自引:0,他引:2
齐家月 《小型微型计算机系统》1995,16(10):1-5
求文介绍一种RISC微处理器的指令流水线结构、其工作原理及相关支持技术,包括HELP指令的产生和插入、延迟控制转移以及硬件互锁。 相似文献
3.
谓词执行及其关键技术浅析 总被引:1,自引:0,他引:1
在超标量和VLIW微处理器的设计中,指令间的相关,尤其是控制相关和数据相关,严重限制了指令级并行(ILP)的开发,从而限制了微处理器性能的进一步提高。条件执行技术(Guarded Execution)能够将控制相关转化为数据相关来处理。具体来说,它能够将控制相关于一条分支指令的其他指令转换为数据相关于该分支条件的条件指令。条件指令与常规指令的不同之处在于它含有显式的条件指示符,其语义为:首先计算指令执行条件,如果条件为真,则执行该指令中的操作,否则将其作为空操作处理。条件执行技术实质上是一种程序变换技术,变换后的程序无论对编译优化还是对硬件调度都有很大好处,但需要专门硬件机制的支持。目前只有ARM指令集与IA-64指令集支持条件执行。 相似文献
4.
设计了一款具有4级流水线结构的16位RISC嵌入式微处理器.针对转移指令,未采用惯用的延迟转移技术,而是通过在取指阶段增加相应的硬件结构实现了无延迟转移.采用内部前推技术解决了指令执行过程中的数据相关.同时通过设置相应的硬件堆栈实现了对中断嵌套和调用嵌套的支持.整体系统结构采用Verilog HDL语言设计,指令系统较完善.在软件平台上的仿真验证初步表明了本设计的正确性. 相似文献
5.
以基本块为单位的非顺序指令预取 总被引:1,自引:0,他引:1
取指令能力的高低对微处理器的性能有很大影响。指令预取技术能够有效地降低指令Cache的访问失效率,提高微处理器的取指令能力,进而提高微处理器的性能。本文提出了一种由分支指令指导的、以基本块为单位的非顺序指令预取技术,每次预取将一个完整的基本块读入指令Cache。这种方法使用静态策略分析程序行为,实现所需的硬件复杂度低。模拟结果显示,该方法能够有效地提高指令Cache访问的命中率。 相似文献
6.
高性能通用微处理器体系结构关键技术研究 总被引:1,自引:0,他引:1
X处理器是我国自主设计的基于EPIC思想的高性能通用微处理器.介绍了8级流水线和OLSM执行模型,以很少的硬件代价克服了基本EPIC模型的局限性.设计了一种多分支预测结构,支持多条分支指令的并行执行,并通过判定执行减少分支指令的数目;设计了两级cache存储器,提出DTD低功耗设计方法,并通过前瞻执行隐藏访存的延迟.最后,展望了高性能通用微处理器的发展趋势. 相似文献
7.
8.
随着深亚微米工艺的广泛应用,瞬态故障已成为芯片失效的主要原因.文中提出了一种向分支指令后插入冗余指令的容错微结构,利用分支误预测浪费的处理带宽,降低了冗余执行导致的性能损失.实验结果表明,该技术的性能损失在6%~31%之间,平均为21%,明显低于MBI技术而和DIE技术的性能损失相当.该技术能够检测流水线上各阶段发生的瞬态故障并能恢复处理器状态,故障检测延时短,需要的硬件开销也较小,非常适合提高带有简单预测机制的嵌入式微处理器的容错能力. 相似文献
9.
10.
一种精确的分支预测微处理器模型 总被引:3,自引:0,他引:3
在当今深流水宽发射的微处理器中,为实现高性能,精确的分支预测是不可缺少的关键技术.分支预测失效将浪费大量的时钟周期,无法发挥乱序执行的效能.宽发射微处理器的有效性能同时还依赖指令窗口的大小和指令预取宽度.提出了一种新的更精确的支持分支预测和分支误预测周期损失的微处理器模型.根据指令的执行带宽为指令窗口中可用指令数的平方根统计规律,给出了一个更为精确的描述微处理器取指带宽、分支预测精度、分支误预测周期损失、指令窗口大小和IPC之间关系的算法,并讨论了这些参数的综合权衡以及这些参数对程序IPC的影响.由此可以确定依赖多个微处理器参数的取指带宽阈值和微处理器中几个关键参数的选取. 相似文献
11.
VLIW结构是开发ILP的一种重要手段,其优点是结构规整简单、硬件复杂度低。但是,完全依靠编译器进行指令调度的机制限制了VLIW结构性能的提高。本文提出了一种基于确定指令延迟的动态VLIW调度机制,该机制利用大部分指令执行时间确定的特点,根据运行时信息重新调度指令的执行顺序,以进一步开发ILP。在FPGA上的实验结果表明,该机制具有线性的硬件复杂度。 相似文献
12.
存储相关性预测对于减少存储相关性冲突、提高微处理器性能具有十分重要的作用。针对传统相关性预测器硬件开销大、可实现性较差的缺点,通过对存储相关性的局部性分析,提出了一种基于指令距离的存储相关性预测方法。该方法充分利用了发生存储相关性冲突的指令在指令距离上的局部性,预测冲突指令的指令距离,进而控制部分访存指令的发射时机,大大减少了存储相关性冲突的次数。实验结果表明,在硬件开销约为1KB的情况下,使用基于指令距离的相关性预测器后,每个时钟周期平均执行的指令数可以提高1.70%,最高可以提高5.11%。在硬件开销较小的情况下,较大程度提高了微处理器的性能。 相似文献
13.
Parcerisa J.-M. Sahuquillo J. Gonzalez A. Duato J. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(2):130-144
Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered superscalar microarchitectures. This new class of interconnects has demands and characteristics different from traditional multiprocessor networks. In particular, in a clustered microarchitecture, a low intercluster communication latency is essential for high performance. We propose some point-to-point cluster interconnects and new improved instruction steering schemes. The results show that these point-to-point interconnects achieve much better performance than bus-based ones, and that the connectivity of the network together with effective steering schemes are key for high performance. We also show that these interconnects can be built with simple hardware and achieve a performance close to that of an idealized contention-free model. 相似文献
14.
15.
Current superscalar architectures strongly depend on an instruction issue queue to achieve multiple instruction issue and out-of-order execution. However, the issue queue requires a centralized structure and mainly causes globally broadcasting operations to wakeup and select the instructions. Therefore, a large issue queue ultimately results in a low clock rate along with a high circuit complexity. In other words, the increasing demands for a larger issue queue correspondingly impose a significant burden on achieving a higher clock speed.This paper discusses our Speculative Pre-Execution Assisted by compileR (SPEAR), a low-complexity issue queue design. SPEAR is designed to manage the small window superscalar architecture more efficiently without increasing the window size. To this end, we have first recognized that the long memory latency is one of the factors that demand a large window, and we aim at achieving early execution of the miss-causing load instructions using another hierarchy of an issue queue. We pre-execute those miss-causing instructions speculatively as an additional prefetching thread. Simulation results show that the SPEAR design achieves performance comparable to or even better than what would be obtained in superscalar architectures with a large issue queue. However, SPEAR is designed with smaller issue queues which consequently can be implemented with low hardware complexity and high clock speed. 相似文献
16.
Modern microprocessors achieve high application performance at an acceptable level of power dissipation. Reorder buffer is used for out-of-order instructions to be committed in-order. The reorder buffer plays a key role in modern microprocessors because performance improvement techniques highly rely on aggressive speculation to feed wider issue, out-of-order, and deep pipelines. In terms of power to performance trade-off, reorder buffer is particularly important. This is because enlarging the reorder buffer size achieves high performance but naive scaling of the conventional reorder buffer architecture can severely increase the complexity and power consumption. In this paper, we propose low-power reorder buffer techniques for contemporary microprocessors. First, the separated reorder buffer reduces power dissipation by deferred allocation and early release. The deferred allocation delays the SROB allocation of instructions until all their data dependencies are resolved. Then, the instructions are executed in program order and they are released faster from the SROB. The result of the instruction is written into rename buffers immediately after the execution completes. Then, the result values in the rename buffer are written into the architectural register file at the commit state. The proposed approaches in this paper provide higher resource utilization and low power consumption. 相似文献
17.
Pedro Trancoso Paraskevas Evripidou Kyriakos Stavrou Costas Kyriacou 《International journal of parallel programming》2006,34(3):213-235
Current high-end microprocessors achieve high performance as a result of adding more features and therefore increasing complexity. This paper makes the case for a Chip-Multiprocessor based on the Data-Driven Multithreading (DDM-CMP) execution model in order to overcome the limitations of current design trends. Data-Driven Multithreading (DDM) is a multithreading model that effectively hides the communication delay and synchronization overheads. DDM-CMP avoids the complexity of other designs by combining simple commodity microprocessors with a small hardware overhead for thread scheduling and an interconnection network. Preliminary experimental results show that a DDM-CMP chip of the same hardware budget as a high-end commercial microprocessor, clocked at the same frequency, achieves a speedup of up to 18.5 with a 78–81% power consumption of the commercial chip. Overall, the estimated results for the proposed DDM-CMP architecture show a significant benefit in terms of both speedup and power consumption making it an attractive architecture for future processors. 相似文献
18.
Won W. Ro Stephen P. Crago Alvin M. Despain Jean-Luc Gaudiot 《The Journal of supercomputing》2006,38(3):237-259
The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result,
today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional
data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability
for future accesses and often fail when memory accesses do not contain much locality.
To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance
decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated
from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching
processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating
and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors.
The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time.
This is achieved by separating the access-related instructions from the main computation and running them early enough on
the two dedicated processors.
Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling
tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall
IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC
improves the performance by 17.2%. 相似文献