首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 984 毫秒
1.
2.
Two processors that compete in the workstation/server markets are compared. The 62.5-MHz IBM RISC System/6000 Model 580 (RS1) exemplifies a moderate clock rate design. As the highest SPECmark89/MHz system it can be viewed as maximizing the work performed per cycle. the 133-/200-MHz DEC Alpha processor represents an aggressive clock rate design. At 200 MHz, the Alpha has the highest MHz rate in the market. The authors discuss clock rate goals, how they influence design choices, and performance implications. The primary advantage for the Alpha design appears to be the high clock rate. The RS1 design includes a significant amount of hardware to increase in superscalar capability, especially on floating-point codes. RS1 has a significant infinite cache CPI advantage on floating-point applications. Infinite cache CPI for the two designs seem comparable on fixed-point codes  相似文献   

3.
Current superscalar architectures strongly depend on an instruction issue queue to achieve multiple instruction issue and out-of-order execution. However, the issue queue requires a centralized structure and mainly causes globally broadcasting operations to wakeup and select the instructions. Therefore, a large issue queue ultimately results in a low clock rate along with a high circuit complexity. In other words, the increasing demands for a larger issue queue correspondingly impose a significant burden on achieving a higher clock speed.This paper discusses our Speculative Pre-Execution Assisted by compileR (SPEAR), a low-complexity issue queue design. SPEAR is designed to manage the small window superscalar architecture more efficiently without increasing the window size. To this end, we have first recognized that the long memory latency is one of the factors that demand a large window, and we aim at achieving early execution of the miss-causing load instructions using another hierarchy of an issue queue. We pre-execute those miss-causing instructions speculatively as an additional prefetching thread. Simulation results show that the SPEAR design achieves performance comparable to or even better than what would be obtained in superscalar architectures with a large issue queue. However, SPEAR is designed with smaller issue queues which consequently can be implemented with low hardware complexity and high clock speed.  相似文献   

4.
Smith  J.E. Weiss  S. 《Computer》1994,27(6):46-58
A discussion is given on two RISC implementations: from Digital Equipment Corporation, the Alpha 21064, and from IBM/Motorola/Apple, the PowerPC 601. Both are superscalar implementations, that is, they can sustain execution of two or more instructions per clock cycle. Otherwise, these two implementations present vastly different philosophies for achieving high performance. The PowerPC 601 focuses on powerful instructions and great flexibility in processing order, while the Alpha 21064 depends on a very fast clock, with simpler instructions and a more streamlined implementation structure. These two RISC microprocessors exemplify contrasting, but equally valid, implementation philosophies. An overview is given of the instruction sets and the authors emphasize the differences in design: PowerPC uses powerful instructions so that fewer are needed to get the job done; Alpha uses simple instructions so that the hardware can be kept simpler and faster. The authors also discuss the pipelined implementations of the two architectures; again, the contrast is between powerful and simple  相似文献   

5.
Hsu  P.Y.-T. 《Micro, IEEE》1994,14(2):23-33
Designed to efficiently support large, real-world, floating-point-intensive applications, the TFP (short for Tremendous Floating-Point) microprocessor is a superscalar implementation of the Mips Technologies architecture. This floating-point, computation-oriented processor uses a superscalar machine organization that dispatches up to four instructions each clock cycle to two floating-point execution units, two memory load/store units, and two integer execution units. Its split-level cache structure reduces cache misses by directing integer data references to a 16-Kbyte on-chip cache, while channeling floating-point data references off chip to a 4 Mbyte cache  相似文献   

6.
The PA7100 CPU, the first precision-architecture, reduced-instruction-set-computer (PA-RISC) architecture implementation to combine an integer core and floating-point coprocessor into a single-chip format, is described. It incorporates superscalar execution and supports clock rates of up to 100 MHz in standard 0.8-μm CMOS. Features such as a flexible primary cache organization and multiprocessing capability allow the device to be scaled to a variety of system applications, price ranges, and performance levels. The microprocessor instruction execution pipeline, cache design, translation look-aside buffer (TLB) for virtual address translation, floating-point unit, and system interface bus are discussed. The design, test, and verification methods used in the development of the PA7100 are reviewed  相似文献   

7.
The PowerPC 601 microprocessor, the first of a family of processors based on the PowerPC architecture, is described. The general-purpose processor contains a 32-Kb cache and a superscalar machine organization that allows dispatch and execution of up to three instructions each clock cycle. The bus interface and storage control mechanisms can be configured for a wide range of system designs, from low-cost desktop personal computers to high-performance multi-processor systems. The PowerPC architecture, machine organization, chip packaging technology, and performance are discussed  相似文献   

8.
介绍了TURBO52的研究背景及技术路线,在保持对经典8051指令集后向兼容的前题下,通过重新进行结构设计来提高系统性能。介绍了指令流水线的结构设计,包括两路超标量结构、分支预测、动态执行和存储管理。在FPGA上运行真实控制系统应用程序测试表明,在相同工作频率下运行一系列系统软件可达经典8051的30倍以上,最高指令吞吐率每时钟周期两条指令。但由于未实现三级存储体系和数据高速缓存,工作在100 MHz以上性能的提升会受限。  相似文献   

9.
This paper analyzes the performance of vector-dominated regions of code in numerical and multimedia applications in a superscalar + vector architecture and compares it with an eight-way superscalar processor. The ability to split a program’s execution into scalar and vector regions allows us to show that (1) as expected, the vector unit is much better than the wide-issue superscalar at executing the vector-dominated regions of the code; (2) on the scalar regions, the eight-way superscalar, although better than a four-way superscalar, is clearly not worth the extra complexity in terms of extra transistors and potential cycle-time limitations. Overall, the vector-enhanced superscalar is from 6% to 303% better than an eight-way superscalar. We also present detailed data on the performance of the memory system, which is usually the key limiting factor when running numerical and multi-\break media applications. We evaluate two additional cache designs that try to alleviate problems created by non-unit stride memory references.  相似文献   

10.
The Metaflow architecture, a unified approach to maximizing the performance of superscalar microprocessors, is introduced. The Metaflow architecture exploits inherent instruction-level parallelism in conventional sequential programs by hardware means, without relying on optimizing compilers. It is based on a unified structure, the DRIS (deferred-scheduling, register-renaming instruction shelf), that manages out-of-order execution and most of the attendant problems. Coupling the DRIS with a speculative-execution mechanism that avoids conditional branch stalls results in performance limited only be inherent instruction-level parallelism and available execution resources. Although presented in the context of superscalar machines, the technique is equally applicable to a superpipelined implementation. Lightning, the first implementation of the Metaflow architecture, which executes the Sparc RISC instruction set is described  相似文献   

11.
前瞻性执行技术是一种提高超标量处理器性能的有效技术,为了分析前瞻性执行的超标量处理器的性能潜力。  相似文献   

12.
The IBM RISC System/6000, a superscalar microprocessor, is presented. The architecture of this processor has its instruction set specifically designed for a superscalar machine containing three independent units-branch, fixed-point, and floating-point. The design also emphasizes high-performance floating-point operations. The design principles are to offer maximum overlap of the three functional units, avoid dead cycles, and define instructions that can (for the most part) be completed at a rate of one per cycle. The branch cycle, fixed- and floating-point units, cache management, and performance are described. Benchmark results are given  相似文献   

13.
VLIW是一种早已出现但一直未能广泛使用而现今又被重新重点研究的微处理器设计思想与技术,它跟超标量技术一样支持每周期执行多条指令,但并行度更高。本文将详细介绍VLIW的概念及其发展历程,讨论VLIW微处理器的特征与所需的编译技术支持,并与超标量微处理器进行比较分析。  相似文献   

14.
Yeager  K.C. 《Micro, IEEE》1996,16(2):28-41
The Mips R10000 is a dynamic, superscalar microprocessor that implements the 64-bit Mips 4 instruction set architecture. It fetches and decodes four instructions per cycle and dynamically issues them to five fully-pipelined, low-latency execution units. Instructions can be fetched and executed speculatively beyond branches. Instructions graduate in order upon completion. Although execution is out of order, the processor still provides sequential memory consistency and precise exception handling. The R10000 is designed for high performance, even in large, real-world applications with poor memory locality. With speculative execution, it calculates memory addresses and initiates cache refills early. Its hierarchical, nonblocking memory system helps hide memory latency with two levels of set-associative, write-back caches  相似文献   

15.
同时多线程(SMT)是一种允许多个独立的线程每周期发射多条指令的技术,这种技术充分利用了可能存在的指令级并行和线程级并行,提高了有限资源的利用率。文章以西北工业大学航空微电子中心自主研发的32位超标量处理器“龙腾R2”为基础,引入SMT技术,在基本不改变内部结构大小、不增加执行功能部件、仅做一些必要修改的前提条件下进行研究。通过仿真不同的线程数和各种线程组合,进行性能分析。尽管存在制约性能提升的一些因素,引入SMT技术后依然获得了最高约50%的性能增加。  相似文献   

16.
The 21164 is a new quad-issue, superscalar Alpha microprocessor that executes 1.2 billion instructions per second. The 300-MHz, 0.5-μm CMOS chip delivers an estimated 345/505 SPECint32/SPECfp92 performance. The design's high clock rate, low operational latency, and high-throughput/nonblocking memory systems contribute to this performance  相似文献   

17.
With just three VLSI parts, Hewlett-Packard's latest workstation class lets designers optimize performance and cost at the system level. Its Hummingbird microprocessor features two-way superscalar execution incorporating two integer units, a floating-point unit, a 1-Kbyte internal instruction cache, an integrated external cache controller, an integrated memory and I/O controller, plus enhancements for little-endian and multimedia applications. Its Artist graphics controller integrates a graphical user interface accelerator, a frame buffer controller, and a video controller on a single chip  相似文献   

18.
The PowerPC is a new RISC architecture derived from IBM's POWER architecture. The changes made to POWER simplify implementations, increase clock rates, enable a higher degree of superscalar execution, extend the architecture to 64 bits, and add multiprocessor support. For compatibility with existing software, the developers retained POWER's basic instruction set, opcode assignments, and programming model  相似文献   

19.
Sima  D. 《Micro, IEEE》1997,17(5):28-39
Clearly, instruction issue and execution are closely related: The more parallel the instruction execution, the higher the requirements for the parallelism of instruction issue. Thus, we see the continuous and harmonized increase of parallelism in instruction issue and execution. This article focuses on superscalar instruction issue, tracing the way parallel instruction execution and issue have increased performance. It also spans the design space of instruction issue, identifying important design aspects and available design choices. The article also demonstrates a concise way to represent the design space using DS trees, reviews the most frequently used issue schemes, and highlights trends for each design aspect of instruction issue  相似文献   

20.
曾斌  安虹  王莉 《计算机科学》2010,37(3):248-252
开发利用ILP(Inst ruction-level Parallelism)是现代高性能处理器取得高性能的关键要素之一。宽发射的超标量处理器、超长指令字处理器和数据流处理器只有在并行执行多条相邻的指令时才能获得较高的性能。数据流处理器的一个关键问题是如何把指令的计算结果高效地播送给目标指令而不用读写集中式寄存器文件。对于每条目标数大于指令所能编码的目标数的指令,编译程序都要插入一棵由MOV指令构成的软件扇出树来把计算结果播送给多条目标指令。为了暴露更多的ILP给硬件执行基底,提出了一种改进的软件扇出树生成算法,本算法根据目标指令的执行概率大小以及目标指令到该指令所在块的出口的关键路径长度来计算目标指令的权值,然后对各个叶子的优先权值进行排序,再根据优先权值的顺序来构造一棵软件扇出树,以便把指令的计算结果播送给多条目标指令。实验结果发现,本算法相对于传统的软件扇出树生成算法其性能有较大的提高。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号