期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Exploiting selective instruction reuse and value prediction in a superscalar architecture

Arpad Gellert Adrian Florea Lucian Vintan 《Journal of Systems Architecture》2009,55(3):188-195

In our previously published research we discovered some very difficult to predict branches, called unbiased branches. Since the overall performance of modern processors is seriously affected by misprediction recovery, especially these difficult branches represent a source of important performance penalties. Our statistics show that about 28% of branches are dependent on critical Load instructions. Moreover, 5.61% of branches are unbiased and depend on critical Loads, too. In the same way, about 21% of branches depend on MUL/DIV instructions whereas 3.76% are unbiased and depend on MUL/DIV instructions. These dependences involve high-penalty mispredictions becoming serious performance obstacles and causing significant performance degradation in executing instructions from wrong paths. Therefore, the negative impact of (unbiased) branches over global performance should be seriously attenuated by anticipating the results of long-latency instructions, including critical Loads. On the other hand, hiding instructions’ long latencies in a pipelined superscalar processor represents an important challenge itself. We developed a superscalar architecture that selectively anticipates the values produced by high-latency instructions. In this work we are focusing on multiply, division and loads with miss in L1 data cache, implementing a dynamic instruction reuse scheme for the MUL/DIV instructions and a simple last value predictor for the critical Load instructions. Our improved superscalar architecture achieves an average IPC speedup of 3.5% on the integer SPEC 2000 benchmarks, of 23.6% on the floating-point benchmarks, and an improvement in energy-delay product (EDP) of 6.2% and 34.5%, respectively. We also quantified the impact of our developed selective instruction reuse and value prediction techniques in a simultaneous multithreaded architecture (SMT) that implies per thread reuse buffers and load value prediction tables. Our simulation results showed that the best improvements on the SPEC integer applications have been obtained with 2 threads: an IPC speedup of 5.95% and an EDP gain of 10.44%. Although, on the SPEC floating-point programs, we obtained the highest improvements with the enhanced superscalar architecture, the SMT with 3 threads also provides an important IPC speedup of 16.51% and an EDP gain of 25.94%. 相似文献

2.

Register port complexity reduction in wide-issue processors with selective instruction execution

《Microprocessors and Microsystems》2007,31(1):51-62

As the width of the processor grows, complexity of a register file (RF) with multiple ports grows more than linearly and leads to larger register access time and higher power consumption. Analysis of SPEC2000 programs reveals that only a small portion of the instructions in a program (16% in integer and 38% in floating-point) require both the source operands. Also, when the programs are executed in an 8-wide processor only a very few (two or less) two-source instructions are executed in a cycle for a significant portion of time (more than 98% for integer and 93% for floating-point), leading to a significant under-utilization of register port bandwidth. In this paper, we propose a novel technique to significantly reduce the number of register ports, with a very minor modification in the select logic to issue only a limited number of two-source instructions each cycle. This is achieved with no significant impact on processor’s overall performance. The novelty of the technique is that it is easy to implement and succeeds in reducing the access time, power, and area of the register file, without aggravating these factors in any other logic on the chip. With this technique in an 8-wide processor, as compared to a conventional 128-entry RF with 16 read ports, for integer programs a register file can be designed with 11 or 10 read ports as these configurations result in instructions per cycle (IPC) degradation of only 0.929% and 3.38%, respectively. This significantly low degradation in IPC is achieved while reducing the register access time by 9% and 12%, respectively, and reducing power by 35% and 50%, respectively. For FP programs, a register file can be designed with 12 read ports (1.16% IPC loss, 8% less access time, and 28% less power) or with 11 read ports (3.5% IPC loss, 9% less access time, and 35% less power). The paper analyzes the performance of all the possible flavors of the proposed technique for register file in both 4-wide and 8-wide processors, and presents a choice of the performance and register port complexity combination to the designer. 相似文献

3.

同时多线程处理器上的动态分支预测器设计方案研究

任建安虹路放梁博《计算机科学》2006,33(3):239-243

同时多线程处理器（SMT）每个周期能够从多个线程中发射指令执行,从而大大地提高了超标量微处理器的指令吞吐量,但多个线程的同时执行也带来了许多硬件资源的共享冲突问题.其中,多个线程共享分支预测硬件的方案会对分支预测精度产生较大的影响.研究SMT处理器中分支处理方案对于处理器整体性能的影响,对于指导SMT处理器的设计是十分重要的.本文利用SMT处理器模拟器,针对各线程运行独立应用的SMT结构实验评估了几种著名的分支预测方案;给出了在单线程和多线程情况下,分支预测方案对分支预测精度和处理器整体性能的影响的分析;总结出在这样的SMT结构中,各线程拥有独立的预测器是一种较好的选择,并且由于各独立预测器可以采用小而简单的结构,所以不会带来太多的硬件开销. 相似文献

4.

Evaluating the Effects of Predicated Execution on Branch Prediction

Gary Tyson Matthew Farrens 《International journal of parallel programming》1996,24(2):159-186

As microprocessor designs move towards deeper pipelines and support for multiple instruction issue, steps must be taken to alleviate the negative impact of branch operations on processor performance. One approach is to use branch prediction hardware and perform speculative execution of the instructions following an unresolved branch. Another technique is to eliminate certain branch instructions altogether by translating the instructions following a forward branch into predicate form. Both these techniques are employed in many current processor designs. This paper investigates the relationship between branch prediction techniques and branch predication. In particular, we are interested in how using predication to remove a certain class of poorly predicted branches affects the prediction accuracy of the remaining branches. A variety of existing predication models for eliminating branch operations are presented, and the effect that eliminating branches has on branch prediction schemes ranging from simple prediction mechanisms to the newer more sophisticated branch predictors is studied. We also examine the impact of predication on basic block size, and how the two techniques used together affect overall processor performance. 相似文献

5.

The Intel IA-64 compiler code generator

《Micro, IEEE》2000,20(5):44-53

In planning the new EPIC (Explicitly Parallel Instruction Computing) architecture, Intel designers wanted to exploit the high level of instruction-level parallelism (ILP) found in application code. To accomplish this goal, they incorporated a powerful set of features such as control and data speculation, predication, register rotation, loop branches, and a large register file. By using these features, the compiler plays a crucial role in achieving the overall performance of an IA-64 platform. This paper describes the electron code generator (ECG), the component of Intel's IA-64 production compiler that maximizes the benefits of these features. The ECG consists of multiple phases. The first phase, translation, converts the optimizer's intermediate representation (ILO) of the program into the ECG IR. Predicate region formation, if conversion, and compare generation occur in the predication phase. The ECG contains two schedulers: the software pipeliner for targeted cyclic regions and the global code scheduler for all remaining regions. Both schedulers make use of control and data speculation. The software pipeliner also uses rotating registers, predication, and loop branches to generate efficient schedules for integer as well as floating-point loops 相似文献

6.

Control flow prediction schemes for wide-issue superscalarprocessors

Dutta S. Franklin M. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(4):346-359

In order to achieve high performance, wide-issue superscalar processors have to fetch a large number of instructions per cycle. Conditional branches are the primary impediment to increasing the fetch bandwidth because they can potentially alter the flow of control and are very frequent. To overcome this problem, these processors need to predict the outcome of multiple branches in a cycle. This paper investigates two control flow prediction schemes that predict the effective outcome of multiple branches with the help of a single prediction. Instead of considering branches as the basic units of prediction, these schemes consider subgraphs of the control flow graph of the executed program as the basic units of prediction and predict the target of an entire subgraph at a time, thereby allowing the superscalar fetch mechanism to go past multiple branches in a cycle. The first control flow prediction scheme investigated considers sequential block-like subgraphs and the second scheme considers tree-like subgraphs to make the control flow predictions. Both schemes do a 1-out-of-4 prediction as opposed to the 1-out-of-2 prediction done by branch-level prediction schemes. These two schemes are evaluated using a MIPS ISA-based 12-way superscalar microarchitecture. An improvement in effective fetch size of approximately 25 percent and 50 percent, respectively, is observed over identical microprocessors that use branch-level prediction. No appreciable difference in the prediction accuracy was observed, although the control flow prediction schemes predicted 1-out-of-4, outcomes 相似文献

7.

Low-Cost Microarchitectural Support for Improved Floating-Point Accuracy

《Computer Architecture Letters》2007,6(1):13-16

Some processors designed for consumer applications, such as graphics processing units (CPUs) and the CELL processor, promise outstanding floating-point performance for scientific applications at commodity prices. However, IEEE single precision is the most precise floating-point data type these processors directly support in hardware. Pairs of native floating-point numbers can be used to represent a base result and a residual term to increase accuracy, but the resulting order of magnitude slowdown dramatically reduces the price/performance advantage of these systems. By adding a few simple microarchitectural features, acceptable accuracy can be obtained with relatively little performance penalty. To reduce the cost of native-pair arithmetic, a residual register is used to hold information that would normally have been discarded after each floating-point computation. The residual register dramatically simplifies the code, providing both lower latency and better instruction-level parallelism. 相似文献

8.

Improving branch prediction accuracy by reducing pattern history table interference 总被引：1，自引：0，他引：1

Po-Yung Chang Marius Evers Yale N. Patt 《International journal of parallel programming》1997,25(5):339-362

A deeply pipelined superscalar processor needs an accurate branch predictor in order to approach its performance potential. The 2-level branch predictors have been shown to achieve high prediction accuracy, yet they still suffer a significant number of mispredictions. It has been shown that a number of these mispredictions are due to interference in the pattern history tables. This paper details a method for reducing the amount of pattern history table interference by dynamically identifying some easily predictable branches and inhibiting the pattern history table update for these branches. We show that inhibiting the update in this manner reduces the amount of destructive interference in the global history variation of the 2-level branch predictor, resulting in significantly improved branch prediction accuracy for the SPEC 95 benchmarks. For example, for a 2 K Byte gshare predictor, we eliminate 38% of the mispredictions for the gcc benchmark. 相似文献

9.

The TMS390C602A floating-point coprocessor for Sparc systems

Darley M. Kronlage B. Bural D. Churchill B. Pulling D. Wang P. Iwamoto R. Yang L. 《Micro, IEEE》1990,10(3):36-47

A recent Sparc (scalable processor architecture) processor consists of a two-chip configuration, containing the TMS390C601 integer unit (IU) and the TMS390C602A floating-point unit (FPU). The second device, an innovative coprocessor that lets the processor execute single- or double-precision floating-point instructions concurrently with IU operations is described. Dedicated floating-point hardware in the FPU increases the performance of the system. Running at clock periods as small as 20 ns, the chip should deliver 5.5 million double-precision floating-point operations per second under the Linpack benchmark (50-MHz clock rate). The FPU provides single- and double-precision arithmetic functions: addition, subtraction, multiplication, division, square root, compare, and convert. To minimize its math unit's latency, the FPU uses a highly parallel architecture requiring separate math units to optimize additions and multiplications. Traps stop the execution of a program to jump to software routine for handling data-dependent errors or to execute instructions not implemented in the hardware. Benchmark results are presented 相似文献

10.

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

《Parallel Computing》2013,39(10):586-602

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively. 相似文献

11.

The Precomputed-Branch architecture: Efficient branches with compiler support

《Journal of Systems Architecture》1999,45(9):651-679

Accurate instruction fetch and branch prediction is increasingly important in today's superscalar architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of predicting the likely outcome of branch instructions. A branch target buffer (BTB) is often used to provide target addresses for taken branches and to predict the destination of indirect jumps. Using a BTB avoids the delay needed to recalculate the destination address and reduces the misfetch penalty. However, an effective branch target buffer can be large and can possibly increase the cycle time of a processor. We propose that a design used in older computers, such as the PDP-8, be used in modern architectures instead of a BTB design. The compiler would pre-compute the branch destination for most branch instructions, allowing the branch information to be stored with the instruction. We consider computing branch destinations at link time and as instructions are fetched into the instruction cache; both alternatives offer similar performance with different advantages. A very small BTB is still useful to predict indirect branches, which cannot be pre-computed. Our results show that the Precomputed-Branch architecture performs better than an architecture using only a BTB, and has significant hardware savings. This is particularly true for larger programs more representative of modern applications. 相似文献

12.

The design space of register renaming techniques

Sima D. 《Micro, IEEE》2000,20(5):70-83

Register renaming is a technique to remove false data dependencie-write after read (WAR) and write after write (WAW)-that occur in straight line code between register operands of subsequent instructions. By eliminating related precedence requirements in the execution sequence of the instructions, renaming increases the average number of instructions that are available for parallel execution per cycle. This results in increased IPC (number of instructions executed per cycle). The identification and exploration of the design space of register-renaming lead to a comprehensive understanding of this intricate technique. As this article shows, the design space of register renaming is spanned by four main dimensions: the scope of register renaming, the layout of the rename buffers, the method of register mapping, and the rename rate. Relevant aspects of the design space give rise to eight basic alternatives for register-renaming. In addition, the kind of operand fetch policy significantly affects how the processor carries out the rename process, which duplicates the eight basic alternatives to 16 possible implementation schemes. The article indicates which basic implementation scheme is used in relevant superscalar processors. As register renaming is usually implemented in conjunction with shelving, the underlying microarchitecture is assumed to employ shelving 相似文献

13.

IBM RISC System/6000: architecture and performance

Oehler R.R. Blasgen M.W. 《Micro, IEEE》1991,11(3)

The IBM RISC System/6000, a superscalar microprocessor, is presented. The architecture of this processor has its instruction set specifically designed for a superscalar machine containing three independent units-branch, fixed-point, and floating-point. The design also emphasizes high-performance floating-point operations. The design principles are to offer maximum overlap of the three functional units, avoid dead cycles, and define instructions that can (for the most part) be completed at a rate of one per cycle. The branch cycle, fixed- and floating-point units, cache management, and performance are described. Benchmark results are given 相似文献

14.

A dynamic block-level execution profiler

《Parallel Computing》2016

Most performance enhancing mechanisms in current processors, such as branch predictors or prefetchers, rely on program characteristics monitored at the granularity of single instructions. However, many of these characteristics can be obtained at the basic block-level instead. The coarser granularity allows a larger portion of the code to be examined, enabling a more accurate profiling and a detailed analysis of the different types of instructions executed within a block. Therefore, block-level analysis can be advantageous for performance enhancing mechanisms, as it allows us to look at how the instructions influence each other, and thus detect complex behavior patterns.In this paper, we present the Dynamic Block-Level Execution Profiler (DBLEP), a basic block level online mechanism that profiles micro-architectural bottlenecks, such as delinquent memory loads, hard-to-predict branches and contention for functional units. DBLEP operates at the basic block level and provides information that can be used to reduce the impact of these bottlenecks. A prefetch dropping scheme and a memory controller policy were developed to use the code profiling information provided by DBLEP. By taking advantage of the high profiling accuracy, these mechanisms are able to improve the processor’s performance by up to 18.6% (5.3% on average). We show that our mechanism’s performance is comparable to mechanisms that work on single instruction granularity, using less hardware. 相似文献

15.

The Impact of Speculative Execution on SMT Processors

Dongsoo Kang Chen Liu Jean-Luc Gaudiot 《International journal of parallel programming》2008,36(4):361-385

By executing two or more threads concurrently, Simultaneous MultiThreading (SMT) architectures are able to exploit both Instruction-Level Parallelism (ILP) and Thread-Level Parallelism (TLP) from the increased number of in-flight instructions that are fetched from multiple threads. However, due to incorrect control speculations, a significant number of these in-flight instructions are discarded from the pipelines of SMT processors (which is a direct consequence of these pipelines getting wider and deeper). Although increasing the accuracy of branch predictors may reduce the number of instructions so discarded from the pipelines, the prediction accuracy cannot be easily scaled up since aggressive branch prediction schemes strongly depend on the particular predictability inherently to the application programs. In this paper, we present an efficient thread scheduling mechanism for SMT processors, called SAFE-T (Speculation-Aware Front-End Throttling): it is easy to implement and allows an SMT processor to selectively perform speculative execution of threads according to the confidence level on branch predictions, hence preventing wrong-path instructions from being fetched. SAFE-T provides an average reduction of 57.9% in the number of discarded instructions and improves the instructions per cycle (IPC) performance by 14.7% on average over the ICOUNT policy across the multi-programmed workloads we simulate. This paper is an extended version of the paper, “Speculation Control for Simultaneous Multithreading,” which appeared in the Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, April 2004. 相似文献

16.

MIPS32指令集兼容的CPU模拟器设计 总被引：3，自引：0，他引：3

下载免费PDF全文

薛勃周玉洁《计算机工程》2009,35(1):263-265

描述一个与MIPS32指令集兼容的CPU模拟器设计方案,该方案用C语言描述处理器的硬件行为,模拟CPU指令的执行过程,实现MIPS32除浮点运算指令以外的所有指令,有大小可配的主存储器、指令和数据统一的二相关高速缓存Cache,内置类型可配的分支预测器和ELF文件解析器,并给出设计的应用实例。相似文献

17.

Designing the TFP microprocessor

Hsu P.Y.-T. 《Micro, IEEE》1994,14(2):23-33

Designed to efficiently support large, real-world, floating-point-intensive applications, the TFP (short for Tremendous Floating-Point) microprocessor is a superscalar implementation of the Mips Technologies architecture. This floating-point, computation-oriented processor uses a superscalar machine organization that dispatches up to four instructions each clock cycle to two floating-point execution units, two memory load/store units, and two integer execution units. Its split-level cache structure reduces cache misses by directing integer data references to a 16-Kbyte on-chip cache, while channeling floating-point data references off chip to a 4 Mbyte cache 相似文献

18.

A pipelined interface for high floating-point performance with precise exceptions

Iacobovici S. 《Micro, IEEE》1988,8(3):77-87

Two options are presented that were considered for a pipelined interface between a central processing unit (CPU) and a floating-point coprocessor (FPU), along with the CPU recovery mechanisms that provide precise floating-point exceptions for each option. The first option supports parallel execution of both floating-point and integer instructions, while the second option pipelines only the execution of floating-point instructions. The use of the second option in National Semiconductor's 32532/32580 processor cluster because it offers high performance with significantly lower complexity. The 32532 microprocessor features a pipelined slave protocol that hides the CPU-FPU communication overhead for most floating-point instructions by pipelining their execution. A simple recovery mechanism implemented within the CPU maintains the precision of floating-point exceptions. As a result, the 32532 microprocessor supports very high floating point performance without sacrificing software compatibility with previous Series 32000 CPU-FPU clusters.<> 相似文献

19.

The TMS320C30 floating-point digital signal processor

Papamichalis P. Simar R. Jr. 《Micro, IEEE》1988,8(6):13-29

The 320C30 is a fast processor with a large memory space and floating-point-arithmetic capabilities. The authors describe the 320C30 architecture in detail, discussing both the internal organization of the device and the external interfaces. They also explain the pipeline structure, addressing software-related issues and constructs, and examine the development tools and support. Finally, they present examples of applications. Some of the major features of the 320C30 are: a 60-ns cycle time that results in execution of over 16 million instructions per second (MIPS) and over 33 million floating-point operations per second (Mflops); 32-bit data buses and 24-bit address buses for a 16M-word overall memory space; dual-access, 4 K×32-bit on-chip ROM and 2 K×32-bit on-chip RAM; a 64×32-bit program cache; a 32-bit integer/40-bit floating-point multiplier and ALU; eight extended-precision registers, eight auxiliary registers, and 23 control and status registers; generally single-cycle instructions; integer, floating-point, and logical operation; two- and three-operand instructions; an on-chip DMA controller; and fabrication in 1-μm CMOS technology and packaging in a 180-pin package. These facilitate FIR (finite impulse response) and IIR (infinite impulse response) filtering, telecommunications and speech applications, and graphics and image processing applications 相似文献

20.

一种有效的同时多线程处理器取指控制机制 总被引：1，自引：0，他引：1

何立强刘志勇《计算机学报》2006,29(4):535-543

同时多线程处理器通过每时钟周期从多个运行的线程取指令执行,极大地提高了处理器的性能.分支预测器的预测精度和取指策略的效率是影响同时多线程处理器性能的关键.通过将一个基于值的分支预测器和一个基于线程推进速度的取指策略相结合,提出一种新的取指控制机制.该结构的硬件开销较小,实现复杂度较低.实验结果表明,该取指控制机制有效地提高了处理器的性能,其相对于传统取指控制机制的性能加速比为28%且该加速比也高于目前基于流缓冲区和基于分支分类器的取指控制机制. 相似文献