首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 35 毫秒
1.
多端口寄存器堆有助于挖掘指令级和线程级并行性,但同时带来面积、能耗和访问时间的压力.文章面向超标量和SMT处理器,给出了一种方法,即通过增加一个小的活跃值堆(Active Value File,AVF)选择性地保存处于活跃周期(从产生到最后一次使用之间)的物理寄存器值.AVF结构可分担主寄存器堆的访问压力并降低端口数目,实现简单且具有写过滤的特点.在获得较大幅度能耗降低的同时不影响时钟频率且IPC损失较小.  相似文献   

2.
In this paper, a scalable scheme, configurable via register-transfer level parameters, for full register bypassing in a modern embedded processor architecture, termed ByoRISC, is presented. The register bypassing specification is parameterized regarding the number of homogeneous register file read and write ports and the number of pipeline stages of the processor. The performance characteristics (cycle time, chip area) of the proposed technique have been evaluated for FPGA target implementations of the synthesizable ByoRISC model. It is proved that, a full bypassing network is a viable solution for the elimination of data hazards when servicing instructions with multiple read and write operands. While the maximum clock frequency is reduced by 17.9% in average, when using partial versus full forwarding, the positive effect of custom computation eliminates this effect by providing cycle speedups of 3.9× to 5.5× and corresponding execution time speedups for a ByoRISC testbed processor of 3.6×. Individual application speedups of up to 9.4× have also been obtained.  相似文献   

3.
肖瑞瑾  权衡  张家杰  尤凯迪  英彦  虞志益 《计算机工程》2012,38(15):283-285,289
针对处理器中可用寄存器数量有限的问题,提出一种适用于多核处理器的扩展寄存器文件设计方案。采用多组结构进行硬件设计,将通信端口映射在扩展寄存器地址空间上,以实现寄存器寻址核间通信机制,引入兼具底层指令与高层封装的混合软件配置方案,改进软件编译流程。评估结果表明,该方案将可用寄存器文件的数量增加一倍,核间通信指令数目减少50%,系统吞吐率得到优化。  相似文献   

4.
Wide-issue and high-frequency processors require not only a low-latency but also high-bandwidth memory system to achieve high performance. Previous studies have shown that using multiple small single-ported caches instead of a monolithic large multi-ported one for L1 data cache can be a scalable and inexpensive way to provide higher bandwidth. Various schemes on how to direct the memory references have been proposed in order to achieve a close match to the performance of an ideal multi-ported cache. However, most existing designs seldom take dynamic data access patterns into consideration, thus suffer from access conflicts within one cache and unbalanced loads between the caches. It is observed in this paper that if one can group data references defined in a program into several regions (access regions) to allow parallel accesses, providing separate small caches – access region cache for these regions may prove to have better performance. A register-guided memory reference partitioning approach is proposed and it effectively identifies these semantic regions and organizes them into multiple caches adaptively to maximize concurrent accesses. The base register name, not its content, in the memory reference instruction is used as a basic guide for instruction steering. With the initial assignment to a specific access region cache per the base register name, a reassignment mechanism is applied to capture the access pattern when program is moving across its access regions. In addition, a distribution mechanism is introduced to adaptively enable access regions to extend or shrink among the physical caches to reduce potential conflicts further. The simulations of SPEC CPU2000 benchmarks have shown that the semantic-based scheme can reduce the conflicts effectively, and obtain considerable performance improvement in terms of IPC; with 8 access region caches, 25–33% higher IPC is achieved for integer benchmark programs than a comparable 8-banked cache, while the benefit is less for floating-point benchmark programs, 19% at most.  相似文献   

5.
In our previously published research we discovered some very difficult to predict branches, called unbiased branches. Since the overall performance of modern processors is seriously affected by misprediction recovery, especially these difficult branches represent a source of important performance penalties. Our statistics show that about 28% of branches are dependent on critical Load instructions. Moreover, 5.61% of branches are unbiased and depend on critical Loads, too. In the same way, about 21% of branches depend on MUL/DIV instructions whereas 3.76% are unbiased and depend on MUL/DIV instructions. These dependences involve high-penalty mispredictions becoming serious performance obstacles and causing significant performance degradation in executing instructions from wrong paths. Therefore, the negative impact of (unbiased) branches over global performance should be seriously attenuated by anticipating the results of long-latency instructions, including critical Loads. On the other hand, hiding instructions’ long latencies in a pipelined superscalar processor represents an important challenge itself. We developed a superscalar architecture that selectively anticipates the values produced by high-latency instructions. In this work we are focusing on multiply, division and loads with miss in L1 data cache, implementing a dynamic instruction reuse scheme for the MUL/DIV instructions and a simple last value predictor for the critical Load instructions. Our improved superscalar architecture achieves an average IPC speedup of 3.5% on the integer SPEC 2000 benchmarks, of 23.6% on the floating-point benchmarks, and an improvement in energy-delay product (EDP) of 6.2% and 34.5%, respectively. We also quantified the impact of our developed selective instruction reuse and value prediction techniques in a simultaneous multithreaded architecture (SMT) that implies per thread reuse buffers and load value prediction tables. Our simulation results showed that the best improvements on the SPEC integer applications have been obtained with 2 threads: an IPC speedup of 5.95% and an EDP gain of 10.44%. Although, on the SPEC floating-point programs, we obtained the highest improvements with the enhanced superscalar architecture, the SMT with 3 threads also provides an important IPC speedup of 16.51% and an EDP gain of 25.94%.  相似文献   

6.
High-energy particles in the space can easily cause soft error in register file (RF).As a critical structure in a processor,RF often stores data for long periods of time and is read frequently,resulting in a higher probability of spreading corrupted data to other parts of the processor.The triple modular redundancy (TMR) is a common and effective fault tolerance method that enables multi-bit error correction.Designing full TMR for all the registers could cause excessive area and power overheads.However,some registers in RF have less impact on processor reliability.Therefore,there is no need to design TMR for them.This paper designs an efficient strategy which can rate the registers in RF based on their vulnerability.Based on the proposed strategy,a new RF fault tolerance method named Partial-TMR formulates in this paper,which selectively protects more vulnerable registers against multi-bit error,and improves fault tolerance efficiency.For integer RF,Partial-TMR improves its soft error correction capability by 24.5% relative to the baseline system and 3%relative to ParShield,while for floating-point RF,the improvement comes to 5.17% and 0.58% respectively.The soft error correction capability of Partial-TMR is slightly lower than that of full TMR by 1% to 3%,but Partial-TMR significantly cuts the area and power overheads.Compared with full TMR,Partial-TMR decreases the area and power overheads by 71.6% and 64.9%,respectively.It also has little impact on the performance.Partial-TMR is a more cost-effective fault tolerance method compared with ParShield and full TMR.  相似文献   

7.
超标量处理器中的寄存器堆通常采用多端口结构以支持宽发射,这种结构对寄存器堆的速度、功耗和面积提出了很大的挑战。设计了一个64*64bit多端口寄存器堆,该寄存器堆能够在同一个时钟周期内完成8次读操作和4次写操作,通过对传统单端读写结构的存储单元进行改进,提出了电源门控与位线悬空技术相结合的单端读写结构的存储单元,12个读写端口全部采用传输门以加快访问速度。采用PTM 90nm、65nm、45nm和32nm仿真模型,在Hspice上进行仿真,与传统单端读写结构相比较,所提出的方法能够显著提升寄存器堆的性能,其中写1操作延时降低超过32%,总功耗降低超过45%,而且存储单元的稳定性也得到明显改善。  相似文献   

8.
Sima  D. 《Micro, IEEE》2000,20(5):70-83
Register renaming is a technique to remove false data dependencie-write after read (WAR) and write after write (WAW)-that occur in straight line code between register operands of subsequent instructions. By eliminating related precedence requirements in the execution sequence of the instructions, renaming increases the average number of instructions that are available for parallel execution per cycle. This results in increased IPC (number of instructions executed per cycle). The identification and exploration of the design space of register-renaming lead to a comprehensive understanding of this intricate technique. As this article shows, the design space of register renaming is spanned by four main dimensions: the scope of register renaming, the layout of the rename buffers, the method of register mapping, and the rename rate. Relevant aspects of the design space give rise to eight basic alternatives for register-renaming. In addition, the kind of operand fetch policy significantly affects how the processor carries out the rename process, which duplicates the eight basic alternatives to 16 possible implementation schemes. The article indicates which basic implementation scheme is used in relevant superscalar processors. As register renaming is usually implemented in conjunction with shelving, the underlying microarchitecture is assumed to employ shelving  相似文献   

9.
同时多线程能在同一时钟周期执行不同线程的指令,并且指令级并行和线程级并行。显式并行指令计算关注于编译器和硬件的相互协作。寄存器文件的设计在高性能处理器设计中十分重要,寄存器栈和寄存器栈引擎是提高其性能的重要手段。该文设计和实现一套并行环境,其中包括并行编译器OpenUH和基于IA-64的同时多线程体系结构EDSMT,实验表明,该并行架构适用于大多数并行应用,针对NAS的并行测试程序,该架构相对于SMTSIM平均有12.48%的性能提升。  相似文献   

10.
This paper proposes an enhanced method of multiple branch prediction using a per-primary branch history table. This scheme improves the previous ones based on a single global branch history register, by reducing interferences among histories of different branches caused by sharing a single register. This scheme also allows the prediction of a branch not to affect the prediction of other branches that are predicted in the same cycle, thus allowing independent and parallel prediction of multiple branches. Our experimental results indicate that these features help to achieve higher prediction accuracy than that of the previous global history scheme (which is already high) with the less hardware cost (i.e., 96.1% vs. 95.1% for integer code and 95.7% vs. 94.9% for floating-point code including nasa7, for a given hardware budget of 128K bits). Moreover, the increased prediction accuracy causes better fetch bandwidth of a superscalar machine (i.e., 7.1 vs. 6.9 instructions per clock cycle for integer code and 11.0 vs. 10.9 instructions per cycle for floating-point code).  相似文献   

11.
张轩  李兆麟 《计算机工程》2007,33(20):248-250
采用全定制设计方法实现了一种6读2写的3232位的多端口寄存器堆,包括结构设计、电路设计、版图设计、仿真验证以及建模建库。该多端口寄存器堆的读写端口互相独立,在一个时钟周期内,能够同时读出6个32位数据,并写入2个32位数据。在电路实现上,采用高速SCL结构的地址译码和分组字线的方法来减少读写延迟。采用了0.18µm 6层金属P阱CMOS工艺来实现版图设计,通过了版图验证和后端仿真。  相似文献   

12.
In a recent study, we discovered that many single load/store operations in embedded applications can be parallelized and thus encoded simultaneously in a single‐instruction multiple‐data instruction, called the multiple load/store (MLS) instruction. In this work, we investigate the problem of utilizing MLS instructions to produce optimized machine code, and propose an effective approach to the problem. Specifically, we formalize the MLS problem, that is, the problem of maximizing the use of MLS instructions with an unlimited register file size. Based on this analysis, we show that we can solve the problem efficiently by translating it into a variant of the problem finding a maximum weighted path cover in a dynamic weighted graph. To handle a more realistic case of the finite size of the register file, our solution is then extended to take into account the constraints of register sequencing in MLS instructions and the limited register resource available in the target processor. We demonstrate the effectiveness of our approach experimentally by using a set of benchmark programs. In summary, our approach can reduce the number of loads/stores by 13.3% on average, compared with the code generated from existing compilers. The total code size reduction is 3.6%. This code size reduction comes at almost no cost because the overall increase in compilation time as a result of our technique remains quite minimal. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

13.
Alsup  M. 《Micro, IEEE》1990,10(3):48-66
The initial members of the 88000 family of high-performance 32-bit microprocessor are the 88100 processor and the 88200 cache and memory management unit (CMMU). The processor manipulates integer and floating-point data and initiates instruction and data memory transactions. The CMMU minimizes the latency of main memory requests by maintaining a cache for data transaction and a cache for memory management translations. A typical system consists of one processor and two identical cache chips, one servicing instruction fetch requests, the other servicing data read and write requests. The overall design process for the 88000 family is described, and the integer instructions are discussed. Decisions made with respect to the processor, cache, and software are examined. Some data on the use of the instruction set by the available compilers and the efficiency of the cache and memory systems are presented  相似文献   

14.
向量处理逻辑与DRAM相结合形成向量PI(MV-PIM)结构,可充分利用PIM结构的高带宽特性。向量寄存器文件是V-PIM的关键资源,其端口数和容量大小直接影响着向量处理器的频率和功耗。设计一个低功耗、高速、多端口的向量寄存器文件是向量处理器数据通路设计的重要任务之一。文章描述了采用多个端口数较少的寄存器体通过交叉互连构成多端口向量寄存器文件的设计方案,实验表明多体交叉结构的向量寄存器文件在功耗、面积等方面比单一的多端口结构具有明显优势。  相似文献   

15.
存储系统是通用处理器在处理流应用时的瓶颈。该文基于FT64流处理器体系结构,提出一种面向流应用的流寄存器文件结构设计方法和数据传输机制,分析它在FT64中的作用。通过采用大容量、高带宽、虚拟多端口的存储器,将大部分流数据存取操作限制在寄存器文件这一层次,减少了主存压力。实验结果表明,该结构能很好地适应流应用需求。  相似文献   

16.
为了降低嵌入式处理器中寄存器堆的功耗,提出一种基于限制取指的寄存器堆延迟写回技术.对于嵌入式处理器,传统的寄存器堆延迟写回技术带来的效果并不明显,文中根据处理器前端比后端快的特点,采用限制取指技术提高寄存器堆延迟写回的效果,不仅大幅度地消除了对寄存器堆不必要的写操作,同时也降低了处理器前端的功耗.FPGA平台上的实验结果表明:在不影响程序性能的情况下,应用该技术后,EEMBC程序对定点寄存器堆的写操作减少了35%,对ICache的访问减少了15%,且没有额外的开销.  相似文献   

17.
Hsu  P.Y.-T. 《Micro, IEEE》1994,14(2):23-33
Designed to efficiently support large, real-world, floating-point-intensive applications, the TFP (short for Tremendous Floating-Point) microprocessor is a superscalar implementation of the Mips Technologies architecture. This floating-point, computation-oriented processor uses a superscalar machine organization that dispatches up to four instructions each clock cycle to two floating-point execution units, two memory load/store units, and two integer execution units. Its split-level cache structure reduces cache misses by directing integer data references to a 16-Kbyte on-chip cache, while channeling floating-point data references off chip to a 4 Mbyte cache  相似文献   

18.
Energy consumption and power dissipation are important concerns in the design of embedded systems and they will become even more crucial with finer process geometry, higher frequencies, deeper pipelines and wider issue designs. In particular, the instruction cache consumes more energy than any other processor module, especially with commonly used highly associative CAM-based implementations.Two energy-efficient approaches for highly associative CAM-based instruction cache designs are presented by means of using a segmented wordline and a predictor-based instruction fetch mechanism. The latter is based on the fact that not all instructions in a given I-cache fetch are used due to taken branches. The proposed Fetch Mask Predictor unit determines which instructions in a cache access will actually be used to avoid fetching any of the other instructions. Both proposed approaches are evaluated for an embedded 4-wide issue processor in 100 nm technology. Experimental results show average I-cache energy savings of 48% and overall processor energy savings of 19%.  相似文献   

19.
This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better interthread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deal location. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to: 1) free registers immediately upon their last use, and 2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs  相似文献   

20.
A recent Sparc (scalable processor architecture) processor consists of a two-chip configuration, containing the TMS390C601 integer unit (IU) and the TMS390C602A floating-point unit (FPU). The second device, an innovative coprocessor that lets the processor execute single- or double-precision floating-point instructions concurrently with IU operations is described. Dedicated floating-point hardware in the FPU increases the performance of the system. Running at clock periods as small as 20 ns, the chip should deliver 5.5 million double-precision floating-point operations per second under the Linpack benchmark (50-MHz clock rate). The FPU provides single- and double-precision arithmetic functions: addition, subtraction, multiplication, division, square root, compare, and convert. To minimize its math unit's latency, the FPU uses a highly parallel architecture requiring separate math units to optimize additions and multiplications. Traps stop the execution of a program to jump to software routine for handling data-dependent errors or to execute instructions not implemented in the hardware. Benchmark results are presented  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号