首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 453 毫秒
1.
Three-dimensional chip (3-D) stacking technology provides a new approach to address the so-called memory wall problem. Memory processor chip stacking reduces this memory wall problem, permitting faster clock rates (with suitable processor logic) or permitting multicore access to shared memory using a large number of vertical vias between tiers in the stack, for ultrawide bit path transfer of data and address information to and from various levels of cache. Although a limited amount of parallel access is possible using conventional two-dimensional (2-D) chip memory-processor approaches, 3-D memory-processor stacking greatly extends this to much larger capacity memories. We evaluate high-clock-rate processors as well as shared memory processors with a large number of cores. Various architectural design options to reduce the impact of the memory wall on the processor performance are explored and validated through simulations. Certain architectural features can be implemented in a 3-D chip, such as an ultrawide, ultrashort vertical bus with low parasitic resistance and the elimination of conventional electrostatic discharge, and packaging parasitics required in multiple package 2-D solutions. The objective is to reduce the clocks per instruction figure of merit for high clock speeds in order to deliver significant performance levels. High-clock-rate processors can be designed with SiGe heterostructure bipolar transistors to obtain processors operating on the order of 16 or 32 GHz.   相似文献   

2.
In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler‐hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write‐back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single‐instruction multiple‐data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32‐bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2‐way MLEP and 33.7% faster with a 4‐way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler.  相似文献   

3.
Current high-end microprocessor designs focus on increasing instruction parallelism and clock frequency at the expense of power dissipation. This paper presents a case study of a different direction, a chip multiprocessor (CMP) with a smaller processor core than a baseline high-end 130-nm 64-bit SPARC server uniprocessor. We demonstrate that the size of the baseline processor core can be reduced by 2/3 using a combination of logical resource reduction and dense custom macros while still delivering about 70% of the TPC-C performance. Circuit speed is traded for power reduction by reducing the power supply from 1.0 to 0.8 V and increasing transistor channel lengths by 12.5% above the minimum. The resulting CMP with six reduced size cores and 4-MB L2 cache is estimated to run at 1.8 GHz while consuming less than 30% of the power compared to the scaled baseline dual-core processor running at 2.4 GHz. The proposed CMP is more than four times higher in TPC/W than the dual-core processor, facilitating the design of high-density servers.  相似文献   

4.
We report the first implementation of the new SPARC V9 64-b instruction set architecture. The HaL processor called SPARC64 is a ceramic Multi-Chip Module (MCM) that contains one CPU chip, one Memory Management Unit (MMU) chip, and four 64 KB Cache chips. Together, they implement a unique three-level address translation scheme that efficiently supports using virtual addresses spread anywhere in the full 64-b address range. The processor assigns a serial number to each issued instruction to track up to 64 in-progress instructions and can speculatively issue through up to 16 branches. It issues up to 4 instructions per cycle and utilizes superscalar instruction issue, register renaming, and dataflow (potentially out-of-order) execution to fully exploit instruction-level parallelism. The processor maintains a precise-state execution model, and commits in-order, up to 9 instructions in a cycle. In a HaL R1 system, a production SPARC64 running at 143 MHz has a performance of 230 SPECint92 and 300 SPECfp92 and dissipates 50 W from a 3.3 V supply  相似文献   

5.
嵌入式Flash CISC/DSP微处理器的研究与实现   总被引:1,自引:0,他引:1       下载免费PDF全文
卢结成  丁丁  丁晓兵  朱少华 《电子学报》2003,31(8):1252-1254
本文研究一种新的既具有微控制器功能,又有增强DSP功能的高性能微处理器的实现架构.在统一的增强CISC指令集下,我们将基于哈佛和寄存器-寄存器结构的微处理器模块和单周期乘法/累加器、桶形移位寄存器、无开销循环及跳转硬件支持模块、硬件地址产生器等DSP功能模块以及嵌入式Flash Memory和指令队列缓冲器有机的集成起来,在统一架构下通过单核实现CISC/DSP微处理器,有效地提高了处理器的性能.该微处理器采用0.35μm CMOS工艺实现,芯片面积为25mm2.在80M工作频率下,动态功耗为425mW,峰值数据处理能力可达80MIPS.该处理器核可满足片上系统(SOC)对高性能处理器的需求.  相似文献   

6.
A dynamic programming processor with parallel and pipeline architecture is described. A 2-μm CMOS technology was applied to the DP processor, which is composed of 127309 transistors on a 7.17×8.62-mm2 die and is housed in an 84-pin PLCC (plastic leaded chip carrier) or PGA (pin grid array) package. The clock frequency is 20 MHz, and the instruction cycle time is 100 ns. Precise electrical simulations permitted the safe use of nonstandard logic and area and power reduction. Implementation of a direct access to all internal registers has proven useful for chip test and software development. A system using one DP processor has given very good results on a wide variety of applications and 0.48% error rate on tests with standard NATO tapes. These results are significantly better than those published for other systems on the same tests  相似文献   

7.
For mobile intelligent robot applications, an 81.6 GOPS object recognition processor is implemented. Based on an analysis of the target application, the chip architecture and hardware features are decided. The proposed processor aims to support both task-level and data-level parallelism. Ten processing elements are integrated for the task-level parallelism and single instruction multiple data (SIMD) instruction is added to exploit the data-level parallelism. The Memory-Centric network-on-chip7 (NoC) is proposed to support efficient pipelined task execution using the ten processing elements. It also provides coherence and consistency schemes tailored for 1-to-N and M-to-1 data transactions in a task-level pipeline. For further performance gain, the visual image processing memory is also implemented. The chip is fabricated in a 0.18- $mu$m CMOS technology and computes the key-point localization stage of the SIFT object recognition twice faster than the 2.3 GHz Core 2 Duo processor.   相似文献   

8.
The microarchitecture of the synergistic processor for a cell processor   总被引:1,自引:0,他引:1  
This paper describes an 11 FO4 streaming data processor in the IBM 90-nm SOI-low-k process. The dual-issue, four-way SIMD processor emphasizes achievable performance per area and power. Software controls most aspects of data movement and instruction flow to improve memory system performance and core performance density. The design minimizes instruction latency while providing for fine grain clock control to reduce power.  相似文献   

9.
A 4-b data processor with 16-instruction set and 1-kb external-RAM access capability has been designed, fabricated, and tested. Each instruction is treated by a three-stage pipeline of instruction fetch, data fetch, and decode/execute. The chip is operable under a 1-GHz clock, and it has a peak performance of 1 GIPS. The fabrication process is 2.5-μm-rule Nb/AlOx/Nb. An interface circuit to access the all DC-powered 1-kb external-RAM chip is installed. The AC power is utilized with both polarities in each of the four blocks, thus realizing an eightfold serial power supply. Power consumption is 40 mW. Half of the function tests have been completed at low frequency (10 kHz). Part of the processor operated at 1 GHz  相似文献   

10.
《IEE Review》2001,47(3):38-40
The author reports on the theory and practice behind an innovative 32 bit RISC processor core, whose architecture can be customised to provide the optimum design solution for processor-based application-specific integrated circuits. In its basic configuration, the ARC processor, the Tangent-A4, is a four-stage pipeline device, with instructions, data and address formats all 32 bit. It also boasts separate instruction and data buses (Harvard architecture), data and instruction caches, and a unique host interface (parallel, JTAG or user defined), giving external devices access to the internal registers and memory  相似文献   

11.
This article proposes design and architecture of a dynamically scalable dual-core pipelined processor. Methodology of the design is the core fusion of two processors where two independent cores can dynamically morph into a larger processing unit, or they can be used as distinct processing elements to achieve high sequential performance and high parallel performance. Processor provides two execution modes. Mode1 is multiprogramming mode for execution of streams of instruction of lower data width, i.e., each core can perform 16-bit operations individually. Performance is improved in this mode due to the parallel execution of instructions in both the cores at the cost of area. In mode2, both the processing cores are coupled and behave like single, high data width processing unit, i.e., can perform 32-bit operation. Additional core-to-core communication is needed to realise this mode. The mode can switch dynamically; therefore, this processor can provide multifunction with single design. Design and verification of processor has been done successfully using Verilog on Xilinx 14.1 platform. The processor is verified in both simulation and synthesis with the help of test programs. This design aimed to be implemented on Xilinx Spartan 3E XC3S500E FPGA.  相似文献   

12.
The memory hierarchy of high-performance and embedded processors has been shown to be one of the major energy consumers. For example, the Level-1 (L1) instruction cache (I-Cache) of the StrongARM processor accounts for 27% of the power dissipation of the whole chip, whereas the instruction fetch unit (IFU) and the I-Cache of Intel's Pentium Pro processor are the single most important power consuming modules with 14% of the total power dissipation [2]. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip. In this paper, we propose a technique that uses an additional mini cache, the LO-Cache, located between the I-Cache and the CPU core. This mechanism can provide the instruction stream to the data path and, when managed properly, it can effectively eliminate the need for high utilization of the more expensive I-Cache. We propose, implement, and evaluate five techniques for dynamic analysis of the program instruction access behavior, which is then used to proactively guide the access of the LO-Cache. The basic idea is that only the most frequently executed portions of the code should be stored in the LO-Cache since this is where the program spends most of its time. We present experimental results to evaluate the effectiveness of our scheme in terms of performance and energy dissipation for a series of SPEC95 benchmarks. We also discuss the performance and energy tradeoffs that are involved in these dynamic schemes. Results for these benchmarks indicate that more than 60% of the dissipated energy in the I-Cache subsystem can be saved  相似文献   

13.
The first implementation of MAJC architecture achieves high performance by using very long instruction word (VLIW), single instruction multiple data (SIMD), and chip multiprocessing. The chip integrates two processors, a memory controller, two high-speed parallel I/O interfaces, and a PCI controller. The chip, fabricated in a 0.22-μm CMOS process with six layers of copper interconnect, contains 13 million transistors and operates at 500 MHz. It is packaged in a 624-pin ceramic column grid array using flip-chip assembly technology  相似文献   

14.
A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads [simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time. Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from the MPEG-4 video coding standard on a cycle-true simulator.
Tim NiggemeierEmail:
  相似文献   

15.
本文针对基于可配置处理器的异构多核结构,提出一种新的线程级动态调度模型。此类异构多核系统中每个核分别针对某一应用做指令集扩展,调度器通过线程、处理器核以及指令集间的映射关系,动态调度线程至适合的处理器核,从而在没有大幅增加芯片面积的前提下,达到与每个核都具有全扩展指令集相近似的加速比,此外该模型还可以有效减少编程模型的复杂度。  相似文献   

16.
This paper describes circuits used to implement the motion video instructions (MVI) in the 550-MHz Alpha 21 164PC Microprocessor. The chip is fabricated in a 0.35-μm CMOS process and is the first implementation of the MVI instruction set in the Alpha architecture. The MVI instruction set, coupled with the high performance of the 550-MHz processor, delivers 30 frames/s digital video disk (DVD) playback with stereo-quality audio, enables video teleconferencing at 30 frames/s, and significantly improves the performance of MPEG encode algorithms  相似文献   

17.
This paper describes a 51.2-GOPS video recognition processor, which achieves real-time multiple processing of in-vehicle video recognition applications in software, while at the same time satisfying power efficiency requirements of an in-vehicle device. The chip integrates 128 RISC microprocessors, each operating at 100 MHz, into a single chip. Hardware configurations of the chip are enhanced for supporting efficient execution of extended C language codes of algorithms based on four basic parallel methods. The results of a benchmark test using a weather-robust lane mark and vehicle detection application show that the processor achieves a four times better performance while it consumes less than 1/20 of peak power consumption compared with a 2.4-GHz general-purpose processor.  相似文献   

18.
A 32 bit call-handling processor for an electronic switching system (ESS) capable of a 5.6 MIPS instruction execution rate is discussed. The processor uses a mixed architecture consisting of a reduced instruction set computer (RISC) and a complex instruction set computer (CISC) to economize the instruction execution, and features a four-stage two-way pipeline and local storage for the RISC and writable control storage for the CISC. To obtain reliability, availability, and serviceability, such functions as parity check/generation, microdiagnostic, and matcher have been incorporated within the chip. The chip contains about 160 K transistors within a chip size of 13.2×13.7 mm2. A 1.2 μm double-metal CMOS technology has been used. In designing the chip layout, a compromise between manual and automatic placing or routing was adopted which enabled a reasonably short design time  相似文献   

19.
A programmable 8-b digital signal processor core with an instruction cycle time of 20 ns is developed. A 37.5-mm chip is fabricated by advanced 1.0-μm double-level-metal CMOS technology. This processor has a reconfigurable high-speed data path supporting several multiply/accumulate function, including 16-tap linear-phase transversal filtering, high-speed adaptive filtering, and eight-point discrete cosine transformation. To provide high-speed operation within the chip, a programmable phase-locked loop circuit is built on the chip. This circuit generates a high-speed clock, which is a multiple of the system clock fed from outside, and is synchronized to the system clock  相似文献   

20.
并行计算是实现高性能计算的一个重要发展方向。随着信号处理、通信等领域对处理能力需求的不断提升,DSP的并行开发技术也得到了较快发展。多器件并行和片上多核的方法可以有效提高处理性能。多核并行处理相对于传统单核DSP要进行多任务并行设计,使系统设计更加复杂。文中在探讨了利用8核处理器进行信号处理开发的关键技术的基础上,采用Round—Robin方式设计了一种多核并行信号处理模式,并对多核的同步、Cache一致性、任务并行分配等进行了论述。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号