期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

SPARC64: a 64-b 64-active-instruction out-of-order-execution MCMprocessor

Williams T. Patkar N. Shen G. 《Solid-State Circuits, IEEE Journal of》1995,30(11):1215-1226

We report the first implementation of the new SPARC V9 64-b instruction set architecture. The HaL processor called SPARC64 is a ceramic Multi-Chip Module (MCM) that contains one CPU chip, one Memory Management Unit (MMU) chip, and four 64 KB Cache chips. Together, they implement a unique three-level address translation scheme that efficiently supports using virtual addresses spread anywhere in the full 64-b address range. The processor assigns a serial number to each issued instruction to track up to 64 in-progress instructions and can speculatively issue through up to 16 branches. It issues up to 4 instructions per cycle and utilizes superscalar instruction issue, register renaming, and dataflow (potentially out-of-order) execution to fully exploit instruction-level parallelism. The processor maintains a precise-state execution model, and commits in-order, up to 9 instructions in a cycle. In a HaL R1 system, a production SPARC64 running at 143 MHz has a performance of 230 SPECint92 and 300 SPECfp92 and dissipates 50 W from a 3.3 V supply 相似文献

2.

A 200-MHz 64-b dual-issue CMOS microprocessor

Dobberpuhl D.W. Witek R.T. Allmon R. Anglin R. Bertucci D. Britton S. Chao L. Conrad R.A. Dever D.E. Gieseke B. Hassoun S.M.N. Hoeppner G.W. Kuchler K. Ladd M. Leary B.M. Madden L. McLellan E.J. Meyer D.R. Montanaro J. Priore D.A. Rajagopalan V. Samudrala S. Santhanam S. 《Solid-State Circuits, IEEE Journal of》1992,27(11):1555-1567

A 400-MIPS/200-MFLOPS (peak) custom 64-b VLSI CPU is described. The chip is fabricated in a 0.75-μm CMOS technology utilizing three levels of metalization and optimized for 3.3-V operation. The die size is 16.8 mm×13.9 mm and contains 1.68 M transistors. The chip includes separate 8-kbyte instruction and data caches and a fully pipelined floating-point unit (FPU) that can handle both IEEE and VAX standard floating-point data types. It is designed to execute two instructions per cycle among scoreboarded integer, floating-point, address, and branch execution units. Power dissipation is 30 W at 200-MHz operation 相似文献

3.

A four-processor building block for SIMD processor arrays

Fisher A.L. Highnam P.T. Rockoff T.E. 《Solid-State Circuits, IEEE Journal of》1990,25(2):369-375

A four-processor chip, for use in processor arrays for image computations, is described. The large degree of data parallelism available in image computations allows dense array implementations where all processors operate under the control of a single instruction stream. An instruction decoder shared by the four processors on the chip minimizes the pin count allocated for global control of the processors. The chip incorporates an interface to an external SRAM (static RAM) for memory expansion without glue chips. The full-custom 2-μm CMOS chip contains 56669 transistors and runs instructions at 10 MHz. Five hundred and twelve 16-b processors and 4 Mbyte of distributed external memory fit on two industry standard cards to yield 5-billion instructions per second peak throughout. As image I/O can overlap perfectly with pixel computation, an array containing 128 of these chips can provide more than 600 16-b operations per pixel on 512×512 images at 30 Hz 相似文献

4.

A 300-MHz 115-W 32-b bipolar ECL microprocessor

Jouppi N.P. Boyle P. Dion J. Doherty M.J. Eustace A. Haddad R.W. Mayo R. Menon S. Monier L.M. Stark D. Turrini S. Yang J.L. Hamburgen R. Fitch J.S. Kao R. 《Solid-State Circuits, IEEE Journal of》1993,28(11):1152-1166

A full-custom single-chip bipolar ECL RISC microprocessor was implemented in a 1.0-μm single-poly bipolar technology. This research prototype contains a CPU and on-chip 2-KB instruction and 2-KB data caches. Worst-case power dissipation with a nominal -5.2 V supply is 115 W. The chip has been designed for a worst-case clock frequency of 275 MHz at a nominal supply. The chip verifies a new style of CAD tools developed during the design process, advanced packaging techniques for high-power microprocessors, and VLSI ECL circuit techniques 相似文献

5.

A zero-overhead self-timed 160-ns 54-b CMOS divider

Williams T.E. Horowitz M.A. 《Solid-State Circuits, IEEE Journal of》1991,26(11):1651-1661

The authors describe the design of a custom integrated circuit for the arithmetic operation of division. The chip uses self-timing to avoid the need for high-speed clocks and directly concatenates precharged function blocks without latches. Internal stages form a ring that cycles without any external signaling. The self-timed control introduces no serial overhead, making the total chip latency equal just the combinational logic delays of the data elements. The ring's data path uses embedded completion encoding and generates the mantissa of a 54-b (floating-point IEEE double-precision) result. Fabricated in 1.2-μm CMOS, the ring occupies 7 mm² and generates a quotient and done indication in 45 to 160 ns, depending on the particular data operands 相似文献

6.

A 32-b RISC/DSP microprocessor with reduced complexity 总被引：2，自引：0，他引：2

Dolle M. Jhand S. Lehner W. Muller O. Schlett M. 《Solid-State Circuits, IEEE Journal of》1997,32(7):1056-1066

This paper presents a new 32-b reduced instruction set computer/digital signal processor (RISC/DSP) architecture which can be used as a general purpose microprocessor and in parallel as a 16-/32-b fixed-point DSP. This has been achieved by using RISC design principles for the implementation of DSP functionality. A DSP unit operates in parallel to an arithmetic logic unit (ALU)/barrelshifter on the same register set. This architecture provides the fast loop processing, high data throughput, and deterministic program flow absolutely necessary in DSP applications. Besides offering a basis for general purpose and DSP processing, the RISC philosophy offers a higher degree of flexibility for the implementation of DSP algorithms and achieves higher clock frequencies compared to conventional DSP architectures. The integrated DSP unit provides instruction set support for highly specialized DSP algorithms. Subword processing optimized for DSP algorithms has been implemented to provide maximum performance for 16-b data types. While creating a unified base for both application areas, we also minimized transistor count and we reduced complexity by using a short instruction pipeline. A parallelism concept based on a varying number of instruction latency cycles made superscalar instruction execution superfluous 相似文献

7.

Low-power design of 8-b embedded CoolRisc microcontroller cores

Piguet C. Masgonty J.-M. Arm C. Durand S. Schneider T. Rampogna F. Scarnera C. Iseli C. Bardyn J.-P. Pache R. Dijkstra E. 《Solid-State Circuits, IEEE Journal of》1997,32(7):1067-1078

Low-power and low-voltage embedded microcontrollers are required more and more for portable applications. Power reduction can be addressed at the software level as well as at the architecture level while searching to reduce the number of executed instructions for a given task. An 8-b RISC-like pipelined microcontroller family is presented achieving one clock per instruction. It is compared to various architectures of existing 8-b microcontrollers. According to an efficiency model taking into account the architecture as well as the number of registers, the presented 8-b microcontroller cores provide four to ten times better performances than existing microcontrollers. On one hand, the operating frequency can be reduced to execute a given task in the same execution time. On the other hand, delivering 10 MIPS performance, more than 2000 MIPS/W can be achieved at 3 V 相似文献

8.

The design of a fully integrated graphics system

Torrance R.R. Sadler S.P. Lamoureux J.P. Lamarche J. Frank D.J. Leveille F. 《Solid-State Circuits, IEEE Journal of》1988,23(2):368-376

The architecture and design of a CMOS chip implementing a medium-resolution graphics system are described. The chip, requiring no external support logic, outputs analog RGB signals at a 40-MHz pixel rate and directly controls a bit-map video RAM (VRAM) memory array. Scan rates and display formats are completely programmable. Pixels stored in the 1 K×1 K bit map can be any of 16 colors taken from a 4096-color palette. The chip can be directly interfaced to most common microprocessors. A 6.7-MIPS (million-instruction-per-second) internal reduced instruction set computer (RISC) CPU directly implements high-level graphics commands. The chip achieves a maximum draw speed of 10 million pixels/s. Designed in a Lisp machine environment, the 100000-transistor chip is implemented in 1.8-μm CMOS and contains standard cells, RAM, ROM, a color table, and three four-bit current-steered digital-to-analog converters (DACs) 相似文献

9.

Energy-Efficient Dynamic Instruction Scheduling Logic Through Instruction Grouping

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(6):848-852

Dynamic instruction scheduling logic is quite complex and dissipates significant energy in microprocessors that support superscalar and out-of-order execution. We propose a novel microarchitectural technique to reduce the complexity and energy consumption of the dynamic instruction scheduling logic. The proposed method groups several instructions as a single issue unit and reduces the required number of ports and the size of the structure. This paper describes the microarchitecture mechanisms and shows evaluation results for energy savings and performance. These results reveal that the proposed technique can greatly reduce energy with almost no performance degradation, compared to the conventional dynamic instruction scheduling logic. 相似文献

10.

A bit-stream digital-to-analog converter with 18-b resolution

Kup B.M.J. Dijkmans E.C. Naus P.J.A. Sneep J. 《Solid-State Circuits, IEEE Journal of》1991,26(12):1757-1763

A two-channel differential 18-b bit-stream digital-to-analog converter implemented in a 2.5-μm BiCMOS process with 7.3-mm² chip area is described. The circuit contains two identical 1-b D-to-A conversion channels on one single chip. Each channel consists of a digital input part, a switched-capacitor D-to-A network, and two high-performance operational amplifiers. Total harmonic distortion plus noise is better than -102 dB. The dynamic range is more than 108 dB, giving a true 18-b resolution. Output operational-amplifier distortion is below 120 dB at full-scale signal. System design considerations, implementation, and measured results of the IC are discussed 相似文献

11.

A 6.7-MFLOPS floating-point coprocessor with vector/matrix instructions

《Solid-State Circuits, IEEE Journal of》1989,24(5):1324-1330

An 80-bit floating-point coprocessor which implements 24 vector/matrix instructions and 22 mathematical functions is described. This processor can execute floating-point addition/rounding and pipelined multiplication concurrently, under the control of horizontal-type microinstructions. The SRT division method and CORDIC trigonometrical algorithm are used for a favorable cost/performance implementation. The performance of 6.7 MFLOPS in the vector-matrix multiplication at 20 MHz has been attained by the use of parallel operations. The vector/matrix instruction is about three times faster than conventional add and multiply instructions. The chip has been fabricated in 1.2- mu m double-metal layer CMOS process containing 433000 transistors on a 11.6*14.9-mm/sup 2/ die size.<> 相似文献

12.

A new carry-free division algorithm and its application to asingle-chip 1024-b RSA processor

Vandemeulebroecke A. Vanzieleghem E. Denayer T. Jespers P.G.A. 《Solid-State Circuits, IEEE Journal of》1990,25(3):748-756

A carry-free division algorithm is described. It is based on the properties of redundant signed digit (RSD) arithmetic to avoid carry propagation and uses the minimum hardware per bit, i.e. one full adder. Its application to a 1024-b RSA (Rivest, Shamir, and Adelman) cryptographic chip is presented. The features of this new algorithm allowed high performance (8 kb/s for 1024-b words) to be obtained for relatively small area and power consumption (80 mm² in a 2-μm CMOS process and 500 mW at 25 MHz) 相似文献

13.

A single-chip MPEG-2 codec based on customizable media embedded processor

Ishiwata S. Yamakage T. Tsuboi Y. Shimazawa T. Kitazawa T. Michinaka S. Yahagi K. Takeda H. Oue A. Kodama T. Matsumoto N. Kamei T. Saito M. Miyamori T. Ootomo G. Matsui M. 《Solid-State Circuits, IEEE Journal of》2003,38(3):530-540

A single-chip MPEG-2 MP@ML codec, integrating 3.8M gates on a 72-mm/sup 2/ die, is described. The codec employs a heterogeneous multiprocessor architecture in which six microprocessors with the same instruction set but different customization execute specific tasks such as video and audio concurrently. The microprocessor, developed for digital media processing, provides various extensions such as a very-long-instruction-word coprocessor, digital signal processor instructions, and hardware engines. Making full use of the extensions and optimizing the architecture of each microprocessor based upon the nature of specific tasks, the chip can execute not only MPEG-2 MP@ML video/audio/system encoding and decoding concurrently, but also MPEG-2 MP@HL decoding in real time. 相似文献

14.

A 2.2 W, 80 MHz superscalar RISC microprocessor

Gerosa G. Gary S. Dietz C. Dac Pham Hoover K. Alvarez J. Sanchez H. Ippolito P. Tai Ngo Litch S. Eno J. Golab J. Vanderschaaf N. Kahle J. 《Solid-State Circuits, IEEE Journal of》1994,29(12):1440-1454

A 28 mW/MHz at 80 MHz structured-custom RISC microprocessor design is described. This 32-b implementation of the PowerPC architecture is fabricated in a 3.3 V, 0.5 μm, 4-level metal CMOS technology, resulting in 1.6 million transistors in a 7.4 mm by 11.5 mm chip size. Dual 8-kilobyte instruction and data caches coupled to a high performance 32/64-b system bus and separate execution units (float, integer, loadstore, and system units) result in peak instruction rates of three instructions per clock cycle. Low-power design techniques are used throughout the entire design, including dynamically powered down execution units. Typical power dissipation is kept under 2.2 W at 80 MHz. Three distinct levels of software-programmable, static, low-power operation-for system power management are offered, resulting in standby power dissipation from 2 mW to 350 mW. CPU to bus clock ratios of 1×, 2×, 3×, and 4× are implemented to allow control of system power while maintaining processor performance. As a result, workstation level performance is packed into a low-power, low-cost design ideal for notebooks and desktop computers 相似文献

15.

Microarchitectural innovations: boosting microprocessor performancebeyond semiconductor technology scaling

Moshovos A. Sohi G.S. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2001,89(11):1560-1575

Semiconductor technology scaling provides faster and more plentiful transistors to build microprocessors, and applications continue to drive the demand for more powerful microprocessors. Weaving the "raw" semiconductor material into a microprocessor that offers the performance needed by modern and future applications is the role of computer architecture. This paper overviews some of the microarchitectural techniques that empower modem high-performance microprocessors. The techniques are classified into: 1) techniques meant to increase the concurrency in instruction processing, while maintaining the appearance of sequential processing and 2) techniques that exploit program behavior. The first category includes pipelining, superscalar execution, out-of-order execution, register renaming, and techniques to overlap memory-accessing instructions. The second category includes memory hierarchies, branch predictors, trace caches, and memory-dependence predictors. The paper also discusses microarchitectural techniques likely to be used in future microprocessors, including data value speculation and instruction reuse, microarchitectures with multiple sequencers and thread-level speculation, and microarchitectural techniques for tackling the problems of power consumption and reliability 相似文献

16.

A 24-b 50-ns digital image signal processor

Nakagawa S.-I. Terane H. Matsumura T. Segawa H. Yoshimoto M. Shinohara H. Kato S.-I. Hatanaka M. Ohira H. Kato Y. Iwatsuki M. Tabuchi K. Horiba Y. 《Solid-State Circuits, IEEE Journal of》1990,25(6):1484-1493

A 50-ns digital image signal processor (DISP)-an image/video application-specific VLSI chip-is discussed. This chip integrates 538 K transistors and dissipates 1.4 W at a 40-MHz clock. It is based on a 24-b fixed-point architecture with a five-stage pipeline. The DISP features a real-time processing capability realized by an enhanced parallel architecture, video-oriented data processing functions, and an instruction cycle time that is typically 35 ns, and 50 ns at worst. This 50-ns cycle time allows the DISP to execute mor than 60-million operations per second (MOPS). High-density 1.0-μm CMOS technology allows numerous on-chip features, including specified resources optimized for image processing. This allows a flexible hardware implementation of various algorithms for picture coding. Several circuit design techniques that are intended to attain a fast instruction cycle are reviewed, including distributed instruction decoding and a hierarchical clocking circuit. The LSI has been designed by the extensive use of a cell-based design method. The processor incorporates a sophisticated testing function compatible with a cell-based design environment 相似文献

17.

An 80-MOPS-peak high-speed and low-power-consumption 16-b digitalsignal processor

Kabuo H. Okamoto M. Tanaka I. Yasoshima H. Marui S. Yamasaki M. Sugimura T. Ueda K. Ishikawa T. Suzuki H. Asahi R. 《Solid-State Circuits, IEEE Journal of》1996,31(4):494-503

This paper describes a 16-b fixed point digital signal processor (DSP), especially its multiply-accumulate (MAC) unit, memories, and instruction set. By adopting a redundant binary multiplier and a variable pipeline structure, this DSP's MAC unit, compared to a conventional MAC unit, consumes about 15% less power and operates 24% faster. Furthermore, its double-speed MAC mechanism can realize twice the performance of a single MAC operation while consuming only 69% more power. By being able to more finely control which portions of memory are activated, the data ROM and data RAM's precharge current was reduced to about 1/8 of the conventional ROM and RAMs. We redesigned the instruction set and reduced its width from 32 b to 24 b based on the analysis of data generated by simulating an application program on our previous DSP. The reduction in instruction width made our on-chip instruction memory size 33% smaller than the previous one. This chip is fabricated with a 0.5-μm double-metal-layer CMOS process and achieves 80-MOPS-peak double speed multiply-accumulate performance 相似文献

18.

A single-chip, 1.6-billion, 16-b MAC/s multiprocessor DSP

Ackland B. Anesko A. Brinthaupt D. Daubert S.J. Kalavade A. Knobloch J. Micca E. Moturi M. Nicol C.J. O'Neill J.H. Othmer J. Sackinger E. Singh K.J. Sweet J. Terman C.J. Williams J. 《Solid-State Circuits, IEEE Journal of》2000,35(3):412-424

An MIMD multiprocessor digital signal-processing (DSP) chip containing four 64-b processing elements (PE's) interconnected by a 128-b pipelined split transaction bus (STBus) is presented. Each PE contains a 32-b RISC core with DSP enhancements and a 64-b single-instruction, multiple-data vector coprocessor with four 16-b MAC/s and a vector reduction unit. PEs are connected to the STBus through reconfigurable dual-ported snooping L1 cache memories that support shared memory multiprocessing using a modified-MESI data coherency protocol. High-bandwidth data transfers between system memory and on-chip caches are managed in a pipelined memory controller that supports multiple outstanding transactions. An embedded RTOS dynamically schedules multiple tasks onto the PEs. Process synchronization is achieved using cached semaphores. The 200-mm², 0.25-μm CMOS chip operates at 100 MHz and dissipates 4 W from a 3.3-V supply 相似文献

19.

A CMOS transistor-only 8-b 4.5-Ms/s pipelined analog-to-digitalconverter using fully-differential current-mode circuit techniques

Chung-Yu Wu Chih-Cheng Chen Jyh-Jer Cho 《Solid-State Circuits, IEEE Journal of》1995,30(5):522-532

Fully-differential current-mode circuit techniques are developed for the design of a pipelined current-mode analog-to-digital converter (IADC) in the standard CMOS digital processes. In the proposed IADC, the 1-b-per-stage architecture based on the reference nonrestoring algorithm is adopted. Thus large component ratios can be avoided and the linearity errors caused by device mismatches can be minimized. As one of the key subcircuits in the IADC, an offset-canceled high speed differential current comparator (CCMP) is proposed and analyzed. In the CCMP, the subtractions of offsets are performed in the current domain without floating capacitors. Moreover, the other key subcircuit, the current sample-and-hold amplifier (CSHA), is also developed to realize the pipeline architecture. An experimental chip for the proposed IADC has been fabricated in 0.8-μm n-well CMOS technology. Using a single 5-V power supply, the fabricated IADC can be operated at 4.5-Ms/s conversion rate with a signal-to-noise-and-distortion-ratio (SNDR) of 51 db (effective 8.2-b) for the input signal at 453 kHz. For 8-b resolution, the fabricated IADC can be operated at 4.5-Ms/s conversion rate with both differential nonlinearity (DNL) and integral nonlinearity (INL) below +/-0.6 LSB. The power consumption and the active chip area are 16 mW/b and 0.73 mm²/b, respectively 相似文献

20.

A Framework for Power-Gating Functional Units in Embedded Microprocessors

Roy S. Ranganathan N. Katkoori S. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(11):1640-1649

Power gating is a technique commonly used for leakage reduction in integrated circuits. In microprocessors, power gating is implemented by using sleep transistors to selectively deactivate circuit modules that remain idle for sustained periods of time during program execution. In this work, we develop a new framework for power gating the functional units in embedded system microprocessors without degradation in performance. The proposed framework includes an efficient algorithm for idle time estimation, appropriate insertion of sleep instructions within the code, and a method for reactivating the sleeping units only when needed without the use of wakeup instructions. We introduce the notion of loop hierarchy trees (LHTs) to represent the partial ordering of the nested loops within the program. From the control flow graph (CFG) representation of the source program, a forest of LHTs is constructed and is used to identify the maximal subgraphs representing the long idle periods for the functional units. For each subgraph thus identified, a sleep instruction is introduced in the program with a list of corresponding functional units to be deactivated. When an instruction is decoded, the functional units needed for that instruction are automatically activated by the control unit such that the units are ready before the instruction reaches the execute stage. This eliminates the need for wakeup instructions to be inserted into the object code reducing the overheads. In our implementation, the ARM processor architecture was modified and resynthesized to include power gating by developing a CMOS cell library of functional units with the above capabilities. Experimental results are reported for a set of 12 benchmarks chosen from the MiBench suite, which indicate that, on average, our technique reduces the leakage energy in functional units by 31.1% for integer benchmarks and 26.8% for floating-point benchmarks. 相似文献