首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A 32-kB cache macro with an experimental reduced instruction set computer (RISC) is realized. A pipelined cache access to realize a cycle time shorter than the cache access time is proposed. A double-word-line architecture combines single-port cells, dual-port cells, and CAM cells into a memory array to improve silicon area efficiency. The cache macro exhibits 9-ns typical clock-to-HIT delay as a result of several circuit techniques, such as a section word-line selector, a dual transfer gate, and 1.0-μm CMOS technology. It supports multitask operation with logical addressing by a selective clear circuit. The RISC includes a double-word load/store instruction using a 64-b bus to fully utilize the on-chip cache macro. A test scheme allows measurement of the internal signal delay. The test device design is based on the unified design rules scalable through multigenerations of process technologies down to 0.8 μm  相似文献   

2.
The design of a 90-nm virtually addressed cache subsystem with separate 32-kB instruction and data caches is described. The circuits and microarchitecture are illustrated, including architecture level trace data validating low-power features and provisions to support snooping while maintaining the latency and power of virtual addressing. Low-power memory management unit design including a translation lookaside buffer with process identifier mapping is also described. Level 1 caches with support for high bandwidth, single cycle 256 bit fill and evict, as well as features for low power are also described. The design approaches are validated through both simulation and experimental results.  相似文献   

3.
《Microelectronics Journal》2015,46(11):1020-1032
This paper describes a new memristor crossbar architecture that is proposed for use in a high density cache design. This design has less than 10% of the write energy consumption than a simple memristor crossbar. Also, it has up to 3 times the bit density of an STT-MRAM system and up to 11 times the bit density of an SRAM architecture. The proposed architecture is analyzed using a detailed SPICE analysis that accounts for the resistance of the wires in the memristor structure. Additionally, the memristor model used in this work has been matched to specific device characterization data to provide accurate results in terms of energy, area, and timing. The proposed memory system was analyzed by modeling two different devices that vary in resistance range and switching time. This system does not require that the memristor devices have inherent diode effects which limit alternate current paths. Therefore this system is capable of utilizing a much broader class of devices.An architectural analysis has also been completed that shows how the memory system may perform as a cache memory. A hybrid cache structure was used to alleviate the long write latencies of memristor devices. This approach consisted of the tag array being made of SRAM cells while the data array was made of the memristor circuit proposed. This hybrid scheme allows multiple reads and writes to concurrently access different sub-arrays within a cache. The performance of these novel memristor based caches was compared to SRAM and STT-MRAM based caches through detailed simulations. The results show that the memristor caches are denser and allow better performance along with lower system power when compared to the STT-MRAM and SRAM caches.  相似文献   

4.
A 1-million transistor 64-b microprocessor has been fabricated using 0.8-μm double-metal CMOS technology. A 40-MIPS (million instructions per second) and 20-MFLOPS (million floating-point operations per second) peak performance at 40 MHz is realized by a self-clocked register file and two translation lookaside buffers (TLBs) with word-line transition detection circuits. The processor contains an integer unit based on the SPARC (scalable processor architecture) RISC (reduced instruction set computer) architecture, a floating-point unit (FPU) which executes IEEE-754 single- and double-precision floating-point operations a 6-KB three-way set-associative physical instruction cache, a 2-KB two-way set-associative physical data cache, a memory management unit that has two TLBs, and a bus control unit with an ECC (error-correcting code) circuit  相似文献   

5.
6.
Coarse-grained reconfigurable architectures (CGRAs) require many processing elements (PEs) and a configuration memory unit (configuration cache) for reconfiguration of its PE array. Although this structure is meant for high performance and flexibility, it consumes significant power. Specially, power consumption by configuration cache is explicit overhead compared to other types of intellectual property (IP) cores. Reducing power is very crucial for CGRA to be more competitive and reliable processing core in embedded systems. In this paper, we propose a reusable context pipelining (RCP) architecture to reduce power-overhead caused by reconfiguration. It shows that the power reduction can be achieved by using the characteristics of loop pipelining, which is a multiple instruction stream, multiple data stream (MIMD)-style execution model. RCP efficiently reduces power consumption in configuration cache without performance degradation. Experimental results show that the proposed approach saves much power even with reduced configuration cache size. Power reduction ratio in the configuration cache and the entire architecture are up to 86.33% and 37.19%, respectively, compared to the base architecture.  相似文献   

7.
This paper presents a low power 16‐bit adiabatic reduced instruction set computer (RISC) microprocessor with efficient charge recovery logic (ECRL) registers. The processor consists of registers, a control block, a register file, a program counter, and an arithmetic and logical unit (ALU). Adiabatic circuits based on ECRL are designed using a 0.35 µm CMOS technology. An adiabatic latch based on ECRL is proposed for signal interfaces for the first time, and an efficient four‐phase supply clock generator is designed to provide power for the adiabatic processor. A static CMOS processor with the same architecture is designed to compare the energy consumption of adiabatic and non‐adiabatic microprocessors. Simulation results show that the power consumption of the adiabatic microprocessor is about 1/3 compared to that of the static CMOS microprocessor.  相似文献   

8.
In current NAND flash design, one of the most challenging issues is reducing peak current consumption (peak ICC), as it leads to peak power drop, which can cause malfunctions in NAND flash memory. This paper presents an efficient approach for reducing the peak ICC of the cache program in NAND flash memory — namely, a program Cache Busy Time (tPCBSY) control method. The proposed tPCBSY control method is based on the interesting observation that the array program current (ICC2) is mainly decided by the bit‐line bias condition. In the proposed approach, when peak ICC2 becomes larger than a threshold value, which is determined by a cache loop number, cache data cannot be loaded to the cache buffer (CB). On the other hand, when peak ICC2 is smaller than the threshold level, cache data can be loaded to the CB. As a result, the peak ICC of the cache program is reduced by 32% at the least significant bit page and by 15% at the most significant bit page. In addition, the program throughput reaches 20 MB/s in multiplane cache program operation, without restrictions caused by a drop in peak power due to cache program operations in a solid‐state drive.  相似文献   

9.
Especially in programmable processors, energy consumption of integrated memories can become a limiting design factor due to thermal dissipation power constraints and limited battery capacity. Consequently, contemporary improvement efforts on memory technologies are focusing more on the energy-efficiency aspects, which has resulted in biased CMOS SRAM cells that increase energy efficiency by favoring one logical value over another. In this paper, xor-masking, a method for exploiting such contemporary low power SRAM memories is proposed to improve the energy-efficiency of instruction fetching. Xor-masking utilizes static program analysis statistics to produce optimal encoding masks to reduce the occurrence of the more energy consuming instruction bit values in the fetched instructions. The method is evaluated on LatticeMico32, a small RISC core popular in ultra low power designs, and on a wide instruction word high performance low power DSP. Compared to the previous “bus invert” technique typically used with similar SRAMs, the proposed method reduces instruction read energy consumption of the LatticeMico32 by up to 13% and 38% on the DSP core.  相似文献   

10.
An intelligent cache based on a distributed architecture that consists of a hierarchy of three memory sections-DRAM (dynamic RAM), SRAM (static RAM), and CAM (content addressable memory) as an on-chip tag-is reported. The test device of the memory core is fabricated in a 0.6 μm double-metal CMOS standard DRAM process, and the CAM matrix and control logic are embedded in the array. The array architecture can be applied to 16-Mb DRAM with less than 12% of the chip overhead. In addition to the tag, the array embedded CAM matrix supports a write-back function that provides a short read/write cycle time. The cache DRAM also has pin compatibility with address nonmultiplexed memories. By achieving a reasonable hit ratio (90%), this cache DRAM provides a high-performance intelligent main memory with a 12 ns(hit)/34 ns(average) cycle time and 55 mA (at 25 MHz) operating current  相似文献   

11.
The architecture and design of a CMOS chip implementing a medium-resolution graphics system are described. The chip, requiring no external support logic, outputs analog RGB signals at a 40-MHz pixel rate and directly controls a bit-map video RAM (VRAM) memory array. Scan rates and display formats are completely programmable. Pixels stored in the 1 K×1 K bit map can be any of 16 colors taken from a 4096-color palette. The chip can be directly interfaced to most common microprocessors. A 6.7-MIPS (million-instruction-per-second) internal reduced instruction set computer (RISC) CPU directly implements high-level graphics commands. The chip achieves a maximum draw speed of 10 million pixels/s. Designed in a Lisp machine environment, the 100000-transistor chip is implemented in 1.8-μm CMOS and contains standard cells, RAM, ROM, a color table, and three four-bit current-steered digital-to-analog converters (DACs)  相似文献   

12.
This paper describes the design and performance of a 64-kbit (65 536 bits) block addressed charge-coupled serial memory. By using the offset-mask charge-coupled device (CCD) electrode structure to obtain a small cell size, and an adaptive system approach to utilize nonzero defect memory chips, the system cost per bit of charge-coupled serial memory can be reduced to provide a solid-state replacement of moving magnetic memories and to bridge the gap between high cost random access memories (RAM's) and slow access magnetic memories. The memory chip is organized as 64K words by 1 bit in 16 blocks of 4 kbits. Each 4-kbit block is organized as a serial-parallel-serial (SPS) array. The chip is fully decoded with write/recirculate control and two-dimensional decoding to permit memory matrix organization with X-Y chip select control. All inputs and the ouput are TTL compatible. Operated at a data rate of 1 MHz, the mean access time is about 2 ms and the average power dissipation is 1 µW/bit. The maximum output data rate is 10 MHz, giving a mean access time of about 200 µs, and an average power dissipation of 10 µW/bit. The memory chip is fabricated using an n-channel polysilicon gate process. Using tolerant design rules (8-µm minimum feature size and ±2-µm alignment tolerance) the CCD cell size is 0.4 mil2and the total chip size is 218 × 235 mil2. The chip is mounted in a 22-pin 400-mil wide ceramic dual in-line package.  相似文献   

13.
In this paper, we describe a procedure for memory design and exploration for low power embedded systems. Our system consists of an instruction cache and a data cache on-chip, and a large memory off-chip. In the first step, we try to reduce the power consumption due to memory traffic by applying memory-optimizing transformations such as loop transformations. Next we use a memory exploration procedure to choose a cache configuration (cache size and line size) that satisfies the system requirements of area, number of cycles and energy consumption. We include energy in the performance metrics, since for different cache configurations, the variation in energy consumption is quite different from the variation in the number of cycles. The memory exploration procedure is very efficient since it exploits the trends in the cycles and energy characteristics to reduce the search space significantly.  相似文献   

14.
A 1 k bit GaAs static RAM with E/D DCFL was designed and successfully fabricated by SAINT. A bit line pull-up was introduced to the design to make higher operation speed by 25 percent and reduce cell array power consumption by 50 percent. The RAM circuit was optimized in the points of a speed, a power, and an operating margin. A minimum address access time of 1.5 ns was measured for a total power dissipation of 369 mW. This performance is the best achieved so far, for practical application in cache or buffer memories.  相似文献   

15.
提出一种超精简处理单元架构。该处理单元基于运算-跳转式单指令处理器体系。使用指令优化和内部总线上加速器,该处理单元能够执行传统算术运算式单指令处理器难于执行的高效位运算以及执行效率较低的数据转移操作。以该处理单元构成的片上大规模并行计算阵列可用于图像处理等局部性强、实时性要求高的计算任务。包含有该处理单元架构的16 16的原型阵列已经在FPGA上实现,性能达30.7GOPS@120MHz,平均功耗39.5mW。  相似文献   

16.
一种低功耗Cache设计技术的研究   总被引:2,自引:0,他引:2  
低功耗、高性能的cache系统设计是嵌入式DSP芯片设计的关键。本文在多媒体处理DSP芯片MD32的设计实践中,提出一种利用读/写缓冲器作为零级cache,减少对数据、指令cache的读/写次数,由于缓冲器读取功耗远远小于片上cache,从而减小cache相关功耗的方法。通过多种多媒体处理测试程序的验证,该技术可减少对指令cache或者数据cache20%~40%的读取次数,以较小芯片面积的增加换取了较大的功耗降低。  相似文献   

17.
《IEE Review》2001,47(3):38-40
The author reports on the theory and practice behind an innovative 32 bit RISC processor core, whose architecture can be customised to provide the optimum design solution for processor-based application-specific integrated circuits. In its basic configuration, the ARC processor, the Tangent-A4, is a four-stage pipeline device, with instructions, data and address formats all 32 bit. It also boasts separate instruction and data buses (Harvard architecture), data and instruction caches, and a unique host interface (parallel, JTAG or user defined), giving external devices access to the internal registers and memory  相似文献   

18.
Many applications would benefit from the availability of large-capacity content addressable memories (CAMs). However, while RAMs, EEPROMs, and other memory types achieve ever-increasing per-chip bit counts, CAMs show little promise of following suit, due primarily to an inherent difficulty in implementing two-dimensional decoding. The serialized operation of most proposed solutions is not acceptable in speed-sensitive environments. In response to the resulting need, this paper describes a fully-parallel (single-clock-cycle) CAM architecture that uses the concept of “preclassification” to realize a second dimension of decoding without compromising throughput. As is typically the case, each CAM entry is used as an index to additional data in a RAM. To achieve improved system integration, the preclassified CAM is merged into the same physical array as its target RAM, and both use the same core cells. Architecture and operation of the resulting novel memory are described, as are two critical-path circuits: the match-line pull-down and the multiple match resolver. The memory circuits, designed in 0.8 μm BiCMOS technology, may be employed in chips as large as 1 Mb, and simulations confirm 37 MHz operation for this capacity. To experimentally verify the feasibility of the architectural and circuit design, an 8 kb test chip was fabricated and found to be fully functional at clock speeds up to 59 MHz, with a power dissipation of 260 mW at 50 MHz  相似文献   

19.
Drastic device shrinking, power supply reduction, increasing complexity and increasing operating speeds affect adversely the reliability of nowadays Integrate Circuits (ICs). In many modern designs, embedded memories occupy the largest part of the die and comprise the large majority of transistors. Furthermore, memories are designed as tight as allowed by the process, and are therefore more prone to failures than other circuits. Error correcting codes (ECCs) are an efficient mean for protecting memories against failures. A major drawback of ECCs is the speed penalty induced by the encoding and decoding circuits. In this paper, we present an architecture enabling implementing ECCs without speed penalty. Furthermore, as the manual implementation of this solution is impractical for complex System-on-Chips (SoCs), we propose an algorithm and a set of generic rules allowing automatic insertion of the delay-free ECCs in any complex architecture at Register Transfer Level (RTL). With respect to a naive insertion in the design of the new architecture, the algorithm enable up to 20 % hardware reduction. The Finite State Machines (FSM) that controls the new ECC architecture is also generated automatically. Experimental evaluations show that the hardware overhead of the speed penalty free ECCs protected memory compared to a standard implementation of ECC protected memory is about 2.5 % with an additional power consumption of 6 %.  相似文献   

20.
This paper describes an on-chip cache, called a separated bit-line unified cache, which minimizes the chip-area cost in high-performance microprocessors. This unified cache has two ports; one for the instruction bus and the other for the data bus. A separated bit-line memory hierarchy architecture realizes memory hierarchy design with only 10%-20% area overhead. The total cache area can be reduced by more than 20%-30% on the average at capacities of larger than 64 KB with the same hit rate as the conventional cache. The cache latency reaches 4.2 ns at a supply voltage of 1 V. Additionally, the cache is physically addressable even if the cache has a large capacity  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号