首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
We present an architecture of decoupled processors with a memory hierarchy consisting only of scratch-pad memories, and a main memory. This architecture exploits the more efficient pre-fetching of Decoupled processors, that make use of the parallelism between address computation and application data processing, which mainly exists in streaming applications. This benefit combined with the ability of scratch-pad memories to store data with no conflict misses and low energy per access contributes significantly for increasing the system’s performance. The application code is split in two parallel programs the first runs on the Access processor and computes the addresses of the data in the memory hierarchy. The second processes the application data and runs on the Execute processor, a processor with a limited address space—just the register file addresses. Each transfer of any block in the memory hierarchy up to the Execute processor’s register file is controlled by the Access processor and the DMA units. This strongly differentiates this architecture from traditional uniprocessors and existing decoupled processors with cache memory hierarchies. The architecture is compared in performance with uniprocessor architectures with (a) scratch-pad and (b) cache memory hierarchies and (c) the existing decoupled architectures, showing its higher normalized performance. The reason for this gain is the efficiency of data transferring that the scratch-pad memory hierarchy provides combined with the ability of the Decoupled processors to eliminate memory latency using memory management techniques for transferring data instead of fixed prefetching methods. Experimental results show that the performance is increased up to almost 2 times compared to uniprocessor architectures with scratch-pad and up to 3.7 times compared to the ones with cache. The proposed architecture achieves the above performance without having penalties in energy delay product costs.  相似文献   

2.
A 32-kB cache macro with an experimental reduced instruction set computer (RISC) is realized. A pipelined cache access to realize a cycle time shorter than the cache access time is proposed. A double-word-line architecture combines single-port cells, dual-port cells, and CAM cells into a memory array to improve silicon area efficiency. The cache macro exhibits 9-ns typical clock-to-HIT delay as a result of several circuit techniques, such as a section word-line selector, a dual transfer gate, and 1.0-μm CMOS technology. It supports multitask operation with logical addressing by a selective clear circuit. The RISC includes a double-word load/store instruction using a 64-b bus to fully utilize the on-chip cache macro. A test scheme allows measurement of the internal signal delay. The test device design is based on the unified design rules scalable through multigenerations of process technologies down to 0.8 μm  相似文献   

3.
The system, circuit, layout and device levels of an integrated cache memory (ICM), which includes 32 kbyte DATA memory with typical address to HIT delay of 18 ns and address to DATA delay of 23 ns, are described. The ICM offers the largest memory size and the fastest speed ever reported in a cache memory. The device integrates a 32 kbyte DATA INSTRUCTION memory, a 34 kbit TAG memory, an 8 kbit VALID flat, a 2 kbit least recently used (LRU) flag, comparators, and CPU interface logic circuits on a chip. The inclusion of the DATA memory is crucial in improving system cycle time. The device uses several novel circuit design technologies, including a double-word-line scheme, low-noise flush clear, a low-power comparator, noise immunity, and directly testable memory design. Its newly proposed way-slice architecture increases both flexibility and expandability  相似文献   

4.
A 1 k bit GaAs static RAM with E/D DCFL was designed and successfully fabricated by SAINT. A bit line pull-up was introduced to the design to make higher operation speed by 25 percent and reduce cell array power consumption by 50 percent. The RAM circuit was optimized in the points of a speed, a power, and an operating margin. A minimum address access time of 1.5 ns was measured for a total power dissipation of 369 mW. This performance is the best achieved so far, for practical application in cache or buffer memories.  相似文献   

5.
A 1024-bit ECL RAM with greatly improved speed performances was developed. Typical access time and write cycle time are as short as 7.5 and 10 ns, respectively, under 784 mW of power dissipation, achieving a power and access-time product as small as 5.7 pJ/bit. Novel ECL circuit techniques, especially in address decoder circuits, as well as improved process technologies enabled realizing these high-speed characteristics. The device uses a V-groove isolation process and a shallow emitter diffusion technology with doped polysilicon. It has a memory organization of 256-words by 4-bits where its main use is as a cache memory. Besides this basic organization, it has flexibility to also operate as a 512-word by 2-bit and 1024-word by 1-bit memory.  相似文献   

6.
The push to embed reliable and low-power memories architectures into modern systems-on-chip is driving the EDA community to develop new design techniques and circuit solutions that can concurrently optimize aging effects due to Negative Bias Temperature Instability (NBTI), and static power consumption due to leakage mechanisms. While recent works have shown how conventional leakage optimization techniques can help mitigate NBTI-induced aging effects on cache memories, in this paper we focus specifically on scratchpad memory (SPM) and present novel software approaches as a means of alleviating the NBTI-induced aging effects. In particular, we demonstrate how intelligent software directed data allocation strategies can extend the lifetime of partitioned SPMs by means of distributing the idleness across the memory sub-banks.  相似文献   

7.
文中针对NoC体系结构,提出了两种数据压缩技术,被称为高速缓存压缩和网络接口控制(NIC)内的压缩.性能实测结果指示压缩能够使NoC设计在较低的网络延迟、较低的功耗和改进应用性能等方面获得优势.  相似文献   

8.
A new architecture for serial access memory is described that enables a static random access memory (SRAM) to operate in a serial access mode. The design target is to access all memory address serially from any starting address with an access time of less than 10 ns. This can be done by all initializing procedure and three new circuit techniques. The initializing procedure is introduced to start the serial operation at an arbitrary memory address. Three circuit techniques eliminate extra delay time caused by an internal addressing of column lines, sense amplifiers, word lines, and memory cell blocks. This architecture was successfully implemented in a 4-Mb CMOS SRAM using a 0.6 μm CMOS process technology. The measured serial access time was 8 ns at a single power supply voltage of 3.3 V  相似文献   

9.
A 4-Mb cache dynamic random access memory (CDRAM), which integrates 16-kb SRAM as a cache memory and 4-Mb DRAM into a monolithic circuit, is described. This CDRAM has a 100-MHz operating cache, newly proposed fast copy-back (FCB) scheme that realizes a three times faster miss access time over with the conventional copy-back method, and maximized mapping flexibility. The process technology is a quad-polysilicon double-metal 0.7-μm CMOS process, which is the same as used in a conventional 4-Mb DRAM. The chip size of 82.9 mm2 is only a 7% increase over the conventional 4-Mb DRAM. The simulated system performance indicated better performance than a conventional cache system with eight times the cache capacity  相似文献   

10.
面向访问需求的数据缓存泄漏功耗管理方法   总被引:1,自引:0,他引:1       下载免费PDF全文
王箫音  佟冬  孙含欣  程旭 《电子学报》2009,37(2):362-366
 本文提出面向访问需求的数据缓存泄漏功耗管理方法,根据访存指令对数据缓存的访问需求控制数据缓存的活动.当流水线中未发现访存指令时,将整个数据缓存保持在非活跃状态;而当发现访存指令进入流水线时,采用两种数据缓存访问控制策略以及对这两种策略的动态选择机制,在流水线早期捕获访存地址的访问需求,对数据缓存的活动作出精细控制.实验结果表明,在平均情况下,本文方法将数据缓存的泄漏功耗降低85.4%,而处理器性能提升4.41%,比传统方法在功耗与性能方面均达到更优结果.  相似文献   

11.
A nibbled-page architecture which can be used to access all column addresses on the selected row address randomly in units of 8 bits at the 100 Mbit/s data rate is discussed. To realize such high-speed architecture, three key circuit techniques have been developed. An on-chip interleaved circuit has been used for the high-speed serial READ and WRITE operations. Column address prefetch and WE signal prefetch techniques have been introduced to eliminate idle time between 8 bit units. The nibbled-page architecture has been successfully implemented in an experimental 16 Mb DRAM, and 100 Mb/s operation has been achieved. The DRAM with nibbled-page mode is very effective in simplifying the design of high-speed data transfer systems  相似文献   

12.
256-Mb DRAM circuit technologies characterized by low power and high fabrication yield for file applications are described. The newly proposed and developed circuits are a self-reverse-biasing circuit for word drivers and decoders to suppress the subthreshold current to 3% of the conventional scheme, and a subarray-replacement redundancy technique that doubles chip yield and consequently reduces manufacturing costs. An experimental 256-Mb DRAM has been designed and fabricated by combining the proposed circuit techniques and a 0.25-μm phase-shift optical lithography, and its basic operations are verified. A 0.72-μm2 double-cylindrical recessed stacked-capacitor (RSTC) cell is used to ensure a storage capacitance of 25 fF/cell. A typical access time under a 2-V power supply voltage was 70 ns. With the proper device characteristics, the simulated performances of the 256-Mb DRAM operating with a 1.5-V power supply voltage are a data-retention current of 53 μA and an access time of 48 ns  相似文献   

13.
高速缓冲存储器的设计与实现   总被引:3,自引:0,他引:3  
随着芯片集成度的提高,在高速CPU与低速内存之间插入有缓冲作用的速度较快、容量较小的高速缓冲存储器,解决了两者速度的平衡和匹配问题,对微处理器整体性能有很大提高。本文从高速缓存的结构和基本理论出发,理论结合实际,介绍了32位高性能、低功耗嵌入式微处理器中高速缓存的实现方法,从RTL设计到版图设计的各个部分进行了论述,并介绍了该模块全定制部分电路和版图的实现。  相似文献   

14.
This quad-issue processor achieves 1-GHz operation through improved dynamic circuit techniques in critical paths and a more extensive on-chip memory system which scales in both bandwidth and latency. Critical logic paths use domino, delayed clocked domino, and logic embedded in dynamic flip-flops for minimum delay. A 64-KB sum-addressed memory data cache combines the address offset add with the cache decode, allowing the average memory latency to scale by more than the clock ratio. Memory bandwidth is improved by using wave pipelined SRAM designs for on-chip caches and a write cache for store traffic. Memory power is controlled without increased latency by use of delayed-reset logic decoders. The chip operates at 1000 MHz and dissipates less than 80 W from a 1.6-V supply. It contains 23 million transistors (12 million in RAM cells) on a 244 mm2 die  相似文献   

15.
This paper presents several architectures and designs of low-power 4-2 and 5-2 compressors capable of operating at ultra low supply voltages. These compressor architectures are anatomized into their constituent modules and different static logic styles based on the same deep submicrometer CMOS process model are used to realize them. Different configurations of each architecture, which include a number of novel 4-2 and 5-2 compressor designs, are prototyped and simulated to evaluate their performance in speed, power dissipation and power-delay product. The newly developed circuits are based on various configurations of the novel 5-2 compressor architecture with the new carry generator circuit, or existing architectures configured with the proposed circuit for the exclusive OR (XOR) and exclusive NOR ( XNOR) [XOR-XNOR] module. The proposed new circuit for the XOR-XNOR module eliminates the weak logic on the internal nodes of pass transistors with a pair of feedback PMOS-NMOS transistors. Driving capability has been considered in the design as well as in the simulation setup so that these 4-2 and 5-2 compressor cells can operate reliably in any tree structured parallel multiplier at very low supply voltages. Two new simulation environments are created to ensure that the performances reflect the realistic circuit operation in the system to which these cells are integrated. Simulation results show that the 4-2 compressor with the proposed XOR-XNOR module and the new fast 5-2 compressor architecture are able to function at supply voltage as low as 0.6 V, and outperform many other architectures including the classical CMOS logic compressors and variants of compressors constructed with various combinations of recently reported superior low-power logic cells.  相似文献   

16.
This paper presents a forward body-biasing (FBB) technique for active and standby leakage power reduction in cache memories. Unlike previous low-leakage SRAM approaches, we include device level optimization into the design. We utilize super high Vt (threshold voltage) devices to suppress the cache leakage power, while dynamically FBB only the selected SRAM cells for fast operation. In order to build a super high Vt device, the two-dimensional (2-D) halo doping profile was optimized considering various nanoscale leakage mechanisms. The transition latency and energy overhead associated with FBB was minimized by waking up the SRAM cells ahead of the access and exploiting the general cache access pattern. The combined device-circuit-architecture level techniques offer 64% total leakage reduction and 7.3% improvement in bit line delay compared to a previous state-of-the-art low-leakage SRAM technique. Static noise margin of the proposed SRAM cell is comparable to conventional SRAM cells.  相似文献   

17.
Various architectural power reduction techniques have been proposed for on-chip caches in the last decade. In this paper, we first show that these power reduction techniques can be suboptimal when thermal effects are considered. Then, we propose a thermal-aware cache power-down technique that minimizes the power density of the active parts by turning off alternating rows of memory cells instead of entire banks. The decrease in the power density lowers the temperature, which then exponentially reduces the leakage. Thus, leakage power of the active parts is reduced in addition to the power eliminated from the parts that are turned off. Simulations based on SPEC2000, NetBench, and MediaBench applications in a 70-nm technology show that the proposed thermal-aware architecture can reduce the total energy consumption by 53% compared to a conventional cache, and 14% compared to a cache architecture with thermal-unaware power reduction scheme. Second, we show a block permutation scheme that can be used during the design of the caches to maximize the distance between blocks with consecutive addresses. Because of spatial locality, blocks with consecutive addresses are likely to be accessed within a short time interval. By maximizing the distance between such blocks, we minimize the power density of the hot spots in the cache, and hence reduce the peak temperature. This, in return, results in an average leakage power reduction of 8.7% compared to a conventional cache without affecting the dynamic power and the latency. Overall, both of our architectures add no extra run-time penalty compared to the thermal-unaware power reduction schemes, yet they result in a significant reduction in the total energy consumption of a cache  相似文献   

18.
An 8-bit fully decoded RAM test circuit has been designed and fabricated using enhancement-mode GaAs-MESFET's with the LPFL circuit approach. Correct operation of the circuit has been observed for a supply voltage varying from 3.5 to 7 V. An access time of 0.6 ns was measured for a total power consumption of 85 mW under nominal operating conditions. This circuit was used to develop and validate both a design strategy and computer-aided design (CAD) tools oriented towards cache or buffer memories of realistic complexity. It is shown that a performance-optimized 1-kbit RAM exhibiting an access time of 1.1 ns for a power dissipation of 850 mW would be feasible with the present fabrication technology.  相似文献   

19.
A high-speed DRAM data transfer scheme between DRAM and logic parts in merged DRAM logic (MDL) designs is proposed with logically divided DRAM row address mapping. The proposed scheme results in a 20% faster write access and 40% faster read access. It can be used as a general design framework to maximise DRAM access speed in various MDL designs. A test chip has been fabricated by 0.16 μm DRAM technology, and the scheme has been verified in the design of a DRAM L2 cache memory  相似文献   

20.
An 8-bit fully decoded RAM test circuit has been designed and fabricated using enhancement-mode GaAs-MESFET's with the LPFL circuit approach. Correct operation of the circuit has been observed for a supply voltage varying from 3.5 to 7 v. An access time of 0.6 ns was measured for a total power consumption of 85 mW under nominal operating conditions. This circuit was used to develop and validate both a design strategy and computer-aided design (CAD) tools oriented towards cache or buffer memories of realistic complexity. It is shown that a performance-optimized 1-kbit RAM exhibiting an access time of 1.1 ns for a power dissipation of 850 mW would be feasible with the present fabrication technology.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号