首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
管茂林  何义  杨乾明  张春元  伍楠 《电子学报》2012,40(7):1379-1385
针对流体系结构中VLIW代码体积对指令存储器的容量和功耗带来的问题,本文通过分析流处理器的指令特征,提出了一种新的VLIW分域压缩技术.在此基础上,本文为流体系结构设计了分布式的片上指令存储器,并提出了SIMD流水的执行模式.实验结果证明,该技术减少了38%的片外指令访存,降低约65%的片上指令存储器空间需求;分布式指令存储器减少了约37%的片上指令存储器面积,使得MASA的系统面积降低了8.92%,并降低了61%的指令存储器功耗.  相似文献   

2.
A high-speed low-power novel architecture of Dual Bit Content Addressable Memory (DB-CAM) is reported in this article. A low leakage, low power and high-speed memory has been developed using the novel architecture of DB-CAM that can store 2 bits in a single CAM block and Static Random Access Memory (SRAM). Data search operation is done by using CAM cells and SRAMs are used as data storage cells. The output of SRAM cells depend on the search result of CAM cells. To make the search operation more precise a priority detector circuit has been proposed. The new architecture of DB-CAM block reduces the power consumption, transistor count and the area on chip enormously. The functionality of these circuits was checked and performance parameters like propagation delay and dynamic power consumption were calculated by spice spectre (CADENCE) using standard 90?nm CMOS technology.  相似文献   

3.
Described is a design for high-speed low-power-consumption fully parallel content-addressable memory (CAM) macros for CMOS ASIC applications. The design supports configurations ranging from 64 words by 8 bits to 2048 words by 64 bits and achieves around 7.5-ns search access times in CAM macros on a 0.35-μm 3.3-V standard CMOS ASIC technology. A new CAM cell with a pMOS match-line driver reduces search rush current and power consumption, allowing a NOR-type match-line structure suitable for high-speed search operations. It is also shown that the CAM cell has other advantages that lead to a simple high-speed current-saving architecture. A small signal on the match line is detected by a single-ended sense amplifier which has both high-speed and low-power characteristics and a latch function. The same type of sense amplifier is used for a fast read operation, realizing 5-ns access time under typical conditions. For further current savings in search operations, the precharging of the match line is controlled based on the valid bit status. Also, a dual bit switch with optimized size and control reduces the current. CAM macros of 256×54 configuration on test chips showed 7.3-ns search access time with a power-performance metric of 131 fJ/bit/search under typical conditions  相似文献   

4.
This paper demonstrates the first 8-Mb chain ferroelectric RAM (chain FeRAM) with 0,25-μm 2-metal CMOS technology. A small die of 76 mm2 and a high average cell/chip area efficiency of 57.4 % have been realized by introducing not only chain architecture but also four new techniques: 1) a one-pitch shift cell realizes small cell size of 5.2 μm2; 2) a new hierarchical wordline architecture reduces row-decoder and plate-driver areas without an extra metal layer; 3) a small-area dummy cell scheme reduces dummy capacitor size to 1/3 of the conventional one; and 4) a new array activation scheme reduces dataline and second amplifier areas. As a result, the chain architecture with these new techniques reduces die size to 65% of that of the conventional FeRAM. Moreover a ferroelectric capacitor overdrive scheme enables sufficient polarization switching, without overbias memory cell array. This scheme lowers the minimum operation voltage by 0.23 V, and enables 2.5-V Vdd operation. Thanks to fast cell plateline drive of chain architecture, the 8-Mb chain FeRAM has achieved the fastest random access time, 40 ns, and read/write cycle time, 70 ns, at 3.0 V so far reported  相似文献   

5.
A low-power precomputation-based fully parallel content-addressable memory   总被引:1,自引:0,他引:1  
This paper presents a novel VLSI architecture for a fully parallel precomputation-based content-addressable memory (PB-CAM) with low-power, low-cost, and low-voltage features. This design is based on a precomputation approach that saves not only power consumption of the CAM system, but also reduces transistor count and operating voltage of the CAM cell. In addition, the proposed PB-CAM word structure adopts the static pseudo-nMOS circuit design to improve system performance. The whole design was fabricated with the TSMC 0.35-/spl mu/m single-poly quadruple-metal CMOS process. With a 128 words by 30 bits CAM size, the measurement results indicate that the proposed circuit works up to 100 MHz with power consumption of 33 mW at 3.3-V supply voltage and works up to 30 MHz under 1.5-V supply voltage.  相似文献   

6.
A new FRAM (ferroelectric RAM) design method, utilising a bit-plate parallel cell architecture is presented. This method is effective in reducing circuit and layout overhead caused by the on-pitch plate control circuitry. It also reduces the power consumption in the memory array. Implementation results for a 0.13 /spl mu/m CMOS technology, 512 kb FRAM prototype show that the memory block area in the proposed architecture is 15.6% less than that of a conventional structure.  相似文献   

7.
In this paper, we propose a new architecture for the two-level lossless data compression and decompression algorithm proposed in that combines the PDLZW algorithm and an approximated adaptive Huffman algorithm with dynamic-block exchange (AHDB). In the new architecture, we replace the CAM dictionary set used in the PDLZW algorithm with a CAM-tag-based dictionary set to reduce hardware cost and the CAM-based ordered list used in the AHDB algorithm with a memory inter-reference (MIR) stage realized by using two SRAMs. The resulting architecture is then implemented based on cell-based libraries with both 0.35- $mu$m 2P4M and 0.18- $mu$m 1P6M process technologies, respectively. With the same process technology, the prototyped chip demonstrates the new architecture not only has better performance, at least 33% improvement, but also occupies less area, only about 44%, and consumes less power, about 50%, in comparison with the architecture proposed in . In addition, the maximum data rate can achieve 2 Gbps when realizing in 0.35 $mu$ m 2P4M process technology and 4 Gbps when realizing in 0.18-$mu$m 1P6M process technology.   相似文献   

8.
Image compression applications use vector quantization (VQ) for its high compression ratio and image quality. The current VQ hardware employs static instead of dynamic code book generation as the latter demands intensive computation and corresponding expensive hardware even though it offers better image quality. This paper describes a VLSI architecture for a real-time dynamic code book generator and encoder of 512×512 images at 30 frames/s. The four-chip 0.8 μm CMOS design implements a tree of Kohonen self-organizing maps, and consists of two VQ processors and two image buffer memory chips. The pipelined VQ processor contains a computational core for both code book generation and encoding, and is scalable to processing larger frames  相似文献   

9.
Chanho Lee 《ETRI Journal》2004,26(1):21-26
This paper proposes a new architecture for a Viterbi decoder with an efficient memory management scheme. The trace‐back operation is eliminated in the architecture and the memory storing intermediate decision information can be removed. The elimination of the trace‐back operation also reduces the number of operation cycles needed to determine decision bits. The memory size of the proposed scheme is reduced to 1/(5×constraint length) of that of the register exchange scheme, and the throughput is increased up to twice that of the trace‐back scheme. A Viterbi decoder complying with the IS‐95 reverse link specification is designed to verify the proposed architecture. The decoder has a code rate of 1/3, a constraint length of 9, and a trace‐forward depth of 45.  相似文献   

10.
This paper presents a code compression and on-the-fly decompression scheme suitable for coarse-grain reconfigurable technologies. These systems pose further challenges by having an order of magnitude higher memory requirement due to much wider instruction words than typical VLIW/TTA architectures. Current compression schemes are evaluated. A highly efficient and novel dictionary-based lossless compression technique is implemented and compared against a previous implementation for a reconfigurable system. This paper looks at several conflicting design parameters, such as the compression ratio, silicon area, latency, and power consumption. Compression ratios in the range of 0.32 to 0.44 are recorded with the proposed scheme for a given set of test programs. With these test programs, a 60% overall silicon area saving is achieved, even after the decompressor hardware overhead is taken into account. The proposed technique may be applied to any architecture which exhibits common characteristics to the example reconfigurable architecture targeted in this paper.   相似文献   

11.
Describes the architecture and design of a CMOS VLSI chip for data compression and decompression using tree-based codes. The chip, called MARVLE, implements a memory-based architecture for variable length encoding and decoding based on tree-based codes. The architecture is based on an efficient scheme of mapping the tree representing any binary code onto a memory device. A prototype 2-mm CMOS VLSI chip has been designed, verified, and fabricated by the MOSIS facility. The chip has a 512×12 static RAM with an access time of 4 ns and logic circuitry for compression as well as decompression. The chip occupies a silicon area of 6.8 mm×6.9 mm and consists of 49695 transistors. The prototype chip yields a compression rate of 95.2 Mb/s and a decompression rate of 60.6 Mb/s with a clock rate of 83.3 MHz. The VLSI hardware can be used to implement the JPEG baseline compression scheme  相似文献   

12.
Based on the analysis of typical hybrid-type content addressable memory (CAM) structures, a hybrid-type CAM architecture with lower power consumption and higher stability was proposed. This design changes the connection of a N-type metal-oxide-semiconductor (NMOS) transistor in the control circuit, which greatly reduces the power consumption during comparison by making the match line simply discharge to the NMOS threshold voltage. A comparative study was made between conventional and the proposed hybrid-type CAM architecture by semiconductor manufacturing international corporation (SMIC) 65 nm complementary metal-oxide-semiconductor (CMOS) technology. Simulation shows that the power consumption of the proposed structure is reduced by 23%. Furthermore, the proposed design also adjusts the match line (ML) discharge path. In case that, the not and type (NAND-type) block is matched and the not or type (NOR-type) block is mismatched, the jitter voltage on the match line can be decreased largely.  相似文献   

13.
Code density is of increasing concern in embedded system design since it reduces the need for the scarce resource memory and also implicitly improves further important design parameters like power consumption and performance. In this paper we introduce a novel, hardware-supported approach. Besides the code, also the lookup tables (LUTs) are compressed, that can become significant in size if the application is large and/or high compression is desired. Our scheme optimizes the number and size of generated LUTs to improve the compression ratio. To show the efficiency of our approach, we apply it to two compression schemes: “dictionary-based” and “statistical”. We achieve an average compression ratio of 48% (already including the overhead of the LUTs). Thereby, our scheme is orthogonal to approaches that take particularities of a certain instruction set architecture into account. We have conducted evaluations using a representative set of applications and have applied it to three major embedded processor architectures, namely ARM, MIPS, and PowerPC.   相似文献   

14.
With a great scalability potential, nonvolatile magnetoresistive memory with spin-torque transfer (STT) programming has become a topic of great current interest. This paper addresses cell structure design for STT magnetoresistive RAM, content addressable memory (CAM) and ternary CAM (TCAM). We propose a new RAM cell structure design that can realize high speed and reliable sensing operations in the presence of relatively poor magnetoresistive ratio, while maintaining low sensing current through magnetic tunneling junctions (MTJs). We further apply the same basic design principle to develop new cell structures for nonvolatile CAM, and TCAM. The effectiveness of the proposed RAM, CAM and TCAM cell structures has been demonstrated by circuit simulation at 0.18 $ mu$m CMOS technology.   相似文献   

15.
A high-throughput memory-efficient decoder architecture for low-density parity-check (LDPC) codes is proposed based on a novel turbo decoding algorithm. The architecture benefits from various optimizations performed at three levels of abstraction in system design-namely LDPC code design, decoding algorithm, and decoder architecture. First, the interconnect complexity problem of current decoder implementations is mitigated by designing architecture-aware LDPC codes having embedded structural regularity features that result in a regular and scalable message-transport network with reduced control overhead. Second, the memory overhead problem in current day decoders is reduced by more than 75% by employing a new turbo decoding algorithm for LDPC codes that removes the multiple checkto-bit message update bottleneck of the current algorithm. A new merged-schedule merge-passing algorithm is also proposed that reduces the memory overhead of the current algorithm for low to moderate-throughput decoders. Moreover, a parallel soft-input-soft-output (SISO) message update mechanism is proposed that implements the recursions of the Balh-Cocke-Jelinek-Raviv (BCJR) algorithm in terms of simple "max-quartet" operations that do not require lookup-tables and incur negligible loss in performance compared to the ideal case. Finally, an efficient programmable architecture coupled with a scalable and dynamic transport network for storing and routing messages is proposed, and a full-decoder architecture is presented. Simulations demonstrate that the proposed architecture attains a throughput of 1.92 Gb/s for a frame length of 2304 bits, and achieves savings of 89.13% and 69.83% in power consumption and silicon area over state-of-the-art, with a reduction of 60.5% in interconnect length.  相似文献   

16.
An intelligent cache based on a distributed architecture that consists of a hierarchy of three memory sections-DRAM (dynamic RAM), SRAM (static RAM), and CAM (content addressable memory) as an on-chip tag-is reported. The test device of the memory core is fabricated in a 0.6 μm double-metal CMOS standard DRAM process, and the CAM matrix and control logic are embedded in the array. The array architecture can be applied to 16-Mb DRAM with less than 12% of the chip overhead. In addition to the tag, the array embedded CAM matrix supports a write-back function that provides a short read/write cycle time. The cache DRAM also has pin compatibility with address nonmultiplexed memories. By achieving a reasonable hit ratio (90%), this cache DRAM provides a high-performance intelligent main memory with a 12 ns(hit)/34 ns(average) cycle time and 55 mA (at 25 MHz) operating current  相似文献   

17.
In embedded control applications, system cost and power/energy consumption are key considerations. In such applications, program memory forms a significant part of the chip area. Hence reducing code size reduces the system cost significantly. A significant part of the total power is consumed in fetching instructions from the program memory. Hence reducing instruction fetch power has been a key target for reducing power consumption. To reduce the cost and power consumption, embedded systems in these applications use application specific processors that are fine tuned to provide better solutions in terms of code density, and power consumption. Further fine tuning to suit each particular application in the targeted class can be achieved through reconfigurable architectures. In this paper, we propose a reconfiguration mechanism, called Instruction Re-map Table, to re-map the instructions to shorter length code words. Using this mechanism, frequently used set of instructions can be compressed. This reduces code size and hence the cost. Secondly, we use the same mechanism to target power reduction by encoding frequently used instruction sequences to Gray codes. Such encodings, along with instruction compression, reduce the instruction fetch power. We enhance Texas Instruments DSP core TMS320C27x to incorporate this mechanism and evaluate the improvements on code size and instruction fetch energy using real life embedded control application programs as benchmarks. Our scheme reduces the code size by over 10% and the energy consumed by over 40%. *A preliminary version of this paper has appeared in the International Conference on Computer Aided Design (ICCAD-2001), San Jose, CA, November 2001.  相似文献   

18.
A static random access memory (SRAM)-based novel hardware architecture for longest prefix match (LPM) search scheme has been proposed in this paper. The key concept of this architecture is to store the IP prefixes virtually in the forwarding table. This architecture reduces memory consumption by using a two-tier hierarchical SRAM-based memory structure for maintaining the next hop port information. Originally, next hop addresses are kept in the shared global memory called next hop global memory (NHGM) and its links are maintained in another memory, called next hop link memory (NHLM). This approximately reduces memory consumption by 50–62.5% compared to existing SRAM-based schemes. The proposed architecture consumes single memory write cycle to store an IP prefix and also takes single memory read cycle for LPM search. However, finding next hop information incurs two memory read cycles due to hierarchical next hop memory structure. The proposed scheme performs an LPM lookup operation in 1.05–1.31 ns in IPv4 and between 1.05 and 1.6 ns in IPv6. This results into LPM search throughput of 950 million lookups per second (MLPS) to 760 MLPS in IPv4 and between 620 and 950 MLPS in IPv6. The average search throughput achievable from this architecture is roughly 850 MLPS in IPv4 and 780 MLPS in IPv6. The numerical results show that this architecture significantly reduces memory requirement, power consumption, and transistor-count/bit requirement.  相似文献   

19.
Chain codes are the most size-efficient representations of rasterised binary shapes and contours. This paper considers a new lossless chain code compression method based on move-to-front transform and an adaptive run-length encoding. The former reduces the information entropy of the chain code, whilst the latter compresses the entropy-reduced chain code by coding the repetitions of chain code symbols and their combinations using a variable-length model. In comparison to other state-of-the-art compression methods, the entropy-reduction is highly efficient, and the newly proposed method yields, on average, better compression.  相似文献   

20.
A new high-density multiple-valued content-addressable memory (CAM) is proposed to perform highly parallel search operations in a limited chip area. The number of cells in the CAM is reduced by the use of multiple-valued data representation. Moreover, multiple-valued stored data correspond to the threshold voltage of a floating-gate MOS transistor, so that the cell circuit can be designed using only a single transistor. As a result, the cell area of the proposed four-valued CAM is reduced to 14% of that of a conventional dynamic binary CAM, and its performance is about 5.4-times higher than that of the corresponding binary one under a 0.8-μm standard EEPROM technology  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号