期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

MOUSETRAP: High-Speed Transition-Signaling Asynchronous Pipelines

Singh M. Nowick S.M. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(6):684-698

An asynchronous pipeline style is introduced for high-speed applications, called MOUSETRAP. The pipeline uses standard transparent latches and static logic in its datapath, and small latch controllers consisting of only a single gate per pipeline stage. This simple structure is combined with an efficient and highly-concurrent event-driven protocol between adjacent stages. Post-layout SPICE simulations of a ten-stage pipeline with a 4-bit wide datapath indicate throughputs of 2.1-2.4 GHz in a 0.18-mum TSMC CMOS process. Similar results were obtained when the datapath width was extended to 16 bits. This performance is competitive even with that of wave pipelines, without the accompanying problems of complex timing and much design effort. Additionally, the new pipeline gracefully and robustly adapts to variable speed environments. The pipeline stages are extended to fork and join structures, to handle more complex system architectures. 相似文献

2.

Pseudo-Parallel Datapath Structure for Power Optimal Implementation of 128-pt FFT/IFFT for WPAN

J. Mathew K. Maharatna B. R. Jose H. Rahaman D. K. Pradhan 《Circuits, Systems, and Signal Processing》2011,30(4):871-882

An optimal implementation of 128-Pt FFT/IFFT for low power IEEE 802.15.3a WPAN using pseudo-parallel datapath structure is presented, where the 128-Pt FFT is devolved into 8-Pt and 16-Pt FFTs and then once again by devolving the 16-Pt FFT into 4×4 and 2×8. We analyze 128-Pt FFT/IFFT architecture for various pseudo-parallel 8-Pt and 16-Pt FFTs and an optimum datapath architecture is explored. It is suggested that there exists an optimum degree of parallelism for the given algorithm. The analysis demonstrated that with a modest increase in area one can achieve significant reduction in power. The proposed architectures complete one parallel-to-parallel (i.e., when all input data are available in parallel and all output data are generated in parallel) 128-point FFT computation in less than 312.5 ns and thereby meet the standard specification. The relative merits and demerits of these architectures have been analyzed from the algorithm as well as implementation point of view. Detailed power analysis of each of the architectures with a different number of data paths at block level is described. We found that from power perspective the architecture with eight datapaths is optimum. The core power consumption with optimum case is 60.6 MW which is only less than half of the latest reported 128-point FFT design in 0.18u technology. Furthermore, a Single Event Upset (SEU) tolerant scheme for registers is also explored. The SEU tolerant scheme will not affect the performance, however, there is an increase power consumption of about 42 percent. Apart from the low power consumption, the advantages of the proposed architectures include reduced hardware complexity, regular data flow and simple counter based control. 相似文献

3.

A 2000-MOPS embedded RISC processor with a Rambus DRAM controller

Suzuki K. Daito M. Inoue T. Nadehara K. Nomura M. Mizuno M. Iima T. Sato S. Fukuda T. Arai T. Kuroda I. Yamashina M. 《Solid-State Circuits, IEEE Journal of》1999,34(7):1010-1021

We have developed a 0.25-μm, 200-MHz embedded RISC processor for multimedia applications. This processor has a dual-issue superscalar datapath that consists of a 32-bit integer unit and a 64-bit single-instruction multiple-data (SIMD) function unit that together have a total of five multiply-adders. An on-chip concurrent Rambus DRAM (C-RDRAM) controller uses interleaved transactions to increase the memory bandwidth of the Rambus channel to 533 Mb/s. The controller also reduces latency by using the transaction interleaving and instruction prefetching. A 64-bit, 200-MHz internal bus transfers data among the CPU core, the C-RDRAM, and the peripherals. These high-data-rate channels improve CPU performance because they eliminate a bottleneck in the data supply. The datapath part of this chip was designed using a functional macrocell library that included placement information for leaf cells and resulted in the SIMD function unit of this chip's having 68000 transistors per square millimeter 相似文献

4.

Power optimization for the datapath of a 32-bit reconfigurable pipelined DSP processor 总被引：1，自引：0，他引：1

Han Liang Chen Jie Chen Xiaodong 《电子科学学刊(英文版)》2005,22(6):650-657

With the continuous increasing of circuit scale, the problem of power consumption is paid much more attention than before, especially in large designs. In this paper, an experience of optimizing the power consumption of the 16-bit datapath in a 32-bit reconfigurable pipelined Digital Signal Processor （DSP） is introduced. By keeping the old input values and preventing the useless switching of the logic blocks on the datapath, the power consumption is much lowered. At the same time, by relocating some logic blocks between different pipeline stages and employing some data forward logics, a better balanced pipeline is achieved to lower the power consumption for conditional computation instructions at very low timing and area costs. The effectivity of these power optimization technologies are proved by the experimental results. Finally, some ideas about how to reduce the power consumption of circuits are proposed, which are very effective and useful in practice designs, especially in pipelined ones. 相似文献

5.

Time redundancy and gate sizing soft error-tolerant based adder design

《Integration, the VLSI Journal》2021

In this paper, we propose an efficient and promising soft error tolerance approach for arithmetic circuits with high performance and low area overhead. The technique is applied for designing soft error tolerant adders and is based on the use of a fault tolerant C-element connecting a given adder output to one input of the C-element while connecting a delayed version of that output to the second input. It exploits the variability of the delay of the adder output bits, in which the most significant bits (MSBs) have longer delay than the least significant bits (LSBs), by adding larger delay to the LSBs and smaller delay to the MSBs to guarantee full fault tolerance against the largest pulse width of transient error (soft error) for the available technology with minimum impact on performance. To guarantee fault protections for transistors feeding outputs with smaller added delay, the technique utilizes transistor scaling to ensure that the injected fault pulse width is less than the added delay of the second output of the C-element. Simulation results reveal that the proposed technique takes precedence over other techniques in terms of failure rate, area overhead, and delay overhead. The evaluation experiments have been done based on simulations at the transistor level using HSPICE to take care of temporal masking combined with electrical masking. In comparison to TMR, the technique achieves 100% reliability with 31% reduction in area overhead without impacting performance in the case of a 32-bit adder, and 42% reduction in area overhead and 5% reduction in performance overhead in the case of a 64-bit adder. While our proposed technique achieves area reduction of 4.95% and 9.23% in comparison to CE-based DMR and Feedback-based DMR techniques in the case of a 32-bit adder, it achieves area reduction of 19.58% and 23.24% in the case of a 64-bit adder. 相似文献

6.

Constant addition with flagged binary adder architectures

Vibhuti Dave Author Vitae Erdal Oruklu^{Author Vitae} 《Integration, the VLSI Journal》2010,43(3):258-267

The goal of this paper is to present architectures that provide the flexibility within a regular adder to augment/decrement the sum of two numbers by a constant. This flexibility adds to the functionality of a regular adder, achieving a comparable performance to conventional designs, thereby eliminating the need of having a dedicated adder unit to perform similar tasks. This paper presents an adder design to accomplish three-input addition if the third operand is a constant. This is accomplished by the introduction of flag bits. Such designs are called Enhanced Flagged Binary Adders (EFBA). It also examines the effect on the performance of the adder when the operand size is expanded from 16 bits to 32 and 64 bits. A detailed analysis has been provided to compare the performance of the new designs with carry-save adders in terms of delay, area and power. 相似文献

7.

Structural Cryptanalysis of SASAS 总被引：1，自引：0，他引：1

Alex Biryukov Adi Shamir 《Journal of Cryptology》2010,23(4):505-518

In this paper we consider the security of block ciphers which contain alternate layers of invertible S-boxes and affine mappings (there are many popular cryptosystems which use this structure, including the winner of the AES competition, Rijndael). We show that a five-layer scheme with 128-bit plaintexts and 8-bit S-boxes is surprisingly weak against what we call a multiset attack, even when all the S-boxes and affine mappings are key dependent (and thus completely unknown to the attacker). We tested the multiset attack with an actual implementation, which required just 2¹⁶ chosen plaintexts and a few seconds on a single PC to find the 2¹⁷ bits of information in all the unknown elements of the scheme. 相似文献

8.

Energy efficient hybrid adder architecture

《Integration, the VLSI Journal》2015

An energy efficient adder design based on a hybrid carry computation is proposed. Addition takes place by considering the carry as propagating forwards from the LSB and backwards from the MSB. The incidence at a midpoint significantly accelerates the addition. This acceleration together with combining low-cost ripple-carry and carry-chain circuits, yields energy efficiency compared to other adder architectures. The optimal midpoint is analytically formulated and its closed-form expression is derived. To avoid the quadratic RC delay growth in a long carry chain, it is optimally repeated. The adder is enhanced in a tree-like structure for further acceleration. 32, 64 and 128-bit adders targeting 500 MHz and 1 GHz clock frequencies were designed in 65 nm technology. They consumed 11–18% less energy compared to adders generated by state-of-the-art EDA synthesis tool. 相似文献

9.

Novel Reversible Design of Advanced Encryption Standard Cryptographic Algorithm for Wireless Sensor Networks

P. Saravanan P. Kalpana 《Wireless Personal Communications》2018,100(4):1427-1458

The quantum of power consumption in wireless sensor nodes plays a vital role in power management since more number of functional elements are integrated in a smaller space and operated at very high frequencies. In addition, the variations in the power consumption pave the way for power analysis attacks in which the attacker gains control of the secret parameters involved in the cryptographic implementation embedded in the wireless sensor nodes. Hence, a strong countermeasure is required to provide adequate security in these systems. Traditional digital logic gates are used to build the circuits in wireless sensor nodes and the primary reason for its power consumption is the absence of reversibility property in those gates. These irreversible logic gates consume power as heat due to the loss of per bit information. In order to minimize the power consumption and in turn to circumvent the issues related to power analysis attacks, reversible logic gates can be used in wireless sensor nodes. This shifts the focus from power-hungry irreversible gates to potentially powerful circuits based on controllable quantum systems. Reversible logic gates theoretically consume zero power and have accurate quantum circuit model for practical realization such as quantum computers and implementations based on quantum dot cellular automata. One of the key components in wireless sensor nodes is the cryptographic algorithm implementation which is used to secure the information collected by the sensor nodes. In this work, a novel reversible gate design of 128-bit Advanced Encryption Standard (AES) cryptographic algorithm is presented. The complete structure of AES algorithm is designed by using combinational logic circuits and further they are mapped to reversible logic circuits. The proposed architectures make use of Toffoli family of reversible gates. The performance metrics such as gate count and quantum cost of the proposed designs are rigorously analyzed with respect to the existing designs and are properly tabulated. Our proposed reversible design of AES algorithm shows considerable improvements in the performance metrics when compared to existing designs. 相似文献

10.

A high-speed energy-efficient 64-bit reconfigurable binary adder

Perri S. Corsonello P. Cocorullo G. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(5):939-943

Datapaths for media signal processing are typically built using programmable computational elements such as adders and multipliers, which can be run-time reconfigured to operate on simple integers with 8, 16, or 32 bits of precision. In this brief, a new high-speed energy-efficient reconfigurable adder for media signal processing is presented. The proposed circuit is based on carry-propagation schemes and can be partitioned to perform one 64-, two 32-, four 16-, and eight 8-bit additions. When the Austria Mikro System (AMS) 0.35 /spl mu/m 2-poly 3-metal 3.3 V CMOS (CSD) process is used to produce layout, a worst propagation delay of about 4.9 ns and an average energy dissipation of about 181 /spl mu/W/MHz are obtained. 相似文献

11.

Design and performance testing of a 2.29-GB/s Rijndael processor

Verbauwhede I. Schaumont P. Kuo H. 《Solid-State Circuits, IEEE Journal of》2003,38(3):569-572

This contribution describes the design and performance testing of an Advanced Encryption Standard (AES) compliant encryption chip that delivers 2.29 GB/s of encryption throughput at 56 mW of power consumption in a 0.18-/spl mu/m CMOS standard cell technology. This integrated circuit implements the Rijndael encryption algorithm, at any combination of block lengths (128, 192, or 25 bits) and key lengths (128, 192, or 256 bits). We present the chip architecture and discuss the design optimizations. We also present measurement results that were obtained from a set of 14 test samples of this chip. 相似文献

12.

A 186-Mvertices/s 161-mW Floating-Point Vertex Processor With Optimized Datapath and Vertex Caches

Chang-Hyo Yu Kyusik Chung Donghyun Kim Seok-Hoon Kim Lee-Sup Kim 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(10):1369-1382

In this paper, a power efficient vertex processor for mobile graphics applications is presented. A four-threaded and four-issue expanded VLIW datapath with a quad-float vertex texture fetcher is proposed by exploiting graphics specific characteristics after evaluation of several candidate architectures. Instruction-level power control methods such as operand sharing and writeback re-allocation along with operand isolations and gated clocks result in 40.4% and 82% reduction in energy dissipation and energy delay product compared to the most widely used single threaded SIMD. The proposed processor with the optimized datapath and vertex caches implemented in a 0.18- mum 1P4M CMOS process achieves 186-Mvertices/s geometry performance which is the best result among the processors that are IEEE-754 compliant. 相似文献

13.

An area-efficient and low-power 64-point pipeline Fast Fourier Transform for OFDM applications

《Integration, the VLSI Journal》2017

In an orthogonal frequency division multiplexing (OFDM) based wireless systems, Fast Fourier Transform (FFT) is a critical block as it occupies large area and consumes more power. In this paper, we present an area-efficient and low power 16-bit word-width 64-point radix-2² and radix-2³ pipelined FFT architectures for an OFDM-based IEEE 802.11a wireless LAN baseband. The designs are derived from radix-2^k algorithm and adopt a Single-Path Delay Feedback (SDF) architecture for hardware implementation. To eliminate the complex multipliers and read-only memory (ROM) which is used for internal storage of twiddle factor coefficients, the proposed 64-point FFT employs a Canonical Signed Digit (CSD) complex constant multiplier using adders, multiplexers and shifters. The complex constant multiplier (CCM) is modified using common sub-expression sharing block that reduces the area of the design. The proposed radix-2² and radix-2³ pipelined FFT architectures are modeled and implemented using TSMC 180 nm CMOS technology with a supply voltage of 1.8 V. The implementation results show that the proposed architectures significantly reduces the hardware cost and power consumption in comparison to existing 64-point FFT architectures. 相似文献

14.

A CCD line addressable random-access memory (LARAM)

《Solid-State Circuits, IEEE Journal of》1975,10(5):268-272

A novel approach to charge-coupled device (CCD) memory organization has been conceived and implemented in a 16 384-bit memory chip. It utilizes an isoplanar n-channel silicon gate MOS process in conjunction with self-aligned implanted barrier, buried channel CCD technology. The chip is organized in four parallel, identical sections of 32 independent lines with each line 128 bits long. The four sections are controlled in parallel. Any of the 32 lines (the same line in each of the four sections) can be randomly accessed; hence the name, line addressable random-access memory (LARAM). Each line can be brought to a halt at any of its 128 possible positions. Design features and test results of the memory are described. 相似文献

15.

Optimal Datapath Widths Within Turbo and Viterbi Decoders for High Area- and Energy-Efficiency

Martin Broich Tobias G. Noll 《Journal of Signal Processing Systems》2017,87(3):299-325

Datapath widths in state-of-the-art Turbo and Viterbi decoder implementations depend on estimated upper bounds of the differences of processed metrics. Aiming at highest area and energy efficiency, this paper presents guidelines for designing Turbo and Viterbi decoder datapaths with minimal widths. This is based on maximum absolute values of branch, state and path metric differences within theMax-Log-MAP respectively Viterbi decoding algorithm applying modulo normalization. The proposed methodology for determining the maximum absolute values covers punctured as well as n-input binary convolutional and Turbo codes so it accommodates higherradix add-compare-select operations. Maximum absolute values of metric differences and minimum datapath widths are presented for the 3GPP-LTE, DVB-RCS2 and IEEE-802.16 (WiMAX) compliant Turbo decoders and for the IEEE-802.11 (Wi-Fi), IEEE-802.16 (WiMAX) and 3GPP-LTE compliant Viterbi decoders. Besides, a new dynamic branch-metric saturation scheme is presented, which enables a further datapath width reduction within Turbo decoders. In total, a datapath width reduction of two bits (?20 %) is achieved applying radix-4 Max-Log-MAP arithmetic. An overall area-time-energy complexity reduction of 31% is achieved for a soft-input soft-output decoder and of 24% for the LTE Turbo decoder. 相似文献

16.

Merging VLIW and vector processing techniques for a simple,high-performance processor architecture

《Microelectronics Journal》2015,46(7):637-655

This paper proposes a new processor architecture called VVSHP for accelerating data-parallel applications, which are growing in importance and demanding increased performance from hardware. VVSHP merges VLIW and vector processing techniques for a simple, high-performance processor architecture. One key point of VVSHP is the execution of multiple scalar instructions within VLIW and vector instructions on unified parallel execution datapaths. Another key point is to reduce the complexity of VVSHP by designing a two-part register file: (1) shared scalar–vector part with eight-read/four-write ports 64×32-bit registers (64 scalar or 16×4 vector registers) for storing scalar/vector data and (2) vector part with two-read/one-write ports 48 vector-registers, each stores 4×32-bit vector data. Moreover, processing vector data with lengths varying from 1 to 256 represents a key point for reducing the loop overheads. VVSHP can issue up to four scalar/vector operations in each cycle for parallel processing a set of operands and producing up to four results to be written back into VVSHP register file. However, it cannot issue more than one memory operation at a time, which loads/stores 128-bit scalar/vector data from/to data memory. The design of our proposed VVSHP processor is implemented using VHDL targeting the Xilinx FPGA Virtex-5 and its performance is evaluated. 相似文献

17.

A 2-/spl mu/m CMOS 8-MIPS digital processor with parallel processing capability

《Solid-State Circuits, IEEE Journal of》1986,21(5):750-765

A 2-/spl mu/m CMOS VLSI digital signal processor (DSP) family, the SP50, is described that is capable of eight million instructions per second and up to six concurrent operations in each instruction. Two DSPs, the PCB5010 and PCB5011, have been developed. Both are based on a common architecture which contains two 16-bit data buses, and a 16/spl times/16/spl rarr/40-bit multiplier accumulator and 16-bit ALU, both with multiprecision support in hardware. Also implemented are two static data RAMs (128/spl times/16 or 256/spl times/16), a data ROM (51/spl times/16), a 15-word three-port register file, three address computation units, and five serial and parallel I/O interfaces. The data path is controlled by an orthogonal instruction set, using 40-bit microcode words. The controller contains a five-level stack and an instruction repeat register, and can have either on-chip program memory (RAM: 32/spl times/40; ROM: 987/spl times/40) or off-chip program memory (up to 64K/spl times/40). Benchmarks show a two to sixfold improvement in overall performance over its predecessors. 相似文献

18.

一种超小型AES专用处理器的FPGA实现

张巨英王和明《微电子学与计算机》2008,25(4):165-168

提出一种基于FPGA的专用处理器设计.它是用于高级加密标准的超小面积设计,支持密钥扩展(现在设计为128位密钥),加密和解密.这个设计采用了完全的8位数据路径宽度,创新的字节替换电路和乘累加器结构,在最小规模的Xilinx Spartan II FPGA芯片XC2S15上实现了一个高级加密标准AES的专用处理器,使用了不到60%的资源.当时钟为70MHz时,可以达到平均加密解密吞吐量2.1Mb/s.主要应用在把低资源占用,低功耗作优先考虑的场合. 相似文献

19.

A BiCMOS thermal head intelligent driver with density controllersfor full-tone rendition printers

Tsumura M. Takeuchi R. Shimizu I. 《Solid-State Circuits, IEEE Journal of》1988,23(2):437-441

A 3-μm BiCMOS thermal head driver using pulsewidth modulation dealing with eight-bit density input data (256 gray levels) is described. Circuits composed of 64×8-bit complex counters, which function as eight-line parallel 64-bit shift registers (shift mode) and as 64 counters which have eight bits (count mode) by alteration of their mutual connections according to the mode signals, have been developed. The complex counter controls the output pulse width according to the binary data and the clock intervals (minimum 100 ns). The shift registers can operate using a 20-MHz clock. The driver consists of about 4500 CMOS gates and 128 bipolar transistors in a 2.8-mm×8.8-mm chip size. The breakdown voltage of the bipolar transistor BV _cbo is more than 35 V. The driver is especially suited for full-tone rendition printers. Applications of the driver include use in thermal print heads, LED print heads, and LCD print heads 相似文献

20.

Scalable Parallel Memory Architectures for Video Coding 总被引：1，自引：0，他引：1

Jarno K. Tanskanen Jarkko T. Niittylahti 《The Journal of VLSI Signal Processing》2004,38(2):173-199

Current video compression standards, which process frames macroblock by macroblock, employ several processing functions to achieve the compression. These functions refer to data memory address space in different ways. E.g., performing motion estimation and motion compensation functions requires many times data accesses unaligned to word boundaries. On the other hand, Discrete Cosine Transformation (DCT) and inverse of it (IDCT) for 8 × 8 block can be performed first for rows and then for columns. Thus, transposition is needed between these two stages. Among other things, parallel memory architecture can provide a solution for these tasks. In our other paper, we shortly surveyed parallel memory architectures and proposed parallel memory architecture designs for different data path widths for video coding applications. In this paper, we construct video coding function examples by using the proposed parallel data memory efficiently. Furthermore, performance and implementation cost of the parallel memory architecture are estimated and compared to more conventional memory architectures. The examples are given for different data bus widths (16, 32, 64, and 128 bits). We show that the parallel memory can keep the data path fully utilized in many video coding function implementations. This ensures high-speed operation and full utilization of the processing resources. 相似文献