期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Using bus-based connections to improve field-programmable gate-array density for implementing datapath circuits

Ye A. Rose J. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(5):462-473

As the logic capacity of field-programmable gate arrays (FPGAs) increases, they are increasingly being used to implement large arithmetic-intensive applications, which often contain a large proportion of datapath circuits. Since datapath circuits usually consist of regularly structured components (called bit-slices) which are connected together by regularly structured signals (called buses), it is possible to utilize datapath regularity in order to achieve significant area savings through FPGA architectural innovations. This paper describes such an FPGA routing architecture, called the multibit routing architecture, which employs bus-based connections in order to exploit datapath regularity. It is experimentally shown that, compared to conventional FPGA routing architectures, the multibit routing architecture can achieve 14% routing area reduction for implementing datapath circuits, which represents an overall FPGA area savings of 10%. This paper also empirically determines the best values of several important architectural parameters for the new routing architecture including the most area efficient granularity values and the most area efficient proportion of bus-based connections. 相似文献

2.

Stateful-NOR based reconfigurable architecture for logic implementation

《Microelectronics Journal》2015,46(6):551-562

Most commercial Field Programmable Gate Arrays (FPGAs) have limitations in terms of density, speed, configuration overhead and power consumption mostly due to the use of SRAM cells in Look-Up Tables (LUTs), configuration memory and programmable interconnects. Also, hardwired Application Specific Integrated Circuit (ASIC) blocks designed for high performance arithmetic circuits in FPGA reduce the area available for reconfiguration. In this paper, we propose a novel generalized hybrid CMOS-memristor based architecture using stateful-NOR gates as basic building blocks for implementation of logic functions. These logic functions are implemented on memristor nanocrossbar layers, while the CMOS layer is used for selection and connection of memristors. The proposed pipelined architecture combines the features of ASIC, FPGA and microprocessor based designs. It has high density due to the use of nanocrossbar layer and high throughput especially for arithmetic circuits. The proposed architecture for three input one output logic block is compared with conventional LUT based Configurable Logic Block (CLB) having the same number of inputs and outputs; which shows 1.82×area saving, 1.57×speedup and 3.63×less power consumption. The automation algorithm to implement any logic function using proposed architecture is also presented. 相似文献

3.

Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance 总被引：2，自引：0，他引：2

《Solid-State Circuits, IEEE Journal of》2009,44(1):49-63

A 65 nm resilient circuit test-chip is implemented with timing-error detection and recovery circuits to eliminate the clock frequency guardband from dynamic supply voltage (V_CC) and temperature variations as well as to exploit path-activation probabilities for maximizing throughput. Two error-detection sequential (EDS) circuits are introduced to preserve the timing-error detection capability of previous EDS designs while lowering clock energy and removing datapath metastability. One EDS circuit is a dynamic transition detector with a time-borrowing datapath latch (TDTB). The other EDS circuit is a double-sampling static design with a time-borrowing datapath latch (DSTB). In comparison to previous EDS designs, TDTB and DSTB redirect the highly complex metastability problem from both the datapath and error path to only the error path, enabling a drastic simplification in managing metastability. From a survey of various EDS circuit options, TDTB represents the lowest clock energy EDS circuit known; DSTB represents the lowest clock energy static-EDS circuit with SER protection known. Error-recovery circuits are introduced to replay failing instructions at lower clock frequency to guarantee correct functionality. Relative to conventional circuits, test-chip measurements demonstrate that resilient circuits enable either 25%-32% throughput gain at equal V_CC or at least 17% V_CC reduction at equal throughput, corresponding to 31%-37% total power reduction. 相似文献

4.

一种超小型AES专用处理器的FPGA实现

张巨英王和明《微电子学与计算机》2008,25(4):165-168

提出一种基于FPGA的专用处理器设计.它是用于高级加密标准的超小面积设计,支持密钥扩展(现在设计为128位密钥),加密和解密.这个设计采用了完全的8位数据路径宽度,创新的字节替换电路和乘累加器结构,在最小规模的Xilinx Spartan II FPGA芯片XC2S15上实现了一个高级加密标准AES的专用处理器,使用了不到60%的资源.当时钟为70MHz时,可以达到平均加密解密吞吐量2.1Mb/s.主要应用在把低资源占用,低功耗作优先考虑的场合. 相似文献

5.

Wave-pipelined intra-chip signaling for on-FPGA communications

Terrence Mak Author Vitae Pete Sedcole^{Author Vitae} 《Integration, the VLSI Journal》2010,43(2):188-201

On-FPGA communication is becoming more problematic as the long interconnection performance is deteriorating in technology scaling. In this paper, we address this issue by proposing a novel wave-pipelined signaling scheme to achieve substantial throughput improvement in FPGAs. A new analytical model capturing the electrical characteristics in FPGA interconnects is presented. Based on the model, throughput and power consumption of a wave-pipelined link have been derived analytically and compared to the conventional synchronous links. Two circuit designs are proposed to realize wave-pipelined link using FPGA fabrics. The proposed approaches are also compared with conventional synchronous and asynchronous pipelining techniques. It is shown that the wave-pipelined approach can achieve up to 5.7 times improvement in throughput and 13% improvement in power consumption versus conventional delay-based on-chip communication schemes. Also, trade-offs between power, throughput and area consumption between the proposed and conventional designs are studied. The wave-pipelining approach provides a new alternative for on-FPGA communications and can potentially become a promising solution to mitigate the future interconnect scaling challenge. 相似文献

6.

Implementation of a High Throughput Soft MIMO Detector on GPU

Michael Wu Yang Sun Siddharth Gupta Joseph R. Cavallaro 《Journal of Signal Processing Systems》2011,64(1):123-136

Multiple-input multiple-output (MIMO) significantly increases the throughput of a communication system by employing multiple antennas at the transmitter and the receiver. To extract maximum performance from a MIMO system, a computationally intensive search based detector is needed. To meet the challenge of MIMO detection, typical suboptimal MIMO detectors are ASIC or FPGA designs. We aim to show that a MIMO detector on Graphic processor unit (GPU), a low-cost parallel programmable co-processor, can achieve high throughput and can serve as an alternative to ASIC/FPGA designs. However, careful architecture aware software design is needed to leverage the performance offered by GPU. We propose a novel soft MIMO detection algorithm, multi-pass trellis traversal (MTT), and show that we can achieve ASIC/FPGA-like performance and handle different configurations in software on GPU. The proposed design can be used to accelerate wireless physical layer simulations and to offload MIMO detection processing in wireless testbed platforms. 相似文献

7.

Optical interconnects for neural and reconfigurable VLSIarchitectures

Fey D. Erhard W. Gruber M. Jahns J. Bartelt H. Grimm G. Hoppe L. Sinzinger S. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2000,88(6):838-848

The increasing transistor density in very large-scale integrated (VLSI) circuits and the limited pin member in the off-chip communication lead to a situation described as interconnect crisis in micro-electronics. Optoelectronic VLSI (OE-VLSI) circuits using short-distance optical interconnects and optoelectronic devices like microlaser, modulator, and detector arrays for optical off-chip sending and receiving offer a technology to overcome this crisis. However, in order to exploit efficiently the potential of thousands of optical off-chip interconnects, an appropriate VLSI architecture is required. We show for the example of neural and reconfigurable VLSI architectures that fine-grain architectures fulfill these requirements. An OE-VLSI circuit realization based on multiple quantum-well modulators functioning as two-dimensional (2-D) optical input/output (I/O) interface for the chip is presented. Due to the parallel optical interface, and improvement of two to three orders of magnitude in the throughput performance is possible compared to all-electronic solutions. For the optical interconnects, a planar-integrated free-space optical system has been designed leading to an optical multichip module. Such a system has been fabricated and experimentally characterized. Furthermore, we designed an manufactured fiber arrays, which will be the core element for a convenient test station for the 2-D optoelectronic I/O interface of OE-VLSI circuits 相似文献

8.

Optimization and Implementation of a Viterbi Decoder Under Flexibility Constraints

《IEEE transactions on circuits and systems. I, Regular papers》2008,55(8):2411-2422

This paper discusses the impact of flexibility when designing a Viterbi decoder for both convolutional and TCM codes. Different trade-offs have to be considered in choosing the right architecture for the processing blocks and the resulting hardware penalty is evaluated. We study the impact of symbol quantization that degrades performance and affects the wordlength of the rate-flexible trellis datapath. A radix-2-based architecture for this datapath relaxes the hardware requirements on the branch metric and survivor path blocks substantially. The cost of flexibility in terms of cell area and power consumption is explored by an investigation of synthesized designs that provide different transmission rates. Two designs are fabricated in a digital 0.13- $mu{hbox {m}}$ CMOS process. Based on post-layout simulations, a symbol baud rate of 168 Mbaud/s is achieved in TCM mode, equivalent to a maximum throughput of 840 Mbit/s using a 64-QAM constellation. 相似文献

9.

Multi-symbol-sliced dynamically reconfigurable Reed-Solomon decoder design based on unified finite-field processing element 总被引：1，自引：0，他引：1

Huai-Yi Hsu Jih-Chiang Yeo An-Yeu Wu 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(5):489-500

Reed-Solomon (RS) codes play an important role in providing the error correction and the data integrity in various communication/storage applications. For high-speed applications, most RS decoders are implemented as dedicated application-specified integrated circuits (ASICs) based on parallel architectures, which can deliver high data throughput rate. For lower-speed applications, the RS decoding operations are usually performed by using fine-grained processing elements (PE) controlled by a programmable digital signal processing (DSP) core, which provides high flexibility. In this paper, we propose a novel m-PE multi-symbol-sliced (MSS) RS datapath structure. The m-PE RS architecture is a highly scalable design and can be dynamically reconfigured at 1-PE, 2-PE,...,m/2-PE, and m-PE modes to deliver necessary data throughput rate. With the help of the gated-clock scheme to turn off the idle PEs, the proposed runtime configurable ASIC design provides good tradeoff between the data throughput rate and the power consumption. Hence, it can save energy to extend the battery life of the portable devices. We demonstrate a prototyping design using 4 PEs by using UMC 0.18-/spl mu/m CMOS technology. The design can be dynamically reconfigured to be operated at 1-PE, 2-PE, and 4-PE modes, with performance of 140 Mb/s at 18.91 mW, 280 Mb/s at 28.77 mW, and 560 Mb/s at 48.47 mW, respectively. Compared with existing RS designs, the proposed m-PE RS decoder has better normalized area/power efficiency than most DSP-type and ASIC-type RS designs. The reconfigurable feature makes our design a good candidate for the error control coding (ECC) unit of the storage system in power-aware portable devices. 相似文献

10.

FPGA based fast and high-throughput 2-slow retiming 128-bit AES encryption algorithm

Reza Rezaeian Farashahi Bahram Rashidi Sayed Masoud Sayedi 《Microelectronics Journal》2014

This paper presents a high throughput digital design of the 128-bit Advanced Encryption Standard (AES) algorithm based on the 2-slow retiming technique on FPGA. The C-slow retiming is a well-known optimization and high performance technique. It can enhance designs with feedback loops and automatically rebalances the registers in the design. The C-slow retiming can break the critical path of the design into finer pieces to improve the throughput of the design. The complexity of the C-slow retiming on FPGA is to find the best register allocation in the data path of the design so that by increasing the number of registers, relocation of the registers to balance the AES architecture be in the best mode, and the critical path be optimally pipelined and improved. In this paper, architecture of the AES algorithm is implemented in the gate level by high-speed and breakable structures that are desirable for the 2-slow retiming. The Mix-columns transformation is implemented based on multiplication by constants 2 and 3 modules with combinational logic circuits. This work has been successfully verified and synthesized using Xilinx ISE 11 byVirtex-5, XC5VLX85 FPGA. The proposed implementation achieves a high throughput of 86 Gb/s and high maximum operation frequency of 671.524 MHz whereas the highest throughput and the highest operation frequency reported in the literature are 73.737 Gb/s and 576.07 MHz, respectively. 相似文献

11.

Dual-Data Rate Transpose-Memory Architecture Improves the Performance,Power and Area of Signal-Processing Systems

Mohamed El-Hadedy Xinfei Guo Martin Margala Mircea R. Stan Kevin Skadron 《Journal of Signal Processing Systems》2017,88(2):167-184

This paper presents a novel type of high-speed and area-efficient register-based transpose memory architecture enabled by reporting on both edges of the clock. The proposed new architecture, by using the double-edge triggered registers, doubles the throughput and increases the maximum frequency by avoiding some of the combinational circuit used in prior work. The proposed design is evaluated with both FPGA and ASIC flow in 28/32nm technology. The experimental results show that the proposed memory achieves almost 4X improvement in throughput while consuming 46 % less area with the FPGA implementations compared to prior work. For ASIC implementations, it achieves more than 60 % area reduction and at least 2X performance improvement while burning 60 % less power compared to other register-based designs implemented with the same flow. As an example, a proposed 8X8 transpose memory with 12-bit input/output resolution is able to achieve a throughput of 107.83Gbps at 647MHz by taking only 140 slices on a Virtex-7 Xilinx FPGA platform, and achieve a throughput of 88.2Gbps at 529MHz by taking 0.024mm ² silicon area for ASIC. The proposed transpose memory is integrated in both 2D-DCT and 2D-IDCT blocks for signal processing applications on the same FPGA platform. The new architecture allows a 3.5X speed-up in performance for the 2D-DCT algorithm, compared to the previous work, while consuming 28 % less area, and 2D-IDCT achieves a 3X speed-up while consuming 20 % less area. 相似文献

12.

The Triptych FPGA architecture

Borriello G. Ebeling C. Hauck S.A. Burns S. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1995,3(4):491-501

Field-programmable gate arrays (FPGAs) are an important implementation medium for digital logic. Unfortunately, they currently suffer from poor silicon area utilization due to routing constraints. In this paper we present Triptych, an FPGA architecture designed to achieve improved logic density with competitive performance. This is done by allowing a per-mapping tradeoff between logic and routing resources, and with a routing scheme designed to match the structure of typical circuits. We show that, using manual placement, this architecture yields a logic density improvement of up to a factor of 3.5 over commercial FPGAs, with comparable performance. We also describe Montage, the first FPGA architecture to fully support asynchronous and synchronous interface circuits 相似文献

13.

A routing algorithm for FPGAs with time-multiplexed interconnects

Ruiqi Luo Xiaolei Chen Yajun Ha 《半导体学报》2020,(2):73-82

Previous studies show that interconnects occupy a large portion of the timing budget and area in FPGAs.In this work,we propose a time-multiplexing technique on FPGA interconnects.In order to fully exploit this interconnect architecture,we propose a time-multiplexed routing algorithm that can actively identify qualified nets and schedule them to multiplexable wires.We validate the algorithm by using the router to implement 20 benchmark circuits to time-multiplexed FPGAs.We achieve a 38%smaller minimum channel width and 3.8%smaller circuit critical path delay compared with the state-of-the-art architecture router when a wire can be time-multiplexed six times in a cycle. 相似文献

14.

Performance Characterization of AES Datapath Architectures in 90-nm Standard Cell CMOS Technology

Cheng Wang Howard M. Heys 《Journal of Signal Processing Systems》2014,75(3):217-231

In this paper, we characterize the performance of datapath architectures of the Advanced Encryption Standard (AES). These architectures are parameterized by a datapath width of 8, 16, 32, 64, or 128 bits and, for the 128-bit width, an unrolling factor of 1, 2, 5 or 10. Composite field S-boxes are adopted for all the architectures and shift registers based ShiftRows and MixColumns components are used for architectures with datapath widths of less than 128 bits. Their performance in terms of area, peak power and average energy is benchmarked using a 90-nm standard cell CMOS technology under a variety of throughput requirements. Through this characterization, the performance trade-offs affected by the architecture parameters are extensively explored. The parameters leading to the best performance are identified. It is found that the 8-bit width datapath, which is conventionally adopted for resource efficient purposes, has the worst energy efficiency and does not result in the minimal peak power among the architectures. As well, the 16, 32 and 64-bit width AES datapath architectures are newly considered or represent improvements over previous work. 相似文献

15.

An efficient logic emulation system

Varghese J. Butts M. Batcheller J. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1993,1(2):171-174

The Realizer, is a logic emulation system that automatically configures a network of field-programmable gate arrays (FPGAs) to implement large digital logic designs, is presented. Logic and interconnect are separated to achieve optimum FPGA utilization. Its interconnection architecture, called the partial crossbar, greatly reduces system-level placement and routing complexity, achieves bounded interconnect delay, scales linearly with pin count, and allows hierarchical expansion to systems with hundreds of thousands of FPGA devices in a fast and uniform way. An actual multiboard system has been built, using 42 Xilinx XC3090 FPGAs for logic. Several designs, including a 32-b CPU datapath, have been automatically realized and operated at speed. They demonstrate very good FPGA utilization. The Realizer has applications in logic verification and prototyping, simulation, architecture development, and special-purpose execution 相似文献

16.

Circuits and architectures for field programmable gate array with configurable supply voltage

Lin Y. Fei Li Lei He 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(9):1035-1047

Field programmable gate arrays (FPGAs) with supply voltage (Vdd) programmability have been proposed recently to reduce FPGA power, where the Vdd-level can be customized for FPGA circuit elements and unused circuit elements can be power-gated. In this paper, we first design novel Vdd-programmable and Vdd-gateable interconnect switches with minimal number of configuration SRAM cells. We then evaluate Vdd-programmable FPGA architectures using the new switches. The best architecture in our study uses Vdd-programmable logic blocks and Vdd-gateable interconnects. Compared to the baseline architecture similar to the leading commercial architecture, our best architecture reduces the minimal energy-delay product by 54.39% with 17% more area and 3% more configuration SRAM cells. Our evaluation results also show that LUT size 4 gives the lowest energy consumption, and LUT size 7 leads to the highest performance, both for all evaluated architectures. 相似文献

17.

An Asynchronous Design for Testability and Implementation in Thin-film Transistor Technology

Chi-Hsuan Cheng James Chien-Mo Li 《Journal of Electronic Testing》2011,27(2):193-201

This paper presents a scan chain design for dual-rail asynchronous circuits. This is a true asynchronous scan chain because no clock is needed even in scan mode. This is a full-scan design for testability (DfT) so only combinational automatic test pattern generation (ATPG) is needed and the fault coverage of generated test patterns is very high. This technique can be applied to various kinds of asynchronous circuits, including pipelines, state machines, and interconnects. Experiments on an 8051 datapath circuit show that the coverage is as high as 99.59%. This technique has been proven to work successfully in 8 μm Thin-film transistor (TFT) technology on the glass. 相似文献

18.

From synchronous to GALS: A new architecture for FPGAs

René Gagné Jean Belzile 《Microelectronics Journal》2009,40(11):1657-1666

The conflictual demand of faster and larger designs is increasingly difficult to answer by the advances of solid state technology alone. At some point, it is expected that designers and manufacturers will have to give up the traditional synchronous design methodology for a Globally Asynchronous Locally Synchronous (GALS) one. Such changes imply more synchronization constraints, but also more flexibility. Consequently, this paper proposes a novel Field-Programmable Gate Arrays (FPGA) architecture that is compatible with existing devices and that can also support GALS designs. The main objective is simple: the proposed architecture must appear unchanged for synchronous design, but it must also include a minimal amount of basic components to prevent metastability for efficient asynchronous communications. Thus, the paper presents the constraint equations required to implement such a circuit. It also presents a pausible clock generator application and simulation results for the proposed architecture. All results demonstrate that with a few additional customized circuits, a standard FPGA cell can become appropriate for GALS methodologies. 相似文献

19.

A Unified FPGA-Based System Architecture for 2-D Discrete Wavelet Transform

Ishmael Sameen Yoong Choon Chang Mow Song Ng Bok-Min Goi Chee-Pun Ooi 《Journal of Signal Processing Systems》2013,71(2):123-142

This paper presents a novel unified and programmable 2-D Discrete Wavelet Transform (DWT) system architecture, which was implemented using a Field Programmable Gate Array (FPGA)-based Nios II soft-core processor working in combination with custom hardware accelerators generated through high-level synthesis. The proposed system architecture, synthesized on an Altera DE3 Stratix III FPGA board, was developed through an iterative design space exploration methodology using Altera’s C2H compiler. Experimental results show that the proposed system architecture is capable of real-time video processing performance for grayscale image resolutions of up to 1920?×?1080 (1080p) when ran on the Altera DE3 board, and it outperforms the existing 2-D DWT architecture implementations known in literature by a considerable margin in terms of throughput. While the proposed 2-D DWT system architecture satisfies real-time performance constraints, it can also perform both forward and inverse DWT, support a number of popular DWT filters used for image and video compression and provide architecture programmability in terms of number of levels of decomposition as well as image width and height. Based from the design principles used to implement the proposed 2-D DWT system architecture, a system design guideline can be formulated for SOC designs which plan to incorporate dedicated 2-D DWT hardware acceleration. 相似文献

20.

Review of advanced FPGA architectures and technologies

Yang Haigang Zhang Jia Sun Jiabin Yu Le 《电子科学学刊(英文版)》2014,31(5):371-393

Field Programmable Gate Array （FPGA） is an efficient reconfigurable integrated circuit platform and has become a core signal processing mieroehip device of digital systems over the last decade. With the rapid development of semiconductor technology, the performance and system inte- gration of FPGA devices have been significantly progressed, and at the same time new challenges arise. The design of FPGA architecture is required to evolve to meet these challenges, while also taking advantage of ever increased microchip density. This survey reviews the recent development of advanced FPGA architectures, including improvement of the programming technologies, logic blocks, intercon- nects, and embedded resources. Moreover, some important emerging design issues of FPGA archi- tectures, such as novel memory based FPGAs and 3D FPGAs, are also presented to provide an outlook for future FPGA development. 相似文献