首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 453 毫秒
1.
Low-Energy Digit-Serial/Parallel Finite Field Multipliers   总被引:5,自引:0,他引:5  
Digit-serial architectures are best suited for systems requiring moderate sample rate and where area and power consumption are critical. This paper presents a new approach for designing digit-serial/parallel finite field multipliers. This approach combines both array-type and parallel multiplication algorithms, where the digit-level array-type algorithm minimizes the latency for one multiplication operation and the parallel architecture inside of each digit cell reduces both the cycle-time as well as the switching activities, hence power consumption. By appropriately constraining the feasible primitive polynomials, the mod p(x) operation involved in finite field multiplication can be performed in a more efficient way. As a result, the computation delay and energy consumption of one finite field multiplication using the proposed digit-serial/parallel architectures are significantly less than of those obtained by folding the parallel semi-systolic multipliers. Furthermore, their energy-delay products are reduced by a even larger percentage. Therefore, the proposed digit-serial/parallel architectures are attractive for both low-energy and high-performance applications.  相似文献   

2.
In an orthogonal frequency division multiplexing (OFDM) based wireless systems, Fast Fourier Transform (FFT) is a critical block as it occupies large area and consumes more power. In this paper, we present an area-efficient and low power 16-bit word-width 64-point radix-22 and radix-23 pipelined FFT architectures for an OFDM-based IEEE 802.11a wireless LAN baseband. The designs are derived from radix-2k algorithm and adopt a Single-Path Delay Feedback (SDF) architecture for hardware implementation. To eliminate the complex multipliers and read-only memory (ROM) which is used for internal storage of twiddle factor coefficients, the proposed 64-point FFT employs a Canonical Signed Digit (CSD) complex constant multiplier using adders, multiplexers and shifters. The complex constant multiplier (CCM) is modified using common sub-expression sharing block that reduces the area of the design. The proposed radix-22 and radix-23 pipelined FFT architectures are modeled and implemented using TSMC 180 nm CMOS technology with a supply voltage of 1.8 V. The implementation results show that the proposed architectures significantly reduces the hardware cost and power consumption in comparison to existing 64-point FFT architectures.  相似文献   

3.
This work presents a novel scalable multiplication algorithm for a type-t Gaussian normal basis (GNB) of GF(2m). Utilizing the basic characteristics of MSD-first and LSD-first schemes with d-bit digit size, the GNB multiplication can be decomposed into n(n + 1) Hankel matrix-vector multiplications. where n = (mt + 1)/d. The proposed scalable architectures for computing GNB multiplication comprise of one d × d Hankel multiplier, four registers and one final reduction polynomial circuit. Using the relationship of the basis conversion from the GNB to the normal basis, we also present the modified scalable multiplier which requires only nk Hankel multiplications, where k = mt/2d if m is even or k = (mt − t + 2)/2d if m is odd. The developed scalable multipliers have the feature of scalability. It is shown that, as the selected digit size d ≥ 8, the proposed scalable architectures have significantly lower time-area complexity than existing digit-serial multipliers. Moreover, the proposed architectures have the features of regularity, modularity, and local interconnection ability. Accordingly, they are well suited for VLSI implementation.  相似文献   

4.
Network-on-Chip (NoC) has been recognized as the new paradigm to interconnect and organize a high number of cores. NoCs address global communication issues in System-on-Chips (SoC) involving communication-centric design and implementation of scalable communication structures evolving application-specific NoC design as a key challenge to modern SoC design. In this paper we present a SystemC customization framework and methodology for automatic design and evaluation of regular and irregular NoC architectures. The presented framework also supports application-specific optimization techniques such as priority assignment, node clustering and buffer sizing. Experimental results show that generated regular NoC architectures achieve an average of 5.5 % lower communication-cost compared to other regular NoC designs while irregular NoCs proved to achieve on average 4.5×higher throughput and 40 % network delay reduction compared to regular mesh topologies. In addition, employing a buffer sizing algorithm we achieve a reduction in network’s power consumption by an average of 45 % for both regular and irregular NoC design flow.  相似文献   

5.
The Discrete Cosine Transform (DCT) is one of the most widely used techniques of transforms in digital signal processing. It is the main algorithm in image and video coding systems. In this paper, we propose an algorithm which generates enhanced Cordic based Loeffler DCT architectures for angle’s precision degrees ranging from 10?1 to 10?7. High level PSNR, area and power estimators have been proposed to make a trade-off between consumption and image quality. An optimal architecture has been retained for its low complexity, low power and high PSNR. The complexity of this architecture is the lowest among the conventional DCT architectures even the BinDCT which is a reference in terms of reduced complexity. The selected architecture has also the closest PSNR to the reference Loeffler-DCT architecture without a substancial loss of power.  相似文献   

6.

The approximate design has emerged as a revolutionary design paradigm to obtain energy efficient digital signal processing cores while exhibiting acceptable accuracy. In different signal processing architectures, multiplier is the prime arithmetic unit and significantly influences the performance of these cores. Therefore, four novel energy efficient rounding based approximate (RBA) multiplier architectures are proposed in this paper. These multipliers first approximate input operands to the nearest power of two values and then achieve multiplication using few adders and shifters. The proposed RBA multipliers significantly reduce implementation complexity and provide higher energy efficiency. Further, a novel reconfigurable rounding based approximate (RRBA) multiplier is proposed to achieve desired performance-quality tradeoff. Further, the performance of proposed RBA and RRBA multipliers is evaluated and analysed over the existing approximate multiplier architectures. The proposed 8-bit RBA0 requires 59.8% (54.7%) reduced area (delay) compared to the existing approximate multiplier. Finally, the efficacy of the proposed multipliers is demonstrated in the application by implementing Gaussian filters embedded with existing and proposed approximate multipliers. The Gaussian filter designed using RBA0 provides 32.5% reduced energy consumption over the filter with existing multiplier.

  相似文献   

7.
The Substitution box (S-Box) forms the core building block of any hardware implementation of the Advanced Encryption Standard (AES) algorithm as it is a non-linear structure requiring multiplicative inversion. This paper presents a full custom CMOS design of S-Box/Inversion S-Box (Inv S-Box) with low power GF (28) Galois Field inversions based on polynomial basis, using composite field arithmetic. The S-Box/Inv S-Box utilizes a novel low power 2-input XOR gate with only six devices to achieve a compact module implemented in 65 nm IBM CMOS technology. The area of the core circuit is only about 288 μm2 as a result of this transistor level optimization. The hardware cost of the S-Box/Inv S-Box is about 158 logic gates equivalent to 948 transistors with a critical path propagation delay of 7.322 ns enabling a throughput of 130 Mega-SubBytes per second. This design indicates a power dissipation of only around 0.09 μW using a 0.8 V supply voltage, and, is suitable for applications such as RFID tags and smart cards which require low power consumption with a small silicon die. The proposed implementation compares favorably with other existing S-Box designs.  相似文献   

8.
This paper presents new time-dependent and time-independent multiplication algorithms over finite fields GF(2m) by employing an interleaved conventional multiplication and a folded technique. The proposed algorithm allows efficient realization of the bit-parallel systolic multipliers. The results show that the proposed time-independent multiplier saves about 54% space complexity as compared to other related multipliers for polynomial and dual bases of GF(2m). The proposed architectures include the features of regularity, modularity and local interconnection. Accordingly, it is well suited for VLSI implementation.  相似文献   

9.
10.
In this paper, we propose some SQuare-RooT (SQRT) Carry SeLect Adder (CSLA) architectures including a high-speed design, a design with the lowest area compared to previous CSLAs, and two hybrid designs. The first proposed architecture is an optimized design of the Binary to Excess-1 Converter (BEC)-based CSLA by employing a new fast and merged add-one and multiplexing circuit. This architecture in addition to attaining much lower area, delay and energy consumption compared to the BEC CSLA, requires almost the same area compared to the best existing CSLA i.e. IRredundant Carry Generation and Selection scheme (IRCGS CSLA) while providing a higher speed. The second proposed CSLA as the lowest-area design is the area-optimized architecture of IRCGS CSLA that exploits a new logic optimization while maintaining its speed. This scheme makes use of a multiplexer-based logic to reduce the number of gates and to achieve a more compact design. In addition, two hybrid CSLAs are proposed by exploiting the benefits of both proposed CSLA architectures. Experimental results show that the hybrid CSLAs lead to lowest area-delay product and energy-delay product among all the proposed and previous designs in a wide range of 8-bit to 128-bit adder size. In fact, 10–48% reduction in area-delay product and 8–65% reduction in energy-delay product are achieved compared to previous designs. Moreover, the hybrid CSLAs outperform the best existing design with respect to all three parameters of area, delay and energy.  相似文献   

11.
Modular multiplication is a fundamental operation in numerous public-key cryptosystems including the RSA method. Increasing popularity of internet e-commerce and other security applications translate into a demand for a scalable performance hardware design framework. Previous scalable hardware methodologies either were not systolic and thus involved performance-degrading, full-word-length broadcasts or were not scalable beyond linear array size. In this paper, these limitations are overcome with the introduction of three classes of scalable-performance modular multiplication architectures based on systolic arrays. Very high clock rates are feasible, since the cells composing the architectures are of bit-level complexity. Architectural methods based on both binary and high-radix modular multiplication are derived. All techniques are constructed to allow additional flexibility for the impact of interconnect delay within the design environment.  相似文献   

12.
A Carry-Select Adder (CSA) is one of the most suitable adders for high-speed applications, but the power and area penalties are greater, because it requires a double Ripple-Carry Adder (RCA) structure corresponding to carry inputs 0 and 1. Current low-power and low-area techniques are not suitable for a standard cell-based design which is one of the widely adopted design methodologies. Our work proposes two simple optimised architectures suitable for standard cell-based designs. A simple decision logic that replaces the RCA for Carry input 1 in a conventional CSA is proposed. One of the proposed architectures reduces power and area significantly with a small delay penalty compared to the existing techniques. Another proposed architecture improves the speed of operation and reduces the power and area considerably. The first one is more suitable for high-speed arithmetic in battery-operated applications where there is a trade-off between speed and power, while the other one is suitable for high-performance applications which also require area and power optimisation. The proposed architectures were implemented in TSMC 0.18um CMOS technology, and compared with conventional Square Root Carry-Select Adders and an existing standard cell-based design.  相似文献   

13.
The optimum architecture design and mapping of QRD-RLS adaptive filters can be achieved through filter architecture selections, look-ahead transformations, and hierarchical pipelining/folding transformations. In this paper, a relaxed annihilation-reordering look-ahead (RARL) architecture is proposed, and shown to be more power and area efficient than pipelined processing architecture which was considered the most area efficient. The filters with this architecture are based on relaxed weight-update through filtering approximation, where a filter tap weight is updated upon arrival of every block of input data, and are speeded up with annihilation-reordering look-ahead transformation. As a result of the computational complexity reduction, this architecture does not change the iteration bound and filter clock frequency, and leads to speed up with linear increase in power consumption, while the pipelined processing architectures result in speedup with quadratic increase in power consumption. Upon hardware mapping, this architecture is also more advantageous to achieve low area designs. Two design examples are presented to illustrate mapping optimization using above transformations. These results are important for mapping designs onto ASICs, FPGAs or parallel computing machines. The results show significant improvements in throughput, power consumption and hardware requirement. It is also interesting to show through mathematics and simulations that the RARL QRD-RLS filters have no performance degradation in terms of convergence rate.  相似文献   

14.
We consider circuit techniques for reducing field-programmable gate-array (FPGA) power consumption and propose a family of new FPGA routing switch designs that are programmable to operate in three different modes: high-speed, low-power, or sleep. High-speed mode provides similar power and performance to traditional FPGA routing switches. In low-power mode, speed is curtailed in order to reduce power consumption. Leakage is reduced by 28%–52% in low-power versus high-speed mode, depending on the particular switch design selected. Dynamic power is reduced by 28%–31% in low-power mode. Leakage power in sleep mode, which is suitable for unused routing switches, is 61%–79% lower than in high-speed mode. Each of the proposed switch designs has a different power/area/speed tradeoff. All of the designs require only minor changes to a traditional routing switch and involve relatively small area overhead, making them easy to incorporate into current commercial FPGAs. The applicability of the new switches is motivated through an analysis of timing slack in industrial FPGA designs. It is observed that a considerable fraction of routing switches may be slowed down (operate in low-power mode), without impacting overall design performance.   相似文献   

15.
ASIC design of a high speed low power circuit for factorial calculation of a number is reported in this paper. The factorial of a number can be calculated using iterative multiplication by incrementing or decrementing process and iterative multiplication can be computed through parallel implementation methodology. Parallel implementation along with Vedic multiplication methodology for calculation of factorial of a number ensures significant reduction in propagation delay and switching power consumption due to reduction of stages in multiplication process, in comparison with the conventionally used Vedic multiplication methodologies like ‘Urdhva-tiryakbyham’ (UT) and ‘Nikhilam Navatascaramam Dasatah’ (NND) based implementation methodology. Transistor level implementation was carried out using spice specter with standard 90 nm CMOS technology and the results were compared with the above mentioned conventional methodologies. The propagation delay for the calculation of 4-bit factorial of a number was only ∼42.13 ns while the power consumption of the same was ∼58.82 mW for a layout area of ∼6 mm2. Improvement in speed was found to be ∼33% and ∼24% while corresponding reduction of power consumption in ∼34.48% and ∼24% for the factorial calculation circuitry in comparison with UT and NND based implementations, respectively.  相似文献   

16.
Recently, the power consumption of integrated circuits has been attracting increasing attention. Many techniques have been studied to improve the power efficiency of digital signal processing units such as fast Fourier transform (FFT) processors, which are popularly employed in both traditional research fields, such as satellite communications, and thriving consumer electronics, such as wireless communications. This paper presents solutions based on parallel architectures for high throughput and power efficient FFT cores. Different combinations of hybrid low‐power techniques are exploited to reduce power consumption, such as multiplierless units which replace the complex multipliers in FFTs, low‐power commutators based on an advanced interconnection, and parallel‐pipelined architectures. A number of FFT cores are implemented and evaluated for their power/area performance. The results show that up to 38% and 55% power savings can be achieved by the proposed pipelined FFTs and parallel‐pipelined FFTs respectively, compared to the conventional pipelined FFT processor architectures.  相似文献   

17.
All modern low power system on a chip (SoC) architectures are equipped with an in-built power management system. Every new system is expected to have more features and lower power consumption, resulting in a continuous demand to improve energy efficiency. To cope up with the ever increasing demand, an active power-aware management verification architecture is necessary to minimize the power consumption. Power reduction techniques include clock-gating, power-gating, multi-voltage, and voltage-frequency scaling. The proposed verification architecture utilizes the Unified Power Format (UPF) 2.1 libraries to achieve early design verification at the Electronic System-Level (ESL) of abstraction. The proposed testbench can verify several designs of different power management schemes. The presented work offers a reduction in power states, CPU time and simulation time as compared to existing techniques. The interactive formal and simulation-based verification methods are used in this paper to remove the simulation artifacts during functional and power co-simulation. Additionally, this paper incorporates functional correctness and power-aware checks for different modules of Design Under Verification (DUV) at Transaction-Level Modeling (TLM).  相似文献   

18.
In this paper, we propose new universal designs of ternary-valued logic (TVL) with high-speed, low-power and full swing output using carbon nanotube FETs (CNTFETs). All of the TVL functions (39 functions) can be implemented in these designs. Ternary value logic is a promising alternative to binary logic due to the reduced integrated circuit (IC) interconnects and chip area. Therefore, a universal design of TVL is a good direction for the future of FPGA design using CNTFET. In this paper, new universal designs of ternary-valued logic based on CNTFETs are proposed and compared with the existing resistive-load CNTFET universal TVL designs. Extensive simulations have been performed in HSPICE to investigate the distribution of power consumption and the delay of the CNTFET-based universal cells due to variations in the supply voltage, the diameter of the CNT, and the room temperature. Simulation results show that the proposed universal TVL designs result in significantly lower power consumption and delay compared with previous resistive-load CNTFET universal TVL implementations.  相似文献   

19.
Variable block-size motion estimation (VBSME) process occupies a major part of computation of an H.264 encoder, which is usually accelerated by bit-parallel hardware architectures with large I/O bit width to meet real-time constrains. However, such kind of architectures increase the area overhead and pin count, and therefore will not be suitable for area-constrained electronic consumer designs such as small portable multimedia devices. This paper addresses this problem by proposing two area efficient least significant bit (LSB) bit-serial architectures with small pin numbers. Both designs take advantage of data reusing technique in different ways for sum of absolute differences (SAD) computation and reading reference pixels, leading to a considerable reduction of memory bandwidth. The first architecture propagates the partial SAD and sum results and broadcasts the reference pixel rows whereas the second design reuse the SAD of small blocks and has a reconfigurable reference buffer leading to a better memory bandwidth when using hardware parallelism. The proposed designs benefit from several optimization techniques including an efficient serial absolute difference architecture, word length reduction by parallelism, bit truncation, mode filtering, and macroblock (MB) level subsampling, which significantly enhance their performances in terms of silicon area, throughput, latency, and power consumption. The first and second designs can support full search VBSME of 720?×?480 video with 30 frames per second (fps), two reference frames, and [?16, 15] search range at a clock frequency of 414 MHz with 29.28 k and 31.5 k gates, respectively.  相似文献   

20.
This paper presents the design and implementation of a hyperelliptic curve cryptography (HECC) coprocessor over affine and projective coordinates, along with measurements of its performance, hardware complexity, and power consumption. We applied several design techniques, including parallelism, pipelining, and loop unrolling, in designing field arithmetic units, group operation units, and scalar multiplication units to improve the performance and power consumption. Our affine and projective coordinate‐based HECC processors execute in 0.436 ms and 0.531 ms, respectively, based on the underlying field GF(289). These results are about five times faster than those for previous hardware implementations and at least 13 times better in terms of area‐time products. Further results suggest that neither case is superior to the other when considering the hardware complexity and performance. The characteristics of our proposed HECC coprocessor show that it is applicable to high‐speed network applications as well as resource‐constrained environments, such as PDAs, smart cards, and so on.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号