期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Partial product addition in Vedic design-ripple carry adder design fir filter architecture for electro cardiogram (ECG) signal de-noising application

《Microprocessors and Microsystems》2020

Design of adder plays a major role in deciding overall performance of system as it is a major building block through generations of design in an innovative design of circuits. In VLSI system and signal processing field applications, various versions of adders are utilized. In applications of signal processing, in recent days, major role is contributed by Finite Impulse Response (FIR) filter. Various authors and papers described its design in a several ways. With the design of effective multiplier, signal denoising application was not explained by any of the existing works. For the generation of partial products, 8-bit multiplier based on a Vedic Mathematics –UrdhvaTiryagbhyam sutra- is proposed in this work. In Vedic multiplier, carry skip method is used for realizing addition of partial product. Four Vedic multipliers of 4 × 4 size are used for designing 8-bit multiplier. Carry skip and UrdhvaTiryagbhyam methods are used for this design. For addition of partial product, this multiplier is designed. Ripple carry adder's logic levels are modified for adding these Vedic multiplier's output. Powerful elimination of ECG noise can be done using this proposed fast FIR filter. In applications of healthcare and biomedical field, they are used. In Vedic design, Ripple Carry Adder (RCA) is used for carrying out partial product addition. Operation of FIR filter with Electro Cardiogram (ECG) signal is done by proposing architecture of FIR filter. It is termed as PPAVD-RCA-FIR and used in de-noising applications. From de-noised signal, Signal to Noise Ratio (SNR), Bit Error Rate (BER) and Mean Square Error (MSE) are computed, which are used for evaluating the performances. When compared with general Vedic multiplier, speed of the proposed design is increased about 13.65% as shown by results. 相似文献

2.

Bipartite modular multiplication with twice the bit-length of multipliers

Masayuki Yoshino Katsuyuki Okeya Camille Vuillaume 《International Journal of Information Security》2009,8(1):13-23

This paper presents a new technique to compute 2ℓ-bit bipartite multiplications with ℓ-bit bipartite multiplication units. Low-end devices such as smartcards are usually equipped with crypto-coprocessors for accelerating the heavy computation of modular multiplications; however, security standards such as NIST and EMV have declared extending the bit length of RSA cryptosystem to resist mathematical attacks, making the multiplier quickly outdated. Therefore, the double-size techniques have been studied this decade to extend the life expectancy of such multipliers. This paper proposes new double-size techniques based on the multipliers implementing either classical or Montgomery modular multiplications, or even both simultaneously (bipartite modular multiplication), in which case one can potentially compute modular multiplications twice faster. Furthermore, in order to get a more realistic estimation than the other works, this paper considers not only the cost of the multiplication, but also the cost of the other arithmetic instructions. In our estimation, the proposal provides comparable results for classical multiplier and Montgomery multiplier, and is the only available method for the bipartite multiplier. A preliminary version of this paper was presented at the 12th Australasian Conference on Information Security and Privacy, ACISP’07. 相似文献

3.

Low-power and high-speed shift-based multiplier for error tolerant applications

《Microprocessors and Microsystems》2017

We propose a new multiplier design that fulfills the need for low-power circuit blocks used in error-tolerant applications on energy-constrained devices. The design trades accuracy for higher speed, lower energy consumption, and lower transistor count. The average relative error of an N-bit multiplier is modeled as a function of N and saturates at a constant (around 17%) as the multiplier width increases. An 8-bit implementation simulated in HSPICE achieved almost 90% energy savings for a random sample of operands as compared to a conventional parallel multiplier. The design is flexible whereby simple variations to the circuit structure lead to a perfectly accurate multiplier. Tests performed on multimedia applications such as JPEG compression showed a promising outcome. 相似文献

4.

A novel architecture design for VLSI implementation of integer DCT in HEVC standard

Loukil Hassen Masmoudi Nouri 《Multimedia Tools and Applications》2020,79(33-34):23977-23993

This paper presents novel hardware of a unified architecture to compute the 4?×?4, 8?×?8, 16?×?16 and 32?×?32 efficient two dimensional (2-D) integer DCT using one block 1-D DCT for the HEVC standard with less complexity and material design. As HEVC large transforms suffer from the huge number of computations especially multiplications, this paper presents a proposition of a modified algorithm reducing the computational complexity. The goal is to ensure the maximum circuit reuse during the computation while keeping the same quality of encoded videos. The hardware architecture is described in VHDL language and synthesized on Altera FPGA. The hardware architecture throughput reaches a processing rate up to 52 million of pixels per second at 90 MHz frequency clock. An IP core is presented using the embedded video system on a programmable chip (SoPC) for implementation and validation of the proposed design. Finally, the proposed architecture has significant advantages in terms of hardware cost and improved performance compared to related work existing in the literature. This architecture can be used in ultra-high definition real-time TV coding (UHD) applications.

相似文献

5.

Novel optimized tree-based stack-type architecture for 2n-bit comparator at nanoscale with energy dissipation analysis

Gudivada A. Arunkumar Sudha Gnanou Florence 《The Journal of supercomputing》2021,77(5):4659-4680

Comparator is an essential building block in many digital circuits such as biometric authentication, data sorting, and exponents comparison in floating-point architectures among others. Quantum-dot Cellular Automata (QCA) is a latest nanotechnology that overcomes the drawbacks of Complementary Metal Oxide Semiconductor (CMOS) technology. In this paper, novel area optimized 2n-bit comparator architecture is proposed. To achieve the objective, 1-bit stack-type and 4-bit tree-based stack-type (TB-ST) comparators are proposed using QCA. Then, two tree-based architectures of 4-bit comparators are arranged in two layers to optimize the number of quantum cells and area of an 8-bit comparator. Thus, this design can be extended to any 2n-bit comparator. Simulation results of 4-bit and 8-bit comparators using QCADesigner 2.0.3 show that there is a significant improvement in the number of quantum cells and area occupancy. The proposed TB-ST 8-bit comparator uses 2.5 clock cycles and 622 quantum cells with area occupancy of 0.49 µm² which is an improvement by 10.5% and 38%, respectively, compared to existing designs. Scaling it to a 32-bit comparator, the proposed architecture requires only 2675 quantum cells in an area of 2.05 µm² with a delay of 3.5 clock cycles, indicating 9.35% and 28.8% improvements, respectively, demonstrating the merit of the proposed architecture. Besides, energy dissipation analysis of the proposed TB-ST 8-bit comparator is simulated on QCADesigner-E tool, indicating average energy dissipation reduction of 17.3% compared to existing works.

相似文献

6.

Current-mode circuit-level technique to design variation-aware nanoscale summing circuit for ultra-low power applications

Guduri Manisha Mehra Rishab Srivastava Pragya Islam Aminul 《Microsystem Technologies》2017,23(9):4045-4056

Prodigious demand for fast performance-ultra low power electronic devices has insinuated the discovery of circuit style that promises reduced propagation delay (t _p), as well as low power dissipation (PWR). MOS current mode logic (MCML) style has emerged as a promising logic style that offers high speed of operation at the expense of acceptable power dissipation. This paper proposes a MCML full adder which employs a load controller circuit. It compares MCML full adder with hybrid-CMOS full adder in terms of various design metrics in superthreshold as well as subthreshold regions. MCML topology with load controller offers a high speed of operation and low power dissipation in superthreshold region. Same circuit arrangement, when operated in subthreshold region also delivers higher operating speed with ultralow power dissipation compared to its hybrid-CMOS counterpart. Power dissipation analysis established MCML based full adder more robust compared to its hybrid-CMOS counterpart. In particular, MCML full adder design achieves 3.77× (2.38×) improvement in propagation delay, 10.43× (3.45×) improvement in average power dissipation, 39.43× (8.21×) lower power-delay product (PDP) and 149.07× (19.55×) improvement in energy-delay product (EDP) in superthreshold (subthreshold) regions of operation at 16-nm technology node. The above results are also validated using TSMC’s industry standard 0.18-μm technology model parameters and a similar trend is observed in the design metrics of the MCML and hybrid-CMOS full adder circuits. In addition, noise performance of the above mentioned circuits is also carried out. It is observed that the noise induced by the hybrid-CMOS full adder is about 14× to that of the MCML full adder.

相似文献

7.

A new systolic multiprocessor architecture for real-time soft tomography algorithms

《Parallel Computing》2016

In this paper, a new systolic multiprocessor architecture for soft tomography algorithms that explores the intrinsic parallelisms and hardware resources which are available in recent Field Programmable Gate Arrays architectures is presented. The soft tomography algorithms such as Electrical Capacitance Tomography (ECT), Magnetic Inductance Tomography (MIT), and Electrical Impedance Tomography (EIT), while they use different sensors and data acquisition modules, they feature common computation requirements which consist of intensive matrix multiplications and fast/frequent memory accesses. Using the variable bit-width and fixed-point multipliers array available in the DSP blocks, which cooperatively perform the partial matrix product with associated Arithmetic and Logic Units (ALU), and distributed memory available in Stratix V FPGA, a dedicated scalable architecture is suggested to host the Landweber algorithm. The experimental results indicate that 16,949 frames of (32 × 32 pixels) can be reconstructed in one second if each element of the matrix is attributed to 18 bits and using a clock frequency of 400 MHz. This is more than enough in most process imaging applications. In addition, the accuracy of the image reconstruction using 18 bits/operand is found to be acceptable since it exceeds 86%. More accuracy can be achieved up to 99% if 36 bits/operand are used which leads to an image reconstruction throughput of 1272 frames /s (for image size 32 × 32). 相似文献

8.

Design of approximate adders and multipliers for error tolerant image processing

《Microprocessors and Microsystems》2020

An adder is the basic computational circuit in digital Very Large Scale Integration (VLSI) design. To improve the design metrics of an adder, Approximate Adders (AAs) have been proposed. These adders have been applied and analyzed on 8 × 8 Dadda multipliers (DMs). The design metrics of proposed AAs, Approximate Dadda Multipliers (ADMs) are synthesized in Cadence Register-Transfer Level (RTL) compiler and compares the design metrics with three different technology nodes. The quantitative characterization such as Error Distance (ED), Error Rate (ER), Pass Rate (PR), Mean Error Distance (MED), Normalised Error Distance (NED) of AAs, and ADMs are computed. Image blending and sharpening approaches have been applied using AAs, and approximate multipliers respectively to analyse the image quality metric using the proposed approximate framework. 相似文献

9.

TexNN: Fast Texture Encoding Using Neural Networks

S. Pratapa T. Olson A. Chalfin D. Manocha 《Computer Graphics Forum》2019,38(1):328-339

We present a novel deep learning‐based method for fast encoding of textures into current texture compression formats. Our approach uses state‐of‐the‐art neural network methods to compute the appropriate encoding configurations for fast compression. A key bottleneck in the current encoding algorithms is the search step, and we reduce that computation to a classification problem. We use a trained neural network approximation to quickly compute the encoding configuration for a given texture. We have evaluated our approach for compressing the textures for the widely used adaptive scalable texture compression format and evaluate the performance for different block sizes corresponding to 4 × 4, 6 × 6 and 8 × 8. Overall, our method (TexNN) speeds up the encoding computation up to an order of magnitude compared to prior compression algorithms with very little or no loss in the visual quality. 相似文献

10.

A hardware architecture for real-time image compression using a searchless fractal image coding method

David Jeff Jackson Haichen Ren Xianwei Wu Kenneth G. Ricks 《Journal of Real-Time Image Processing》2007,1(3):225-237

In this paper we present a novel hardware architecture for real-time image compression implementing a fast, searchless iterated function system (SIFS) fractal coding method. In the proposed method and corresponding hardware architecture, domain blocks are fixed to a spatially neighboring area of range blocks in a manner similar to that given by Furao and Hasegawa. A quadtree structure, covering from 32 × 32 blocks down to 2 × 2 blocks, and even to single pixels, is used for partitioning. Coding of 2 × 2 blocks and single pixels is unique among current fractal coders. The hardware architecture contains units for domain construction, zig-zag transforms, range and domain mean computation, and a parallel domain-range match capable of concurrently generating a fractal code for all quadtree levels. With this efficient, parallel hardware architecture, the fractal encoding speed is improved dramatically. Additionally, attained compression performance remains comparable to traditional search-based and other searchless methods. Experimental results, with the proposed hardware architecture implemented on an Altera APEX20K FPGA, show that the fractal encoder can encode a 512 × 512 × 8 image in approximately 8.36 ms operating at 32.05 MHz. Therefore, this architecture is seen as a feasible solution to real-time fractal image compression.

David Jeff JacksonEmail:

相似文献

11.

Multi-gate device and summing-circuit co-design robustness studies @ 32-nm technology node

Kumar Amresh Islam Aminul 《Microsystem Technologies》2017,23(9):4099-4109

This paper presents a FinFET-based static 1-bit full adder cell that helps to recover the huge penalty in performance, while staying quite close to the minimum energy point. The proposed design offers higher computing speed (by 7.96×) and lower energy (by 5.86×), lower energy-delay product (EDP) (by 21.08×) at the expense of higher power dissipation (by 1.36×) compared to its MOSFET counterpart. It proves its robustness against process variations by featuring tighter spread in power distribution (by 3.20×), in delay distribution (by 4.70×), in PDP (power-delay product) distribution (by 3.35×) and in EDP distribution (by 3.14×) compared to its MOSFET counterpart. The proposed design achieves these improvements due to employment of new FinFET technology in the full adder design. Multi-gate devices in this technology are less affected by random dopant fluctuation (RDF) and short-channel effects such as threshold voltage rolloff, drain-induced barrier lowering (DIBL), etc. To establish that our design is better this paper analyzes five more 1-bit full adder cells and compares them with the proposed design in terms of power, delay and PDP. We perform simulation using 32-nm Predictive Technology Model (PTM) parameters on SPICE.

相似文献

12.

Shear effect on dynamic behavior of microcantilever beam with manufacturing process defects

Bourouina Hicham Yahiaoui Réda Yusifli Elmar Benamar Mohammed El Amine Ghoumid Kamal Herlem Guillaume 《Microsystem Technologies》2017,23(7):2537-2542

This paper is concerned with the investigation of the shear effect on the dynamic behavior of a thin microcantilever beam with manufacturing process defects. Unlike the Rayleigh beam model (RBM), the Timoshenko beam model (TBM) takes in consideration the shear effect on the resonance frequency. This effect become significant for thin microcantilever beams with larger slenderness ratios that are normally encountered in MEMS devices such as sensors. The TBM model is presented and analyzed by numerical simulation using Finite Element Method (FEM) to determine corrective factors for the correction of the effect of manufacturing process defects like the underetching at the clamped end of the microbeam and the nonrectangular cross section of the area. A semi-analytical approach is proposed for the extraction of the Young’s modulus from 3D FEM simulation with COMSOL Multiphysics software. This model was tested on measurements of a thin chromium microcantilever beam of dimensions (80 × 2 × 0.95 μm³). Final results indicate that the correction of the effect of manufacturing process defects is significant where the corrected value of Young’s modulus is very close to the experimental results and it is about 280.81 GPa.

相似文献

13.

On high-performance parallel decimal fixed-point multiplier designs

《Computers & Electrical Engineering》2014,40(7):2126-2138

High-performance, area and power efficient hardware implementation of decimal multiplication is preferred to slow software simulations in various key scientific and financial applications, where errors caused by converting decimal numbers into their approximate binary representations are unacceptable. This paper presents a parallel architecture for fixed-point 8421-BCD-based decimal multiplication. In essence, it applies a hybrid 8421–5421 recoding scheme to generate partial products, and accumulates them with 8421 carry-lookahead adders organized as a tree structure. In addition, we propose a 4221-BCD-based decimal multiplier that is built upon a novel 4221-BCD full adder; operands of this 4221 multiplier are directly represented in the 4221 BCD. The proposed 16 × 16 decimal multipliers are compared with other best-known decimal multiplier designs with a TSMC 90-nm technology, and the evaluation results show that the proposed 8421–5421 multiplier achieves the lowest delay and area, as well as the highest power efficiency, among all the existing hardware-based BCD multipliers. 相似文献

14.

FPGA implementation of a run-time configurable NTT-based polynomial multiplication hardware

《Microprocessors and Microsystems》2020

Multiplication of polynomials of large degrees is the predominant operation in lattice-based cryptosystems in terms of execution time. This motivates the study of its fast and efficient implementations in hardware. Also, applications such as those using homomorphic encryption need to operate with polynomials of different parameter sets. This calls for design of configurable hardware architectures that can support multiplication of polynomials of various degrees and coefficient sizes.In this work, we present the design and an FPGA implementation of a run-time configurable and highly parallelized NTT-based polynomial multiplication architecture, which proves to be effective as an accelerator for lattice-based cryptosystems. The proposed polynomial multiplier can also be used to perform Number Theoretic Transform (NTT) and Inverse NTT (INTT) operations. It supports 6 different parameter sets, which are used in lattice-based homomorphic encryption and/or post-quantum cryptosystems. We also present a hardware/software co-design framework, which provides high-speed communication between the CPU and the FPGA connected by PCIe standard interface provided by the RIFFA driver [1]. For proof of concept, the proposed polynomial multiplier is deployed in this framework to accelerate the decryption operation of Brakerski/Fan-Vercauteren (BFV) homomorphic encryption scheme implemented in Simple Encrypted Arithmetic Library (SEAL), by the Cryptography Research Group at Microsoft Research [2]. In the proposed framework, polynomial multiplication operation in the decryption of the BFV scheme is offloaded to the accelerator in the FPGA via PCIe bus while the rest of operations in the decryption are executed in software running on an off-the-shelf desktop computer. The hardware part of the proposed framework targets Xilinx Virtex-7 FPGA device and the proposed framework achieves the speedup of almost 7 × in latency for the offloaded operations compared to their pure software implementations, excluding I/O overhead. 相似文献

15.

Fast modular multiplication based on complement representation and canonical recoding

《国际计算机数学杂志》2012,89(13):2871-2879

Modular multiplication is the fundamental operation in implementing circuits for cryptosystem, as the process of encrypting and decrypting a message requires modular exponentiation that can be decomposed into multiplications. In this paper, a proposed multiplication method utilizes the complement recoding method and canonical recoding technique. By performing the complement representation method and canonical recoding technique, the number of partial products can be further reduced. Based on these techniques, an algorithm with efficient multiplication method is proposed. For multiplication operation, in average case, the proposed algorithm can reduce the number of k-bit additions from 1/4k+(log (k)/k)+5/2 to 1/6k+(log (k)/k)+5/2, where k is the bit length of the multiplicand and multiplier. Besides, if we perform the proposed technique to compute common-multiplicand multiplication, the computational complexity can reduce the number of k-bit additions from 1/2k+2×(log (k)/k)+5 to 1/3k+2×(log (k)/k)+5. We can, therefore, efficiently speed up the overall computing performance of the multiplication operation. 相似文献

16.

A directional and scalable streaming deblocking filter hardware architecture for HEVC decoder

《Microprocessors and Microsystems》2021

In this work, a directional streaming hardware architecture for Deblocking Filter (DBF) of High-Efficiency Video Coding (HEVC) decoder is presented. The architecture uses adaptive parallel and pipeline processing strategies for low power and high-performance applications including broadcasting and Virtual reality etc. In order to remove the dependency from neighboring blocks, a restructured block size have been used. Since, the developed architecture is scalable, 68 × 68 Coding Unit (CU) block processing supports splitting into 36 × 36, 20 × 20 and 12 × 12. The proposed architecture uses 3-stage pipeline to complete the 68 × 68 CU block processing of DBF. The stage-1 micro-pipeline stages of 8, 4, 2 and 1 varied in accordance with CU sizes 68 × 68, 36 × 36, 20 × 20 and 12 × 12 respectively. In stage-2 of main pipeline, the blocks are further processed into single cycle parallel edge-filter. Each 8 × 8 block is processed for DBF Horizontal Filtering (HF) and Vertical Filtering (VF) simultaneously. During the stage-3 process of write-back operations, 4 × 8 blocks are stored into the memory to reconstruct the frame. The design has been implemented in both Field Programmable Gate array (FPGA) Virtex-6 and Application Specific Integrated Circuit (ASIC) using 180 nm technology. The results show that 68 × 68, 36 × 36 and 20 × 20 CU blocks have higher processing speed with reduced resources of 254K, 31K and 14.7K as compared with the previous works. The proposed architecture supports low power and high processing speed applications because of variable throughput. 相似文献

17.

CudaFilters: A SignalPlant library for GPU‐accelerated FFT and FIR filtering

下载免费PDF全文

Petr Nejedly Filip Plesinger Josef Halamek Pavel Jurak 《Software》2018,48(1):3-9

Signal filtering is one of the essential tasks in signal processing. It may become an extremely time‐consuming process, as in the case of intracranial electroencephalogram recordings (eg, 30‐min records) with a large number of channels (up to 256) and high sampling frequencies (up to 5 kHz in research related to ultra‐high‐frequency oscillations). The usual way of dealing with time consumption is process parallelization. Moreover, parallelization using graphic processing unit (GPU) allows further shortening of computing times thanks to the large number of GPU cores. This paper describes a library for GPU‐accelerated finite impulse response (FIR) and fast Fourier transform (FFT) filtering—“CudaFilters.” This library is designed for SignalPlant software—a free tool for signal analysis. The resultant acceleration in computing times was 5× to 40× depending on the task, data, and hardware configuration. The results were also compared to computing speeds in Matlab. 相似文献

18.

An iterative logarithmic multiplier

Z. Babi?Author VitaeA. Avramovi?Author Vitae P. Buli?Author Vitae 《Microprocessors and Microsystems》2011,35(1):23-33

Digital signal processing algorithms often rely heavily on a large number of multiplications, which is both time and power consuming. However, there are many practical solutions to simplify multiplication, like truncated and logarithmic multipliers. These methods consume less time and power but introduce errors. Nevertheless, they can be used in situations where a shorter time delay is more important than accuracy. In digital signal processing, these conditions are often met, especially in video compression and tracking, where integer arithmetic gives satisfactory results. This paper presents a simple and efficient multiplier with the possibility to achieve an arbitrary accuracy through an iterative procedure, prior to achieving the exact result. The multiplier is based on the same form of number representation as Mitchell’s algorithm, but it uses different error correction circuits than those proposed by Mitchell. In such a way, the error correction can be done almost in parallel (actually this is achieved through pipelining) with the basic multiplication. The hardware solution involves adders and shifters, so it is not gate and power consuming. The error summary for operands ranging from 8 bits to 16 bits indicates a very low relative error percentage with two iterations only. For the hardware implementation assessment, the proposed multiplier is implemented on the Spartan 3 FPGA chip. For 16-bit operands, the time delay estimation indicates that a multiplier with two iterations can work with a clock cycle more than 150 MHz, and with the maximum relative error being less than 2%. 相似文献

19.

A Parallel Algorithm for 4×4 DCT

《Journal of Parallel and Distributed Computing》1999,57(2):257-269

By developing a generalized 1D approach and parallel computing algorithm, this paper presents a parallel algorithm design and hardware implementation for the computation of 4×4 DCT. This algorithm sorts all the 2D input pixel data into four groups. Each group is then forwarded to a 1D DCT arithmetic unit to complete the computation. After a few simple additions which are designed to follow the output of 1D DCTs, the computation of 2D DCT is implemented in parallel. Therefore, the efficiency of the algorithm is entirely dependent on the 1D DCT algorithm adopted, and all the existing fast algorithms for 1D DCT can be directly applied to further optimise the algorithm design. The development can also be further extended to compute general 2D DCT by a recursive procedure where the 4×4 DCT algorithm is used as the basic core. 相似文献

20.

Scalable Unified Transform Architecture for Advanced Video Coding Embedded Systems

Tiago Dias Sebastián López Nuno Roma Leonel Sousa 《International journal of parallel programming》2013,41(2):236-260

A novel high throughput and scalable unified architecture for the computation of the transform operations in video codecs for advanced standards is presented in this paper. This structure can be used as a hardware accelerator in modern embedded systems to efficiently compute all the two-dimensional 4 × 4 and 2 × 2 transforms of the H.264/AVC standard. Moreover, its highly flexible design and hardware efficiency allows it to be easily scaled in terms of performance and hardware cost to meet the specific requirements of any given video coding application. Experimental results obtained using a Xilinx Virtex-5 FPGA demonstrated the superior performance and hardware efficiency levels provided by the proposed structure, which presents a throughput per unit of area relatively higher than other similar recently published designs targeting the H.264/AVC standard. Such results also showed that, when integrated in a multi-core embedded system, this architecture provides speedup factors of about 120× concerning pure software implementations of the transform algorithms, therefore allowing the computation, in real-time, of all the above mentioned transforms for Ultra High Definition Video (UHDV) sequences (4,320 × 7,680 @ 30 fps). 相似文献