共查询到20条相似文献,搜索用时 44 毫秒
1.
High-speed VLSI architectures for the AES algorithm 总被引:1,自引:0,他引:1
Xinmiao Zhang Parhi K.K. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2004,12(9):957-967
This paper presents novel high-speed architectures for the hardware implementation of the Advanced Encryption Standard (AES) algorithm. Unlike previous works which rely on look-up tables to implement the SubBytes and InvSubBytes transformations of the AES algorithm, the proposed design employs combinational logic only. As a direct consequence, the unbreakable delay incurred by look-up tables in the conventional approaches is eliminated, and the advantage of subpipelining can be further explored. Furthermore, composite field arithmetic is employed to reduce the area requirements, and different implementations for the inversion in subfield GF(2/sup 4/) are compared. In addition, an efficient key expansion architecture suitable for the subpipelined round units is also presented. Using the proposed architecture, a fully subpipelined encryptor with 7 substages in each round unit can achieve a throughput of 21.56 Gbps on a Xilinx XCV1000 e-8 bg560 device in non-feedback modes, which is faster and is 79% more efficient in terms of equivalent throughput/slice than the fastest previous FPGA implementation known to date. 相似文献
2.
Locally connected VLSI architectures for the Viterbi algorithm 总被引:1,自引:0,他引:1
The Viterbi algorithm is a well-established technique for channel and source decoding in high-performance digital communication systems. Implementations of the Viterbi algorithm on three types of locally connected processor arrays are described. The restriction is motivated by the fact that both the cost and performance metrics of VLSI favour architectures in which on-chip interprocessor communication is localized. Each of the structures presented can accommodate arbitrary alphabet sizes and algorithm memory lengths 相似文献
3.
4.
The real time implementation of an efficient signal compression technique, Vector Quantization (VQ), is of great importance
to many digital signal coding applications. In this paper, we describe a new family of bit level systolic VLSI architectures
which offer an attractive solution to this problem. These architectures are based on a bit serial, word parallel approach
and high performance and efficiency can be achieved for VQ applications of a wide range of bandwidths. Compared with their
bit parallel counterparts, these bit serial circuits provide better alternatives for VQ implementations in terms of performance
and cost. 相似文献
5.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1999,7(1):135-138
This paper presents novel very large scale integration (VLSI) architectures in support of an efficient implementation of Leighton's well-known Columnsort. The designs take advantage of reconfigurable bus architectures enhanced with simple shift switches. Our first main contribution is to show that Columnsort can be partitioned into two components: a hardware scheme involving the task of sorting arrays of small size and a hardware or software scheme that involves simple data movement tasks. Our second main contribution is to demonstrate that the dynamically reconfigurable mesh architecture can be exploited to obtain a small and efficient hardware sorter. The resulting architectures feature high regularity of circuitry, simplicity of control structure, and adaptability. Both theoretical analyses and simulation tests have shown that the proposed VLSI architectures for sorting are superior to existing designs in the context of sorting small and moderate size arrays 相似文献
6.
Mansour M.M. Shanbhag N.R. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(4):627-650
Very large scale integration (VLSI) design methodology and implementation complexities of high-speed, low-power soft-input soft-output (SISO) a posteriori probability (APP) decoders are considered. These decoders are used in iterative algorithms based on turbo codes and related concatenated codes and have shown significant advantage in error correction capability compared to conventional maximum likelihood decoders. This advantage, however, comes at the expense of increased computational complexity, decoding delay, and substantial memory overhead, all of which hinge primarily on the well-known recursion bottleneck of the SISO-APP algorithm. This paper provides a rigorous analysis of the requirements for computational hardware and memory at the architectural level based on a tile-graph approach that models the resource-time scheduling of the recursions of the algorithm. The problem of constructing the decoder architecture and optimizing it for high speed and low power is formulated in terms of the individual recursion patterns which together form a tile graph according to a tiling scheme. Using the tile-graph approach, optimized architectures are derived for the various forms of the sliding-window and parallel-window algorithms known in the literature. A proposed tiling scheme of the recursion patterns, called hybrid tiling, is shown to be particularly effective in reducing memory overhead of high-speed SISO-APP architectures. Simulations demonstrate that the proposed approach achieves savings in area and power in the range of 4.2%-53.1% over state of the art. 相似文献
7.
8.
Masera G. Piccinini G. Roch M.R. Zamboni M. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1999,7(3):369-379
A great interest has been gained in recent years by a new error-correcting code technique, known as “turbo coding”, which has been proven to offer performance closer to the Shannon's limit than traditional concatenated codes. In this paper, several very large scale integration (VLSI) architectures suitable for turbo decoder implementation are proposed and compared in terms of complexity and performance; the impact on the VLSI complexity of system parameters like the state number, number of iterations, and code rate are evaluated for the different solutions. The results of this architectural study have then been exploited for the design of a specific decoder, implementing a serial concatenation scheme with 2/3 and 3/4 codes; the designed circuit occupies 35 mm2, supports a 2 Mb/s data rate, and for a bit error probability of 10-6, yields a coding gain larger than 7 dB, with ten iterations 相似文献
9.
Parhi K.K. Nishitani T. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1993,1(2):191-202
A folded architecture and a digit-serial architecture are proposed for implementation of one- and two-dimensional discrete wavelet transforms. In the one-dimensional folded architecture, the computations of all wavelet levels are folded to the same low-pass and high-pass filters. The number of registers in the folded architecture is minimized by the use of a generalized life time analysis. The converter units are synthesized with a minimum number of registers using forward-backward allocation. The advantage of the folded architecture is low latency and its drawbacks are increased hardware area, less than 100% hardware utilization, and the complex routing and interconnection required by the converters used. These drawbacks are eliminated in the alternate digit-serial architecture at the expense of an increase in the system latency and some constraints on the wordlength. In latency-critical applications, the use of the folded architecture is suggested. If latency is not so critical, the digit-serial architecture should be used. The use of a combined folded and digit-serial architecture is proposed for implementation of two-dimensional discrete wavelet transforms 相似文献
10.
This paper presents the VLSI architectures for three-level correlator design based on 1-m CMOS technology. The architecture performs very high speed, real-time, three-level cross-correlation of signals. Two architectures, one for serial incoming samples of signals (serial data) and the other for stored signal samples (parallel data), are described in the paper. 相似文献
11.
The Haar transform is very useful in many signal and image processing applications where real-time implementation is essential. Three VLSI computing architectures are proposed for fast implementation of the Haar transform. Comparisons on the advantages and disadvantages of the proposed architectures are also presented.<> 相似文献
12.
VLSI architectures for video compression-a survey 总被引:3,自引:0,他引:3
Pirsch P. Demassieux N. Gehrke W. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》1995,83(2):220-246
The paper presents an overview on architectures for VLSI implementations of video compression schemes as specified by standardization committees of the ITU and ISO. VLSI implementation strategies are discussed and split into function specific and programmable architectures. As examples for the function oriented approach, alternative architectures for DCT and block matching will be evaluated. Also dedicated decoder chips are included Programmable video signal processors are classified and specified as homogeneous and heterogenous processor architectures. Architectures are presented for reported design examples from the literature. Heterogenous processors outperform homogeneous processors because of adaptation to the requirements of special, subtasks by dedicated modules. The majority of heterogenous processors incorporate dedicated modules for high performance subtasks of high regularity as DCT and block matching. By normalization to a fictive 1.0 μm CMOS process typical linear relationships between silicon area and through-put rate have been determined for the different architectural styles. This relationship indicates a figure of merit for silicon efficiency 相似文献
13.
The paper presents the problem of fault tolerance in VLSI array structures: its aim is to discuss architectures capable of surviving a number of random faults while keeping costs (in terms of added silicon area and of increased processing time) as low as possible. Two different approaches are presented, both based upon introduction of simple patterns of faults and by global reconfiguration techniques (rather than one-to-one substitution of faulty elements by spare ones). Various solutions are compared, and relative performances are discussed in order to determine criteria for selecting the one most suitable to particular applications. 相似文献
14.
Modular, area-efficient VLSI architectures for computing the arithmetic Fourier transform (AFT) are proposed. By suitable design of PEs and I/O sequencing, nonuniform data dependencies in the AFT computation which require nonequidistant inputs and assignment of Mobius function values are resolved. The proposed design employs 2N +1 PEs to compute 2N +1 Fourier coefficients. Each PE has an adder and a fixed amount of local storage, and one PE has a multiplier. I/O with the host is performed using a fixed number of channels. This results in simple PE organization, compared with those needed in known DFT/FFT architectures. The design achieves O (N ) speedup. It uses significantly fewer PEs than designs in the literature and supports real-time applications by allowing continuous sequential input. It can be extended to achieve linear speedup in a fixed size array with 2p +1 PEs, 1⩽p ⩽N 相似文献
15.
The evolution of CORDIC, an iterative arithmetic computing algorithm capable of evaluating various elementary functions using a unified shift-and-add approach, and of CORDIC processors is reviewed. A method to utilize a CORDIC processor array to implement digital signal processing algorithms is presented. The approach is to reformulate existing DSP algorithms so that they are suitable for implementation with an array performing circular or hyperbolic rotation operations. Three categories of algorithm are surveyed: linear transformations, digital filters, and matrix-based DSP algorithms 相似文献
16.
Olaf J. Joeressen Martin Vaupel Heinrich Meyr 《The Journal of VLSI Signal Processing》1994,8(2):169-181
During the last years decoding algorithms that make not only use of soft quantized inputs but also deliver soft decision outputs have attracted considerable attention because additional coding gains are obtainable in concatenated systems. A prominent member of this class of algorithms is the Soft-Output Viterbi Algorithm. In this paper two architectures for high speed VLSI implementations of the Soft-Output Viterbi-Algorithm are proposed and area estimates are given for both architectures. The well known trade-off between computational complexity and storage requirements is played to obtain new VLSI architectures with increased implementation efficiency. Area savings of up to 40% in comparison to straightforward solutions are reported.This work was supported by the Deutsche Forschungsgemeinschaft (DFG) under contract Me 651/12-1. 相似文献
17.
New VLSI architectures for fast convolutional threshold decoders that process soft-quantized channel symbols are presented. The new architectures feature pipelining and parallelism and make it possible to fabricate decoders for data rates up to hundreds of Mbits per second. With these architectures, the data rate is shown to be independent of the memory of the code, implying that fast AAPP (approximate a posteriori probability) decoders can be built for long powerful codes. Furthermore, the architectures are convenient to use with low and high coding rates. Using a typical example it is shown that a soft-decision threshold decoder can provide a substantial coding gain while being less costly to implement than the hard-decision threshold decoder 相似文献
18.
Combining the Residue Number System as a computational tool and VLSI as a fabrication medium promises to provide modular and cost efficient implementation of many digital signal processing algorithms. In this paper, a memory model has been developed. It is a low level model, which is used to derive relationships between the size of each modulus (in the chosen number system), and both chip area and time required for implementing the corresponding look-up tables. The memory model allows the selection of the most efficient layout for memories which do not have power of two dimensions. A set of multi-look-up table modules has been proposed as building block units for implementing digital signal processing algorithms. A procedure has been developed to optimize the area and time of those modules. 相似文献
19.
Flexible VLSI architectures for high-speed 2-D finite-impulse-response (FIR) and infinite-impulse-response (IIR) digital filters are described. Cyclical parallel processing structures for 2-D FIR and IIR digital filtering are derived from the employment of storage elements. The hardware architectures that realize the parallel processing structures are developed. The resulting architectures, which are mainly constructed of three types of standard cells, exhibit a high degree of modularity and regularity, and thus a high suitability for VLSI implementation. The architectures can process 2-D data arrays of arbitrary dimensions in real time or near real time and have higher hardware efficiency and lower implementation cost than the direct-form realization 相似文献
20.
Yong-Jin Jeong Burleson W.P. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1997,5(2):211-217
We present two novel iterative algorithms and their array structures for integer modular multiplication. The algorithms are designed for Rivest-Shamir-Adelman (RSA) cryptography and are based on the familiar iterative Horner's rule, but use precalculated complements of the modulus. The problem of deciding which multiples of the modulus to subtract in intermediate iteration stages has been simplified using simple look-up of precalculated complement numbers, thus allowing a finer-grain pipeline. Both algorithms use a carry save adder scheme with module reduction performed on each intermediate partial product which results in an output in carry-save format. Regularity and local connections make both algorithms suitable for high-performance array implementation in FPGA's or deep submicron VLSI. The processing nodes consist of just one or two full adders and a simple multiplexor. The stored complement numbers need to be precalculated only when the modulus is changed, thus not affecting the performance of the main computation. In both cases, there exists a bit-level systolic schedule, which means the array can be fully pipelined for high performance and can also easily be mapped to linear arrays for various space/time tradeoffs 相似文献