首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The Householder transformation is considered to be desirable among various unitary transformations due to its superior computational efficiency and robust numerical stability. Specifically, the Householder transformation outperforms the Givens rotation and the modified Gram-Schmidt methods in numerical stability under finite-precision implementations, as well as requiring fewer arithmetical operations. Consequently, the QR decomposition based on the Householder transformation is promising for VLSI implementation and real-time high throughput modern signal processing. In this paper, a recursive complex Householder transformation (CHT) with a fast initialization algorithm is proposed and its associated parallel/pipelined architecture is also considered. Then, a CHT based recursive least-squares algorithm with a fast initialization is presented. Its associated systolic array processing architecture is also considered.This work was supported in part of the National Science Council of the R.O.C. under grant NSC80-E-SP-009-01A.This work was supported in part by a UC Micro grant and NSF grant NCR-8814407.  相似文献   

2.
A class of selection algorithms using binary partition that are very efficient for median and rank order filtering is considered. A unified discussion of these algorithms is presented. The algorithms have better time-area complexity than sorting methods. Counting, firing, and updating are the three basic steps. A generic structure is proposed to realize these algorithms. They can be implemented by simple and regular modules in VLSI  相似文献   

3.
This paper presents two efficient very large scale integration (VLSI) architectures and buffer size optimization for full-search block matching algorithms. Starting from an overlapped data flow of search area, both systolic- and semisystolic-array architectural solutions are derived. By means of exploiting stream memory banks, not only input/output (I/O) bandwidth can be minimized, but also processor element efficiency can be improved. In addition, the controller structure for both solutions are very straightforward, making them very suitable for VLSI implementation to meet computational requirements. Moreover, by exploring the dependency graph, we focus on the problem of reducing the internal buffer size under minimal I/O bandwidth constraint to derive guidelines on reducing redundant internal buffer as well as to achieve area-efficient VLSI architectures. Simulation results show that, for N=P=16 (N is the reference block size and P is the search range), I/O bandwidth can be reduced by 2.4 times, while buffer size increases less than 38%. Two prototype chips for N=P=16 have been designed and fabricated. Test results show that clock rate can be up to 90 MHz, implying that more than 87.9-K motion vectors per second can be achieved to meet real-time requirements specified in MPEG-2 MP@ML coding standard  相似文献   

4.
Execution latency and I/O bandwidth play essential roles in determining the effectiveness and the cost of a parallel hardware implementation for block-matching motion estimation algorithms. Unfortunately, almost all traditional architecture designs, e.g. the two-dimensional mesh-connected systolic array architecture (2DMCSA) and the tree-type structure (TTS), fail to take these two factors into account simultaneously. As a result, they suffer from either large execution latency or huge input bandwidth requirements. The authors propose a family of tree/linear architectures, which efficiently optimise the total implementation cost by combining the merits of the 2DMCSA and the TTS. Moreover, to facilitate hardware designs, the authors present the tree-cut techniques and the on-chip buffer design method to meet computational demands various video compression applications. The proposed architectures are capable executing the exhaustive search and the search block-matching algorithms, they offer relatively flexible and cost-effective hardware solutions for a wide range of video coding systems, including CD-ROM, portable visual communications systems and high-definition TV  相似文献   

5.
6.
The real time implementation of an efficient signal compression technique, Vector Quantization (VQ), is of great importance to many digital signal coding applications. In this paper, we describe a new family of bit level systolic VLSI architectures which offer an attractive solution to this problem. These architectures are based on a bit serial, word parallel approach and high performance and efficiency can be achieved for VQ applications of a wide range of bandwidths. Compared with their bit parallel counterparts, these bit serial circuits provide better alternatives for VQ implementations in terms of performance and cost.  相似文献   

7.
This paper presents novel very large scale integration (VLSI) architectures in support of an efficient implementation of Leighton's well-known Columnsort. The designs take advantage of reconfigurable bus architectures enhanced with simple shift switches. Our first main contribution is to show that Columnsort can be partitioned into two components: a hardware scheme involving the task of sorting arrays of small size and a hardware or software scheme that involves simple data movement tasks. Our second main contribution is to demonstrate that the dynamically reconfigurable mesh architecture can be exploited to obtain a small and efficient hardware sorter. The resulting architectures feature high regularity of circuitry, simplicity of control structure, and adaptability. Both theoretical analyses and simulation tests have shown that the proposed VLSI architectures for sorting are superior to existing designs in the context of sorting small and moderate size arrays  相似文献   

8.
Very large scale integration (VLSI) design methodology and implementation complexities of high-speed, low-power soft-input soft-output (SISO) a posteriori probability (APP) decoders are considered. These decoders are used in iterative algorithms based on turbo codes and related concatenated codes and have shown significant advantage in error correction capability compared to conventional maximum likelihood decoders. This advantage, however, comes at the expense of increased computational complexity, decoding delay, and substantial memory overhead, all of which hinge primarily on the well-known recursion bottleneck of the SISO-APP algorithm. This paper provides a rigorous analysis of the requirements for computational hardware and memory at the architectural level based on a tile-graph approach that models the resource-time scheduling of the recursions of the algorithm. The problem of constructing the decoder architecture and optimizing it for high speed and low power is formulated in terms of the individual recursion patterns which together form a tile graph according to a tiling scheme. Using the tile-graph approach, optimized architectures are derived for the various forms of the sliding-window and parallel-window algorithms known in the literature. A proposed tiling scheme of the recursion patterns, called hybrid tiling, is shown to be particularly effective in reducing memory overhead of high-speed SISO-APP architectures. Simulations demonstrate that the proposed approach achieves savings in area and power in the range of 4.2%-53.1% over state of the art.  相似文献   

9.
本文通过将全搜索矢量量化算法(Full Search Vector Quantization)的计算转换成内积(inner product)运算,并利用Baugh-Wooley算法,阐述了FSVQ算法的一种新的有效的基于二进制补码的VLSI实现结构。由于该结构的规则性(regularity)和模块性(modularity),它可以被高效地应用在语音、图像、和视频编码的VLSI实现中。  相似文献   

10.
A great interest has been gained in recent years by a new error-correcting code technique, known as “turbo coding”, which has been proven to offer performance closer to the Shannon's limit than traditional concatenated codes. In this paper, several very large scale integration (VLSI) architectures suitable for turbo decoder implementation are proposed and compared in terms of complexity and performance; the impact on the VLSI complexity of system parameters like the state number, number of iterations, and code rate are evaluated for the different solutions. The results of this architectural study have then been exploited for the design of a specific decoder, implementing a serial concatenation scheme with 2/3 and 3/4 codes; the designed circuit occupies 35 mm2, supports a 2 Mb/s data rate, and for a bit error probability of 10-6, yields a coding gain larger than 7 dB, with ten iterations  相似文献   

11.
A folded architecture and a digit-serial architecture are proposed for implementation of one- and two-dimensional discrete wavelet transforms. In the one-dimensional folded architecture, the computations of all wavelet levels are folded to the same low-pass and high-pass filters. The number of registers in the folded architecture is minimized by the use of a generalized life time analysis. The converter units are synthesized with a minimum number of registers using forward-backward allocation. The advantage of the folded architecture is low latency and its drawbacks are increased hardware area, less than 100% hardware utilization, and the complex routing and interconnection required by the converters used. These drawbacks are eliminated in the alternate digit-serial architecture at the expense of an increase in the system latency and some constraints on the wordlength. In latency-critical applications, the use of the folded architecture is suggested. If latency is not so critical, the digit-serial architecture should be used. The use of a combined folded and digit-serial architecture is proposed for implementation of two-dimensional discrete wavelet transforms  相似文献   

12.
This paper presents the VLSI architectures for three-level correlator design based on 1-m CMOS technology. The architecture performs very high speed, real-time, three-level cross-correlation of signals. Two architectures, one for serial incoming samples of signals (serial data) and the other for stored signal samples (parallel data), are described in the paper.  相似文献   

13.
Ray Liu  K.J. 《Electronics letters》1990,26(23):1962-1963
The Haar transform is very useful in many signal and image processing applications where real-time implementation is essential. Three VLSI computing architectures are proposed for fast implementation of the Haar transform. Comparisons on the advantages and disadvantages of the proposed architectures are also presented.<>  相似文献   

14.
VLSI architectures for video compression-a survey   总被引:3,自引:0,他引:3  
The paper presents an overview on architectures for VLSI implementations of video compression schemes as specified by standardization committees of the ITU and ISO. VLSI implementation strategies are discussed and split into function specific and programmable architectures. As examples for the function oriented approach, alternative architectures for DCT and block matching will be evaluated. Also dedicated decoder chips are included Programmable video signal processors are classified and specified as homogeneous and heterogenous processor architectures. Architectures are presented for reported design examples from the literature. Heterogenous processors outperform homogeneous processors because of adaptation to the requirements of special, subtasks by dedicated modules. The majority of heterogenous processors incorporate dedicated modules for high performance subtasks of high regularity as DCT and block matching. By normalization to a fictive 1.0 μm CMOS process typical linear relationships between silicon area and through-put rate have been determined for the different architectural styles. This relationship indicates a figure of merit for silicon efficiency  相似文献   

15.
This paper presents several techniques for the very large-scale integration (VLSI) implementation of the maximum a posteriori (MAP) algorithm. In general, knowledge about the implementation of the Viterbi (1967) algorithm can be applied to the MAP algorithm. Bounds are derived for the dynamic range of the state metrics which enable the designer to optimize the word length. The computational kernel of the algorithm is the add-MAX* operation, which is the add-compare-select operation of the Viterbi algorithm with an added offset. We show that the critical path of the algorithm can be reduced if the add-MAX* operation is reordered into an offset-add-compare-select operation by adjusting the location of registers. A general scheduling for the MAP algorithm is presented which gives the tradeoffs between computational complexity, latency, and memory size. Some of these architectures eliminate the need for RAM blocks with unusual form factors or can replace the RAM with registers. These architectures are suited to VLSI implementation of turbo decoders.  相似文献   

16.
The paper presents the problem of fault tolerance in VLSI array structures: its aim is to discuss architectures capable of surviving a number of random faults while keeping costs (in terms of added silicon area and of increased processing time) as low as possible. Two different approaches are presented, both based upon introduction of simple patterns of faults and by global reconfiguration techniques (rather than one-to-one substitution of faulty elements by spare ones). Various solutions are compared, and relative performances are discussed in order to determine criteria for selecting the one most suitable to particular applications.  相似文献   

17.
This article presents the VLSI design of a configurable RSA public key cryptosystem supporting the 512-bit, 1024-bit and 2048-bit based on Montgomery algorithm achieving comparable clock cycles of current relevant works but with smaller die size. We use binary method for the modular exponentiation and adopt Montgomery algorithm for the modular multiplication to simplify computational complexity, which, together with the systolic array concept for electric circuit designs effectively, lower the die size. The main architecture of the chip consists of four functional blocks, namely input/output modules, registers module, arithmetic module and control module. We applied the concept of systolic array to design the RSA encryption/decryption chip by using VHDL hardware language and verified using the TSMC/CIC 0.35 m 1P4 M technology. The die area of the 2048-bit RSA chip without the DFT is 3.9 × 3.9 mm2 (4.58 × 4.58 mm2 with DFT). Its average baud rate can reach 10.84 kbps under a 100 MHz clock.  相似文献   

18.
Flexible VLSI architectures for high-speed 2-D finite-impulse-response (FIR) and infinite-impulse-response (IIR) digital filters are described. Cyclical parallel processing structures for 2-D FIR and IIR digital filtering are derived from the employment of storage elements. The hardware architectures that realize the parallel processing structures are developed. The resulting architectures, which are mainly constructed of three types of standard cells, exhibit a high degree of modularity and regularity, and thus a high suitability for VLSI implementation. The architectures can process 2-D data arrays of arbitrary dimensions in real time or near real time and have higher hardware efficiency and lower implementation cost than the direct-form realization  相似文献   

19.
During the last years decoding algorithms that make not only use of soft quantized inputs but also deliver soft decision outputs have attracted considerable attention because additional coding gains are obtainable in concatenated systems. A prominent member of this class of algorithms is the Soft-Output Viterbi Algorithm. In this paper two architectures for high speed VLSI implementations of the Soft-Output Viterbi-Algorithm are proposed and area estimates are given for both architectures. The well known trade-off between computational complexity and storage requirements is played to obtain new VLSI architectures with increased implementation efficiency. Area savings of up to 40% in comparison to straightforward solutions are reported.This work was supported by the Deutsche Forschungsgemeinschaft (DFG) under contract Me 651/12-1.  相似文献   

20.
The evolution of CORDIC, an iterative arithmetic computing algorithm capable of evaluating various elementary functions using a unified shift-and-add approach, and of CORDIC processors is reviewed. A method to utilize a CORDIC processor array to implement digital signal processing algorithms is presented. The approach is to reformulate existing DSP algorithms so that they are suitable for implementation with an array performing circular or hyperbolic rotation operations. Three categories of algorithm are surveyed: linear transformations, digital filters, and matrix-based DSP algorithms  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号