期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

On supercomputing with systolic/wavefront array processors

《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》1984,72(7):867-884

In many scientific and signal processing applications, there are increasing demands for large-volume and/or high-speed computations which call for not only high-speed computing hardware, but also for novel approaches in computer architecture and software techniques in future supercomputers. Tremendous progress has been made on several promising parallel architectures for scientific computations, including a variety of digital filters, fast Fourier transform (FFT) processors, data-flow processors, systolic arrays, and wavefront arrays. This paper describes these computing networks in terms of signal-flow graphs (SFG) or data-flow graphs (DFG), and proposes a methodology of converting SFG computing networks into synchronous systolic arrays or data-driven wavefront arrays. Both one- and two-dimensional arrays are discussed theoretically, as well as with illustrative examples. A wavefront-oriented programming language, which describes the (parallel) data flow in systolic/wavefront-type arrays, is presented. The structural property of parallel recursive algorithms points to the feasibility of a Hierarchical Iterative Flow-Graph Design (HIFD) of VLSI Array Processors. The proposed array processor architectures, we believe, will have significant impact on the development of future supercomputers. 相似文献

2.

Systolic array processing of the Viterbi algorithm

Chang C.-Y. Yao K. 《IEEE transactions on information theory / Professional Technical Group on Information Theory》1989,35(1):76-86

Results on efficient forms of decoding convolutional codes based on the Viterbi algorithm by using systolic arrays are presented. Various properties of convolutional codes are discussed. A technique called strongly connected trellis decoding is introduced to increase the efficient utilization of all the systolic array processors. Issues dealing with the composite branch metric generation, survivor updating, overall system architecture, throughput rate, and computational overhead ratio are also investigated. The scheme is applicable to both hard and soft decoding of any rate b/n convolutional code. It is shown that as the length of the code becomes large, the systolic Viterbi decoder maintains a regular and general interconnection structure as well as moderate throughput rate gain over the sequential Viterbi decoder 相似文献

3.

Control generation in the design of processor arrays

Jürgen Teich Lothar Thiele 《The Journal of VLSI Signal Processing》1991,3(1-2):77-92

The problem of mapping algorithms onto regular arrays has received great attention in the past. Results are available on the mapping of regular algorithms onto systolic or wavefront arrays. On the other hand, many algorithms that can be implemented on parallel architectures are not completely regular but are composed of a set of regular subalgorithms. Recently, a class of configurable processor arrays has been proposed that allows the efficient implementation of piecewise regular algorithms. In contrary to pure systolic of wavefront arrays they are distinguished by a dynamic configuration structure. The known trajectories, however, cannot be applied to the design of configurable processor arrays because the functions of the procesing elements and the interconnection structure are time- and space-dependent. In this paper, a systematic procedure is introduced that allows the efficient design of configurable processor arrays including the specification of the processing elements and the generation of control signals. Control signals are propagated through the processor array. The proposed design trajectory can be used for the design of regular arrays or configurable arrays. 相似文献

4.

FPGA design and implementation of a low-power systolic array-based adaptive Viterbi decoder 总被引：1，自引：0，他引：1

Man Guo Ahmad M.O. Swamy M.N.S. Chunyan Wang 《IEEE transactions on circuits and systems. I, Regular papers》2005,52(2):350-365

In this paper, by modifying the well-known Viterbi algorithm, an adaptive Viterbi algorithm that is based on strongly connected trellis decoding is proposed. Using this algorithm, the design and a field-programmable gate array implementation of a low-power adaptive Viterbi decoder with a constraint length of 9 and a code rate of 1/2 is presented. In this design, a novel systolic array-based architecture with time multiplexing and arithmetic pipelining for implementing the proposed algorithm is used. It is shown that the proposed algorithm can reduce by up to 70% the average number of ACS computations over that by using the nonadaptive Viterbi algorithm, without degradation in the error performance. This results in lowering the switching activities of the logic cells, with a consequent reduction in the dynamic power. Further, it is shown that the total power consumption in the implementation of the proposed algorithm can be reduced by up to 43% compared to that in the implementation of the nonadaptive Viterbi algorithm, with a negligible increase in the hardware. 相似文献

5.

DTW的ASIC实现算法研究 总被引：3，自引：0，他引：3

李韬贺前华王前《微电子学》2004,34(3):281-284

通过分析DTW算法，提出了一种适合ASIC实现的心动阵列结构。仿真结果表明，该并行VLSI处理器阵列系统能够在N M-1个时钟周期内计算出两个模板的匹配加权距离。较之基于通用处理器串行实现的DTW算法需要的3pMN／2个时钟周期，该算法节省了大量的运算时间。相似文献

6.

A sorter-based architecture for a parallel implementation of communication intensive algorithms

Josef G. Krammer 《The Journal of VLSI Signal Processing》1991,3(1-2):93-103

This paper deals with the parallel execution of algorithms with global and/or irregular data dependencies on a regularly and locally connected processor array. The associated communication problems are solved by the use of a two-dimensional sorting algorithm. The proposed architecture, which is based on a two-dimensional sorting network, offers a high degree of flexibility and allows an efficient mapping of many irregularly structured algorithms. In this architecture a one-dimensional processor array performs all required control and arithmetic operations, whereas the sorter solves complex data transfer problems. The storage capability of the sorting network is also used as memory for data elements. The algorithms for sparse matrix computations, fast Fourier transformation and for the convex hull problem, which are mapped onto this architecture, as well as the simulation of a shared-memory computer show that the utilization of the most complex components, the processors, is O(1). 相似文献

7.

An Efficient VLSI Architecture for Full-Search Block Matching Algorithms 总被引：1，自引：0，他引：1

Chen-Yi Lee Mei-Cheng Lu 《The Journal of VLSI Signal Processing》1997,15(3):275-282

This paper presents a novel memory-based VLSI architecture for full search block matching algorithms. We propose a semi-systolic array to meet the requirements of high computational complexity, where data communications are handled in two styles: (1) global connections for search data and (2) local connections for partial sum. Data flow is handled by a multiple-port memory bank so that all processor elements function on target data items. Thus hardware efficiency achieved can be up to 100%. Both semi-systolic array structure and related memory management strategies for full-search block matching algorithms are highlighted and discussed in detail in the paper. 相似文献

8.

Parallel structures for joint channel estimation and data detectionover fading channels

Omidi M.J. Gulak P.G. Pasupathy S. 《Selected Areas in Communications, IEEE Journal on》1998,16(9):1616-1629

Joint data and channel estimation for mobile communication receivers can be realized by employing a Viterbi detector along with channel estimators which estimate the channel impulse response. The behavior of the channel estimator has a strong impact on the overall error rate performance of the receiver. Kalman filtering is an optimum channel estimation technique which can lead to significant improvement in the receiver bit error rate (BER) performance. However, a Kalman filter is a complex algorithm and is sensitive to roundoff errors. Square-root implementation methods are required for robustness against numerical errors. Real-time computation of the Kalman estimator in a mobile communication receiver calls for parallel and pipelined structures to take advantage of the inherent parallelism in the algorithm. In this paper different implementation methods are considered for measurement update and time update equations of the Kalman filter. The unit-lower-triangular-diagonal (LD) correction algorithm is used for the time update equations, and systolic array structures are proposed for its implementation. For the overall implementation of joint data and channel estimation, parallel structures are proposed to perform both the Viterbi algorithm and channel estimation. Simulation results show the numerical stability of different implementation techniques and the number of bits required in the digital computations with different estimators 相似文献

9.

A recursive design methodology for VLSI: Theory and example

H.C. Yung C.R. Allen H.K.E. Liesenberg D.J. Kinniment 《Integration, the VLSI Journal》1984,2(3):213-225

A novel VLSI (Very Large Scale Integration) methodology based on the hierarchical design of computational and system blocks is presented. The underlying algorithms used are shown to optimise the area-time complexity (AT²) of the computational units and at the system design level. The technique is illustrated for a matrix-matrix multiplication by using an image processing window convolver. This paper describes the performance of the recursive design technique comparing it to a typical systolic array, and demonstrates how data word size and convolution size may be expanded by movement up the architectural hierarchy. A prototype CAD (Computer Aided Design) autolayout program is described which maps directly into the hierarchical design environment. Using such design aids, flexible and correct designs may be generated which offer very simple data flow and highly local interconnection, with high performance. 相似文献

10.

A systolic array for pyramidal algorithms

Christian Lengauer Jingling Xue 《The Journal of VLSI Signal Processing》1991,3(3):237-257

Pyramidal algorithms manipulate hierarchical representations of data and are used in many image processing applications, for example, in image segmentation and border extraction. We present a systolic array which performs pyramidal algorithms. The array is tow-dimensional with one processor per image pixel; the number of steps in its execution is independent of the size of the image. The derivation of the array is governed by a mechanical method whose input is a Pascal-like program. After a number of manual transformations that prepare the program for the method, correct and optimal parallelism is infused mechanically. A processor layout is selected, and the channel connections follow immediately.Supported by an Overseas Research Students Award and a University of Edinburgh Postgraduate Fellowship. 相似文献

11.

高速Viterbi处理器—流水式块处理并行结构 总被引：2，自引：0，他引：2

宣建华姚庆栋《通信学报》1995,16(1):94-100

本文提出一种流水式块处理并行Ｖｉｔｅｒｂｉ处理器，可以得到ＬＭ倍增速（Ｍ为流水级数，Ｌ为块长度），为达到更高速的Ｖｉｔｅｒｂｉ处理器提供了新型的并行结构。它可用Ｓｙｓｔｏｌｉｅ阵列构成，因而适于ＶＬＳＩ实现。相似文献

12.

Integrated multielement receiver structures for spatially distributed interference channels

《IEEE transactions on information theory / Professional Technical Group on Information Theory》1986,32(2):195-219

The structure and performance of a maximum likelihood (ML) receiver for reception in spatially distributed interference channels when a multielement array capability is available are described. Under the assumption of Gaussian interference, the receiver consists of an optimum ML array processor followed by a sequence estimator implemented using the Viterbi Algorithm. Numerical results are provided to illustrate the symbol error probability performance of this optimum ML receiver. A completely adaptive realization of this scheme is described with performance illustrated through simulation. 相似文献

13.

2-D DCT systolic array implementation

Ma W. 《Electronics letters》1991,27(3):201-202

The algorithm and architecture of a 2-D systolic array processor for the DCT (discrete cosine transform) are proposed. It is based on the relationship between DCT and cosine DFT and sine DFT. Two systolic architectures of 1-D DCT data and control flow computation are discussed. By use of the main feature of the two systolic 1-D arrays for DCT, a full 2-D systolic DCT array is presented.<> 相似文献

14.

Application-specific instruction set processor for SoC implementation of modern signal processing algorithms

Zhaohui Liu Dickson K. McCanny J.V. 《IEEE transactions on circuits and systems. I, Regular papers》2005,52(4):755-765

A novel application-specific instruction set processor (ASIP) for use in the construction of modern signal processing systems is presented. This is a flexible device that can be used in the construction of array processor systems for the real-time implementation of functions such as singular-value decomposition (SVD) and QR decomposition (QRD), as well as other important matrix computations. It uses a coordinate rotation digital computer (CORDIC) module to perform arithmetic operations and several approaches are adopted to achieve high performance including pipelining of the micro-rotations, the use of parallel instructions and a dual-bus architecture. In addition, a novel method for scale factor correction is presented which only needs to be applied once at the end of the computation. This also reduces computation time and enhances performance. Methods are described which allow this processor to be used in reduced dimension (i.e., folded) array processor structures that allow tradeoffs between hardware and performance. The net result is a flexible matrix computational processing element (PE) whose functionality can be changed under program control for use in a wider range of scenarios than previous work. Details are presented of the results of a design study, which considers the application of this decomposition PE architecture in a combined SVD/QRD system and demonstrates that a combination of high performance and efficient silicon implementation are achievable. 相似文献

15.

Implementation of a Viterbi Processor for a Digital Communications System with a Time-Dispersive Channel

Frenette N. McLane P. Peppard L. Cotter F. 《Selected Areas in Communications, IEEE Journal on》1986,4(1):160-167

This paper describes the theory, design, and testing of a Viterbi processor for a digital communication system with intersymbol interference over fading time-dispersive channels. The requirement is to implement the Viterbi algorithm for a channel memory of 9 baud at a data rate of 2400 bits/s. The processor is partitioned into three subprocessors corresponding to the correlation, state metric evaluation, and state decision-making operations. For prototype evaluation, each subprocessor is being implemented as a separate chip using4-5 mum CMOS technology. The architecture, circuit design, and subsystem characterization of the correlator chip are described in some detail. The chip is required to evaluate 1024 state transition metrics in each baud interval (about 400 ns) using a pipeline architecture. Simulation and initial test results verify the correct operation of the chip with an adequate-speed safety margin. The theory of operation and architecture of the state metric chip are described. With off-chip memory for state metric storage, the state transition metrics from the correlator chip are used to determine the winning (optimal) path in the Viterbi trellis and to calculate the corresponding 16-bit state metric for each baud interval. Implementation of the third chip which is required to make a state decision regarding the bit sequence sent is presently being investigated. 相似文献

16.

Parallel Viterbi algorithm implementation: breaking theACS-bottleneck

Fettweis G. Meyr H. 《Communications, IEEE Transactions on》1989,37(8):785-790

The central unit of a Viterbi decoder is a data-dependent feedback loop which performs an add-compare-select (ACS) operation. This nonlinear recursion is the only bottleneck for a high-speed parallel implementation. A linear scale solution (architecture) is presented which allows the implementation of the Viterbi algorithm (VA) despite the fact that it contains a data-dependent decision feedback loop. For a fixed processing speed it allows a linear speedup in the throughput rate by a linear increase in hardware complexity. A systolic array implementation is discussed for the add-compare-select unit of the VA. The implementation of the survivor memory is considered. The method for implementing the algorithm is based on its underlying finite state feature. Thus, it is possible to transfer this method to other types of algorithms which contain a data-dependent feedback loop and have a finite state property 相似文献

17.

A CCD programmable signal processor

Chiang A.M. 《Solid-State Circuits, IEEE Journal of》1990,25(6):1510-1517

A generic large-coupled device (CCD) signal processor that performs 2.8-billion computations per second with a 10-MHz clock rate is described. The device's concept, design, operation, performance, and applications are reviewed. A dynamic range greater than 42 dB has been demonstrated by the device. This processor can be used as a one-dimensional correlator, a two-dimensional matched filter or a two-layer neural net device. The device demonstrates the flexibility and computational power that is possible using CCD technology 相似文献

18.

VASS—A VLSI array system synthesizer

Jinn-Wang Yeh Wen-Jiunn Cheng Chein-Wei Jen 《Journal of Signal Processing Systems》1996,12(2):135-158

相似文献

19.

A Unifying Lattice-Based Approach for the Partitioning of Systolic Arrays via LPGS and LSGP

Karl-Heinz Zimmermann 《The Journal of VLSI Signal Processing》1997,17(1):21-41

Various methods for the synthesis of systolic arrays from signal and image processing algorithms have been developed in the past few years. In this paper, we propose a technique for the partitioning problem, the problem to synthesize systolic arrays whose size does not match the problem size. Our technique generalizes most of the known lattice-based approaches to the partitioning problem and combines the multiprojection method for the synthesis of systolic arrays with the locally sequential-globally parallel (LSGP) and locally parallel-globally sequential (LPGS) partitioning schemes. Starting from (1) a k-dimensional large-size systolic array obtained from a system of n-dimensional uniform recurrences by a space-time transformation and (2) an arbitrary lattice in k-space inducing a partitioning of the array into subarrays, a small-size systolic array with a scalar-valued system clock is constructed via the LSGP or LPGS paradigm. In particular, the allocation function for the small-size array can be written in closed form and the timing function is obtained from timing functions for the subdomains, the set of operations performed by the subarrays, by simple greedy algorithms. In this way, the problem of finding optimal timing functions can in various cases be reduced to finding optimal timing functions for the subdomains. For problems of large size, these greedy algorithms seem to be preferable when compared with existing integer or non-convex programming formulations for finding (sub-)optimal timing functions. We also provide some new results, a necessary and sufficient condition for the existence of counter data flow, a formal relationship between partitionings of processor space and index space of the uniform recurrences in terms of counter data flow, and the structural equivalence between the lattice-based LSGP and LPGS schemes applied to the partitioning of index and processor space. 相似文献

20.

Systolic architectures for radar CFAR detectors

Hwang J.-N. Ritcey J.A. 《Signal Processing, IEEE Transactions on》1991,39(10):2286-2295

The authors discuss several advances in the evolution of radar CFAR (constant false alarm rate) detectors, from the classical mean-level detector to more recent designs using order statistics, or sorted data values. These algorithms can be implemented by modifying the existing running window order statistic filtering techniques used in signal/image processing. Although the signal processing theory of CFAR detection is well advanced, practical applications lag because of the high throughput required in radar. This intensive computational requirement is unlikely to be met by further advances in VLSI technology alone; it must result from parallel processing techniques. Systolic array architectures are proposed for several important CFAR detectors. Techniques for improving the processor utilization efficiency of the proposed array architectures are also discussed 相似文献