首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A 2D array implementation of image segmentation by a directed split and merge procedure is proposed. Parallelism is realized on two levels: one within the split and merge operations, where more than one merge (or split) may proceed concurrently, and the second between the split and merge operations, where several splits may be performed in parallel with merges. Both the split and merge operations are based on nearest neighbor communications between the processing elements (PEs), and facilitating low communication costs. The basic arithmetic operations required to perform split and merge are comparison and addition, allowing a simple structure of the PE as well as a hardwired control. A local of 512 bytes is sufficient to hold the interim data associated with each PE. A prototype PE has been constructed using 3 μm double-metal CMOS technology. Scaling up to 0.8 μm, it is possible to incorporate 32 PEs on a 5 cm2 chip. With sufficiently large PE window sizes, image segmentation can be achieved in linear time  相似文献   

2.
A prototype filter design is reviewed to underscore the computational problems arising in such designs. A purely systolic-array architecture is presented. This array provides the computational support necessary for filter design. Due to a simple and novel data steering technique the array is capable of carrying out a number of important matrix operations such as factorization, inversion of factors, and matrix-matrix multiplication. Another interesting attribute is the array's ability to maximally overlap computations of multiphase algorithms. In this study we demonstrate the execution of a dense matrix factorization phase and a factor inversion phase on the array with no need for intraphase or interphase I/O. We show that these phases (which are the backbone of an optimal filtering algorithm) are completed in the optimal count of aboutn time units. The array employs 2n nn simple processing elements (PEs) that are active every other time unit. It is shown that the functions of two adjacent PEs can be merged and assigned to a single PE thus maximizing PE utilization. A possible design of a merged PE is given.  相似文献   

3.
This paper describes the design of a processing element (PE) for systolic array applications. The PE, which can be configured as a multiplier-accumulator or an inner product step processor, supports several of the most common systolic algorithms in signal processing and matrix arithmetic. Communication with neighbouring PEs is achieved through 18 on-chip serial links, each operating at 50 Mbits per second. The device incorporates both scan path and built-in self-test features. Integration of test circuitry with the serial communication ports permitted the testability features to be implemented with a total area overhead of under 9%. The 30 k transistor ASIC device is implemented in 2 micron HCMOS gate array technology, packaged in a 48 pin DIP and performs at 10 MFLOPS.  相似文献   

4.
In this paper, VLSI array architectures for matrix inversion are studied. A new binary-coded z-path (Bi-z) CORDIC is developed and implemented to compute the operations required in the matrix inversion using the Givens rotation (GR) based QR decomposition. The Bi-z CORDIC allows both the GR vectoring and rotation mode, as well as division and multiplication to be executed in a single unified processing element (PE). Hence, a 2D (2 dimensional) array consisting of PEs with different functionalities can be folded into a 1D array to reduce hardware complexity. The Bi-z CORDIC also eliminates the arithmetic complexity of the angle quantization and formation computation that exist in the traditional CORDIC. Two mapping techniques, namely a linear mapping method and an interlaced mapping method, for mapping a 2D matrix inversion array into a 1D array are proposed and developed. Consequently two corresponding array architectures are designed and implemented. Both the architectures use the Bi-z CORDIC in their PEs and they are designed to be fully scalable and parameterizable in terms of matrix size and data wordlength. The linear mapping method is a straightforward mapping offering simple schedule and control signals. The interlaced mapping method has a more complicated schedule with complex control signals but achieves 100% or near 100% processor utilization for odd and even size matrix, respectively.  相似文献   

5.
Modular, area-efficient VLSI architectures for computing the arithmetic Fourier transform (AFT) are proposed. By suitable design of PEs and I/O sequencing, nonuniform data dependencies in the AFT computation which require nonequidistant inputs and assignment of Mobius function values are resolved. The proposed design employs 2N+1 PEs to compute 2N+1 Fourier coefficients. Each PE has an adder and a fixed amount of local storage, and one PE has a multiplier. I/O with the host is performed using a fixed number of channels. This results in simple PE organization, compared with those needed in known DFT/FFT architectures. The design achieves O(N) speedup. It uses significantly fewer PEs than designs in the literature and supports real-time applications by allowing continuous sequential input. It can be extended to achieve linear speedup in a fixed size array with 2p+1 PEs, 1⩽pN  相似文献   

6.
Describes an LSI adaptive array processor (AAP) for two-dimensional data processings. The AAP contains a large number of one-bit processing elements (PEs) arranged in a square array. The large degree of parallelism and control registers in each PE allow for high speed and flexible operations. High transfer capability is also obtained by a simple inter-PE connection network with hierarchical bypasses. The high applicability to various data processings is indicated by a matrix multiplication example, utilizing an algorithm similar to a systolic one. An AAP LSI composed of 8/spl times/8 PEs with powerful functions has been implemented in a 96.0 mm/SUP 2/ chip by using 2 /spl mu/m Si-gate p-well CMOS technology. A high-speed cycle time of 55 ns, low power dissipation of 1.1 W, and high packing density of 1170 transistors/mm/SUP 2/ has been achieved by a skilful manual design. Though the LSI contains as many as 111900 transistors, the design effort has only required one man-year due to cellular array regularity. This LSI is expected to realize a high-performance AAP compactly.  相似文献   

7.
An array processing chip integrating 128 bit-serial processing elements (PEs) on a single die is discussed. Each PE has a 16-function logic unit, a single-bit adder, a 32-b variable-length shift register, and 1 kb of local RAM. Logic in each PE provides the capability to mask PEs individually. A modified grid interconnection scheme allows each PE to communicate with each of its eight nearest neighbors. A 32-b bus is used to transfer data to and from the array in a single cycle. Instruction execution is pipelined, enabling all instructions to be executed in a single cycle. The 1-μm CMOS design contains over 1.1-million transistors on an 11.0-mm×11.7-mm die  相似文献   

8.
Computing with large die-size graphical processors, that involves huge arrays of identical structures, in the late CMOS era is abounding with challenges due to spatial non-idealities arising from chip-to-chip and within-chip variation of MOSFET threshold voltage. In this paper, we propose a software-framework using machine learning for in-situ prediction and correction of computation corrupted due to threshold voltage variation of transistors. Semi-supervised training is imparted to a fully connected cascade feed-forward (FCCFF) neural network (NN). This FCCFF-NN then creates an accurate spatial map of faulty processing elements (PE), which are avoided in computing. Besides correcting spatial faults, any transient errors (such as single-event upsets) are also tracked and corrected if the number of affected PEs is large enough to cause noticeable computing errors. For experimental validation, we consider a 256 × 256 PE array. Each PE is comprised of add-accumulate-multiply (AAM) block with three 8-bit registers (two for inputs and a third for storing the computed result). One thousand instances of this processor array are created and PEs in each instance are randomly perturbed with threshold voltage variation. Common image processing operations such as low pass filtering and edge enhancement are performed on each of these 1,000 instances. A fraction of these images (about 10 %) is used to train the NN for spatial non-idealities. Based on this training, the NN is able to accurately predict the spatial extremities in 95 % of all the remaining 90 % of the cases. The proposed NN based error tolerance produces superior quality processed images whose degradation is no longer visually perceptible.  相似文献   

9.
This paper presents the design and implementation of high performance bi-directional linear systolic array (BLSA) with low-power, reconfigurable processing elements (PE). The BLSA acts as a hardware accelerator for implementing a broad class of problems which are met in a variety of applications such as digital signal processing, computer graphics, graph algorithms, etc. We define a unique algorithm representation for solving problems such as matrix multiplication, transitive closure, finding critical path in a graph, finding all-pairs shortest paths in a graph, etc. The algorithm is mapped into a BLSA with reconfigurable PEs. A clock gating technique is used to minimize power-consumption of a multi-functional PE. Performance of the BLSA are considered from the aspects of power-consumption and communication bandwidth. Using clock gating technique we achieve PE power reduction of 85% in average. Communication bandwidth is considered for different number of PEs in the BLSA and different operand size. The obtained results are in the range of 442 up to 9460 MB/s, i.e. bandwidth of our design is better for larger array and operand size. A lower-power, reconfigurable PE is realized using Xilinx FPGA chips.  相似文献   

10.
The reconfiguration of multipipeline arrays in the presence of both faulty processing elements (PEs) and switching elements (SEs) is addressed. Different fault models are used for the PEs and SEs: a PE can be either fault free or faulty; a SE is modeled using a novel functional approach which relates its switching capabilities to its status. This permits a PE to retain a partial functionality in the presence of a fault. An appropriate transformation of the multipipeline array reconfiguration problem to a maximum flow problem is then presented. The conditions under which this transformation is possible, are fully analyzed. A reconfiguration algorithm based on the maximum flow algorithm is presented; the proposed algorithm is optimal as the number of reconfigured pipelines is maximized  相似文献   

11.
The design of a fault-tolerant rectangular array of processing elements (PEs) is presented in which the reconfiguration is done by means of on-chip distributed logic, without the help of any external host. Spare PEs are included in every column of the array, and faulty PEs are bypassed within a column to facilitate reconfiguration in the presence of faults. Scan paths are used to enhance the testability of the array. PEs are tested locally using near-neighbor comparisons without the need of an external host. Because the interconnections between logical neighbors are short, the speed penalty for reconfiguration is very small. Any amount of redundancy can be incorporated in the array without changing the topology of the scheme or the design of the reconfiguration switches. The scheme is well suited for very large-area, high-density chips and wafer-scale integration. In order to demonstrate the capabilities of the scheme and evaluate its performance, an experimental chip consisting of a 6×4 array was designed, fabricated, and tested. Details of the design and the implementation of the chip are presented. The scheme is also analyzed for yield and area utilization for a range of array sizes and PE survival probabilities  相似文献   

12.
In this paper, a fully-pipeline linear systolic array based on adjusted Montgomery's algorithm is presented to perform modular multiplication at extremely high speed. The processing element (PE) consists of only 4 full-adders and 14 flip-flops. Three-stage internal pipelined PE results in a very short critical path with only a one-bit full-adder delay. Thus, it can run at a very high cycle rate. The total execution time for an n-bit modular multiplication is 2n + 11 cycles with only (n/2 + 2) PEs. A modular exponentiation based on it takes (3n + 16.5)n cycles in average. Compared with most published VLSI modular multipliers, the hardware complexity is greatly reduced while keeping very high throughput. Therefore it is a good candidate of the arithmetic units used in the many public-key crypto-systems, e.g. RSA, Elliptic Curve and so on, especially for the embedded applications concerning information security.  相似文献   

13.
The paper provides proof of the odd symmetry of the vector of weight coefficients obtained on the basis of the least squares criterion in the linear antenna array with linear constraints and desired signal. Pairs of symmetrical elements of such vector are complex-conjugate to one another. For ensuring this property, the vector of constrained parameters (array pattern values in the directions of interest) must be real-valued, but need not be symmetrical. The odd symmetry of the antenna array vectors of input signals and weights makes it possible for such array to develop adaptive algorithms based on real-valued arithmetic. In this case, such algorithms have the number of arithmetic operations per iteration two or four times less as compared to the equivalent number of real-valued arithmetic operations of similar algorithms in the complex-valued arithmetic. The paper presents the results of comparative simulation of algorithms in the complex-valued and real-valued arithmetic. These results indicate that an adaptive algorithm using the real-valued arithmetic ensures (1.5–2)-fold shorter transient and deeper (by 2–3 dB) valleys in the steady state of the array pattern in the directions of sources of adaptively suppressed interferences as compared to the algorithm using the complex-valued arithmetic.  相似文献   

14.
A novel application-specific instruction set processor (ASIP) for use in the construction of modern signal processing systems is presented. This is a flexible device that can be used in the construction of array processor systems for the real-time implementation of functions such as singular-value decomposition (SVD) and QR decomposition (QRD), as well as other important matrix computations. It uses a coordinate rotation digital computer (CORDIC) module to perform arithmetic operations and several approaches are adopted to achieve high performance including pipelining of the micro-rotations, the use of parallel instructions and a dual-bus architecture. In addition, a novel method for scale factor correction is presented which only needs to be applied once at the end of the computation. This also reduces computation time and enhances performance. Methods are described which allow this processor to be used in reduced dimension (i.e., folded) array processor structures that allow tradeoffs between hardware and performance. The net result is a flexible matrix computational processing element (PE) whose functionality can be changed under program control for use in a wider range of scenarios than previous work. Details are presented of the results of a design study, which considers the application of this decomposition PE architecture in a combined SVD/QRD system and demonstrates that a combination of high performance and efficient silicon implementation are achievable.  相似文献   

15.
The design of the processing element of GASP, a GaAs supercomputer with a 500-MHz instruction issue rate and 1-GHz subsystem clocks, is presented. The novel, functionally modular, block data flow architecture of GASP is described. The architecture and design of a GASP processing element is then presented. The processing element (PE) is implemented in a hybrid semiconductor module with 152 custom GaAs ICs of eight different types. The effects of the implementation technology on both the system-level architecture and the PE design are discussed. SPICE simulations indicate that parts of the PE are capable of being clocked at 1 GHz, while the rest of the PE uses a 500-MHz clock. The architecture utilizes data flow techniques at a program block level, which allows efficient execution of parallel programs while maintaining reasonably good performance on sequential programs. A simulation study of the architecture indicates that an instruction execution rate of over 30,000 MIPS can be attained with 65 PEs  相似文献   

16.
A collision detection VLSI processor is proposed in order to achieve ultrahigh-performance processing with an ideal parallel processing scheme. A large number of coordinate transformations and memory accesses to the obstacle memory are fully utilized in the processing algorithm, so that direct collision detection can be executed with a VLSI-oriented regular data flow. The structure of each processing element (PE) is very simple because a PE mainly consists of a COordinate Rotational DIgital Computer (CORDIC) arithmetic unit for the coordinate transformation and memories for the storage of manipulator and obstacle information. When 100 PEs are used for parallel processing, the performance is about 10,000 times faster than that of conventional approaches using a single general-purpose microprocessor  相似文献   

17.
In H.264/AVC, the motion estimation (ME) routine supports variable block size and involves highly parallel sum of absolute difference (SAD) computations. In this study, we introduce a bit serial hybrid-grained processing element (PE) based 2D architecture that has both early termination and intensive data reuse capabilities. PEs operate on most significant bit-first arithmetic for early termination and the 2D architecture enables on-chip data reuse between neighboring PEs in a bit-by-bit pipelined fashion. Hybrid-grained PEs reduce the hardware overhead of conventional adder tree structures used for implementing the variable block size ME. Our design reduces the gate count by 7x compared to its ASIC counterpart, operates at a comparable frequency while sustaining 30 fps and 60 fps; and outperforms bit parallel and bit serial architectures in terms of throughput and performance per gate for various video formats.  相似文献   

18.
近轴近似是引起抛物方程(parabolic equation,PE)自身固有相位误差的根本原因.为选取适合目标场景中的最优PE形式,基于色散分析方法,推导了现有六种PE形式的色散关系,评估出折射率和传播仰角对各PE形式相位误差的影响,进而分析出对流层电波传播、水下声波传播、森林电波传播三种典型场景中各PE形式的精度.研...  相似文献   

19.
This paper describes a low-power, single-chip video encoder intended for battery-operated portable applications. Design goals are minimizing system power as well as utilized bandwidth, and maximizing system integration. The encoder achieves competitive compression, with convenient bit rate scalability, using a peak power dissipation of several hundred μW on a video stream of 8-bit gray scale, 30 frame/s, and 128×128 demonstration resolution. Compression is performed using wavelet filtering, zero-trees, and arithmetic coding, all integrated on a single chip (3 million transistors, 1 cm2, in 0.6 μm CMOS, operating at 500 kHz), with no external memory or control. Results do not include use of motion compensation, however, hooks are included at algorithmic and architectural levels to add motion compensation at the cost of power dissipation a few times higher, and more internal memory. In the absence of motion compensation, temporal correlation is still utilized through the use of simple frame differencing. The architectural centerpiece is a massively parallel, fine granularity SIMD array of processing elements (PEs). A mapping is made between small image blocks (4×4 pixels on the test chip) and PEs, with each PE containing both memory and logic required for its block. These results are obtained by careful coordination of design in a deep vertical manner, ranging from system, algorithmic, architectural, circuit, and layout, and designing simultaneously for all required algorithmic subcomponents  相似文献   

20.
A VLSI array processor for 16-point FFT   总被引:1,自引:0,他引:1  
An implementation of a two-dimensional array processor for fast Fourier transform (FFT) using a 2-μm CMOS technology is presented. The array processor, which is dedicated to 16-point FFT, implements a 4×4 mesh array of 16 processing elements (PEs) working in parallel. Design considerations in both the chip level and the PE level are examined. A layout design methodology based on bit-slice units (BSUs) results in a very simple design, easy debugging, and a regular interconnection scheme through abutment. It contains about 48,000 transistors on an area of 53.52 mm2, excluding the 83-pad area, and operation is on a 15-MHz clock. The array processor performs 24.6 million complex multiplications per second, and computes a 16-point FFT in 3 μs  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号