期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A simplified design strategy for mapping image processing algorithms on a SIMD torus

Guna Seetharaman 《Theoretical computer science》1995,140(2):319-331

It is proposed to enhance and simplify the programming of a two dimensional (2-D) torus (and mesh) connected SIMD array of simple processing elements (PEs) by introducing two dedicated communication registers in each PE. A new SIMD algorithm to transpose a matrix using only two buffers at each PE is described. A method is proposed to effectively realize large number of arbitrary, one-to-one, personalized, and concurrent communication between the PEs, by suitably repeating the matrix transpose algorithm. Implementation of several image processing tasks of shift-variant nature, such as hough transform, histogram, median filters, which involve such communication, is enhanced by this approach. The dynamic behavior of such a SIMD implementation is data independent, unlike the ones that employ greedy methods for handling the overall communication. This feature facilitates coordinated use of several independently operating SIMD meshes in a newly emerging computer vision paradigm known as multiview image-sequence analysis (MVISA) for 3-D perception of unstructured dynamic scenes. 相似文献

2.

Parallel processing approaches to edge relaxation

Eva Leung Xiaobo Li 《Pattern recognition》1988,21(6):547-558

This paper describes several parallel algorithms for image edge relaxation on array processors with different numbers of processing elements (PEs) connected by a mesh or hypercube network. The time complexity of Prager's original edge relaxation scheme is O(N²) per iteration using floating-point operations on a sequential machine, where N² is the number of pixels in the image. Modifications to the scheme are made so that no multiplications are employed and only integer operations are required. Moreover, with parallel processing, the time complexity per iteration is reduced to some constant value. A time complexity analysis on two parallel algorithms is performed. Although the algorithm on an array processor with 4N² PEs achieved higher degree of parallelism, the algorithm with N² PEs is preferred. Further modifications on the latter algorithm are made to accommodate to fewer PEs. 相似文献

3.

Residue systolic implementations for neural networks

Dr. C. N. Zhang M. Wang C. C. Tseng 《Neural computing & applications》1995,3(3):149-156

In this work we propose two techniques for improving VLSI implementations for artificial neural networks (ANNs). By making use of two kinds of processing elements (PEs), one dedicated to the basic operations (addition and multiplication) and another to evaluate the activation function, the total time and cost for the VLSI array implementation of ANNs can be decreased by a factor of two compared with previous work. Taking the advantage of residue number system, the efficiency of each PE can be further increased. Two RNS- based array processor designs are proposed. The first is built by look-up tables, and the second is constructed by binary adders accomplished by the mixed- radix conversion (MRC), such that the hardwares are simple and high speed operations are obtained. The proposed techniques are general enough to be extended to cover other forms of loading and learning algorithms. 相似文献

4.

Configuring a wafer-scale two-dimensional array of single-bitprocessors

Boubekeur A. Patry J.-L. Saucier G. Trilhe J. 《Computer》1992,25(4):29-39

An overview of the ELSA (European large SIMD array) project, which uses a two-level strategy to achieve defect tolerance for wafer-scale architectures implemented in silicon, is presented. The target architecture is a 2-D array of processing elements for low-level image processing. An array is divided into subarrays called chips. At the chip level, defect tolerance is proved by an extra column of PEs (processing element) and bypassing techniques. At the wafer level, a double-rail connection network is used to construct a target array of defect-free chips that is as large and as fast as possible. Its main advantage is being independent of chip defects, as it is controlled from the I/O pads. An algorithm for constructing an optimized two-dimensional array on a wafer containing a given number of defect-free PEs and connections, a method to program the switches for the target architecture found by the algorithm, and software for programming the switches using laser cuts are discussed 相似文献

5.

A multipurpose neural processor for machine vision systems 总被引：1，自引：0，他引：1

Knopf G.K. Gupta M.M. 《Neural Networks, IEEE Transactions on》1993,4(5):762-777

A multitask neural network is proposed as a plausible visual information processor for performing a variety of real-time operations associated with the early stages of vision. The computational role performed by the processor, named the positive-negative (PN) neural processor, emulates the spatiotemporal information processing capabilities of certain neural activity fields found along the human visual pathway. The state-space model of this visual information processor corresponds to a bilayered two-dimensional array of densely interconnected nonlinear processing elements (PE's). An individual PE represents the neural activity exhibited by a spatially localized subpopulation of excitatory or inhibitory nerve cells. Each PE may receive inputs from an external signal space as well as from itself and the neighboring PE's within the network. The information embedded in the external input data which originates from a video camera or another processor is extracted by the feedforward subnet. The feedback subnet of the PN neural processor generates a variety of transient and steady-state activities. Their various computational roles are applicable to gray level, edge, texture, or color information processing. Computer simulations involving gray level image processing are used to illustrate the versatility of the PN neural processor architecture for machine vision system design. 相似文献

6.

General-purpose vision chip architecture for real-time machine vision

《Advanced Robotics》2013,27(6):619-627

To solve the I/O bottleneck problem in existing vision systems and to realize versatile processing adaptive to various and changing environments, we propose a new vision chip architecture for applications such as robot vision. The chip has general-purpose processing elements (PEs) with each PE being directly connected to a photo detector (PD) and can implement various visual processing algorithms. We developed and simulated some sample programs for the chip and proved that they can be processed within 1 ms/frame, a rate that is high enough for high-speed visual feedback for robot control. Aiming to complete the chip, we are now developing test chips based on the architecture. The latest design has 8 x 8 PEs and PDs in an area 3.3 mm x 3.0 mm using a 0.8 μm CMOS process. 相似文献

7.

A digital retina-like low-level vision processor

Mertoguno S. Bourbakis N.G. 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2003,33(5):782-788

This correspondence presents the basic design and the simulation of a low level multilayer vision processor that emulates to some degree the functional behavior of a human retina. This retina-like multilayer processor is the lower part of an autonomous self-organized vision system, called Kydon, that could be used on visually impaired people with a damaged visual cerebral cortex. The Kydon vision system, however, is not presented in this paper. The retina-like processor consists of four major layers, where each of them is an array processor based on hexagonal, autonomous processing elements that perform a certain set of low level vision tasks, such as smoothing and light adaptation, edge detection, segmentation, line recognition and region-graph generation. At each layer, the array processor is a 2D array of k/spl times/m hexagonal identical autonomous cells that simultaneously execute certain low level vision tasks. Thus, the hardware design and the simulation at the transistor level of the processing elements (PEs) of the retina-like processor and its simulated functionality with illustrative examples are provided in this paper. 相似文献

8.

Real-time image processing on a custom computing platform

Athanas P.M. Abbott A.L. 《Computer》1995,28(2):16-25

The authors explore the utility of custom computing machinery for accelerating the development, testing, and prototyping of a diverse set of image processing applications. We chose an experimental custom computing platform called Splash-2 to investigate this approach to prototyping real time image processing designs. Custom computing platforms are emerging as a class of computers that can provide near application specific computational performance. We developed a real time image processing system called VTSplash, based on the Splash-2 general-purpose platform. Splash-2 is an attached processor featuring programmable processing elements (PEs) and communication paths. The Splash-2 system uses arrays of RAM based field programmable gate arrays (FPGAs), crossbar networks, and distributed memory to accomplish the needed flexibility and performance tasks. Such platforms let designers customize specific operations for function and size, and data paths for individual applications 相似文献

9.

An associative processing module for a heterogeneous visionarchitecture

Storer R. Pout M.R. Thomson A.R. Dagless E.L. Duller A.W.G. Marriott A.P. Hicks P.J. 《Micro, IEEE》1992,12(3):42-55

The heterogeneous vision architecture that satisfies the computing demands of real-time computer vision by providing parallelism in three different forms is described. A pipeline of digital signal processing (DSP) chips initially processes signals. Then a SIMD associative processor array processes images and extract features, and a MIMD network of transputers processes extracted objects in parallel. The array's VLSI implementation, the processing modes available due to the use of content-addressable memory, and the means of achieving efficient 2-D interprocessor communication in the linear array are described. An application as a vehicle number plate recognition system is presented 相似文献

10.

On Achieving Maximum Performance in Time-Varying Arrays

Kulasinghe P. Elamawy A. 《Journal of Parallel and Distributed Computing》1995,31(2)

Several important computationally intensive algorithms can be implemented on special purpose VLSI arrays. A number of such algorithms naturally map onto either heterogenous arrays or arrays employing PEs with switchable functions, or both. In many cases, such designs are the only known ones for VLSI implementation. Synchronization is generally achieved by assuming that the time required to perform basic PE computations is uniform, although the PEs perform different functions and may change their functions at different algorithmic steps. This simplistic approach may result in significant performance degradation. This paper addresses the properties, performance, and theory of time-varying heterogeneous arrays for the objective of achieving maximum performance. A systematic method for collision avoidance is formally introduced and analyzed. Our approach is based on dynamically balancing a two-level pipelined array through the use of a set of buffers. Another set of buffers is used to guarantee data synchronization. We show that if the initial delays (PE execution times) and the time variances are deterministic, an equivalent time-invariant array can be constructed (in polynomial time) which is optimal in speed. We describe a method for estimating the upper bound on computational time when array time variance is nondeterministic. Our method requires only knowledge of the bounds on initial delays. 相似文献

11.

Image processing with VLSI

AG Corry DK Arvind GLS Connolly RR Korya IN Parker 《Microprocessors and Microsystems》1983,7(10):482-486

The development and implementation of systems for the more complex realtime image processing and scene understanding tasks, such as robot vision and remote surveillance, calls for faster computation than that possible using the traditional serial computer. The advent of VLSI has made feasible the consideration of more specialized processing architectures, designed to support these datarates, while keeping systems compact and relatively cheap. Two approaches are discussed: the use of a programmable processor array, and the customizing of image processing algorithms in silicon. This paper examines designs based upon each approach in the light of the techniques and constraints of VLSI. In particular we describe in some detail an example of a VLSI parallel array processor, the Grid (GEC rectangular image and data processor), and a number of special-purpose CMOS/SOS chips based on systolic design techniques. 相似文献

12.

环网处理器阵列的容错重构技术

祝龙婷武继刚姜桂圆王超《计算机工程与科学》2015,37(8):1423-1429

高效的容错技术对于提高多处理器系统的可靠性至关重要。环网(Torus)是连接多处理器阵列的重要网络结构,而环网处理器阵列上的容错重构技术目前尚属空白。针对环网阵列的特殊连接方式,将环网阵列重构问题转化为矛盾图上求解最大独立集问题。矛盾图上的结点表示故障处理器的替换方案,而边代表了不同替换方案之间的不可共存特性。主要是根据三种不同的冗余处理器分布方案,设计生成矛盾图算法,求解最大独立集算法,以及由独立集生成逻辑处理器阵列算法,取得了令人满意的结果。实验结果表明,当阵列规模较小或故障率较低时,一行一列和十字型的冗余单元分布的重构能力较好;而随着阵列规模或故障率的增大,三种冗余单元分布策略的重构成功率都随之下降,但可通过增加冗余单元以及调整冗余分布来改善容错效果。此外,从实验结果中还可以看出,环网处理器阵列的容错能力显然优于网格(Mesh)处理器阵列。相似文献

13.

A scalable,real-time,image processing pipeline

Pieter P. Jonker Erwin R. Komen Martin A. Kraaijveld 《Machine Vision and Applications》1995,8(2):110-121

To speed up image processing in the field of robot vision and industrial inspection, a pipeline element that can perform fast cellular logic operations was made. This cellular logic processing element (CLPE) can process binary images with a speed of 100ns per pixel. The processing element is a CMOS VLSI device. It includes a writable logic array for storing sets of 3 × 3 structuring elements that define the cellular logic operations. This paper describes how such CLPEs can be used for building a pipeline for mixed gray-value processing and cellular logic processing. 相似文献

14.

Analog VLSI systems for image acquisition and fast early vision processing

John L. Wyatt Jr. Craig Keast Mark Seidel David Standley Berthold Horn Tom Knight Charles Sodini Hae-Seung Lee Tomaso Poggio 《International Journal of Computer Vision》1992,8(3):217-230

This article describes a project to design and build prototype analog early vision systems that are remarkably low-power, small, and fast. Three chips are described in detail. A continuous-time CMOS imager and processor chip uses a fully parallel 2-D resistive grid to find an object's position and orientation at 5000 frames/second, using only 30 milliwatts of power. A CMOS/CCD imager and processor chip does high-speed image smoothing and segmentation in a clocked, fully parallel 2-D array. And a chip that merges imperfect depth and slope data to produce an accurate depth map is under development in switched-capacitor CMOS technology. 相似文献

15.

Solution of dense linear systems on an optimal systolic architecture

Ahmed El-Amawy 《Computers & Electrical Engineering》1987,13(3-4):177-193

The paper presents an optimal systolic array architecture for rapid solution of dense systems of linear equations. The array solves a system of size n×n in 4n + 1 time units including I/0 time. Data communications are strictly local and the processing elements (PEs) are simple. The complete three-phase solution algorithm is executed on a single array, employing about 3n²/2 PEs without any need for costly inter-phase I/0. Due to a novel data steering mechanism, the three algorithmic phases are maximally overlapped. Design optimality is established using systolic precedence diagrams. It is also shown that merging the functions of two adjacent PEs into a single PE is possible resulting in maximal PE utilization. An interesting result regarding cascading phase-optimal arrays is obtained. 相似文献

16.

Functional programming on a dataflow architecture: Applications in real-time image processing

Jocelyn Sérot Georges Quénot Bertrand Zavidovique 《Machine Vision and Applications》1993,7(1):44-56

This paper presents a dataflow functional computer (DFFC) developed at the Etablissement Technique Central de l'Armement (ETCA) and dedicated to real-time image processing. Two types of data-driven processing elements, dedicated respectively to low-level and mid-level processings are integrated in a regular 3D array. The design of the DFFC relies on a close integration of the dataflow-architecture principles and the functional programming concept. An image processing algorithm, expressed with a syntax similar to that of functional programming (FP) is first converted into a dataflow graph. The nodes of this graph are real-time operators that can be implemented on the physical processors of the dataflow machine. This dataflow graph is then mapped directly onto the processor array. The programming environment includes a complete compilation stream from the FP specification to hardware implementation, along with a global operator database. Apart from being a research tool for real-time image processing, the DFFC may also be used to perform the automatic synthesis of autonomous vision automata from a high-level functional specification. An experimental system, including 1024 lowlevel custom dataflow processors and 12 T800 transputers, was built and can perform up to 50 billion operations/s. Several image processing algorithms were implemented on this system and run in real-time at digital video speed. 相似文献

17.

Simulation and verification of associative processor arrays

A. W. G. Duller R. Storer 《Parallel Computing》1992,18(12):1403-1414

This work is based on the design of a VLSI processor array comprising single bit processing elements combined with Content Addressable Memory (CAM) [1,2]. The processors are connected in a linear array with 64 currently being combined on a chip. Each processor is linked to 64 bits of data CAM and 4 bits of subset CAM (used for marking subsets of the array for subsequent processing). The architecture is targeted at image applications including pixel based processing as well as higher level symbolic manipulation and incorporates a data shift register linking all of the processing elements which allows data loading and processing to occur concurrently.

The current situation is that an extensive functional simulation package has been written [3] which allows algorithms to be coded and executed on a system which comprises an arbitrary number of array chips together with its controlling hardware. This allows algorithms to be investigated, and tuned to the architecture. A reduced design has been fabricated and the chips are undergoing parametric testing. A full version of the processor array chip will then be produced allowing a complete image system to be tested.

The VLSI design work undertaken so far [2] shows that the blocks which constitute the design can easily be replicated an arbitrary number of times (subject to chip size constraints) to create an application specific CAM array. The need for this type of flexibility has been borne out by the algorithmic work that has been carried out by a number of workers. In order to make the design of application specific arrays possible it is vital that the simulation tools are fast enough to allow adequate testing to be performed on the new design. It is for this reason that the original simulation package, written in C, has been transferred onto a transputer array.

This paper looks at the way in which the simulation is mapped onto the transputers in such a way that an arbitrary number can be used. In addition the problems of verification and validation of the simulator and the VLSI design are addressed. Results are given for a number of different applications which show very encouraging speed-ups. In many ways it has been found that the efficiency with which the simulation can be carried out with a large number of transputers mirrors the efficiency of the processor array in terms of communications overhead. 相似文献

18.

Orthogonal multiprocessor sharing memory with an enhanced mesh for integrated image understanding

《CVGIP: Image Understanding》1991,53(1):31-45

This paper proposes a new parallel architecture, which has the potential to support low-level image processing as well as intermediate and high-level vision analysis tasks efficiently. The integrated architecture consists of an SIMD mesh of processors enhanced with multiple broadcast buses, and MIMD multiprocessor with orthogonal access buses, and a two-dimensional shared memory array. Low-level image processing is performed on the mesh processor, while intermediate and high-level vision analysis is performed on the orthogonal multiprocessor. The interaction between the two levels is supported by a common shared memory. Concurrent computations and I/O are made possible by partitioning the memory into disjoint spaces so that each processor system can access a different memory space. To illustrate the power of such a two-level system, we present efficient parallel algorithms for a variety of problems from low-level image processing to high-level vision. Representative problems include matrix based computations, histogramming and key counting operations, image component labeling, pyramid computations, Hough transform, pattern clustering, and scene labeling. Through computational complexity analysis, we show that the integrated architecture meets the processing requirements of most image understanding tasks. 相似文献

19.

Flexible rerouting schemes for reconfiguration of multiprocessor arrays

Guiyuan Jiang Jigang Wu Jizhou Sun Yiyi Gao 《Journal of Parallel and Distributed Computing》2014

In a multiprocessor array, some processing elements (PEs) fail to function normally due to hardware defects or soft faults caused by overheating, overload or occupancy by other running applications. Fault-tolerant reconfiguration reorganizes fault-free PEs to a new regular topology by changing the interconnection among PEs. This paper investigates the problem of constructing as large as possible logical array with short interconnects from a physical array with faults. A flexible rerouting scheme is developed to improve the efficiency of utilizing fault-free PEs. Under the scheme, two efficient reconfiguration algorithms are proposed. The first algorithm is able to generate the maximum logical array (MLA) in linear time. The second algorithm reduces the interconnect length of the MLA, and it is capable of producing nearly optimal logical arrays in comparison to the lower bound of the interconnect length, that is also proposed in this paper. Experimental results validate the efficiency of the flexible rerouting schemes and the proposed algorithms. For 128×128 host arrays with 30% unavailable PEs, the proposed approaches improve existing algorithm up to 44% in terms of logical array size, while reducing the interconnection redundancy by 49.6%. In addition, the proposed algorithms are more scalable than existing approaches. On host arrays with 50% unavailable PEs, our algorithms can produce logical arrays with harvest over 56% while existing approaches fail to construct a feasible logical array. 相似文献

20.

Efficient control generation for mapping nested loop programs onto processor arrays

《Journal of Systems Architecture》2007,53(5-6):300-309

Processor array architectures are optimal platforms for computationally intensive applications. Such architectures are characterized by hierarchies of parallelism and memory structures, i.e. processor arrays apart from different levels of cache have a large number of processing elements (PE) where each PE can further contain sub-word parallelism. In order to handle large scale problems, balance local memory requirements with I/O-bandwidth, and use different hierarchies of parallelism and memory, one needs a sophisticated transformation called hierarchical partitioning. Innately the applications are data flow dominant and have almost no control flow, but the application of hierarchical partitioning techniques has the disadvantage of a more complex control flow. In a previous paper, the authors presented first time a methodology for the automated control path synthesis for the mapping of partitioned algorithms onto processor arrays. However, the control path contained complex multiplication and division operators. In this paper, we propose a significant extension to the methodology which reduces the hardware cost of the global controller and memory address generators by avoiding these costly operations. 相似文献