首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
It is well known that parallel computers can be used very effectively for image processing at the pixel level, by assigning a processor to each pixel or block of pixels, and passing information as necessary between processors whose blocks are adjacent. This paper discusses the use of parallel computers for processing images at the region level, assigning a processor to each region and passing information between processors whose regions are related. The basic difference between the pixel and region levels is that the regions (e.g. obtained by segmenting the given image) and relationships differ from image to image, and even for a given image, they do not remain fixed during processing. Thus, one cannot use the standard type of cellular parallelism, in which the set of processors and interprocessor connections remain fixed, for processing at the region level. Reconfigurable cellular computers, in which the set of processors that each processor can communicate with can change during a computation, are more appropriate. A class of such computers is described, and general examples are given illustrating how such a computer could initially configure itself to represent a given decomposition of an image into regions, and dynamically reconfigure itself, in parallel, as regions merge or split.  相似文献   

2.
Most Western Governments (USA, Japan, EEC, etc.) have now launched national programmes to develop computer systems for use in the 1990s. These so-called Fifth Generation computers are viewed as “knowledge” processing systems which support the symbolic computation underlying Artificial Intelligence applications. The major driving force in Fifth Generation computer design is to efficiently support very high level programming languages (i.e. VHLL architecture).

Historycally, however, commercial VHLL architectures have been largely unsuccesful. The driving force in computer designs has principally been advances in hardware which at the present time means architectures to exploit very large scale integration (i.e. VLSI architecture).

This paper examines VHLL architectures and VLSI architectures and their probable influences on Fifth Generation computers. Interestingly the major problem for both architecture classes is parallelism; how to orchestrate a single parallel computation so that it can be distributed across an ensemble of processors.  相似文献   


3.
An increasing number of parallel computer products are appearing in the market place. Their design motivations and market areas cover a broad spectrum: (i) Transaction Processing Systems, such as Parallel UNIX systems (e.g. SEQUENT Balance), for data processing applications; (ii) Numeric Supercomputers, such as Hypercube systems (e.g. INTEL iPSC), for scientific and engineering applications; (iii) VLSI Architectures, such as parallel microcomputers (e.g. INMOS Transputer), for exploiting very large scales of integration; (iv) High-Level Language Computers, such as Logic machines (e.g. FUJITSU Kabu-Wake), for symbolic computation; and (v) Neurocomputers, such as Connectionist computers (e.g. THINKING MACHINES Connection Machine), for general-purpose pattern matching applications.

This survey paper gives an overview of these novel parallel computers and discusses the likely commercial impact of parallel computers.  相似文献   


4.
Recent developments in electrostatic plotting techonology-specifically in areas related to the plotter's speed, density, and color-forced processing requirements for graphics controllers beyond what is possible with traditional approaches. New market segments in the VLSI field, which were opened by the introduction of color, created more demanding applications, exceeding 10 million graphic elements per single E-size drawing. Conventional technology and architecture cannot generate the performance requirements within the given cost constraints. The new RPM (Raster Processing Machine) controller is based on a unique modular pipeline architecture, VLSI implementation of rasterizing routines, and parallel processing for performance and bottleneck by passing. For simple plots a single board processor based on the Motorola 68000 provides conversion from random graphic elements to the raster form required. As plot complexity increases, a special bipolar processor with a writable control store is added. For applications covering large graphic constructs with special texture (often needed for color generation), two custom chips on a dedicated VLSI board are added to the system. Multiplicity of processors and buses is possible for additional throughput. The controller's performance, along with the analysis of potential constraints, is also described.  相似文献   

5.
Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.  相似文献   

6.
The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000.  相似文献   

7.
This work is based on the design of a VLSI processor array comprising single bit processing elements combined with Content Addressable Memory (CAM) [1,2]. The processors are connected in a linear array with 64 currently being combined on a chip. Each processor is linked to 64 bits of data CAM and 4 bits of subset CAM (used for marking subsets of the array for subsequent processing). The architecture is targeted at image applications including pixel based processing as well as higher level symbolic manipulation and incorporates a data shift register linking all of the processing elements which allows data loading and processing to occur concurrently.

The current situation is that an extensive functional simulation package has been written [3] which allows algorithms to be coded and executed on a system which comprises an arbitrary number of array chips together with its controlling hardware. This allows algorithms to be investigated, and tuned to the architecture. A reduced design has been fabricated and the chips are undergoing parametric testing. A full version of the processor array chip will then be produced allowing a complete image system to be tested.

The VLSI design work undertaken so far [2] shows that the blocks which constitute the design can easily be replicated an arbitrary number of times (subject to chip size constraints) to create an application specific CAM array. The need for this type of flexibility has been borne out by the algorithmic work that has been carried out by a number of workers. In order to make the design of application specific arrays possible it is vital that the simulation tools are fast enough to allow adequate testing to be performed on the new design. It is for this reason that the original simulation package, written in C, has been transferred onto a transputer array.

This paper looks at the way in which the simulation is mapped onto the transputers in such a way that an arbitrary number can be used. In addition the problems of verification and validation of the simulator and the VLSI design are addressed. Results are given for a number of different applications which show very encouraging speed-ups. In many ways it has been found that the efficiency with which the simulation can be carried out with a large number of transputers mirrors the efficiency of the processor array in terms of communications overhead.  相似文献   


8.
The development and implementation of systems for the more complex realtime image processing and scene understanding tasks, such as robot vision and remote surveillance, calls for faster computation than that possible using the traditional serial computer. The advent of VLSI has made feasible the consideration of more specialized processing architectures, designed to support these datarates, while keeping systems compact and relatively cheap. Two approaches are discussed: the use of a programmable processor array, and the customizing of image processing algorithms in silicon. This paper examines designs based upon each approach in the light of the techniques and constraints of VLSI. In particular we describe in some detail an example of a VLSI parallel array processor, the Grid (GEC rectangular image and data processor), and a number of special-purpose CMOS/SOS chips based on systolic design techniques.  相似文献   

9.
实时识别目标在现代高科技战争中具有重要军事意义。设计了一个由5片ADSP21060组成的并行处理系统。该系统采用基于环网的数据块处理策略,主DSP将数据分割为长度相等的块,并分配给各从DSP进行处理。以FIR滤波器并行算法为例,验证了本设计在运算速度方面显著优于单DSP系统。该设计可应用到对实时性能要求苛刻的军事领域。  相似文献   

10.
Ray tracing is a well known technique to generate life-like images. Unfortunately, ray tracing complex scenes can require large amounts of CPU time and memory storage. Distributed memory parallel computers with large memory capacities and high processing speeds are ideal candidates to perform ray tracing. However, the computational cost of rendering pixels and patterns of data access cannot be predicted until runtime. To parallelize such an application efficiently on distributed memory parallel computers, the issues of database distribution, dynamic data management and dynamic load balancing must be addressed. In this paper, we present a parallel implementation of a ray tracing algorithm on the Intel Delta parallel computer. In our database distribution, a small fraction of database is duplicated on each processor, while the remaining part is evenly distributed among groups of processors. In the system, there are multiple copies of the entire database in the memory of groups of processors. Dynamic data management is acheived by an ALRU cache scheme which can exploit image coherence to reduce data movements in ray tracing consecutive pixels. We balance load among processors by distributing subimages to processors in a global fashion based on previous workload requests. The success of our implementation depends crucially on a number of parameters which are experimentally evaluated. © 1997 John Wiley & Sons, Ltd.  相似文献   

11.
In this paper we benchmark the performance of the Cray T3D, IBM 9076 SP/1 and Intel Paragon XP/S parallel computers, using implementations of parallel algorithms for the computation of the vector outer-product A = uvT operation. The vector outer-product operation, although very simple in nature, requires the computation of a large number of floating-point operations and its parallelization induces a great level of communication between the processors. It is thus suited to measure the relative speed of the processor, memory subsystem and network capabilities of a parallel computer. It should not be considered a ‘toy problem’, since it arises in numerical methods in the context of the solution of systems of non-linear equations – still a difficult problem to solve. We present algorithms for both the explicit shared-memory and message-passing programming models together with theoretical computation models for those algorithms. Actual experiments were run on those computers, using Fortran 77 implementations of the algorithms. The results obtained with these experiments show that due to the high degree of communication between the processors one needs a parallel computer with fast communications and carefully implemented data exchange routines. The theoretical computation model allows prediction of the speed-up to be obtained for some problem size on a given number of processors. © 1997 John Wiley & Sons, Ltd.  相似文献   

12.
The design and test results for two analog adaptive VLSI processing chips are described. These chips use pulse coded signals for communication between processing nodes and analog weights for information storage. The weight modification rule, implemented on chip, uses concepts developed by E. Oja (1982) and later extended by T. Leen et al. (1989) and T. Sanger (1989). Experimental results demonstrate that the network produces linearly separable outputs that correspond to dominant features of the inputs. Such representations allow for efficient additional neural processing. Part of the adaptation rule also includes a small number of fixed inputs and a variable lateral inhibition mechanism. Experimental results from the first chip show the operation of function blocks that make a single processing node. These function blocks include forward transfer function, weight modification, and inhibition. Experimental results from the second chip show the ability of an array of processing elements to extract important features from the input data.  相似文献   

13.
A study has been made of how cost-effectiveness due to the improvement of VLSI technology can apply to a scientific computer system without performance loss. The result is a parallel computer, ADENA (Alternating Direction Edition Nexus Array), with a core consisting of four kinds of VLSI chips, two for processor elements (PES) and two for the interprocessor network (plus some memory chips). An overview of ADENA and an analysis of its performance are given. The design considerations for the PEs incorporated in ADENA are discussed. The factors that limit performance in a parallel processing environment are analyzed, and the measures employed to improve these factors at the LSI design level are described. The 42.6 sq cm CMOS PEs reach a peak performance of 20 MFLOPS and a 256-PE ADENA 1.5 GFLOPS has been achieved and 300 to 400 MFLOPS for PDE applications  相似文献   

14.
15.
随着北斗三号导航卫星系统开始组网运行,星上系统对星载计算机系统提出了更快的数据传输和运算速度需求。中国科学院研制的北斗卫星采用高运算性能和高数据传输性能的星上计算机系统,其核心部件全部自主可控,即以龙芯中科公司生产的龙芯1E高性能宇航级处理器芯片为主构建的硬件环境,以实时操作系统VxWorks为软件环境。为适应龙芯中科1E系列的新型升级芯片,本文通过开发BSP和串口驱动,并在设备驱动管理上配置支持VxBus型驱动架构,实现VxWorks在新型芯片上的移植和运行,同时使驱动程序的可靠性、可移植性、独立性等性能得到有效提升。  相似文献   

16.
地震并行处理模式与应用框架   总被引:4,自引:0,他引:4  
文中研究石油地震数据处理的并行计算设计模式(流水、扇出/扇入、主从和混合)和应用框架,框架和模式的目的都是复用成功的软件设计自力更生,框架可以看作一类设计模式的具体实现。针对地震数据处理模式设计和实现了GRISYS地震数据处理应用框架。利用这个框架,以往大量的串行地震处理模块,不需要任何改动,可以在工作站集群计算机或大规模并行计算机上实现并行计算,在曙光2000-Ⅱ并行计算机上试验,获得了非常高的并行处理加速比。  相似文献   

17.
This paper describes the use of a new integrated hybrid programming system which provides equation oriented specification of continuous system simulation models and automated setup, checkout, and operation of the hybrid computer. The programming system is based upon the APBE compiler originally developed through the National Computing Center in England and the HYTRAN Operations Interpreter (HOI) implemented on EAI hybrid computers. In the new program generation system, both processors have been enhanced to enrich the languages and to provide compatible file processing so that the compiler generates a complete HYTRAM object program file which is processed on-line by the hybrid system for setup, checkout, and operation of the analog processor. The benefits of this automatic program generation over the manual process include significant reduction of programming time and cost, error free program setup and operation, and increased hybrid computer productivity.  相似文献   

18.
B. S. Thornton 《Automatica》1971,7(6):741-746
For accurate and reliable automatic control the organisation of computer systems in aircraft and applications in other multi-task computer systems is more likely to be based on a number of “cooperative” computers in an integrated manner [1], than a main central computer.

Such a cooperative design should provide for the normal work load of each computer plus checking the data processing of another computer and have the ability to continue operation in case of a malfunction in one processor. To ensure that these requirements are met in a data driven operating system the basic system design parameters need careful assessment for optimised system performance especially when deadlines must be met in some control functions and an order of priorities preserved.

The present paper reports progress on a computer aided system design of such linked cooperative computers under the above conditions to determine basic system parameters such as the desirable character rates for the several central processors (C.P's) and optimum cycle length of C.P. to C.P. transfers.

The methods also overcomes objections [2] to some systems design procedures by allowing the results of sampled data theory to be made use of in the systems design by incorporating the sampling rate—although calculated separately [3] in one of the prime design parameters of the system.  相似文献   


19.
一种基于多处理器任务复制的分簇调度算法   总被引:2,自引:1,他引:1  
任务调度的优劣是决定并行分布式计算机系统性能好坏的重要因素之一。为优化任务调度,基于一些典型算法(如LG、PPA算法等),提出了一种新的任务调度算法。该算法一方面复制满足条件的前驱任务来缩短调度长度;另一方面合理地复制其他前驱任务和合并冗余簇来减少所需处理器的数目。实验表明,该算法在调度长度和所需处理器的数目上优于以上典型算法,并具有更小的时间复杂度,对并行计算机系统性能的提升具有一定的意义。  相似文献   

20.
数字图像处理需要大量的数据运算,要求系统具有很高的数据吞吐量。并行处理结构能较好地满足这一要求。介绍一种SIMD并行多DSP数字图像处理系统。该系统具有避免冲突、能连续处理图像数据、处理器间通信及I/O部分简单、硬件及软件模块化等优点。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号