期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Experimental application-driven architecture analysis of anSIMD/MIMD parallel processing system

Bronson E.C. Casavant T.L. Jamieson L.H. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(2):195-205

An experimental analysis of the architecture of an SIMD/MIMD parallel processing system is presented. Detailed implementations of parallel fast Fourier transform (FFT) programs were used to examine the performance of the prototype of the PASM (Partitionable SIMD/MIMD) parallel processing system. Detailed execution-time measurements using specialized timing hardware were made for the complete FFT and for components of SIMD, MIMD, and barrier-synchronized MIMD implementations. The component measurements isolated the effects of floating-point arithmetic operations, interconnection network transfer operations, and program control overhead. The measurements allow an accurate extrapolation of the execution time, speedup, and efficiency of the MIMD, SIMD, and barrier-synchronized MIMD programs to a full 1024-processor PASM system. This constitutes one of the first results of this kind, in which controlled experiments on fixed hardware were used to make comparisons of these fundamental modes of computing. Overall, the experimental results demonstrate the value of mixed-mode SIMD/MIMD computing and its suitability for computational intensive algorithms such as the FET 相似文献

2.

Large-scale parallel processing systems

《Microprocessors and Microsystems》1987,11(1):3-20

Parallel processing is an area of growing interest to the computer science and engineering communities. This paper is an introduction to some of the concepts involved in the design and use of large-scale parallel systems. Parallel machines that are classified as SIMD (synchronous) and MIMD (asynchronous) systems, composed of a large number of microprocessors, are explored. Parallel algorithms are examined, using image smoothing, recursive doubling and contour tracing as examples. Single stage and multistage networks are discussed. The single stage Cube, PM21, Four Nearest Neighbor and Shuffle-Exchange networks are presented, and the multistage Cube network is described. Case studies of three microprocessor-based systems are given as examples of parallel machine designs, specifically the MPP SIMD machine, the Ultracomputer MIMD system, and the PASM SIMD/MIMD machine. 相似文献

3.

Mapping Conjugate Gradient Algorithms for Neutron Diffusion Applications onto SIMD, MIMD, and Mixed-Mode Machines

John John E. So Thomas J. Downar Raghunandan Janardhan Howard Jay Siegel 《International journal of parallel programming》1998,26(2):183-207

The performance of conjugate gradient (CG) algorithms for the solution of the system of linear equations that results from the finite-differencing of the neutron diffusion equation was analyzed on SIMD, MIMD, and mixed-mode parallel machines. A block preconditioner based on the incomplete Cholesky factorization was used to accelerate the conjugate gradient search. The issues involved in mapping both the unpreconditioned and preconditioned conjugate gradient algorithms onto the mixed-mode PASM prototype, the SIMD MasPar MP-1, and the MIMD Intel Paragon XP/S are discussed. On PASM , the mixed-mode implementation outperformed either SIMD or MIMD alone. Theoretical performance predictions were analyzed and compared with the experimental results on the MasPar MP-1 and the Paragon XP/S. Other issues addressed include the impact on execution time of the number of processors used, the effect of the interprocessor communication network on performance, and the relationship of the number of processors to the quality of the preconditioning. Applications studies such as this are necessary in the development of software tools for mapping algorithms onto either a single parallel machine or a heterogeneous suite of parallel machines. 相似文献

4.

Multiple Quadratic Forms: A Case Study in the Design of Data-Parallel Algorithms

《Journal of Parallel and Distributed Computing》1994,21(1):124-139

Data-parallel implementations of the computationally intensive task of solving multiple quadratic forms (MQFs) have been examined. Coupled and uncoupled parallel methods are investigated, where coupling relates to the degree of interaction among the processors. Also, the impact of partitioning a large MQF problem into smaller non-interacting subtasks is studied. Trade-offs among the implementations for various data-size/machine-size ratios are categorized in terms of complex arithmetic operation counts, communication overhead, and memory storage requirements. Furthermore, the impact on performance of the mode of parallelism used is considered, specifically, SIMD versus MIMD versus SIMD/MIMD mixed-mode. From the complexity analyses, it is shown that none of the algorithms presented in this paper is best for all data-size/machine-size ratios. Thus, to achieve scalability (i.e., good performance as the number of processors available in a machine increases), instead of using a single algorithm, the approach discussed is to have a set of algorithms from which the most appropriate algorithm or combination of algorithms is selected based on the ratio calculated from the scaled machine size. The analytical results have been verified by experiments on the MasPar MP-1 (SIMD), nCUBE 2 (MIMD), and PASM (mixed-mode) prototype. 相似文献

5.

Parallel Image Correlation: Case Study to Examine Trade-Offs in Algorithm-to-Machine Mappings

Armstrong James B. Maheswaran Muthucumaru Theys Mitchell D. Siegel Howard Jay Nichols Mark A. Casey Kenneth H. 《The Journal of supercomputing》1998,12(1-2):7-35

Performance of a parallel algorithm on a parallel machine depends not only on the time complexity of the algorithm, but also on how the underlying machine supports the fundamental operations used by the algorithm. This study analyzes various mappings of image correlation algorithms in SIMD, MIMD, and mixed-mode environments. Experiments were conducted on the Intel Paragon, MasPar MP-1, nCUBE 2, and PASM prototype. The machine features considered in this study include: modes of parallelism, communication/computation ratio, network topology and implementation, SIMD CU/PE overlap, and communication/computation overlap. Performance of an implementation can be enhanced by using algorithmic techniques that match the machine features. Some algorithmic techniques discussed here are additional communication versus redundant computation, data block transfers, and communication/computation overlap. The results presented are applicable to a large class of image processing tasks. Case studies, such as the one presented here, are a necessary step in developing software tools for mapping an application task onto a single parallel machine and for mapping the subtasks of an application task, or a set of independent application tasks, onto a heterogeneous suite of parallel machines. 相似文献

6.

Parallel Implementations of Block-Based Motion Vector Estimation for Video Compression on Four Parallel Processing Systems

Min Tan Janet M. Siegel Howard Jay Siegel 《International journal of parallel programming》1999,27(3):195-225

Parallel algorithms, based on a distributed memory machine model, for an exhaustive search technique for motion vector estimation in video compression are being designed and evaluated. Results from the execution on a 16,384 processor MasPar MP-1 (an SIMD machine), a 140 node Intel Paragon XP/S and a 16 node IBM SP2 (two M IMD machines), and the 16 processor PASM prototype (a partitionable SIMD/MIMD mixed-mode machine) are presented. The trade-offs of using different modes of parallelism (SIMD, SPMD, and mixed-mode) and different data partitioning schemes (the rectangular and stripe subimage methods) are examined. The analytical and experimental results shown in this application study will help practitioners to predict and contrast the performance of different approaches to parallel implementation of this important video compression technique. The results presented are also applicable to a large class of image and video processing tasks. Case studies, such as the one presented here, are a necessary step in developing software tools for mapping an application task onto a single parallel machine and for mapping a set of independent application tasks, or the subtasks of a single application task, onto a heterogeneous suite of parallel machines. 相似文献

7.

The Multistage Cube: A Versatile Interconnection Network 总被引：2，自引：0，他引：2

Siegel H.J. McMillen R.J. 《Computer》1981,14(12):65-76

The cube network can support both MIMD and SIMD processing in distributed systems. It allows flexible communications in systems like PASM, PUMPS, and the BMD test bed. 相似文献

8.

A Heterogeneous Mixed-Mode Execution Model for Massively Parallel Systems

《Journal of Parallel and Distributed Computing》1999,56(1):2-16

In this paper, we consider a massively parallel system that is composed of heterogeneous processors, that is, processors with different processing power, and that combines the advantages of the SIMD and MIMD architectures. The heterogeneous mixed-mode (HeMM) execution model is composed of two main components, which operate in the well-known SIMD and MIMD paradigms. The main computing power comes from a component that is composed of a massive number of processors and operates in a data parallel manner. The other component is composed of a few (or even one) fast processors which operate in the MIMD paradigm. The operation of a small number of processors in an MIMD paradigm has been well demonstrated through actual systems. The processors in this component add flexibility to the execution of the parallel programs such that it adjusts to the changing parallelism of the program to enhance the performance. Based on this execution model we analyze the gains in performance that is obtainable by this new system. We show that substantial performance gains can be obtained by using the HeMM system. 相似文献

9.

Static scheduling for barrier MIMD architectures

Henry G. Dietz Abderrazek Zaafrani Matthew T. O'Keefe 《The Journal of supercomputing》1992,5(4):263-289

In a SIMD or VLIW machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.In this paper we present a timing analysis that allows a compiler for a MIMD machine to eliminate a large fraction of the run-time synchronization by making efficient use of static code scheduling. Although these techniques can be adapted to be applied to most MIMD machines, this paper centers on the analysis and scheduling for barrier MIMD machines. Barrier MIMDs are asynchronous multiple instruction stream/multiple data stream architectures capable of parallel execution of variable execution-time instructions and arbitrary control flow (e.g., while loops and calls). However, they also incorporate a special hardware barrier synchronization mechanism that facilitates static scheduling by providing a mechanism which the compiler can use to enforce precise timing constraints. In other words, the compiler tracks relative timing between processors and uses static code scheduling until the timing imprecision becomes too large, at which point the compiler simply inserts a barrier to reduce that timing imprecision to zero (or a small constant).This paper describes new scheduling and barrier placement algorithms for barrier MIMDs that are based loosely on the list scheduling approach employed for VLIWs [Ellis 1985]. In addition, the experimental results from scheduling thousands of synthetic benchmark programs for a parameterized barrier MIMD machine are presented. 相似文献

10.

A parallel algorithm for graph matching and its MasParimplementation

Allen R. Cinque L. Tanimoto S. Shapiro L. Yasuda D. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(5):490-501

Search of discrete spaces is important in combinatorial optimization. Such problems arise in artificial intelligence, computer vision, operations research, and other areas. For realistic problems, the search spaces to be processed are usually huge, necessitating long computation times, pruning heuristics, or massively parallel processing. We present an algorithm that reduces the computation time for graph matching by employing both branch-and-bound pruning of the search tree and massively-parallel search of the as-yet-unpruned portions of the space. Most research on parallel search has assumed that a multiple-instruction-stream/multiple-data-stream (MIMD) parallel computer is available. Since massively parallel stream (SIMD) computers are much less expensive than MIMD systems with equal numbers of processors, the question arises as to whether SIMD systems can efficiently handle state-space search problems. We demonstrate that the answer is yes, and in particular, that graph matching has a natural and efficient implementation on SIMD machines 相似文献

11.

A model for an intelligent operating system for executing image understanding tasks on a reconfigurable parallel architecture

C. Henry Chu Edward J. Delp Leah H. Jamieson Howard Jay Siegel Francis J. Weil Andrew B. Whinston 《Journal of Parallel and Distributed Computing》1989,6(3)

Parallel processing is one approach to achieving the large computational processing capabilities required by many real-time computing tasks. One of the problems that must be addressed in the use of reconfigurable multiprocessor systems is matching the architecture configuration to the algorithms to be executed. This paper presents a conceptual model that explores the potential of artificial intelligence tools, specifically expert systems, to design an Intelligent Operating System for multiprocessor systems. The target task is the implementation of image understanding systems on multiprocessor architectures. PASM is used as an example multiprocessor. The Intelligent Operating System concepts developed here could also be used to address other problems requiring real-time processing. An example image understanding task is presented to illustrate the concept of intelligent scheduling by the Intelligent Operating System. Also considered is the use of the conceptual model when developing an image understanding system in order to test different strategies for choosing algorithms, imposing execution order constraints, and integrating results from various algorithms. 相似文献

12.

Data management and control-flow aspects of an SIMD/SPMD parallellanguage/compiler

Nichols M.A. Siegel H.J. Dietz H.G. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(2):222-234

Features of an explicitly parallel programming language targeted for reconfigurable parallel processing systems, where the machine's N processing elements (PEs) are capable of operating in both the SIMD and SPMD modes of parallelism, are described. The SPMD (single program-multiple data) mode of parallelism is a subset of the MIMD mode where all processors execute the same program. By providing all aspects of the language with an SIMD mode version and an SPMD mode version that are syntactically and semantically equivalent, the language facilitates experimentation with and exploitation of hybrid SIMD/SPMD machines. Language constructs (and their implementations) for data management, data-dependent control-flow, and PE-address-dependent control-flow are presented. These constructs are based on experience gained from programming a parallel machine prototype and are being incorporated into a compiler under development. Much of the research presented is applicable to general SIMD machines and MIMD machines 相似文献

13.

On the Complexity of Scheduling MIMD Operations for SIMD Interpretation

《Journal of Parallel and Distributed Computing》1995,29(1):91-95

Programming SIMD hardware to interpret (in parallel) programs and data resident in each PE is a technique for obtaining a cost-effective massively parallel MIMD processing environment. The performance of the synthesized MIMD environment can be greatly improved by with a variable instruction interpreter that delays the interpretation of infrequent operations. In this paper, the process of building a variable instruction interpreter that optimizes an objective function is examined. Two different objective functions are considered, namely, maximizing the total instruction throughput (called Maximal MIMD Instruction Throughput, MMIT) and maximizing overall PE utilization (called Maximal MIMD PE Utilization, MMPU). We show that the decision version of both the MMIT and MMPU problems is NP-complete. 相似文献

14.

Functionally reconfigurable general purpose parallel machines and some image processing and pattern recognition applications

Nikola K Kasabov 《Pattern recognition letters》1985,3(3):215-223

Functionally reconfigurable general purpose parallel machines (FRPM) could be reconfigured during the operation from SIMD to MIMD mode or vice versa (first aspect) and from one interconnection network to another according to the data storing order (second aspect). General purpose machines are considered in order to obtain an arbitrary data exchange between the processing elements they are built of. A model for describing such interconnection networks is presented. A full-information exchange network in introduced which is reconfigurable in a programming way to tree-, matrix-, cube-, linear-neighbourhood and FFT-network. Some schemes for constructing SIMD/MIMD reconfigurable machines are given. The usefullness of using FRMP for image processing and pattern recognition is discussed. 相似文献

15.

Hierarchical multiple-SIMD architecture for image analysis

Graham Nudd Nick Francis Tim Atherton Darren Kerbyson Roger Packwood John Vaudin 《Machine Vision and Applications》1992,5(2):85-103

Real-time image analysis requires the use of massively parallel machines. Conventional parallel machines consist of an array of identical processors organized in either single instruction multiple data (SIMD) or multiple instruction multiple data (MIMD) configurations. Machines of this type generally only operate effectively on parts of the image analysis problem. SIMD on the low level processing and MIMD on the high level processing. In this paper we describe the Warwick Pyramid Machine, an architecture consisting of both SIMD and MIMD parts in a multiple-SIMD (MSIMD) organization which can operate effectively at all levels of the image analysis problem. 相似文献

16.

Minimizing communication in the bitonic sort

Jae-Dong Lee Batcher K.E. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(5):459-474

This paper presents bitonic sorting schemes for special-purpose parallel architectures such as sorting networks and for general-purpose parallel architectures such as SIMD and/or MIMD computers. First, bitonic sorting algorithms for shared-memory SIMD and/or MIMD computers are developed. Shared-memory accesses through the interconnection network of shared memory SIMD and/or MIMD computers can be very time consuming. A scheme is introduced which reduces the number of such accesses. This scheme is based on the parity strategy which is the main idea of the paper. By reducing the communication through the network, a performance improvement is achieved. Second, a recirculating bitonic sorting network is presented, which is composed of one level of N/2 comparators plus an Ω-network of (log N-1) switch levels. This network reduces the cost complexity to O(N log N) compared with the O(N log² N) of the original bitonic sorting network, while preserving the same time complexity. Finally, a simplified multistage bitonic sorting network, is presented. For simplifying the interlevel wiring, the parity strategy is used, so N/2 keys are wired straight through the network 相似文献

17.

Parallel Labeling of Three-Dimensional Clusters on Networks of Workstations

Felipe Knop Vernon Rego 《Journal of Parallel and Distributed Computing》1998,49(2):117

Cluster algorithms have application in diverse areas, including statistical mechanics of polymer solutions, spin models in physics, and the study of ecological systems. Most parallel cluster labeling algorithms are designed for SIMD and MIMD multiprocessors and based on relaxation methods. We present a parallel 3-D cluster labeling algorithm based on mapping tables, for distributed memory environments. The proposed algorithm focuses on minimizing interprocess communication to enhance execution performance on workstation networks. We implemented the algorithm with the aid of theEcliPSeparallel replication toolkit, exploiting special tree-combining and data reduction features of the system. We report on performance results for experiments conducted on workstation clusters. 相似文献

18.

An Investigation of Scalable SIMD I/O Techniques with Application to Parallel JPEG Compression

《Journal of Parallel and Distributed Computing》1995,30(2):111-128

The problem inherent with any digital image or digital video system is the large amount of bandwidth required for transmission or storage. This has driven the research area of image compression to develop more complex algorithms that compress images to lower data rates with better fidelity. One approach that can be used to increase the execution speed of these complex algorithms is through the use of parallel processing. In this paper, we address the parallel implementation of the JPEG still-image compression standard on the MasPar MP-1, a massively parallel SIMD computer. We develop two novel byte alignment algorithms which are used to efficiently input and output compressed data from the parallel system, and present results which show real-time performance is possible. We also discuss several applications, such as motion JPEG, that can be used in multimedia systems. 相似文献

19.

Boundary element analysis on vector and parallel computers

《Computing Systems in Engineering》1994,5(3):239-252

Boundary element analysis (BEA) can be characterized as a numerical technique that generally shifts the computational burden in the analysis toward numerical integration and the solution of nonsymmetric and either dense or blocked sparse systems of algebraic equations. Researchers have explored the concept that the fundamental characteristics of BEA can be exploited to generate effective implementations on vector and parallel computers. In this paper, the results of some of these investigations are discussed. The performance of overall algorithms for BEA on vector supercomputers, massively data parallel single instruction multiple data (SIMD), and relatively fine grained distributed memory multiple instruction multiple data (MIMD) computer systems is described. Some general trends and conclusions are discussed, along with indications of future developments that may prove fruitful in this regard. 相似文献

20.

并行任务图的优化调度算法

李于锋莫则尧肖永浩熊敏《计算机工程与科学》2019,41(6):955-962

科学与工程计算中的很多复杂应用问题需要使用科学工作流技术,超算领域中的科学工作流常以并行任务图建模,并行任务图的有效调度对应用的高效执行有重要意义。给出了资源限制条件下并行任务图的调度模型;针对Fork-Join类并行任务图给出了若干最优化调度结论;针对一般并行任务图提出了一种新的调度算法,该算法考虑了数据通信开销对资源分配和调度性能的影响,并对已有的CPA算法在特定情况下进行了改进。通过实验与常用的CPR和CPA算法做比较,验证了提出的新算法能够获得很好的调度效果。本文提出的调度算法和得到的最优调度结论对工作流应用系统的高性能调度功能开发具有借鉴意义。相似文献