首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents the results of parallelizing a three-dimensional Navier-Stokes solver on a 32K-processor Thinking Machines CM-2, a 128-node Intel iPSC/860, and an 8-processor CRAY Y-MP. The main objective of this work is to study the performance of the flow solver, INS3D-LU code, on two distributed-memory machines, a massively parallel SIMD machine (CM-2) and a moderately parallel MIMD machine (iPSC/860), and compare it with its performance on a shared-memory MIMD machine with a small number of processors (Y-MP). The code is based on a Lower-Upper Symmetric-Gauss-Seidel implicit scheme for the pseudocompressibility formulation of the three-dimensional incompressible Navier-Stokes equations. The code was rewritten in CMFORTRAN with shift operations and run on the CM-2 using the slicewise model. The code was also rewritten with distributed data and Intel message-passing calls and run on the iPSC/860. The timing results for two grid sizes are presented and analyzed using both 32-bit and 64-bit arithmetic. Also, the impact of communication and load balancing on the performance of the code is outlined. The results show that reasonable performance can be achieved on these parallel machines. However, the CRAY Y-MP outperforms the CM-2 and iPSC/860 for this particular algorithm.The author is an employee of Computer Sciences Corporation. This work was funded through NASA Contract NAS 2-12961.  相似文献   

2.
In this article we discuss the detailed implementation of a parallel pseudospectral code for integration of the Navier-Stokes equations on an Intel iPSC/860 Hypercube. Issues related to the basic efficient parallelization of the algorithm on a hypercube are discussed, as well as optimization issues specific to the iPSC/860 system. With the combination of optimizations presented, the code runs on a 32-node iPSC/860 system at a speed exceeding that of the fastest implementation on a Cray YMP by nearly 25%.  相似文献   

3.
A three-dimensional electromagnetic particle-in-cell code with Monte Carlo collision (PIC-MCC) is developed for MIMD parallel supercomputers. This code uses a standard relativistic leapfrog scheme incorporating Monte Carlo calculations to push plasma particles and to include collisional effects on particle orbits. A local finite-difference time-domain method is used to update the self-consistent electromagnetic fields. The code is implemented using the General Concurrent PIC (GCPIC) algorithm, which uses domain decomposition to divide the computation among the processors. Particles must be exchanged between processors as they move among subdomains. Message passing is implemented using the Express Cubix library and the PVM. We evaluate the performance of this code using a 512-processor Intel Touchstone Delta, a 512-processor Intel Paragon, and a 256-processor CRAY T3D. It is shown that a high parallel efficiency exceeding 95% has been achieved on all three machines for large problems. We have run PIC-MCC simulations using several hundred million particles with several million collisions per time step. For these large-scale simulations the particle push time achieved is in the range of 90–115 ns/particle/time step, and the collision calculation time in the range of a few hundred nanoseconds per collision.  相似文献   

4.
Grimshaw  A.S. 《Computer》1993,26(5):39-51
Mentat, an object-oriented parallel processing system designed to directly address the difficulty of developing architecture-independent parallel programs, is discussed. The Mentat system consists of two components: the Mentat programming language and the Mentat runtime system. The Mentat programming language, which is based on C++, is described. Performance results from implementing the Mentat runtime system on a network of Sun 3 and 4 workstations, the Silicon Graphics Iris, the Intel iPSC/2, and the Intel iPSC/860 are presented  相似文献   

5.
A platform for biological sequence comparison on parallel computers   总被引:1,自引:0,他引:1  
We have written two programs for searching biological sequence databases that run on Intel hypercube computers. PSCANLIB compares a single sequence against a sequence library, and PCOMPLIB compares all the entries in one sequence library against a second library. The programs provide a general framework for similarity searching; they include functions for reading in query sequences, search parameters and library entries, and reporting the results of a search. We have isolated the code for the specific function that calculates the similarity score between the query and library sequence; alternative searching algorithms can be implemented by editing two files. We have implemented the rapid FASTA sequence comparison algorithm and the more rigorous Smith-Waterman algorithm within this framework. The PSCANLIB program on a 16 node iPSC/2 80386-based hypercube can compare a 229 amino acid protein sequence with a 3.4 million residue sequence library in approximately 16 s with the FASTA algorithm. Using the Smith-Waterman algorithm, the same search takes 35 min. The PCOMPLIB program can compare a 0.8 million amino acid protein sequence library with itself in 5.3 min with FASTA on a third-generation 32 node Intel iPSC/860 hypercube.  相似文献   

6.
The implementation of two compilers for the data-parallel programming language Dataparallel C is described. One compiler generates code for Intel and nCUBE hypercube multicomputers; the other generates code for Sequent multiprocessors. A suite of Dataparallel C programs has been compiled and executed, and their execution times and speedups on the Intel iPSC/2, the nCUBE 3200 and the Sequent Symmetry are presented  相似文献   

7.
This paper presents a solution to the problem of partitioning the work for sparse matrix factorization to individual processors on a multiprocessor system. The proposed task assignment strategy is based on the structure of the elimination tree associated with the given sparse matrix. The goal of the task scheduling strategy is to achieve load balancing and a high degree of concurrency among the processors while reduçing the amount of processor-to-processor data comnication, even for arbitrarily unbalanced elimination trees. This is important because popular fill-reducing ordering methods, such as the minimum degree algorithm, often produce unbalanced elimination trees. Results from the Intel iPSC/2 are presented for various finite-element problems using both nested dissection and minimum degree orderings.Research supported by the Applied Mathematical Sciences Research Program, Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems Inc.  相似文献   

8.
基于工作站机群的PVM系统的序列比对   总被引:1,自引:0,他引:1  
序列比对是分子生物学研究领域的一个重要的工具。在DNA数据量急剧增加的今天,高效的序列比对算法在研究新发现的次序中显得非常重要。通过Smith和Waterman法用PVM系统在工作站机群上已完成了分布式序列比对法。也同样在Inter iPSC/860高效性能并行计算机上获得了成功。这个分布式Smith-Waterman算法在Internet GRAIL和GENQUEST上充当搜索工具。该文论述了此算法的实现和性能指标。  相似文献   

9.
Ordering clones from a genomic library into physical maps of whole chromosomes presents a pivotal computational problem in genetics. Previous research has shown the physical mapping problem to be isomorphic to the NP-complete Optimal Linear Arrangement (OLA) problem for which no polynomial-time algorithm for determining the optimal solution is known. Serial implementations of stochastic global optimization techniques such as simulated annealing yielded very good results but proved computationally intensive. The design, analysis and implementation of coarse-grained parallel MIMD algorithms for simulated annealing on the Intel iPSC/860 hypercube is presented. Data decomposition and control decomposition strategies based on Markov chain decomposition, perturbation methods and problem-specific annealing heuristics are proposed and applied to the physical mapping problem. A suite of parallel algorithms are implemented on an 8-node Intel iPSC/860 hypercube, exploiting the nearest-neighbor communication pattern on the Boolean hypercube topology. Convergence, speedup and scalability characteristics of the various parallel algorithms are analyzed and discussed. Results indicate a deterioration of performance when a single Markov chain of solution states is distributed across multiple processing elements in the Intel iPSC/860 hypercube.  相似文献   

10.
A mesh-vertex finite volume scheme for solving the Euler equations on triangular unstructured meshes is implemented on a MIMD (multiple instruction/multiple data stream) parallel computer. Three partitioning strategies for distributing the work load onto the processors are discussed. Issues pertaining to the communication costs are also addressed. We find that the spectral bisection strategy yields the best performance. The performance of this unstructured computation on the Intel iPSC/860 compares very favorably with that on a one-processor CRAY Y-MP/1 and an earlier implementation on the Connection Machine.The authors are employees of Computer Sciences Corporation. This work was funded under contract NAS 2-12961  相似文献   

11.
We discuss a hybrid strategy for implementing global combine operations on distributed memory Multiple Instruction Multiple Data multicomputers. A theoretical analysis is given and results from its implementation on the Intel iPSC/860 are reported.  相似文献   

12.
We evaluate the basic performance of the Intel iPSC/860 computer, which can have up to 128 Intel i860-based nodes connected together with a hypercube network topology. After giving a brief overview of the system, the properties and bottlenecks of the hardware architecture and software environment are discussed. Basic memory, scalar and vector performance of a single node is evaluated, and the communication performance and the overlap of computation and communication are analysed.  相似文献   

13.
The MCHF (Multiconfiguration Hartree-Fock) atomic structure package consists of a series of programs that predict a range of atomic properties and communicate information through files. Several of these have now been modified for the distributed-memory environment. On the Intel iPSC/860 the restricted amount of memory and the lack of virtual memory required a redesign of the data organization with large arrays residing on disk. The data structures also had to be modified. To a large extent, data could be distributed among the nodes, but crucial to the performance of the MCHF program was the global information that is needed for an even distribution of the workload. This paper outlines the computational problems that must be solved in an atomic structure calculation and describes the strategies used to distribute both the data and the workload on a distributed-memory system. Performance data are provided for some benchmark calculations on the Intel iPSC/860.  相似文献   

14.
In this article, we study the effects of network topology and load balancing on the performance of a new parallel algorithm for solving triangular systems of linear equations on distributed-memory message-passing multiprocessors. The proposed algorithm employs novel runtime data mapping and workload redistribution methods on a communication network which is configured as a toroidal mesh. A fully parameterized theoretical model is used to predict communication behaviors of the proposed algorithm relevant to load balancing, and the analytical performance results correctly determine the optimal dimensions of the toroidal mesh, which vary with the problem size, the number of available processors, and the hardware parameters of the machine. Further enhancement to the proposed algorithm is then achieved through redistributing the arithmetic workload at runtime. Our FORTRAN implementation of the proposed algorithm as well as its enhanced version has been tested on an Intel iPSC/2 hypercube, and the same code is also suitable for executing the algorithm on the iPSC/860 hypercube and the Intel Paragon mesh multiprocessor. The actual timing results support our theoretical findings, and they both confirm the very significant impact a network topology chosen at runtime can have on the computational load distribution, the communication behaviors and the overall performance of parallel algorithms.  相似文献   

15.
CHAU-WEN TSENG 《Software》1997,27(7):763-796
Fortran D is a version of Fortran enhanced with data decomposition specifications. Case studies illustrate strengths and weaknesses of the prototype Fortran D compiler when compiling linear algebra codes and whole programs. Statement groups, execution conditions, inter-loop communication optimizations, multi-reductions, and array kills for replicated arrays are identified as new compilation issues. On the Intel iPSC/860, the output of the prototype Fortran D compiler approaches the performance of hand-optimized code for parallel computations, but needs improvement for linear algebra and pipelined codes. The Fortran D compiler outperforms the CM Fortran compiler (2.1 beta) by a factor of four or more on the TMC CM-5 when not using vector units. Its performance is comparable to the DEC and IBM HPF compilers on a Alpha cluster and SP-2. Better analysis, run-time support, and flexibility are required for the prototype compiler to be useful for a wider range of programs. © 1997 John Wiley & Sons, Ltd.  相似文献   

16.
On most high-performance architectures, data movement is slow compared to floating point (in particular, vector) performance. On these architectures block algorithms have been successful for matrix computations. By considering a matrix as a collection of submatrices (the so-called blocks), one naturally arrives at algorithms that require little data movement. The optimal blocking strategy, however, depends on the computing environment and on the problem parameters. On parallel machines, tradeoffs between individual floating point performance and overall system performance also come into play. Current approaches use fixed-width blocking strategies which are not optimal. This paper presents an adaptive blocking methodology for determining a good blocking strategy systematically. We demonstrate this technique on a block QR factorization routine on a distributed-memory machine. Using timing models for the high-level kernels of the algorithm, we can formulate in a recurrence relation a blocking strategy that avoids adding extra delays along the critical path of the algorithm. This recurrence relation predicts performance well since we base our timing models on observed data, not other simplistic measures. Experiments on the Intel iPSC/1 hypercube show that, in fact, the resulting blocking strategy is as good as any fixed-width blocking strategy, independent of problem size and the number of processors employed. So while we do not know the optimum fixed-width blocking strategy unless we rerun the same problem several times, adaptive blocking provides close to optimum performance in the first run. We also mention how adaptive blocking can result in performance portable code by automating the generation of the timing models.This work was partially performed when the author was a graduate student at Cornell University and was supported by the Office of Naval Research under contract N00014-83-K-0640 and by NSF contract DCR 86-02310. Support at Argonne was provided by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy, under Contract W-31-109-Eng-38. Computations were performed in part at the facilities of the Cornell Computational Optimization Project which is supported by the National Science Foundation under contract DMS 87-06133, at the Cornell National Supercomputer Facility, and at the Advanced Research Computing Facility at Argonne National Laboratory.  相似文献   

17.
We present lattice parallelism (LPAR), a programming methodology and software development tool for implementing scientific computations on distributed memory MIMD multiprocessors. LPAR supports an efficient portable model for managing physically distributed, dynamic data structures in a shared name space. It enables the programmer to effectively manage locality for achieving high performance without becoming involved with low-level details such as communication. LPAR is intended for applications which locally concentrate computational effort non-uniformly or employ multiple level representations. We present computational results for two such applications-particle dynamics and multigrid-running on the Intel iPSC/860 and the nCUBE/2. Performance achieved with these portable applications is competitive with highly optimized Fortran running on a single processor of the Cray Y-MP.  相似文献   

18.
In an earlier paper we described how uniformization can be used as the basis of a conservative parallel simulation algorithm for simulating a continuous time Markov chain (CTMC). The fundamental notion is that uniformization permits the calculation (in advance of actually running the simulation) of instants where processors will synchronize, achieving much lower synchronization overhead than is usually attributed to conservative methods. In this paper we extend the idea further, showing how to use uniformization in the context of an optimistic parallel simulation to reduce the frequency of state-saving, schedule intelligently, and eliminate the Global Virtual Time (GVT) calculation. We demonstrate the efficiency of the method by implementation on a 16-processor Intel iPSC/2 and on 256 processors of the Intel Touchstone Delta.  相似文献   

19.
Realistic simulations of fluid flow in oil reservoirs have been proven to be computationally intensive. In this work, techniques for solving large sparse systems of linear equations that arise in simulating these processes are developed for parallel computers such as INTEL hypercubes iPSC/2 and iPSC/860. This solver is based on a combined multigrid and domain decomposition approach. The Algorithm uses line corrections solved using a multigrid method, line Jacobi and block incomplete domain decomposition as an overall preconditioner for a conjugate gradient-like acceleration method, ORTHOMIN (k). This is shown to be a factor of ten times faster on a 32-processor hypercube compared to widely used sequential solvers. Three test problems are used to validate the results which include implicit wells and faults: The first is based on highly heterogeneous two-phase flow, the second on the SPE Third Comparative Solution and the third on real production compositional data.  相似文献   

20.
Book review     
Summary The book is priced at $11.95. I normally would not mention price in a review, but in this case it seems worth noting. Because the book is far superior to some that I've seen priced at over $50, and because it would make such a perfect supplement to many courses that cover automated deduction, the fact that its price has been held down so low is particularly significant.The book offers one man's assessment of where the difficulties lie. What makes it significant is that this man's judgments have proven correct a surprising number of times. Had someone of less stature attempted such a book, I would probably take it a lot less seriously. As it is, I consider the book to be a major contribution, and I recommend it.This work was supported by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy, under contract W-31-109-Eng-38.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号