首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
In this article, we study the effects of network topology and load balancing on the performance of a new parallel algorithm for solving triangular systems of linear equations on distributed-memory message-passing multiprocessors. The proposed algorithm employs novel runtime data mapping and workload redistribution methods on a communication network which is configured as a toroidal mesh. A fully parameterized theoretical model is used to predict communication behaviors of the proposed algorithm relevant to load balancing, and the analytical performance results correctly determine the optimal dimensions of the toroidal mesh, which vary with the problem size, the number of available processors, and the hardware parameters of the machine. Further enhancement to the proposed algorithm is then achieved through redistributing the arithmetic workload at runtime. Our FORTRAN implementation of the proposed algorithm as well as its enhanced version has been tested on an Intel iPSC/2 hypercube, and the same code is also suitable for executing the algorithm on the iPSC/860 hypercube and the Intel Paragon mesh multiprocessor. The actual timing results support our theoretical findings, and they both confirm the very significant impact a network topology chosen at runtime can have on the computational load distribution, the communication behaviors and the overall performance of parallel algorithms.  相似文献   

2.
In this article we discuss the detailed implementation of a parallel pseudospectral code for integration of the Navier-Stokes equations on an Intel iPSC/860 Hypercube. Issues related to the basic efficient parallelization of the algorithm on a hypercube are discussed, as well as optimization issues specific to the iPSC/860 system. With the combination of optimizations presented, the code runs on a 32-node iPSC/860 system at a speed exceeding that of the fastest implementation on a Cray YMP by nearly 25%.  相似文献   

3.
The authors describe parallel algorithms for simulating certain continuous time Markov chains, such as those arising in queueing network models of distributed computing systems or communications systems. The algorithms are based on the technique of uniformization. Two variations of a conservative parallel simulation algorithm are presented. In each algorithm, a relatively short presimulation is performed to identify those times, and only those times, at which the simulation algorithm requires processor pairs to synchronize. Speedup studies of the algorithms, performed on a 16-node Intel iPSC/2 hypercube, are presented and discussed  相似文献   

4.
Ordering clones from a genomic library into physical maps of whole chromosomes presents a central computational problem in genetics. Chromosome reconstruction via clone ordering is shown to be isomorphic to the classical NP-complete Optimal Linear Arrangement problem. Parallel algorithms for simulated annealing and microcanonical annealing based on Markov chain decomposition are proposed and applied to the problem of chromosome reconstruction via clone ordering. These algorithms are implemented on a cluster of UNIX workstations using the Parallel Virtual Machine (PVM) system. PVM is a software system that permits a heterogeneous collection of networked computers to be viewed by a user's program as a single monolithic parallel computing resource. The parallel algorithms are implemented and tested on clonal data derived from Chromosome IV of the fungus Asperigillus nidulans Perturbation methods and problem-specific annealing heuristics for the parallel simulated annealing and parallel microcanonical annealing algorithms are proposed and described. Convergence, speedup and scalability characteristics of the various parallel algorithms are analyzed and discussed. This revised version was published online in June 2006 with corrections to the Cover Date.  相似文献   

5.
This paper formalizes a general technique to combine different methods in the solution of large systems of nonlinear equations using parallel asynchronous implementations on distributed-memory multiprocessor systems. Such combinations of methods, referred to as team algorithms, are evaluated as a way of obtaining desirable properties of different methods and a sufficient condition for their convergence is derived. The load flow problem of electrical power networks is presented as an example problem that, under certain conditions, has the characteristics to make a team algorithm an appealing choice for its solution. Experimental results of an implementation on an Intel iPSC/860 Hypercube are reported, showing that considerable speedup and robustness can be obtained using team algorithms  相似文献   

6.
《Computers & chemistry》1988,12(1):15-20
In this paper, we discuss the implementation and solution of molecular dynamics polymer models on the Intel iPSC Hypercube parallel computer. We first describe a model problem whose inherent parallelism can be exploited effectively and which is therefore amenable to parallel solution. We next discuss the salient features of the hypercube programs that were used to solve the model problem. Finally, we present results which demonstrate that the efficiencies of the solutions approach 100% as the number of atoms in the polymer and the number of hypercube processors are increased for the problem in question. The results demonstrate that parallel solutions of polymer problems using a hypercube architecture offer potentially significant savings over sequential solutions.  相似文献   

7.
A platform for biological sequence comparison on parallel computers   总被引:1,自引:0,他引:1  
We have written two programs for searching biological sequence databases that run on Intel hypercube computers. PSCANLIB compares a single sequence against a sequence library, and PCOMPLIB compares all the entries in one sequence library against a second library. The programs provide a general framework for similarity searching; they include functions for reading in query sequences, search parameters and library entries, and reporting the results of a search. We have isolated the code for the specific function that calculates the similarity score between the query and library sequence; alternative searching algorithms can be implemented by editing two files. We have implemented the rapid FASTA sequence comparison algorithm and the more rigorous Smith-Waterman algorithm within this framework. The PSCANLIB program on a 16 node iPSC/2 80386-based hypercube can compare a 229 amino acid protein sequence with a 3.4 million residue sequence library in approximately 16 s with the FASTA algorithm. Using the Smith-Waterman algorithm, the same search takes 35 min. The PCOMPLIB program can compare a 0.8 million amino acid protein sequence library with itself in 5.3 min with FASTA on a third-generation 32 node Intel iPSC/860 hypercube.  相似文献   

8.
We evaluate the basic performance of the Intel iPSC/860 computer, which can have up to 128 Intel i860-based nodes connected together with a hypercube network topology. After giving a brief overview of the system, the properties and bottlenecks of the hardware architecture and software environment are discussed. Basic memory, scalar and vector performance of a single node is evaluated, and the communication performance and the overlap of computation and communication are analysed.  相似文献   

9.
Optimizing large join queries that consist of many joins has been recognized as NP-hard. Most of the previous work focuses on a uniprocessor environment. In a multiprocessor, the location of each join adds another dimension to the complexity of the problem. In this paper, we examine the feasibility of exploiting the inherent parallelism in optimizing large join queries on a hypercube multiprocessor. This includes using the multiprocessor not only to answer the large join query but also to optimize it. We propose an algorithm to estimate the cost of a parallel large join plan. Three heuristics are provided for generating an initial solution, which is further optimized by an iterative local-improvement method. The entire process of parallel query optimization and execution is simulated on an Intel iPSC/2 hypercube machine. Our experimental results show that the performance of each heuristic depends on the characteristics of the query  相似文献   

10.
A discussion is presented of two ways of mapping the cells in a two-dimensional area of a chip onto processors in an n-dimensional hypercube such that both small and large cell moves can be applied. Two types of move are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described, along with a distributed data structure that needs to be stored in the hypercube to support such a parallel cost evaluation. A novel tree broadcasting strategy is presented for the hypercube that is used extensively in the algorithm for updating cell locations in the parallel environment. A dynamic parallel annealing schedule is proposed that estimates the errors due to interacting parallel moves and adapts the rate of synchronization automatically. Two novel approaches in controlling error in parallel algorithms are described: heuristic cell coloring and adaptive sequence control. The performance on an Intel iPSC-2/D4/MX hypercube is reported  相似文献   

11.
Realistic simulations of fluid flow in oil reservoirs have been proven to be computationally intensive. In this work, techniques for solving large sparse systems of linear equations that arise in simulating these processes are developed for parallel computers such as INTEL hypercubes iPSC/2 and iPSC/860. This solver is based on a combined multigrid and domain decomposition approach. The Algorithm uses line corrections solved using a multigrid method, line Jacobi and block incomplete domain decomposition as an overall preconditioner for a conjugate gradient-like acceleration method, ORTHOMIN (k). This is shown to be a factor of ten times faster on a 32-processor hypercube compared to widely used sequential solvers. Three test problems are used to validate the results which include implicit wells and faults: The first is based on highly heterogeneous two-phase flow, the second on the SPE Third Comparative Solution and the third on real production compositional data.  相似文献   

12.
The Gamma database machine project   总被引:4,自引:0,他引:4  
The design of the Gamma database machine and the techniques employed in its implementation are described. Gamma is a relational database machine currently operating on an Intel iPSC/2 hypercube with 32 processors and 32 disk drives. Gamma employs three key technical ideas which enable the architecture to be scaled to hundreds of processors. First, all relations are horizontally partitioned across multiple disk drives, enabling relations to be scanned in parallel. Second, parallel algorithms based on hashing are used to implement the complex relational operators, such as join and aggregate functions. Third, dataflow scheduling techniques are used to coordinate multioperator queries. By using these techniques, it is possible to control the execution of very complex queries with minimal coordination. The design of the Gamma software is described and a thorough performance evaluation of the iPSC/s hypercube version of Gamma is presented  相似文献   

13.
A multilevel algorithm is presented for direct, parallel factorization of the large sparse matrices that arise from finite element and spectral element discretization of elliptic partial differential equations. Incomplete nested dissection and domain decomposition are used to distribute the domain among the processors and to organize the matrix into sections in which pivoting is applied to stabilize the factorization of indefinite equation sets. The algorithm is highly parallel and memory efficient; the efficient use of sparsity in the matrix allows the solution of larger problems as the number of processors is increased, and minimizes computations as well as the number and volume of communications among the processors. The number of messages and the total volume of messages passed during factorization, which are used as measures of algorithm efficiency, are reduced significantly compared to other algorithms. Factorization times are low and speedups high for implementation on an Intel iPSC/860 hypercube computer. Furthermore, the timings for forward and back substitutions are more than an order-of-magnitude smaller than the matrix decomposition times.  相似文献   

14.
A macro package for expressing message passing functions within parallel FORTRAN program is presented. It makes the user program fully portable among all parallel computers where the macros are implemented. The implementation on the Intel iPSC/2 hypercube is discussed in more detail. New message passing primitives have been added to the iPSC/2 operating system, offering the user a broader functionality at no efficiency loss. The full macro set, using these primitives, works with the same performance as the original Intel primitives.  相似文献   

15.
The performance of production programs can be improved by firing multiple rules concurrently at a production cycle. Although considerable amount of research has been done on parallel processing of production programs, the problem of multiple rule firing has not been thoroughly investigated yet. In this paper, we begin by identifying the problems associated with multiple rule firing systems, the compatibility and convergence problems, and present three multiple rule firing models which address them. The rule dependence model addresses the compatibility problem using interrule data dependence analysis. The single-context-multiple-rules (SCMR) model and the multiple-contexts-multiple-rules (MCMR) model address both the compatibility and convergence problems. A production program executed under the SCMR and the MCMR models reaches a solution which is equivalent to the sequential execution. These three multiple rule firing models have been simulated on the RUBIC simulator, and the MCMR model, which has the highest performance, has been implemented on the Intel iPSC/2 hypercube. The simulation and implementation results are reported.  相似文献   

16.
在超立方体上并行仿真BP神经网   总被引:3,自引:0,他引:3  
本文研究在超立方体上并行仿真BP神经网的方法,报告我们在Intel iPCS/860超立方上开发的一个BP神经仿真器。文章着重讨论如何把BP神经网均衡地分配到超立方体的结点以及如何在超立方体上并行实现BP算法等问题。  相似文献   

17.
This paper presents the results of parallelizing a three-dimensional Navier-Stokes solver on a 32K-processor Thinking Machines CM-2, a 128-node Intel iPSC/860, and an 8-processor CRAY Y-MP. The main objective of this work is to study the performance of the flow solver, INS3D-LU code, on two distributed-memory machines, a massively parallel SIMD machine (CM-2) and a moderately parallel MIMD machine (iPSC/860), and compare it with its performance on a shared-memory MIMD machine with a small number of processors (Y-MP). The code is based on a Lower-Upper Symmetric-Gauss-Seidel implicit scheme for the pseudocompressibility formulation of the three-dimensional incompressible Navier-Stokes equations. The code was rewritten in CMFORTRAN with shift operations and run on the CM-2 using the slicewise model. The code was also rewritten with distributed data and Intel message-passing calls and run on the iPSC/860. The timing results for two grid sizes are presented and analyzed using both 32-bit and 64-bit arithmetic. Also, the impact of communication and load balancing on the performance of the code is outlined. The results show that reasonable performance can be achieved on these parallel machines. However, the CRAY Y-MP outperforms the CM-2 and iPSC/860 for this particular algorithm.The author is an employee of Computer Sciences Corporation. This work was funded through NASA Contract NAS 2-12961.  相似文献   

18.
This paper gives hypercube algorithms for some simple problems involving geometric properties of sets of points. The properties considered emphasize aspects of convexity and domination. Efficient algorithms are given for both fine- and medium-grain hypercube computers, including a discussion of implementation, running times and results on an Intel iPSC hypercube, as well as theoretical results. For both serial and parallel computers, sorting plays an important role in geometric algorithms for determining simple properties, often being the dominant component of the running time. Since the time required to sort data on a hypercube computer is still not fully understood, the running times of some of our algorithms for unsorted data are not completely determined. For both the fine- and medium-grain models, we show that faster expected-case running time algorithms are possible for point sets generated randomly. Our algorithms are developed for sets of planar points, with several of them extending to sets of points in spaces of higher dimension.  相似文献   

19.
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms′ performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain.  相似文献   

20.
This paper describes an efficient algorithm for the parallel solution of systems of linear equations with a block tridiagonal coefficient matrix. The algorithm comprises a multilevel LU-factorization based on block cyclic reduction and a corresponding solution algorithm.

The paper includes a general presentation of the parallel multilevel LU-factorization and solution algorithms, but the main emphasis is on implementation principles for a message passing computer with hypercube topology. Problem partitioning, processor allocation and communication requirement are discussed for the general block tridiagonal algorithm.

Band matrices can be cast into block tridiagonal form, and this special but important problem is dealt with in detail. It is demonstrated how the efficiency of the general block tridiagonal multilevel algorithm can be improved by introducing the equivalent of two-way Gaussian elimination for the first and the last partitioning and by carefully balancing the load of the processors. The presentation of the multilevel band solver is accompanied by detailed complexity analyses.

The properties of the parallel band solver were evaluated by implementing the algorithm on an Intel iPSC hypercube parallel computer and solving a larger number of banded linear equations using 2 to 32 processors. The results of the evaluation include speed-up over a sequential processor, and the measure values are in good agreement with the theoretical values resulting from complexity analysis. It is found that the maximum asymptotic speed-up of the multilevel LU-factorization using p processors and load balancing is approximated well by the expression (p +6)/4.

Finally, the multilevel parallel solver is compared with solvers based on row and column interleaved organization.  相似文献   


设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号