首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 328 毫秒
1.
In this study, a parallel computing technology is applied on the simulation of a wind turbine flow problem. A third-order Roe type flux limited splitting based on a pre-conditioning matrix with an explicit time marching method is used to solve the Navier–Stokes equations. The original FORTRAN code was parallelized with Message Passing Interface (MPI) language and tested on a 64-CPU IBM SP2 parallel computer. The test results show that a significant reduction of computing time in running the model and a super-linear speed up rate is achieved up to 32 CPUs at IBM SP2 processors. The speed up rate is as high as 49 for using IBM SP2 64 processors. The test shows very promising potential of parallel processing to provide prompt simulation of the current wind turbine problems.  相似文献   

2.
《Computers & chemistry》1996,20(3):381-384
This paper describes the construction of a scientific parallel computer system using IBM compatible personal computers as the node processors; and the development of suitable system software which enables applications, such as molecular mechanics and molecular dynamics calculations, to be run on the parallel computer. Performance figures are given for molecular mechanics calculations. A major feature of the host and node processors is the use of multiple, low-cost, interprocessor communications links, which have hardware broadcasting facilities.  相似文献   

3.
4.
A linear-scaling algorithm has been developed to perform large-scale molecular-dynamics (MD) simulations, in which interatomic forces are computed quantum mechanically in the framework of the density functional theory. A divide-and-conquer algorithm is used to compute the electronic structure, where non-additive contribution to the kinetic energy is included with an embedded cluster scheme. Electronic wave functions are represented on a real-space grid, which is augmented with coarse multigrids to accelerate the convergence of iterative solutions and adaptive fine grids around atoms to accurately calculate ionic pseudopotentials. Spatial decomposition is employed to implement the hierarchical-grid algorithm on massively parallel computers. A converged solution to the electronic-structure problem is obtained for a 32,768-atom amorphous CdSe system on 512 IBM POWER4 processors. The total energy is well conserved during MD simulations of liquid Rb, showing the applicability of this algorithm to first principles MD simulations. The parallel efficiency is 0.985 on 128 Intel Xeon processors for a 65,536-atom CdSe system.  相似文献   

5.
We present a parallel implementation of the Bose Hubbard model, using imaginary time propagation to find the lowest quantum eigenstate and real time propagation for simulation of quantum dynamics. Scaling issues, performance of sparse matrix-vector multiplication, and a parallel algorithm for determining nonzero matrix elements are described. Implementation of imaginary time propagation yields an O(N) linear convergence on a single processor and slightly better than ideal performance on up to 160 processors for a particular problem size. The determination of the nonzero matrix elements is intractable using sequential non-optimized techniques for large problem sizes. Thus, we discuss a parallel algorithm that takes advantage of the intrinsic structural characteristics of the Fock-space matrix representation of the Bose Hubbard Hamiltonian and utilizes a parallel implementation of a Fock state look up table to make this task solvable within reasonable timeframes. Our parallel algorithm demonstrates near ideal scaling on thousand of processors. We include results for a matrix 22.6 million square, with 202 million nonzero elements, utilizing 2048 processors.  相似文献   

6.
A scalable and portable Fortran code is developed to calculate Coulomb interaction potentials of charged particles on parallel computers, based on the fast multipole method. The code has a unique feature to calculate microscopic stress tensors due to the Coulomb interactions, which is useful in constant-pressure simulations and local stress analyses. The code is applicable to various boundary conditions, including periodic boundary conditions in two and three dimensions, corresponding to slab and bulk systems, respectively. Numerical accuracy of the code is tested through comparison of its results with those obtained by the Ewald summation method and by direct calculations. Scalability tests show the parallel efficiency of 0.98 for 512 million charged particles on 512 IBM SP3 processors. The timing results on IBM SP3 are also compared with those on IBM SP4.  相似文献   

7.
A scalable parallel algorithm has been designed to perform multimillion-atom molecular dynamics (MD) simulations, in which first principles-based reactive force fields (ReaxFF) describe chemical reactions. Environment-dependent bond orders associated with atomic pairs and their derivatives are reused extensively with the aid of linked-list cells to minimize the computation associated with atomic n-tuple interactions (n?4 explicitly and ?6 due to chain-rule differentiation). These n-tuple computations are made modular, so that they can be reconfigured effectively with a multiple time-step integrator to further reduce the computation time. Atomic charges are updated dynamically with an electronegativity equalization method, by iteratively minimizing the electrostatic energy with the charge-neutrality constraint. The ReaxFF-MD simulation algorithm has been implemented on parallel computers based on a spatial decomposition scheme combined with distributed n-tuple data structures. The measured parallel efficiency of the parallel ReaxFF-MD algorithm is 0.998 on 131,072 IBM BlueGene/L processors for a 1.01 billion-atom RDX system.  相似文献   

8.
《Parallel Computing》1997,22(13):1883-1896
The aim of the logic programming system PloSys is to offer most Prolog functionalities on distributed memory parallel computers as well as networks of personal computers. The PloSys computational model exploits OR-parallelism. It is designed to implement cuts and side-effects while keeping the number of messages exchanged between processors low. It is a multi-sequential model based on copying. Scheduling is centralized as is the management of the Prolog cuts and side-effects. The scheduler keeps up to date the knowledge of the left-most worker in the scope of each cut. A PloSys prototype was implemented and its first results, on an IBM SP-1, are promising.  相似文献   

9.
Parallel computers are having a profound impact on computational science. Recently highly parallel machines have taken the lead as the fastest supercomputers, a trend that is likely to accelerate in the future. We describe some of these new computers, and issues involved in using them. We present elliptic PDE solutions currently running at 3.8 gigaflops, and an atmospheric dynamics model running at 1.7 gigaflops, on a 65 536-processor computer.

One intrinsic disadvantage of a parallel machine is the need to perform inter-processor communication. It is important to ensure that such communication time is maintained at a small fraction of computation time. We analyze standard multigrid algorithms in two and three dimensions from this point of view, indicating that performance efficiencies in excess of 95% are attainable under suitable conditions on moderately parallel machines. We also demonstrate that such performance is not attainable for multigrid on massively parallel computers, as indicated by an example of poor multigrid efficiency on 65 536 processors. The fundamental difficulty is the inability to keep 65 536 processors busy when operating on very coarse grids.

Most algorithms used for implementing applications on parallel machines have been derived directly from algorithms designed for serial machines. The previously mentioned multigrid example indicates that such ‘parallelized’ algorithms may not always be optimal. Parallel machines open the possibility of finding totally new approaches to solving standard tasks—intrinsically parallel algorithms. In particular, we present a class of superconvergent multiple scale methods that were motivated directly by massevely parallel machines. These methods differ from standard multigrid methods in an intrinsic way, and allow all processors to be used at all times, even when processing on the coarsest grid levels. Their serial versions are not sensible algorithms. The idea that parallel hardware—the Connection Machine in this case—can lead to discovery of new mathematical algorithms was surprising for us.  相似文献   


10.
Two parallel computer paradigms available today are multi-core accelerators such as the Sony, Toshiba and IBM Cell or Graphics Processing Unit (GPUs), and massively parallel message-passing machines such as the IBM Blue Gene (BG). The solution of systems of linear equations is one of the most central processing unit-intensive steps in engineering and simulation applications and can greatly benefit from the multitude of processing cores and vectorisation on today's parallel computers. We parallelise the conjugate gradient (CG) linear equation solver on the Cell Broadband Engine and the IBM Blue Gene/L machine. We perform a scalability analysis of CG on both machines across 1, 8 and 16 synergistic processing elements and 1–32 cores on BG with heptadiagonal matrices. The results indicate that the multi-core Cell system outperforms by three to four times the massively parallel BG system due to the Cell's higher communication bandwidth and accelerated vector processing capability.  相似文献   

11.
COMLNK is a software system designed to provide the user of a communication link between two processors with facilities for file transport and Job submission from one computer to another. It has been implemented for a high speed communication link between an IBM 360/370 running under OS/MFT, MVT VS1 or VS2 and a PDP 11 running under DOS. The various record formats which are supported by the operating systems for their file organizations can be handled in a straightforward manner, while conversion options from one format to another are possible. Files may be shipped, listed on the IBM printer, punched on the IBM punch or submitted to the jobqueue.  相似文献   

12.
Like many architectural techniques that originated with mainframes. the use of multiple processors in a single computer is becoming popular in workstations and even personal computers. Multiprocessors constitute a significant percentage of recent workstation sales, and highly affordable multiprocessor personal computers are available in local computer stores. Once again, we find ourselves in a familiar situation: hardware is ahead of software. Because of the complexity of parallel programming, multiprocessors today are rarely used to speed up individual applications. Instead, they usually function as cycle-servers that achieve increased system throughput by running multiple tasks simultaneously. Automatic parallelization by a compiler is a particularly attractive approach to software development for multiprocessors, as it enables ordinary sequential programs to take advantage of the multiprocessor hardware without user involvement. This article looks to the future by examining some of the latest research results in automatic parallelization technology  相似文献   

13.
In this paper we benchmark the performance of the Cray T3D, IBM 9076 SP/1 and Intel Paragon XP/S parallel computers, using implementations of parallel algorithms for the computation of the vector outer-product A = uvT operation. The vector outer-product operation, although very simple in nature, requires the computation of a large number of floating-point operations and its parallelization induces a great level of communication between the processors. It is thus suited to measure the relative speed of the processor, memory subsystem and network capabilities of a parallel computer. It should not be considered a ‘toy problem’, since it arises in numerical methods in the context of the solution of systems of non-linear equations – still a difficult problem to solve. We present algorithms for both the explicit shared-memory and message-passing programming models together with theoretical computation models for those algorithms. Actual experiments were run on those computers, using Fortran 77 implementations of the algorithms. The results obtained with these experiments show that due to the high degree of communication between the processors one needs a parallel computer with fast communications and carefully implemented data exchange routines. The theoretical computation model allows prediction of the speed-up to be obtained for some problem size on a given number of processors. © 1997 John Wiley & Sons, Ltd.  相似文献   

14.
Particle-in-cell simulations often suffer from load-imbalance on parallel machines due to the competing requirements of the field-solve and particle-push computations. We propose a new algorithm that balances the two computations independently. The grid for the field-solve computation is statically partitioned. The particles within a processor's sub-domain(s) are dynamically balanced by migrating spatially-compact groups of particles from heavily loaded processors to lightly loaded ones as needed. The algorithm has been implemented in the quicksilver electromagnetic particle-in-cell code. We provide details of the implementation and present performance results for quicksilver running models with up to a billion grid cells and particles on thousands of processors of a large distributed-memory parallel machine.  相似文献   

15.
An element-by-element solution algorithm for systems of equations arising in applying the finite element method in solid mechanics was implemented on the loosely coupled array of processors (lCAP) parallel computer located at IBM Kingston. The element-by-element algorithm has previously been shown to be advantageous over direct solution algorithms for large problems on sequential computers. It also has the advantage that it can be implemented in parallel on machines such as the lCAP in a relatively straightforward manner. The results show that solution speedup efficiencies of approximately 95% can be readily achieved with this method, with no indication that the speed-up efficiency drops off as more processors are added. The implementation used is applicable to other coarse-grained parallel architectures in addition to the lCAP computer.  相似文献   

16.
基于平衡负载、减小通信开销的考虑,对于非均衡负载节点并行机提出了两种并行遗传算法一动态负载平衡的孤岛模型和主从模型,并与基本的孤岛模型做了比较。两种算法在实际使用中均取得了较好的效果。  相似文献   

17.
This paper reports the implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer. Basic aspects of running Monte Carlo particle transport calculations on parallel architectures are described. Our parallelization is based on equally partitioning photons among the processors and uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams. These parallelization techniques are also applicable to other distributed memory architectures. A linear increase in computing speed with the number of processors is demonstrated for up to 32 processors. This speed-up is especially significant in Single Photon Emission Computed Tomography (SPECT) simulations involving higher energy photon emitters, where explicit modeling of the phantom and collimator is required. For (131)I, the accuracy of the parallel code is demonstrated by comparing simulated and experimental SPECT images from a heart/thorax phantom. Clinically realistic SPECT simulations using the voxel-man phantom are carried out to assess scatter and attenuation correction.  相似文献   

18.
并行油藏模拟软件的实现及在国产高性能计算机上的应用   总被引:5,自引:0,他引:5  
主要介绍了百万网格点规模的精细油藏数值模拟在国产高性能并行计算机与微机机群系统上的应用情况 .针对若干组来自于国内油田的百万网格点实际数据 ,给出了在多种国产并行机环境下的运行结果 ,并作了分析与评价 .在此基础上 ,讨论并行油藏数值模拟软件高效实现过程中遇到的关键技术 ,探讨大型软件并行化过程中经常遇到的瓶颈问题及改进方案  相似文献   

19.
Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. These networks allow the customer to connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Some of these networks use source routing and wormhole switching. In particular, we are interested in Myrinet networks because it is a well-known commercial product and its behavior can be controlled by the software running in network interfaces (Myrinet Control Program, MCP). Usually, the Myrinet network uses up*/down* routing for computing the paths for every source-destination pair. We propose the In-Transit Buffer (ITB) mechanism to improve network performance. We apply the ITB mechanism to NOWs with up*/down* source routing, like Myrinet, analyzing its behavior on both networks with regular and irregular topologies. The proposed scheme can be implemented on Myrinet networks by only modifying the MCP, without changing the network hardware. We evaluate by simulation several networks with different traffic patterns using timing parameters taken from the Myrinet network. Results show that the current routing schemes used in Myrinet networks can be strongly improved by applying the ITB mechanism. In general, our proposed scheme is able to double the network throughput on medium and large NOWs. Finally, we present a first implementation of the ITB mechanism on a Myrinet network.  相似文献   

20.
The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube and IBM SP1 and SP2 parallel computers is documented. Spatially evolving disturbances associated with laminar-to-turbulent transition in boundary-layer flows are computed with the PSDNS code. The feasibility of using the PSDNS to perform transition studies on these computers is examined. The results indicate that PSDNS approach can effectively be parallelized on a distributed-memory parallel machine by remapping the distributed data structure during the course of the calculation. Scalability information is provided to estimate computational costs to match the actual costs relative to changes in the number of grid points. By increasing the number of processors, slower than linear speedups are achieved with optimized (machine-dependent library) routines. This slower than linear speedup results because the computational cost is dominated by FFT routine, which yields less than ideal speedups. By using appropriate compile options and optimized library routines on the SP1, the serial code achieves 52–56 Mflops on a single node of the SP1 (45 percent of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a real world simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP supercomputer. For the same simulation, 32-nodes of the SP1 and SP2 are required to reach the performance of a Cray C-90. A 32 node SP1 (SP2) configuration is 2.9 (4.6) times faster than a Cray Y/MP for this simulation, while the hypercube is roughly 2 times slower than the Y/MP for this application.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号