首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We discuss the design and implementation of HYDRA_OMP a parallel implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M) code HYDRA. The code is designed primarily for conducting cosmological hydrodynamic simulations and is written in Fortran77+OpenMP. A number of optimizations for RISC processors and SMP-NUMA architectures have been implemented, the most important optimization being hierarchical reordering of particles within chaining cells, which greatly improves data locality thereby removing the cache misses typically associated with linked lists. Parallel scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes for a variety of modern SMP architectures. We give performance data in terms of the number of particle updates per second, which is a more useful performance metric than raw MFlops. A basic version of the code will be made available to the community in the near future.  相似文献   

2.
In modern parallel adaptive mesh computations the problem size varies during simulation. In this study we investigate the comparative behavior of four load balancing algorithms when the number of processors is dynamically changed during the lifetime of a multistage parallel computation. The focus is on communication and data movement overheads, total parallel runtime and total resource consumption. We demonstrate the main ideas for the case of six adaptive mesh refinement (AMR) applications with different kinds of growth patterns. The results presented are for a 32 processor Intel cluster connected by Ethernet.  相似文献   

3.
On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality. In this article, we study the performance of a parallel PDE solver with adaptive mesh refinement (AMR). The solver is parallelized using OpenMP and the adaptive mesh refinement makes dynamic load balancing necessary. Due to the dynamically changing memory access pattern caused by the runtime adaption, it is a challenging task to achieve a high degree of geographical locality. The main conclusions of the study are: (1) that geographical locality is very important for the performance of the solver, (2) that the performance can be improved significantly using dynamic page migration of misplaced data, (3) that a migrate-on-next-touch directive works well whereas the first-touch strategy is less advantageous for programs exhibiting a dynamically changing memory access patterns, and (4) that the overhead for such migration is low compared to the total execution time.  相似文献   

4.
We present a suite of algorithms for migrating Lagrangian data between processors in a parallel environment when the underlying mesh is Eulerian. The collection of algorithms applies to both uniform and adaptive meshes. The algorithms are implemented in, and distributed with, FLASH, a publicly available multiphysics simulation code. Migrating Lagrangian data on an Eulerian mesh is non-trivial because the Eulerian grid points are spatially fixed whereas Lagrangian entities move with the flow of a simulation. Thus, the movement of Lagrangian data cannot use the data migration methods associated with the Eulerian mesh. Additionally, when the mesh is adaptive, as the simulation progresses the grid resolution changes. The resulting regridding process can cause complex Lagrangian data migration.The algorithms presented in this paper describe Lagrangian data movement on a static uniform mesh and on an adaptive octree based block-structured mesh. Some of the algorithms are general enough to be applicable to any block structured mesh, while some others exploit the meta-data and structure of PARAMESH, the adaptive mesh refinement (AMR) package used in FLASH. We also present an analysis of the algorithms’ comparative performances in different parallel environments, and different flow characteristics.  相似文献   

5.

In this paper, we present several important details in the process of legacy code parallelization, mostly related to the problem of maintaining numerical output of a legacy code while obtaining a balanced workload for parallel processing. Since we maintained the non-uniform mesh imposed by the original finite element code, we have to develop a specially designed data distribution among processors so that data restrictions are met in the finite element method. In particular, we introduce a data distribution method that is initially used in shared memory parallel processing and obtain better performance than the previous parallel program version. Besides, this method can be extended to other parallel platforms such as distributed memory parallel computers. We present results including several problems related to performance profiling on different (development and production) parallel platforms. The use of new and old parallel computing architectures leads to different behavior of the same code, which in all cases provides better performance in multiprocessor hardware.

  相似文献   

6.
FLASH is a multiphysics multiscale adaptive mesh refinement (AMR) code originally designed for simulation of reactive flows often found in Astrophysics. With its wide user base and flexible applications configuration capability, FLASH has a dual task of maintaining scalability and portability in all its solvers. The scalability of fully explicit solvers in the code is tied very closely to that of the underlying mesh. Others such as the Poisson solver based on a multigrid method have more complex scaling behavior. Multigrid methods suffer from processor starvation and dominating communication costs at coarser grids with increase in the number of processors. In this paper, we propose a combination of uniform grid mesh with AMR mesh, and the merger of two different sets of solvers to overcome the scalability limitation of the Poisson solver in FLASH. The principal challenge in the proposed merger is the efficiency of the communication algorithm to map the mesh back and forth between uniform grid and AMR. We present two different parallel mapping algorithms and also discuss results from performance studies of the two implementations. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

7.
A parallel algorithm is derived for solving the discrete-ordinates approximation of the neutron transport equation, based on the naturally occurring decoupling between the mesh sweeps in the various discrete directions during each iteration. In addition, the parallel Source Iteration (SI) algorithm, characterized by its coarse granularity and static scheduling, is implemented for the Nodal Integral Method (NIM) into the Parallel Nodal Transport (P-NT) code on Intel's iPSC/2 hypercube. Measured parallel performance for solutions of two test problems is used as evidence of the parallel algorithm's potential for high speedup and efficiency. The measured performance data are also used to develop and validate a parallel performance model for the total, serial, parallel, and global-summation time components per iteration as a function of the spatial mesh size, the problem size (number of mesh cells and angular quadrature order), and the number of utilized processors. The potential for high performance (large speedup at high efficiency) for large problems is explored using the performance model, and it is concluded that present applications in three-dimensional Cartesian geometry will benefit by concurrent execution on parallel computers with up to a few hundred processors.Research sponsored by the U.S. Department of Energy, managed by Martin Marietta Energy Systems, Inc., under contract No. DE-AC05-84OR21400.  相似文献   

8.
In this paper we consider the scalability of parallel space‐filling curve generation as implemented through parallel sorting algorithms. Multiple sorting algorithms are studied and results show that space‐filling curves can be generated quickly in parallel on thousands of processors. In addition, performance models are presented that are consistent with measured performance and offer insight into performance on still larger numbers of processors. At large numbers of processors, the scalability of adaptive mesh refined codes depends on the individual components of the adaptive solver. One such component is the dynamic load balancer. In adaptive mesh refined codes, the mesh is constantly changing resulting in load imbalance among the processors requiring a load‐balancing phase. The load balancing may occur often, requiring the load balancer to perform quickly. One common method for dynamic load balancing is to use space‐filling curves. Space‐filling curves, in particular the Hilbert curve, generate good partitions quickly in serial. However, at tens and hundreds of thousands of processors serial generation of space‐filling curves will hinder scalability. In order to avoid this issue we have developed a method that generates space‐filling curves quickly in parallel by reducing the generation to integer sorting. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

9.
相对于一致加密网格,SAMR网格可以在保持相同数值模拟精度的前提下,大幅度减少网格数目,缩短计算时间。针对惯性约束聚变中的流体力学不稳定性数值模拟,基于JASMIN框架研制了二维多介质流体力学并行SAMR应用程序。在数百个CPU核上模拟了压缩内爆模型,数值模拟结果和并行性能分析显示了应用程序的正确性和并行实现的高效率。  相似文献   

10.
In this paper, an object-oriented framework for numerical analysis of multi-physics applications is presented. The framework is divided into several basic sets of classes that enable the code segments to be built according to the type of problem to be solved. Fortran 2003 was used in the development of this finite element program due to its advantages for scientific and engineering programming and its new object-oriented features. The program was developed with h-type adaptive mesh refinement, and it was tested for several classical cases involving heat transfer, fluid mechanics and structural mechanics. The test cases show that the adaptive mesh is refined only in the localization region where the feature gradient is relatively high. The overall mesh refinement and the h-adaptive mesh refinement were justified with respect to the computational accuracy and the CPU time cost. Both methods can improve the computational accuracy with the refinement of mesh. The overall mesh refinement causes the CPU time cost to greatly increase as the mesh is refined. However, the CPU time cost does not increase very much with the increase of the level of h-adaptive mesh refinement. The CPU time cost can be saved by up to 90%, especially for the simulated system with a large number of elements and nodes.  相似文献   

11.
A parallel electrostatic Poisson's equation solver coupled with parallel adaptive mesh refinement (PAMR) is developed in this paper. The three-dimensional Poisson's equation is discretized using the Galerkin finite element method using a tetrahedral mesh. The resulting matrix equation is then solved through the parallel conjugate gradient method using the non-overlapping subdomain-by-subdomain scheme. A PAMR module is coupled with this parallel Poisson's equation solver to adaptively refine the mesh where the variation of potentials is large. The parallel performance of the parallel Poisson's equation is studied by simulating the potential distribution of a CNT-based triode-type field emitter. Results with ∼100 000 nodes show that a parallel efficiency of 84.2% is achieved in 32 processors of a PC-cluster system. The field emission properties of a single CNT triode- and tetrode-type field emitter in a periodic cell are computed to demonstrate their potential application in field emission prediction.  相似文献   

12.
Structured adaptive mesh refinement (SAMR) methods for the numerical solution of partial differential equations yield highly advantageous ratios for cost/accuracy as compared to methods based on static uniform approximations. These techniques are being effectively used in many domains including computational fluid dynamics, numerical relativity, astrophysics, subsurface modeling, and oil reservoir simulation. Distributed implementations of these methods, however, lead to significant challenges in dynamic data-distribution, load-balancing, and runtime management. This paper presents an application-centric characterization of a suite of dynamic domain-based inverse space-filling curve partitioning techniques for the distributed adaptive grid hierarchies that underlie SAMR applications. The overall goal of this research is to formulate policies required to drive a dynamically adaptive metapartitioner for SAMR grid hierarchies capable of selecting the most appropriate partitioning strategy at runtime based on current application and system state. Such a metapartitioner can significantly reduce the execution time of SAMR applications.  相似文献   

13.
Adaptive mesh refinement (AMR) is a type of multiscale algorithm that achieves high resolution in localized regions of dynamic, multidimensional numerical simulations. One of the key issues related to AMR is dynamic load balancing (DLB), which allows large-scale adaptive applications to run efficiently on parallel systems. In this paper, we present an efficient DLB scheme for structured AMR (SAMR) applications. This scheme interleaves a grid-splitting technique with direct grid movements (e.g., direct movement from an overloaded processor to an underloaded processor), for which the objective is to efficiently redistribute workload among all the processors so as to reduce the parallel execution time. The potential benefits of our DLB scheme are examined by incorporating our techniques into a SAMR cosmology application, the ENZO code. Experiments show that by using our scheme, the parallel execution time can be reduced by up to 57% and the quality of load balancing can be improved by a factor of six, as compared to the original DLB scheme used in ENZO.  相似文献   

14.
Cosmological simulations of structures and galaxies formations have played a fundamental role in the study of the origin, formation and evolution of the Universe. These studies improved enormously with the use of supercomputers and parallel systems and, recently, grid based systems and Linux clusters. Now we present the new version of the tree N-body parallel code FLY that runs on a PC Linux Cluster using the one side communication paradigm MPI-2 and we show the performances obtained. FLY is included in the Computer Physics Communication Program Library. This new version was developed using the Linux Cluster of CINECA, an IBM Cluster with 1024 Intel Xeon Pentium IV 3.0 GHz. The results show that it is possible to run a 64 million particle simulation in less than 15 minutes for each time-step, and the code scalability with the number of processors is achieved. This leads us to propose FLY as a code to run very large N-body simulations with more than 109 particles with the higher resolution of a pure tree code. The FLY new version is available at the CPC Program Library, http://cpc.cs.qub.ac.uk/summaries/ADSC_v2_0.html [U. Becciani, M. Comparato, V. Antonuccio-Delogu, Comput Phys. Comm. 174 (2006) 605].  相似文献   

15.
I describe a Poisson solver for the adaptive mesh magnetohydrodynamics (MHD) code NIRVANA using ADI techniques (ADI: Alternative Direction Implicit). The solver is fit to the mesh refinement framework of the code and utilizes its special block-structured design. The key part of the method is an algorithm for the intelligent clustering of subgrids which permits the application of numerical methods based on dimensional operator splitting like ADI. Test problems show the convergence of this ansatz.  相似文献   

16.
A code for the direct numerical simulation (DNS) of incompressible flows with one periodic direction has been developed. It provides a fairly good performance on both Beowulf clusters and supercomputers. Since the code is fully explicit, from a parallel point-of-view, the main bottleneck is the Poisson equation. To solve it, a Fourier diagonalization is applied in the periodic direction to decompose the original 3D system into a set of mutually independent 2D systems. Then, different strategies can be used to solved them. In the previous version of the code, that was conceived for low-cost PC clusters with poor network performance, a Direct Schur-complement Decomposition (DSD) algorithm was used to solve them. Such a method, that is very efficient for PC clusters, cannot be used with an arbitrarily large number of processors and mesh sizes, mainly due to the RAM memory requirements. To do so, a new version of the solver is presented in this paper. It is based on the DSD algorithm that is used as a preconditioner for a Conjugate Gradient method. Numerical experiments showing the scalability and the flexibility of the method on both the MareNostrum supercomputer and a PC cluster with a conventional 100 Mbits/s network are presented and discussed. Finally, illustrative DNS results of an air-filled differentially heated cavity at Ra = 1011 are also presented.  相似文献   

17.
Application of variable time-step and unstructured adaptive mesh refinement in parallel three-dimensional Direct Simulation Monte Carlo (DSMC) method is presented. A variable time-step method using the particle fluxes conservation (mass, momentum and energy) across the cell interface is implemented to reduce the number of simulated particles and the number of iterations of transient period towards steady state, without sacrificing the solution accuracy. In addition, a three-dimensional h-refined unstructured adaptive mesh with simple but effective mesh-quality control, obtained from a preliminary parallel DSMC simulation, is used to increase the accuracy of the DSMC solution. Completed code is then applied to compute several external and internal flows, and compared with previous results wherever available.  相似文献   

18.
In parallel adaptive mesh refinement (AMR) computations the problem size can vary significantly during a simulation. The goal here is to explore the performance implications of dynamically varying the number of processors proportional to the problem size during simulation. An emulator has been developed to assess the effects of this approach on parallel communication, parallel runtime and resource consumption. The computation and communication models used in the emulator are described in detail. Results using the emulator with different AMR strategies are described for a test case. Results show for the test case, varying the number of processors, on average, reduces the total parallel communications overhead from 16 to 19% and improves parallel runtime time from 4 to 8%. These results also show that on average resource utilization improves more than 37%.  相似文献   

19.
20.
提出了面向科学计算的64位流体系结构——MASA,它具有强局域性、并行性、解耦合访存操作和计算操作等特征,特别适合于计算密集型的并行应用.作者使用时钟精确的模拟器评测了流体力学中的典型应用在MASA上的运行性能,结果表明MASA在500MHz的情况下能够获得比1.6GHz的Iantium2近4倍的加速,证实了流体系结构在高性能计算领域的极大潜力.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号