首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
作为一种典型的拉格朗日型无网格数值方法,光滑粒子流体动力学(SPH)方法在模拟自由表面流问题时具有天然优势。但是,该方法计算量大、耗时长,为此提出了一种基于粒子分解的SPH并行算法。该算法将所有粒子平均分配到各个进程进行计算,每个时间步通信仅调用一次发送、接收和广播函数,因此易于实现且可扩展性较好。应用该并行算法对二维溃坝流和三维液滴冲击液膜问题进行数值模拟,结果表明:该并行算法能显著减少模拟所消耗的计算时间,有利于进行三维大规模计算问题的数值模拟;当粒子数大于百万时,最大加速比可达30以上。  相似文献   

2.
The fast Fourier transform (FFT) is undoubtedly an essential primitive that has been applied in various fields of science and engineering. In this paper, we present a decomposition method for the parallelization of multi-dimensional FFTs with the smallest communication amounts for all ranges of the number of processes compared to previously proposed methods. This is achieved by two distinguishing features: adaptive decomposition and transpose order awareness. In the proposed method, the FFT data is decomposed based on a row-wise basis that maps the multi-dimensional data into one-dimensional data, and translates the corresponding coordinates from multi-dimensions into one dimension so that the one-dimensional data can be divided and allocated equally to the processes using a block distribution. As a result and different from previous works that have the dimensions of decomposition pre-defined, our method can adaptively decompose the FFT data on the lowest possible dimensions depending on the number of processes. In addition, this row-wise decomposition provides plenty of alternatives in data transpose, and different transpose order results in different amounts of communication. We identify the best transpose orders with the smallest communication amounts for the 3-D, 4-D, and 5-D FFTs by analyzing all possible cases. We also develop a general parallel software package for the most popular 3-D FFT based on our method using the 2-D domain decomposition. Numerical results show good performance and scaling properties of our implementation in comparison with other parallel packages. Given both communication efficiency and scalability, our method is promising in the development of highly efficient parallel packages for the FFT.  相似文献   

3.
The Schur complement domain decomposition method is used for solution of large linear systems. The algorithm is based on the subdivision of the domain into smaller ones and the solution of those sub-domains independently. Regarding water distribution systems modeling, the hydraulic simulation could be formulated as a sequence of systems of linear equations. Therefore, this paper utilizes the domain decomposition method to accelerate the simulation process further. The method is evaluated using a large scale real-world system with 63,616 junctions and 64,200 pipes as case study. The case study shows that the methodology could improve the performance of hydraulic simulation app. by a factor of 8 without losing accuracy at a suitable level of domain decomposition. Although the optimal level of decomposition is case specific, considerable speedup might still be achievable by decomposing a large system into only a few subsystems.  相似文献   

4.
为提高接触问题并行计算的效率,分析内力计算和接触计算过程的并行性,提出基于边权约束法构造接触多约束图的方法,对比和分析多约束图剖分算法和双重区域剖分算法的负载平衡和通信性能.数值实验表明,在典型二维模型中多约束图剖分算法的负载平衡性能略低于双重区域剖分算法,但仍可将负载不平衡度控制在较好的范围内,简化并行计算的通信过程,减少总通信量并降低动态通信量比例.  相似文献   

5.
非结构网格上求解中子输运方程的并行流水线Sn扫描算法   总被引:11,自引:4,他引:7  
间断有限元离散纵标方法(Sn)是广泛应用于求解高维非定常中子输运方程的数值方法,它涉及几何网格空间、速度相空间和中子能群的离散,计算量很大.该文基于非结构网格,提出了基于区域分解的并行流水线Sn扫描算法,通过设计具有不同内在并行度和通信面体比的区域分解方法和队列插入算法,对两个不同物理模型,分别使用两台并行机的92个和256个CPU,获得72倍和78倍以上的加速.可扩展性能分析表明,算法的性能非常依赖于并行机的点对点通信延迟.  相似文献   

6.
A parallel molecular dynamics simulation method, designed for large-scale problems, employing dynamic spatial domain decomposition for short-ranged molecular interactions is proposed. In this parallel cellular molecular dynamics (PCMD) simulation method, the link-cell data structure is used to reduce the searching time required for forming the cut-off neighbor list as well as for domain decomposition, which utilizes the multi-level graph-partitioning technique. A simple threshold scheme (STS), in which workload imbalance is monitored and compared with some threshold value during the runtime, is proposed to decide the proper time for repartitioning the domain. The simulation code is implemented and tested on the memory-distributed parallel machine, e.g., PC-cluster system. Parallel performance is studied using approximately one million L-J atoms in the condensed, vaporized and supercritical states. Results show that fairly good parallel efficiency at 49 processors can be obtained for the condensed and supercritical states (∼60%), while it is comparably lower for the vaporized state (∼40%).  相似文献   

7.
在分析已有的演化计算并行化实现策略的基础上,构造一种基于思维进化与空间分解并行策略的演化算法(MLSP-PEA).将思维进化与空间分解技术相结合,采用趋同与异化操作,获得了较好的解精度以及可扩展性.自强3000上的实验结果表明,在处理多维函数优化问题时,MLSP-PEA与基于空间分解并行策略的演化算法(SP-PEA)相比具有更好的解精度以及更快的收敛速度.  相似文献   

8.
In this paper, an efficient unstructured mesh calculation method in an OpenMP parallel computation using multi-core processor is proposed. This is a new domain decomposition method with two characteristics. The first characteristic is to define the size of the sub-block in the computation domain by the size of the cache memory in each core. The second one is to reduce idle time by distributing a defined sub-block for each core appropriately. Using the proposed method, a computation on compressible flow around a plane was able to achieve speed-up more than about 20% in comparison with a conventional method.  相似文献   

9.
Evolvable hardware (EHW) refers to an automatic circuit design approach, which employs evolutionary algorithms (EAs) to generate the configurations of the programmable devices. The scalability is one of the main obstacles preventing EHW from being applied to real-world applications. Several techniques have been proposed to overcome the scalability problem. One of them is to decompose the whole circuit into several small evolvable sub-circuits. However, current techniques for scalability are mainly used to evolve combinational logic circuits. In this paper, in order to decompose a sequential logic circuit, the state decomposition, output decomposition and input decomposition are united as a three-step decomposition method (3SD). A novel extrinsic EHW system, namely 3SD–ES, which combines the 3SD method with the (μ, λ) ES (evolution strategy), is proposed, and is used for the evolutionary designing of larger sequential logic circuits. The proposed extrinsic EHW system is tested extensively on sequential logic circuits taken from the Microelectronics Center of North Carolina (MCNC) benchmark library. The results demonstrate that 3SD–ES has much better performance in terms of scalability. It enables the evolutionary designing of larger sequential circuits than have ever been evolved before.  相似文献   

10.
The lattice Boltzmann method is an important technique for the numerical solution of partial differential equations because it has nearly ideal scalability on parallel computers for many applications. However, to achieve the scalability and speed potential of the lattice Boltzmann technique, the issues of data reusability in cache‐based computer architectures must be addressed. Utilizing the two‐dimensional diffusion equation, , this paper examines cache optimization for the lattice Boltzmann method in both serial and parallel implementations. In this study, speedups due to cache optimization were found to be 1.9–2.5 for the serial implementation and 3.6–3.8 for the parallel case in which the domain decomposition was optimized for stride‐one access. In the parallel non‐cached implementation, the method of domain decomposition (horizontal or vertical) used for parallelization did not significantly affect the compute time. In contrast, the cache‐based implementation of the lattice Boltzmann method was significantly faster when the domain decomposition was optimized for stride‐one access. Additionally, the cache‐optimized lattice Boltzmann method in which the domain decomposition was optimized for stride‐one access displayed superlinear scalability on all problem sizes as the number of processors was increased. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

11.
图像的形态学卡通纹理分解与图像层级分解是近年来图像分析领域的研究热点.针对现有图像分解算法求解过程复杂且不易对图像层级分解的问题,提出了一种基于算子的快速图像分解算法.在该方法中,由局部奇异线性算子来刻画图像的纹理成分,剩余图像信号则被描述为卡通成分;同时通过引入自适应参数调节的方法来实现对剩余信号的分解,进而实现图像的层级分解.其中,自适应参数由图像局部全变差变化率来确定.实验结果表明,该算法具有较好的运行效率,同时具有较好的分解视觉效果.  相似文献   

12.
With tens of petaflops supercomputers already in operation and exaflops machines expected to appear within the next 10 years, efficient parallel computational methods are required to take advantage of such extreme-scale machines. In this paper, we present a three-dimensional domain decomposition scheme for enabling large-scale electronic structure calculations based on density functional theory (DFT) on massively parallel computers. It is composed of two methods: (i) the atom decomposition method and (ii) the grid decomposition method. In the former method, we develop a modified recursive bisection method based on the moment of inertia tensor to reorder the atoms along a principal axis so that atoms that are close in real space are also close on the axis to ensure data locality. The atoms are then divided into sub-domains depending on their projections onto the principal axis in a balanced way among the processes. In the latter method, we define four data structures for the partitioning of grid points that are carefully constructed to make data locality consistent with that of the clustered atoms for minimizing data communications between the processes. We also propose a decomposition method for solving the Poisson equation using the three-dimensional FFT in Hartree potential calculation, which is shown to be better in terms of communication efficiency than a previously proposed parallelization method based on a two-dimensional decomposition. For evaluation, we perform benchmark calculations with our open-source DFT code, OpenMX, paying particular attention to the O(N)O(N) Krylov subspace method. The results show that our scheme exhibits good strong and weak scaling properties, with the parallel efficiency at 131,072 cores being 67.7% compared to the baseline of 16,384 cores with 131,072 atoms of the diamond structure on the K computer.  相似文献   

13.
To achieve scalable parallel performance in molecular dynamics simulations, we have modeled and implemented several dynamic spatial domain decomposition algorithms. The modeling is based upon the bulk synchronous parallel architecture model (BSP), which describes supersteps of computation, communication, and synchronization. Using this model, we have developed prototypes that explore the differing costs of several spatial decomposition algorithms and then use this data to drive implementation of our molecular dynamics simulator,Sigma. The parallel implementation is not bound to the limitations of the BSP model, allowing us to extend the spatial decomposition algorithm. For an initial decomposition, we use one of the successful decomposition strategies from the BSP study and then subsequently use performance data to adjust the decomposition, dynamically improving the load balance. The motivating reason to use historical performance data is that the computation to predict a better decomposition increases in cost with the quality of prediction, while the measurement of past work often has hardware support, requiring only a slight amount of work to modify the decomposition for future simulation steps. In this paper, we present our adaptive spatial decomposition algorithms, the results of modeling them with the BSP, the enhanced spatial decomposition algorithm, and its performance results on computers available locally and at the national supercomputer centers.  相似文献   

14.
A domain decomposition algorithm for molecular dynamics simulation of atomic and molecular systems with arbitrary shape and non-periodic boundary conditions is described. The molecular dynamics program uses cell multipole method for efficient calculation of long range electrostatic interactions and a multiple time step method to facilitate bigger time steps. The system is enclosed in a cube and the cube is divided into a hierarchy of cells. The deepest level cells are assigned to processors such that each processor has contiguous cells and static load balancing is achieved by redistributing the cells so that each processor has approximately same number of atoms. The resulting domains have irregular shape and may have more than 26 neighbors. Atoms constituting bond angles and torsion angles may straddle more than two processors. An efficient strategy is devised for initial assignment and subsequent reassignment of such multiple-atom potentials to processors. At each step, computation is overlapped with communication greatly reducing the effect of communication overhead on parallel performance. The algorithm is tested on a spherical cluster of water molecules, a hexasaccharide and an enzyme both solvated by a spherical cluster of water molecules. In each case a spherical boundary containing oxygen atoms with only repulsive interactions is used to prevent evaporation of water molecules. The algorithm shows excellent parallel efficiency even for small number of cells/atoms per processor.  相似文献   

15.
结合QR分解的迭代检测算法与连续干扰消除思想提出了一种新型的QR迭代检测算法。该算法充分利用最后检测层分集增益最高、性能最优的特点,在每一次QR分解之后,仅保留最后检测层的判决,在接收信号中消除已判决信号的干扰,并将信道矩阵中已判决信号的列删除,降低信道矩阵列的维数后,进行下一次QR分解,直到所有层的信号都检测出来。分析表明,新型QR迭代检测算法复杂度大约为连续干扰消除算法的1/8,约为传统迭代检测算法的1/2。仿真试验表明,对称系统中新型QR迭代检测算法性能与传统迭代检测算法基本保持一致,都要优于连续干扰消除算法。  相似文献   

16.
《国际计算机数学杂志》2012,89(1-2):123-136
This is a framework of the domain decomposition method (DDM) for solving PDEs on parallel computers. Three types of DDM: DDM with overlapping, DDM without overlapping and DDM with fictitious components are discussed in a uniform framework.  相似文献   

17.
位姿图优化(pose graph optimization, PGO)是计算机视觉领域中广泛应用的高维非凸优化算法,很难直接求解,主要依赖于迭代技术,对初始值的质量要求较高,在实践中很难得到保证。针对位姿图优化问题进行了研究,提出了基于特征分解的位姿图简单封闭解算法,该算法首先对PGO问题的最大似然估计进行半定松弛,然后将其转换为特征分解问题,并利用数据的稀疏性设计了改进的模型降阶方法进行求解,进一步提高了算法的计算速度。算法具有可伸缩性、计算成本低和精度高等优点。最后,在模拟和真实的位姿图数据集上进行实验评估,结果表明在不影响精度的情况下,该算法可以快速地进行位姿图优化。  相似文献   

18.
A scalable parallel algorithm has been designed to study long-time dynamics of many-atom systems based on the nudged elastic band method, which performs mutually constrained molecular dynamics simulations for a sequence of atomic configurations (or states) to obtain a minimum energy path between initial and final local minimum-energy states. A directionally heated nudged elastic band method is introduced to search for thermally activated events without the knowledge of final states, which is then applied to an ensemble of bands in a path ensemble method for long-time simulation in the framework of the transition state theory. The resulting molecular kinetics (MK) simulation method is parallelized with a space-time-ensemble parallel nudged elastic band (STEP-NEB) algorithm, which employs spatial decomposition within each state, while temporal parallelism across the states within each band and band-ensemble parallelism are implemented using a hierarchy of communicator constructs in the Message Passing Interface library. The STEP-NEB algorithm exhibits good scalability with respect to spatial, temporal and ensemble decompositions on massively parallel computers. The MK simulation method is used to study low strain-rate deformation of amorphous silica.  相似文献   

19.
Fast cell-based decomposition and applications to solid modeling   总被引:1,自引:0,他引:1  
As opposed to constructing a complex solid model by adding simple primitives, maximal volume decomposition is the method to decompose a solid model into simpler volumes whose union is the solid model itself. A maximal volume of a solid model is the large and simple volume that does not have a concave edge and that is not contained by any other such volume. In spite of the usefulness in solid modeling such as recognition of intersecting machining features, the scalability of the method has been a problem for practical applications. The two issues in the problem—the global effect of face extension and the heavy computational load for cell collection—should be addressed. In this paper, a volume decomposition method that handles these two issues effectively is presented. This method called fast volume decomposition decomposes a solid model significantly fast with the localized face extension and the cell collection using the seed cell.  相似文献   

20.
面向大规模可视数据的高速绘制问题,提出了一种基于区域分解的并行动态LOD(level-of-detail,层次细节模型)构建算法。算法首先改进了传统的渐进网格方法,实现了基于二次误差测度网格简化算法的渐进网格方法;接着提出了一种基于模型包围盒的区域分解算法,实现了原始模型的自适应区域分解;在每个子区域上,并行地执行渐进网格方法,实现了模型的并行动态LOD构建。实验结果表明,该算法可生成高质量的LOD模型,具备理想的加速比和可扩放性;与串行算法相比,该算法有效地提高了算法的执行效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号