期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

范培勤刘晓妍过武宏崔宝龙《计算机工程与科学》2020,42(3):404-410

针对水声传播模型的计算量大,难以满足实时化、精细化水下声传播信息保障需求的难题,基于MPI+OpenMP混合并行编程方法,开展了WKBZ简正波模型混合并行计算方法研究,实现了水下声场2级混合并行计算。该方法通过节点间消息传递、节点内内存共享的方式,有效克服了MPI并行编程模型通信开销大和OpenMP并行编程环境可扩展性差的缺点,较好地解决了水下声传播快速计算的问题。测试结果表明,该方法能够较好地利用SMP集群节点间和节点内多级并行机制,充分发挥消息传递编程模型和共享内存编程模型各自的优势,大幅降低MPI进程间通信带来的时间开销,有效提升程序的可扩展性和并行效率。相似文献

2.

基于SMP集群系统的并行编程模式研究与分析 总被引：4，自引：1，他引：4

宋伟宋玉《计算机技术与发展》2007,17(2):164-168

并行计算技术是计算机技术发展的重要方向之一，SMP与集群是当前主流的并行体系结构。当前并行程序设计方法主要采用基于消息传递模型的MPI和基于共享存储模型的OpenMP，两种编程模式各有特点和适用范围。对SMP集群以及MPI和OpenMP的特点进行了分析，介绍了在SMP集群系统中利用MPI和OpenMP混合编程的可行性方法。相似文献

3.

MPI+TBB混合并行编程模型在分子动力学中的应用

白明泽赵文辉豆育升孙世新温迪《计算机应用研究》2012,29(5):1772-1774

为了提高分子动力学模拟在对称多处理(SMP)集群上的计算速度,在分子动力学并行方法中引入MPI+TBB的混合并行编程模型。基于该模型,在分子动力学软件LAMMPS中设计并实现混合并行算法,在节点间采用MPI及空间分解技术实施进程级并行,节点内采用TBB及临界区技术实施线程级并行。在SMP集群中的测试表明,该方法在体系较大以及节点数较多时可以明显减少通信时间,使加速比在纯MPI模型上提高45%。结果表明,MPI+TBB混合并行编程模型可促进分子动力学并行模拟且效率明显提升。相似文献

4.

基于SMP集群系统的并行编程模式研究

田跃欣《福建电脑》2008,(2):49-50

并行计算技术是计算机技术发展的重要方向之一,SMP与集群是当前主流的并行体系结构。当前并行程序设计方法主要采用基于消息传递模型的MPI和基于共享存储模型的OpenMP,两种编程模式各有特点和适用范围。本文对SMP集群以及MH和OpenMP的特点进行了分析．并介绍了在SMP集群系统中利用MH和OpenMP混合编程的可行性方法。相似文献

5.

基于SMP集群系统的并行编程模式研究与分析

宋伟宋玉《微机发展》2007,17(2):164-167

并行计算技术是计算机技术发展的重要方向之一,SMP与集群是当前主流的并行体系结构。当前并行程序设计方法主要采用基于消息传递模型的MPI和基于共享存储模型的OpenMP,两种编程模式各有特点和适用范围。对SMP集群以及MPI和OpenMP的特点进行了分析,介绍了在SMP集群系统中利用MPI和OpenMP混合编程的可行性方法。相似文献

6.

SMPCluster：如何开发两级并行 总被引：3，自引：1，他引：3

下载免费PDF全文

王韬李晓明《计算机工程与科学》2002,24(4):78-80

本文由基础的Linux操作系统入手，考察在一个SMP系统内部的两种不同的并行实现机制：代表共享存储模型的线程模型（和OpenMP模型）和代表消息传递模型的MPI模型。然后，通过分析应当如何结合节点和节点内两级并行得出：从效率和易用性的综合考虑，在LinuxSMP Cluster上应当直接使用利用共享内存进行通信的MPI进行编程。相似文献

7.

槽流拟颗粒模型的并行算法 总被引：1，自引：1，他引：0

易锋郭力王利民王小伟葛蔚《计算机与应用化学》2005,22(9):707-710

将流体处理为离散粒子,应用拟颗粒硬球模型来研究槽流中的流动现象,与分子动力学模拟的算法类似,是研究槽流机理的一种行之有效的方法。为了作大规模的模拟,本文采用区域分解算法和消息传递编程模型技术,将该模型串行程序并行化,应用一维划分、单相传递的方法简化了并行算法,采用轮换搜索法来避免硬球碰撞次序对结果的影响。在可扩展的机群系统上用实例计算,通过与串行程序的对比,验证了并行程序的正确性,表明本文设计的并行算法取得了较高的并行计算效率。相似文献

8.

化学驱油藏数模并行化中的关键技术

常晓东胡长军李永红《微计算机信息》2007,23(28):249-251

本文描述了化学复合驱数值模拟程序UTCHEM在分布式内存多计算机并行系统SMP-CLUSTER上并行化的关键技术。化学复合驱并行模型采用单程序多数据（SPMD）程序模型，利用区域分解方法将整个求解区域分解为子区域，使得多个计算节点同时求解一个单一的模拟问题。各计算节点通过消息传递对重叠区域的共享数据进行通信，以协调各节点之问的计算。目前仅对压力方程组求解部分进行了并行化实现。测试结果显示了较好的并行效率。相似文献

9.

分子动力学模拟的优化与并行研究 总被引：2，自引：1，他引：2

张勤勇蒋洪川刘翠华《计算机应用研究》2005,22(8):84-85

分析讨论了分子动力学模拟的算法特征和计算特点,对串行程序作了优化,并使之适合于作并行化。对模拟体系使用区域分解的方法,在计算节点间保留了部分重叠区域,采用基于消息传递的MPI设计平台,在可扩展机群上实现了并行化,获得了90%以上的并行效率。相似文献

10.

基于MPI并行计算的信号稀疏分解

下载免费PDF全文

刘浩杨辉尹忠科王建英《计算机工程》2008,34(12):19-21

在研究信号稀疏分解理论及其最常用的匹配追踪算法的基础上,针对MP算法存在的计算量过大的问题,提出一种基于并行计算系统实现信号稀疏分解的方法。该方法利用8台微机,采用MPI消息传递机制,以100 M高速以太网作为互联网络,构建了一套Beowulf 并行计算系统,在此系统上通过编制并行程序来实现MP算法。实际测试表明这种方法具有很高的并行计算效率,分解时间从单机75 min左右下降到8机并行11 min左右,大大提高了信号稀疏分解的速度。相似文献

11.

Achieving Scalable Parallel Molecular Dynamics Using Dynamic Spatial Domain Decomposition Techniques

Lars Nyland Jan Prins Ru Huai Yun Jan Hermans Hye-Chung Kum Lei Wang 《Journal of Parallel and Distributed Computing》1997,47(2):129

To achieve scalable parallel performance in molecular dynamics simulations, we have modeled and implemented several dynamic spatial domain decomposition algorithms. The modeling is based upon the bulk synchronous parallel architecture model (BSP), which describes supersteps of computation, communication, and synchronization. Using this model, we have developed prototypes that explore the differing costs of several spatial decomposition algorithms and then use this data to drive implementation of our molecular dynamics simulator,Sigma. The parallel implementation is not bound to the limitations of the BSP model, allowing us to extend the spatial decomposition algorithm. For an initial decomposition, we use one of the successful decomposition strategies from the BSP study and then subsequently use performance data to adjust the decomposition, dynamically improving the load balance. The motivating reason to use historical performance data is that the computation to predict a better decomposition increases in cost with the quality of prediction, while the measurement of past work often has hardware support, requiring only a slight amount of work to modify the decomposition for future simulation steps. In this paper, we present our adaptive spatial decomposition algorithms, the results of modeling them with the BSP, the enhanced spatial decomposition algorithm, and its performance results on computers available locally and at the national supercomputer centers. 相似文献

12.

Elimination of the computational broadcast in systolic arrays: an application to the qr decomposition algorithm

《国际计算机数学杂志》2012,89(4):449-469

Two types of broadcast in algorithms are determined: (1) a data broadcast, where one data value is used for more than one computation and (2) a computational broadcast where one variable is computed in more then one computation. Both types of broadcast are preferred to be eliminated when a processor array implementation is desired by using VLSI technology.

When the algorithm computes only one variable value for each index vector then the computational broadcast can be eliminated in a straight forward manner by introducing counter values resulting in a single assignment code. However in cases when the algorithm computes two or more variable values that are specified by a different computational broadcast, has not been considered. As far as is known it has been solved by deriving localized algorithms in single assignment code heuristically.

In this paper we define this problem in terms of a system of affine recurrence equations and analyze the data dependencies introduced. Then we show a synthesis procedure that eliminates the computational broadcast and a few examples of implementation are shown. The QR decomposition algorithm is also presented in a localized single assignment code by using the proposed method and several different parallel implementations are discussed. 相似文献

13.

使用GPU加速分子动力学模拟中的非绑定力计算 总被引：1，自引：0，他引：1

吴强杨灿群葛振陈娟《计算机工程与科学》2009,31(Z1)

在分子动力学模拟(MD)中,对非绑定力的计算需要花费大量的时间。本文提出了基于CUDA和Brook+的两种双精度算法,分别在NVIDIA和AMD两款主流GPU上实现了非绑定力的计算,借助GPU的计算能力加速了整个MD程序。算法对MD进行了任务分割,采用区域分解的方法将非绑定力的计算映射到GPU的计算核心上,同时针对两款GPU的各自特点提出了线程块内共享存储、最小化数据集两种优化方法。性能测试结果表明,与Intel Xeon 2.6GHzCPU的单核相比,43.2万粒子的高速粒子碰撞模拟,在配置NVIDIA Tesla C1060的系统上性能提高了6.5倍,在配置AMD HD4870的系统上性能提高了4.8倍。相似文献

14.

一种实用的并行计算模型 总被引：11，自引：0，他引：11

计永昶丁卫群陈国良安虹《计算机学报》2001,24(4):437-441

对于当前流行的工作站集群环境和各类并行机系统,文中提出了一种实用的并行计算模型,即基于LogGP的非独占异质同步模型NHBL（Nondedicated Heterogeneous Barrier LogGP）,它旨在反映具有异质性和非独占性的NOW计算环境对并行算法设计和分析的影响,然后用NHBL模型分析了PSRS算法在国家高性能计算中心（合肥）的工作站集群NHPCC-Cluster和曙光－1000MPP由的代价,并用实测结果进行了验证。相似文献

15.

Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics

《Journal of molecular graphics & modelling》2013

Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10–50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.’s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps’ C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. 相似文献

16.

图形处理器空间插值并行算法的实现

下载免费PDF全文

赵艳伟程振林董慧方金云《中国图象图形学报》2012,17(4):575-581

空间插值是地理信息系统(GIS)空间分析中计算复杂且耗时的操作,因此无法满足实时性的要求。随着图形处理器(GPU)浮点计算能力的大幅提高,GPU通用计算已成为处理GIS领域内复杂计算的研究热点。为实时化一些传统低效的算法提供了良好的契机。利用GPU在并行计算上的优势,将反距离加权法插值算法映射到了统一计算设备架构(CUDA)并行编程架构。首先在GPU中建立二级索引使计算层次得到了合理的划分,然后利用多线程分块策略执行并行插值计算。最后通过实验表明,该方法的插值误差与CPU方法相比能控制在10-6数量级,并且在插值半径较大插值数据较多的情况下,该算法可达到40倍以上的加速比。充分证明了该方法的正确性及高效性。相似文献

17.

基于改进细胞链表算法的分子动力学模拟性能优化模型

金明灿胡长军李建江苗庆松《计算机科学》2013,40(2):12-15

在改进的细胞链表算法中,细胞大小的减少会降低该算法的通信量和粒子之间距离计算的次数,同时会增加部居细胞的数量。多细胞分子动力学算法是分子动力学模拟中普遍使用的并行算法。将改进细胞链表算法的基本思想应用到多细胞分子动力学算法中,推导出了一个分子动力学模拟性能评价模型,并据此提出一个优化模型来加速分子动力学模拟。实验结果表明,根据该优化模型确定的细胞大小可以提高分子动力学模拟程序的性能。相似文献

18.

非定常粒子输运蒙特卡罗自适应并行计算

邓力袁国兴黄正丰许海燕王瑞宏李树《数值计算与计算机应用》2003,24(2):111-115

§1.引言对Boltzmann方程求解,采用连续截面、精确角分布的蒙特卡罗模拟(下简记为MC),可以获得理想的结果,然而MC方法计算耗时多是其相对其它方法的最大不足,并行计算和高加速比是克服这种不足的可行途径。相似文献

19.

Algorithm optimization in molecular dynamics simulation

Di-Bao Wang Fei-Bin Hsiao 《Computer Physics Communications》2007,177(7):551-559

Establishing the neighbor list to efficiently calculate the inter-atomic forces consumes the majority of computation time in molecular dynamics (MD) simulation. Several algorithms have been proposed to improve the computation efficiency for short-range interaction in recent years, although an optimized numerical algorithm has not been provided. Based on a rigorous definition of Verlet radius with respect to temperature and list-updating interval in MD simulation, this paper has successfully developed an estimation formula of the computation time for each MD algorithm calculation so as to find an optimized performance for each algorithm. With the formula proposed here, the best algorithm can be chosen based on different total number of atoms, system average density and system average temperature for the MD simulation. It has been shown that the Verlet Cell-linked List (VCL) algorithm is better than other algorithms for a system with a large number of atoms. Furthermore, a generalized VCL algorithm optimized with a list-updating interval and cell-dividing number is analyzed and has been verified to reduce the computation time by 30∼60% in a MD simulation for a two-dimensional lattice system. Due to similarity, the analysis in this study can be extended to other many-particle systems. 相似文献