期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Accelerating dissipative particle dynamics with multiple GPUs

Sibo Wang Junbo Xu Hao Wen 《Computer Physics Communications》2013

Dissipative particle dynamics (DPD) simulation is implemented on multiple GPUs by using NVIDIA’s Compute Unified Device Architecture (CUDA) in this paper. Data communication between each GPU is executed based on the POSIX thread. Compared with the single-GPU implementation, this implementation can provide faster computation speed and more storage space to perform simulations on a significant larger system. In benchmark, the performance of GPUs is compared with that of Material Studio running on a single CPU core. We can achieve more than 90x speedup by using three C2050 GPUs to perform simulations on an 80∗80∗80 system. This implementation is applied to the study on the dispersancy of lubricant succinimide dispersants. A series of simulations are performed on lubricant–soot–dispersant systems to study the impact factors including concentration and interaction with lubricant on the dispersancy, and the simulation results are agreed with the study in our present work. 相似文献

2.

大整数乘法Schönhage-Strassen算法的多核并行化研究

赵玉文刘芳芳蒋丽娟杨超《软件学报》2018,29(12):3604-3613

基于数论转换的Schönhage-Strassen算法（简称SSA）是目前实际应用中使用较多、速度较快的大整数乘法算法之一.首先对SSA算法原理进行了详细分析,然后从细粒度的角度对SSA算法在多核平台进行比较细致的并行优化.基于大整数运算开源库GMP实现了SSA算法并行化方案,并在Intel X86平台进行了验证和测试.经测试,8线程时的最大加速比可达到6.59,平均加速比6.41.在浪潮TS850服务器对并行方案的扩展性进行测试,实验结果表明：SSA算法并行方案具有良好的扩展性,最大加速比可达21.42. 相似文献

3.

The aggregation and diffusion of asphaltenes studied by GPU-accelerated dissipative particle dynamics

Sibo Wang Junbo Xu Hao Wen 《Computer Physics Communications》2014

The heavy crude oil consists of thousands of compounds and much of them have large molecular weights and complex structures. Studying the aggregation and diffusion behavior of asphaltenes can facilitate the understanding of the heavy crude oil. In previous studies, the fused aromatic rings were treated as rigid bodies so that dissipative particle dynamics (DPD) integrated with the quaternion method can be used to study asphaltene systems. In this work, DPD integrated with the quaternion method is implemented on graphics processing units (GPUs). Compared with the serial program, tens of times speedup can be achieved when simulations performed on a single GPU. Using multiple GPUs can provide faster computation speed and more storage space for simulations of significant large systems. By using large systems, simulations of the asphaltene–toluene system at extremely dilute concentrations can be performed. The determined diffusion coefficients of asphaltenes are similar to that in experimental studies. At last, the aggregation behavior of asphaltenes in heptane was investigated, and the simulation results agreed with the modified Yen model. Monomers, nanoaggregates and clusters were observed from the simulations at different concentrations. 相似文献

4.

CPU–GPU hybrid parallel strategy for cosmological simulations

Yueqing Wang Yong Dou Song Guo Yuanwu Lei Dan Zou 《Concurrency and Computation》2014,26(3):748-765

Gadget is a simulation application for N‐body and smoothed particle hydrodynamics problems in cosmology, and it is widely applied in solving series of cosmological problems. N‐body focuses on the motion of the interaction of N particles, and smoothed particle hydrodynamics is a fluid simulation algorithm that studies the movement of fluid through particle simulation. Most scholars focus their attention on accelerating Gadget on multi‐core CPU or graphics processing units (GPUs) platforms. However, these research activities failed to achieve CPU–GPU hybrid computing, which resulted in tremendous waste of CPU computing resources. In this paper, we propose a CPU–GPU hybrid parallel strategy to accelerate Gadget‐2, a massively parallel structure formation code for cosmological simulations. This strategy uses CPU and GPU to process the calculation of short‐range force. To ensure CPU and GPU workload balance, a dynamic task allocation scheme is proposed according to the computational performance difference between the CPU and GPU. Experimental results showed that our CPU–GPU hybrid parallel strategy achieved an overall speedup factor of 18.6 and a partial speedup factor for short‐range force calculation of 28.35 compared with a single‐core CPU implementation for particles in million‐size magnitudes. Moreover, compared with a GPU platform that contained 12 CPU cores and one GPU, our hybrid parallel strategy obtained overall speedup and partial speedup factors of 6% and 20%, respectively. Furthermore, the scalability of the hybrid strategy is very fine – its performance will be enhanced when the problem scale is increasing. However, this strategy also has its limitation that the performance enhancement will be decreasing if the ratio(the number of CPU cores divides that of the GPU cards) reduces. Finally, in our hybrid strategy, the CPU coefficient of utilization improved by 17.14% or better. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

5.

Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU

Georgios kalantzis Hidenobu Tachibana 《Computer methods and programs in biomedicine》2014

For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU–GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. 相似文献

6.

Derivation of the multisymplectic Crank–Nicolson scheme for the nonlinear Schrödinger equation

Wenjun Cai Yushun WangYongzhong Song 《Computer Physics Communications》2014

The Crank–Nicolson scheme as well as its modified schemes is widely used in numerical simulations for the nonlinear Schrödinger equation. In this paper, we prove the multisymplecticity and symplecticity of this scheme. Firstly, we reconstruct the scheme by the concatenating method and present the corresponding discrete multisymplectic conservation law. Based on the discrete variational principle, we derive a new variational integrator which is equivalent to the Crank–Nicolson scheme. Therefore, we prove the multisymplecticity again from the Lagrangian framework. Symplecticity comes from the proper discrete Hamiltonian structure and symplectic integration in time. We also analyze this scheme on stability and convergence including the discrete mass conservation law. Numerical experiments are presented to verify the efficiency and invariant-preserving property of this scheme. Comparisons with the multisymplectic Preissmann scheme are made to show the superiority of this scheme. 相似文献

7.

A stochastic Trotter integration scheme for dissipative particle dynamics

《Mathematics and computers in simulation》2007,73(2-6):190-194

In this article we show in detail the derivation of an integration scheme for the dissipative particle dynamic model (DPD) using the stochastic Trotter formula [G. De Fabritiis, M. Serrano, P. Español, P.V. Coveney, Phys. A 361 (2006) 429]. We explain some subtleties due to the stochastic character of the equations and exploit analyticity in some interesting parts of the dynamics. The DPD–Trotter integrator demonstrates the inexistence of spurious spatial correlations in the radial distribution function for an ideal gas equation of state. We also compare our numerical integrator to other available DPD integration schemes. 相似文献

8.

Multi-scale simulations of plasma with iPIC3D

Stefano Markidis Giovanni Lapenta Rizwan-uddin 《Mathematics and computers in simulation》2010

The implicit Particle-in-Cell method for the computer simulation of plasma, and its implementation in a three-dimensional parallel code, called iPIC3D, are presented. The implicit integration in time of the Vlasov–Maxwell system, removes the numerical stability constraints and it enables kinetic plasma simulations at magnetohydrodynamics time scales. Simulations of magnetic reconnection in plasma are presented to show the effectiveness of the algorithm. 相似文献

9.

Distance‐based undirected formations of single‐integrator and double‐integrator modeled agents in n‐dimensional space

Kwang‐Kyo Oh Hyo‐Sung Ahn 《国际强度与非线性控制杂志
》2014,24(12):1809-1820

We study the local asymptotic stability of undirected formations of single‐integrator and double‐integrator modeled agents based on interagent distance control. First, we show that n‐dimensional undirected formations of single‐integrator modeled agents are locally asymptotically stable under a gradient control law. The stability analysis in this paper reveals that the local asymptotic stability does not require the infinitesimal rigidity of the formations. Second, on the basis of the topological equivalence of a dissipative Hamiltonian system and a gradient system, we show that the local asymptotic stability of undirected formations of double‐integrator modeled agents in n‐dimensional space is achieved under a gradient‐like control law. Simulation results support the validity of the stability analysis. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

10.

A superlinear speedup region for matrix multiplication

Marjan Gusev Sasko Ristov 《Concurrency and Computation》2014,26(11):1847-1868

The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores (e.g., L3 cache) and each core possesses private low level caches (e.g., L1 and L2 caches). Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

11.

GPU加速的高精度数字地面模型建模方法

闫长青岳天祥《计算机工程与应用》2012,48(22):22-27

以曲面轮为基础发展的高精度曲面建模方法（HASM）可以建立具有高精度的数字高程模型,但使用该方法需要求解偏微分方程离散产生的大规模线性方程组,计算量巨大,严重制约了对大规模数据的模拟应用;而现代GPU技术的发展使GPU越来越广泛地应用于通用计算加速。为了提高HASM方法的模拟速度,把高精度曲面模拟与GPU通用技术相结合,提出了GPU加速的高精度曲面建模方法。把HASM模拟过程中的有限差分离散、离散后的大规模线性系统求解分别使用GPU进行分解,使用共轭梯度（CG）和预处理共轭梯度方法（PCG）将求解任务分解为可以并行处理的独立的多任务,使得计算任务并行化,同时并行运行大规模线程,每个线程执行一个独立的任务,充分利用了现代GPU强大的通用计算能力,并行处理以获得加速。利用并行化加速的高精度曲面建模算法使用英伟达公司的统一计算开发架构（CUDA）编程实现,GPU采用该公司的Quadro 2000。分别应用该算法进行了数值实验和实际项目区数字高程模型（DEM）模拟实验。实验结果表明,充分利用GPU的并行处理能力加速后的HASM方法,在保证达到相同曲面模拟的精度条件下,和传统的CPU方法相比,算法可以获得超过一个数量级的加速。相似文献

12.

Mean Shift图像分割算法的并行化 总被引：2，自引：2，他引：0

下载免费PDF全文

李宏益吴素萍《中国图象图形学报》2013,18(12):1610-1619

图像分割作为高性能并行计算的一个主要应用领域,其算法本身的时间复杂度和实时性需求要求不断改进计算机硬件技术和并行处理的算法。Mean Shift算法是图像分割领域一个比较经典的算法,在图像分割过程中,不需要任何先验知识,是一种无监督的分割过程,在图像分割的具体实现中应用广泛。利用TBB（Threading Building Block）工具和CUDA（Compute Unified Device Architecture）对Mean Shift算法进行多核和GPU(Graphic Processing Unit )并行化改造。文中首先分析出Mean Shift分割过程中最耗时的部分Mean Shift聚类,之后利用TBB和CUDA对Mean Shift聚类进行了并行化改造,并对两种并行方法进行了对比分析。实验结果表明,两种并行方法都取得了较好的加速效果,加速比都随着图像增大和带宽参数的增加而增大,基于TBB的加速比稳定趋于核数。相似文献

13.

Solving very large instances of the scheduling of independent tasks problem on the GPU 总被引：1，自引：0，他引：1

Frédéric Pinel Bernabé Dorronsoro Pascal Bouvry 《Journal of Parallel and Distributed Computing》2013

In this paper, we present two new parallel algorithms to solve large instances of the scheduling of independent tasks problem. First, we describe a parallel version of the Min–min heuristic. Second, we present GraphCell, an advanced parallel cellular genetic algorithm (CGA) for the GPU. Two new generic recombination operators that take advantage of the massive parallelism of the GPU are proposed for GraphCell. A speedup study shows the high performance of the parallel Min–min algorithm in the GPU versus several CPU versions of the algorithm (both sequential and parallel using multiple threads). GraphCell improves state-of-the-art solutions, especially for larger problems, and it provides an alternative to our GPU Min–min heuristic when more accurate solutions are needed, at the expense of an increased runtime. 相似文献

14.

An OpenCL implementation for the solution of the time-dependent Schrödinger equation on GPUs and CPUs

Cathal Ó Broin L.A.A. Nikolopoulos 《Computer Physics Communications》2012,183(10):2071-2080

Open Computing Language (OpenCL) is a parallel processing language that is ideally suited for running parallel algorithms on Graphical Processing Units (GPUs). In the present work we report on the development of a generic parallel single-GPU code for the numerical solution of a system of first-order ordinary differential equations (ODEs) based on the OpenCL model. We have applied the code in the case of the Time-Dependent Schrödinger Equation of atomic hydrogen in a strong laser field and studied its performance on NVIDIA and AMD GPUs against the serial performance on a CPU. We found excellent scalability and a significant speedup of the GPU over the CPU device. The speedup in the benchmark tended towards a value of about 40 with significant speedups expected against multi-core CPUs. Furthermore, though we do not present the detailed benchmarks here, we also have achieved speedup values of around 75 by performing a slight optimization of the described algorithm. 相似文献

15.

Parallelizing a powerful Monte Carlo method for electron beam dose calculations

W. Volken P. Schwab P.G. Kropf H. Neuenschwander 《Concurrency and Computation》1996,8(6):489-498

The Macro Monte Carlo (MMC) method has been developed to improve the speed of traditional Monte Carlo calculations without significant loss in accuracy. MMC runs about one order of magnitude faster for clinically relevant irradiation situations. For routine use in a clinical environment a further speedup is necessary. The MMC code was therefore parallelized and implemented on PowerPC based Parsytec PowerXplorer and GC Power Plus systems. The performance of the parallel code is presented and compared to the sequential implementations on standard hardware platforms. Almost linear speedup is achieved for the parallel sections of the code. Furthermore the performance of the interprocessor communication based on a virtual topology is demonstrated for the two different machine architectures. 相似文献

16.

Parallel Implementation of 2-D Telegraphic Equation on MPI/PVM Cluster

Simon?Uzezi?Ewedafe Email author Rio?Hirowati?Shariffudin 《International journal of parallel programming》2011,39(2):202-231

In this paper, a parallel implementation of the Iterative Alternating Direction Explicit method by D’Yakonov (IADE-DY) to solve 2-D telegraphic problem on a distributed system using Message Passing Interface (MPI) and Parallel Virtue Machine (PVM) are presented. The parallelization of the program is implemented by a domain decomposition strategy. A Single Program Multiple Data (SPMD) model is employed for the implementation. The implementation is discussed in relation to means of the parallel performance strategies and analysis. The model enhances overlap communication and computation to avoid unnecessary synchronization, hence, the method yields significant speedup. The level of speedup observed from tables as the mesh increases are in the range of 5–10%. Improvement has been achieved by numbers of tables and figures in our experiment. We present some analyses that are helpful for speedup and efficiency. It is concluded that the efficiency is strongly dependent on the grid size, block numbers and the number of processors for both MPI and PVM. Different strategies to improve the computational efficiency are proposed. 相似文献

17.

Development of a parallel implicit solver of fluid modeling equations for gas discharges

Chieh-Tsan Hung Yuan-Ming Chiu Feng-Nan Hwang Jong-Shinn Wu 《Computer Physics Communications》2011,(1):161-163

A parallel fully implicit PETSc-based fluid modeling equations solver for simulating gas discharges is developed. Fluid modeling equations include: the neutral species continuity equation, the charged species continuity equation with drift-diffusion approximation for mass fluxes, the electron energy density equation, and Poisson's equation for electrostatic potential. Except for Poisson's equation, all model equations are discretized by the fully implicit backward Euler method as a time integrator, and finite differences with the Scharfetter–Gummel scheme for mass fluxes on the spatial domain. At each time step, the resulting large sparse algebraic nonlinear system is solved by the Newton–Krylov–Schwarz algorithm. A 2D-GEC RF discharge is used as a benchmark to validate our solver by comparing the numerical results with both the published experimental data and the theoretical prediction. The parallel performance of the solver is investigated. 相似文献

18.

Estimating the speedup in parallel parsing

Sarkar D. Deo N. 《IEEE transactions on pattern analysis and machine intelligence》1990,16(7):677-683

A method for estimating the speedup for asynchronous bottom-up parallel parsing is presented. Two models for bottom-up parallel parsing are proposed, and the speedup for each of the two models is estimated. The speedup obtained for model A is a very close to the simulation result already available in literature; however, the model is restrictive because it can only communicate with its immediate left and right neighbors. This increases the processor coordination and interprocessor communication times. Model B, while showing a greater speedup time, is expensive to construct when the number of processors is large 相似文献

19.

Consensus with guaranteed convergence rate of high-order integrator agents in the presence of time-varying delays

H.J. Savino L.C.A. Pimenta 《International journal of systems science》2016,47(10):2475-2486

This paper aims to study the consensus problem in directed networks of agents with high-order integrator dynamics and fixed topology. It is considered the existence of non-uniform time-varying delays in the agents control laws for each interaction between agents and their neighbours. Based on Lyapunov–Krasovskii stability theory and algebraic graph theory, sufficient conditions, in terms of linear matrix inequalities, are given to verify if consensus is achieved with guaranteed exponential convergence rate. The efficiency of the proposed method is verified by numerical simulations. The simulations reveal that the conditions established in this work outperformed the similar existing ones in all numerical tests accomplished in this paper. 相似文献

20.

Parallel multi‐level 2D‐DWT on CUDA GPUs and its application in ring artifact removal

Leqing Zhu Yadong Zhou Daxing Zhang Dadong Wang Huiyan Wang Xun Wang 《Concurrency and Computation》2015,27(17):5188-5202

This paper presented two schemes of parallel 2D discrete wavelet transform (DWT) on Compute Unified Device Architecture graphics processing units. For the first scheme, the image and filter are transformed to spectral domain by using Fast Fourier Transformation (FFT), multiplied and then transformed back to space domain by using inverse FFT. For the second scheme, the image pixels are convolved directly with filters. Because there is no data relevance, the convolution for data points on different positions could be executed concurrently. To reduce data transfer, the boundary extension and down‐sampling are processed during data loading stage, and transposing is completed implicitly during data storage. A similar skill is adopted when parallelizing inverse 2D DWT. To further speed up the data access, the filter coefficients are stored in the constant memory. We have parallelized the 2D DWT for dozens of wavelet types and achieved a speedup factor of over 380 times compared with that of its CPU version. We applied the parallel 2D DWT in a ring artifact removal procedure; the executing speed was accelerated near 200 times compared with its CPU version. The experimental results showed that the proposed parallel 2D DWT on graphics processing units can significantly improve the performance for a wide variety of wavelet types and is promising for various applications. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献