首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Graphics-processing units (GPUs) suitable for general-purpose numerical computation are now available with performances in excess of 1 Teraflops, faster by one to two orders of magnitude than conventional desktop CPUs. Monte Carlo particle transport algorithms are ideally suited to parallel processing architectures and so are good candidates for acceleration using a GPU. We have developed a general-purpose code that computes the transport of high energy (>1 keV) photons through arbitrary 3-dimensional geometry models, simulates their physical interactions and performs tallying and variance reduction. We describe a new algorithm, the particle-per-block technique, that provides a good match with the underlying GPU multiprocessor hardware design. Benchmarking against an existing CPU-based simulation running on a single-core of a commodity desktop CPU demonstrates that our code can accurately model X-ray transport, with an approximately 35-fold speed-up factor.  相似文献   

2.
3.
A Graphics Processing Unit (GPU)-CUDA C and (Multi-core)-OpenMP versions of the Reaction Ensemble Monte Carlo method (REMC) are presented. The REMC algorithm is a powerful tool to investigate the equilibrium behavior of chemically reacting systems in highly non-ideal conditions. Both the GPU and the Multi-core versions of the code are particularly efficient when the total potential energy of the system must be calculated, as in the constant-pressure systems. Results, obtained in the case of Helium plasma at high pressure, show differences between real and ideal cases.  相似文献   

4.
Statistical tests are often performed to discover which experimental variables are reacting to specific treatments. Time-series statistical models usually require the researcher to make assumptions with respect to the distribution of measured responses which may not hold. Randomization tests can be applied to data in order to generate null distributions non-parametrically. However, large numbers of randomizations are required for the precise p-values needed to control false discovery rates. When testing tens of thousands of variables (genes, chemical compounds, or otherwise), significant q-value cutoffs can be extremely small (on the order of 10−5 to 10−8). This requires high-precision p-values, which in turn require large numbers of randomizations. The NVIDIA® Compute Unified Device Architecture® (CUDA®) platform for General Programming on the Graphics Processing Unit (GPGPU) was used to implement an application which performs high-precision randomization tests via Monte Carlo sampling for quickly screening custom test statistics for experiments with large numbers of variables, such as microarrays, Next-Generation sequencing read counts, chromatographical signals, or other abundance measurements. The software has been shown to achieve up to more than 12 fold speedup on a Graphics Processing Unit (GPU) when compared to a powerful Central Processing Unit (CPU). The main limitation is concurrent random access of shared memory on the GPU. The software is available from the authors.  相似文献   

5.
一种基于OPENACC的GPU加速实现高斯模糊算法   总被引:1,自引:0,他引:1  
针对使用底层API进行GPU加速时存在的编码复杂以及效率低下等缺陷,文中试图利用基于中间层的OPENACC加速技术对传统的串行代码进行改写,从而达到改善开发效率,简化代码之目的.文中以传统的串行高斯模糊算法为处理对象,在其中添加OPENACC指令,提出基于OPENACC指令的GPU加速算法,并对算法流程进行了分析和说明.通过与原生CUDA和串行高斯的结果对比之后,发现随着处理像素数量的增加,串行高斯性能呈指数变化,而CUDA和OPENAC则呈线性变化.结果表明,该算法能在不改变原有非并行代码结构的基础上,通过增加高效的OPENACC指令即可获得与CUDA近似的图像处理质量和处理性能,且较CUDA具有更高的代码开发效率.  相似文献   

6.
稀疏矩阵与向量相乘SpMV是求解稀疏线性系统中的一个重要问题,但是由于非零元素的稀疏性,计算密度较低,造成计算效率不高。针对稀疏矩阵存在的一些不规则性,利用混合存储格式来进行SpMV计算,能够提高对稀疏矩阵的压缩效率,并扩大其适应范围。HYB是一种广泛使用的混合压缩格式,其性能较为稳定。而随着GPU并行计算得到普遍应用以及CPU日趋多核化,因此利用GPU和多核CPU构建异构并行计算系统得到了普遍的认可。针对稀疏矩阵的HYB存储格式中的ELL和COO存储特征,把两部分数据分别分割到CPU和GPU进行协同并行计算,既能充分利用CPU和GPU的计算资源,又能够发挥CPU和GPU的计算特性,从而提高了计算资源的利用效能。在分析CPU+GPU异构计算模式的特征的基础上,对混合格式的数据分割和共享方面进行优化,能够较好地发挥在异构计算环境的优势,提高计算性能。  相似文献   

7.
We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.  相似文献   

8.
In this paper, we demonstrate some speedup opportunity regarding Monte Carlo simulation on graphic processing unit architecture, with financial application. We leverage on the possibility of reducing the volume of actually generated random numbers, by replacing the generation phase with some shuffling using Compute Unified Device Architecture's built‐in shuffle instructions. We will study various shuffling patterns and duration, elect the best among them with regard to induced correlation, using Granger causality test. We will then study the accuracy and variance of results actually achieved by our general‐purpose computing on graphic processing unit shuffled Monte‐Carlo, exhibiting a computational time reduced by half while error remains marginal. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

9.
CPU与GPU上几种矩阵乘法的比较与分析   总被引:1,自引:0,他引:1       下载免费PDF全文
描述了矩阵乘法在CPU上的三种实现方法和在GPU上基于CUDA架构的四种实现方法,分析了高性能方法的原由,发现它们的共同特点都是合理地组织数据并加以利用,这样能有效地减少存取开销,极大地提高算法的速度。其中CPU上的最优实现方法比普通算法快了200多倍,GPU上的最优实现方法又比CPU上的最优实现方法快了约6倍。  相似文献   

10.
In wireless communication, Viterbi decoding algorithm (VDA) is the one of most popular channel decoding algorithms, which is widely used in WLAN, WiMAX, or 3G communications. However, the throughput of Viterbi decoder is constrained by the convolutional characteristic. Recently, the three‐point VDA (TVDA) was proposed to solve this problem. In TVDA, the whole procedure can be divided into three phases, the forward, trace‐back, and decoding phases. In this paper, we analyze the parallelism of TVDA and propose parallel TVDA on the multi‐core CPU, graphics processing unit (GPU), and field programmable gate array (FPGA). We demonstrate approaches that fully exploit its performance potential on CPU, GPU, and FPGA computing platforms. For CPU platforms, we perform two optimization methods, single instruction multiple data and multithreading to gain over 145 × speedup over the naive CPU version on a quad‐core CPU platform. For GPU platforms, we propose the combination of cached memory optimization, coalesced global memory accesses, codeword packing scheme, and asynchronous data transition, achieving the throughput of 404.65 Mbps and 12 × speedup over initial GPU versions on an NVIDIA GeForce GTX580 card and 7 × speedup over Intel quad‐core CPU i5‐2300, under the same manufacturing year and both with fully optimized schemes. In addition, for FPGA platforms, we customize a radix‐4 pipelined architecture for the TVDA in a 45‐nm FPGA chip from Xilinx (XC6VLX760). Under 209.15‐MHz clock rate, it achieves a throughput of 418.30 Mbps. Finally, we also discuss the performance evaluation and efficiency comparison of different flexible architectures for real‐time Viterbi decoding in terms of the decoding throughput, power consumption, optimization schemes, programming costs, and price costs.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

11.
The processor evolution has reached a critical moment in time where it will soon be impossible to increase the frequency much further. Processor designers such as Motorola, Intel and IBM have all realised that the only way to improve the FLOP/Watt ratio is to develop multi-core devices. One of the most current examples of multi-core processors is the new Sony/Toshiba/IBM Cell/B.E. multi-core processor. For the suitability to run in parallel, Monte Carlo methods are often considered embarrassingly parallel. This paper describes how a common Monte Carlo based financial simulation can be calculated in parallel using the Cell/B.E. multi-core processor. The measured performance with the achieved multi-core speed-up is also presented. With the recent availability of this increasingly available technology, financial simulations can now be performed in a fraction of the time it used to. This can also be achieved with a limited power and volume budget using commercially available technology. The main challenge with multi-core devices is clearly the programmability. The work presented here describes how this challenge could be dealt with.A basic MPI library has been developed to handle the partitioning and communication of data. The thread creation follows a POSIX thread creation model. MPI together with POSIX make the application portable in between various multi-processor systems and multi-core devices. The conclusions made indicate that a function offload MPI implementation on the Cell/B.E. multi-core processor can efficiently be used to speed-up the Monte Carlo solution of financial simulations. The conclusions made herein are also applicable to other situations where an algorithm can be easily parallelized.  相似文献   

12.
Quantum Monte Carlo methods enable us to determine the ground-state properties of atomic or molecular clusters. Here, we present a reconfigurable computing architecture using Field Programmable Gate Arrays (FPGAs) to accelerate two computationally intensive kernels of a Quantum Monte Carlo (QMC) application applied to N-body systems. We focus on two key kernels of the QMC application: acceleration of potential energy and wave function calculations. We compare the performance of our application on two reconfigurable platforms. Firstly, we use a dual-processor 2.4 GHz Intel Xeon augmented with two reconfigurable development boards consisting of Xilinx Virtex-II Pro FPGAs. Using this platform, we achieve a speedup of 3× over a software-only implementation. Following this, the chemistry application is ported to the Cray XD1 supercomputer equipped with Xilinx Virtex-II Pro and Virtex-4 FPGAs. The hardware-accelerated application on one node of the high performance system equipped with a single Virtex-4 FPGA yields a speedup of approximately 25× over the serial reference code running on one node of the dual-processor dual-core 2.2 GHz AMD Opteron. This speedup is mainly attributed to the use of pipelining, the use of fixed-point arithmetic for all calculations and the fine-grained parallelism using FPGAs. We can further enhance the performance by operating multiple instances of our design in parallel.  相似文献   

13.
随着工业计算需求的激增,计算流体力学 (Computational Fluid Dynamics, CFD) 学科对计算效率问题越来越重视。作者基于自行开发的 Navier-Stokes 解算器,引入多重网格加速收敛算法,并结合NVIDIA GPU 计算平台,从数值方法和高性能计算两个方面为 CFD 实现加速。数值加速算例测试结果表明,基于多重网格算法的 GPU 解算器相对 CPU 版本代码双精度可获得 45 倍以上的加速。  相似文献   

14.
蒙特卡罗中子-光子输运程序MCNP的并行化   总被引:2,自引:0,他引:2  
1.引 言 随着并行计算机的问世,并行算法和并行系统也不断发展,如 PVM(Parallel VirturalMachine),SMP(Sharae Memory Processors);MPI(Message Passing Interface)和 HPF(High Power Fortran)等,这些并行系统原理基本相同,差异主要是并行指令和数据传递方式.在这些并行系统中,PVM和 MPI系统具有通用性强、系统规模小、使用方便和可移植性强的优点,且安装、测试、编程与实现相对要容易一些,它是当前国际卜公认…  相似文献   

15.
由于GPU(图形处理器)性能的大幅提高和可编程性的发展,基于GPU的光线追踪算法逐渐成为研究热点。光线追踪算法需要的计算量大,基于此,分析了光线追踪算法的基本原理,在NVIDIA公司的CUDA(计算统一设备体系结构)环境下采用均匀栅格法作为加速结构实现了光线追踪算法。实验结果表明,该计算模式相对于传统基于CPU的光线追踪算法具有更快的整体运算速度,GPU适合处理高密度数据计算。  相似文献   

16.
在分布式存储结构的机群系统上,采用可移植消息传递接口MPI与C语言绑定,设计并实现了并行蒙特卡罗算法.有效解决了计算量大、串行算法执行时间长的问题。通过对机群节点间通信时间开销的研究分析.采用主从式编程模型改进并行蒙特卡罗算法,实现了负载平衡,提高了机群处理器的利用率,进一步缩短了执行时间。  相似文献   

17.
Many highly developed Monte Carlo tools for the evaluation of cross sections based on tree matrix elements exist and are used by experimental collaborations in high energy physics. As the evaluation of one-loop matrix elements has recently been undergoing enormous progress, the combination of one-loop matrix elements with existing Monte Carlo tools is on the horizon. This would lead to phenomenological predictions at the next-to-leading order level. This note summarises the discussion of the next-to-leading order multi-leg (NLM) working group on this issue which has been taking place during the workshop on Physics at TeV Colliders at Les Houches, France, in June 2009. The result is a proposal for a standard interface between Monte Carlo tools and one-loop matrix element programs.Dedicated to the memory of, and in tribute to, Thomas Binoth, who led the effort to develop this proposal for Les Houches 2009. Thomas led the discussions, set up the subgroups, collected the contributions, and wrote and edited this paper. He made a promise that the paper would be on the arXiv the first week of January, and we are faithfully fulfilling his promise. In his honour, we would like to call this the Binoth Les Houches Accord.  相似文献   

18.
提供了一种新的贷款组合决策优化方法,该模型用更能反映贷款组合信用风险特征的CVaR作为风险度量。由于在实际中很难获取各笔贷款的历史数据,为此给出了一种基于Matlab语言的Monte Carlo仿真方法。从而使谊模型可以通过线性规划技术有效的进行求解。最后给出了一个例子。  相似文献   

19.
针对具有三级维修机构保障的复杂设备,通过对设备使用维修流程分析,给出了设备整个使用寿命期内的使用与维修状态转移图,建立了设备整个使用寿命期内的维修周期与平均可用度关系模型。并应用蒙特卡洛仿真方法,结合算例分析得到了使平均可用度达到最大的最佳维修周期,说明了模型的适用性与灵敏性,可为设备维修决策提供依据。  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号