期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

张祎晨黄铁军《中国图象图形学报》2023,28(2):358-371

树突对大脑神经元实现不同的信息处理功能有着重要作用。精细神经元模型是一种对神经元树突以及离子通道的信息处理过程进行精细建模的模型,可以帮助科学家在实验条件的限制之外探索树突信息处理的特性。由精细神经元组成的精细神经网络模型可通过仿真对大脑的信息处理过程进行模拟,对于理解树突的信息处理机制、大脑神经网络功能背后的计算机理具有重要作用。然而,精细神经网络仿真需要进行大量计算,如何对精细神经网络进行高效仿真是一个具有挑战的研究问题。本文对精细神经网络仿真方法进行梳理,介绍了现有主流仿真平台与核心仿真算法,以及可进一步提升仿真效率的高效仿真方法。将具有代表性的高效仿真方法按照发展历程以及核心思路分为网络尺度并行方法、神经元尺度并行方法以及基于GPU(graphics processing unit)的并行仿真方法3类。对各类方法的核心思路进行总结,并对各类方法中代表性工作的细节进行分析介绍。随后对各类方法所具有的优劣势进行分析对比,对一些经典方法进行总结。最后根据高效仿真方法的发展趋势,对未来研究工作进行展望。相似文献

2.

Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics

《Journal of molecular graphics & modelling》2013

Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10–50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.’s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps’ C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. 相似文献

3.

Parallelization of a color-entropy preprocessed Chan–Vese model for face contour detection on multi-core CPU and GPU

《Parallel Computing》2015

Face tracking is an important computer vision technology that has been widely adopted in many areas, from cell phone applications to industry robots. In this paper, we introduce a novel way to parallelize a face contour detecting application based on the color-entropy preprocessed Chan–Vese model utilizing a total variation G-norm. This particular application is a complicated and unsupervised computational method requiring a large amount of calculations. Several core parts therein are difficult to parallelize due to heavily correlated data processing among iterations and pixels.We develop a novel approach to parallelize the data-dependent core parts and significantly improve the runtime performance of the model computation. We implement the parallelized program on OpenCL for both multi-core CPU and GPU. For 640 * 480 input images, the parallelized program on a NVidia GTX970 GPU, a NVidia GTX660 GPU, and an AMD FX8530 8-core CPU is on average 18.6, 12.0 and 4.40 times faster than its single-thread C version on the AMD FX8530 CPU, respectively. Some parallelized routines have much higher performance improvement compared to the whole program. For instance, on the NVidia GTX970 GPU, the parallelized entropy filter routine is on average 74.0 times faster than its single-thread C version on the AMD FX8530 8-core CPU. We discuss the parallelization methodologies in detail, including the scalability, thread models, as well as synchronization methods for both multi-core CPU and GPU. 相似文献

4.

基于CUDA的大规模群体行为实时仿真并行实现及优化

贺毅辉叶晨刘志忠彭伟《计算机应用》2012,32(9):2466-2469

群体仿真中个体从环境中查找相关对象时会导致较高的时间复杂度。要使大规模群体能够实时仿真,必须降低模型运算的时间复杂度或者提高计算平台的能力。通过对Biods模型为典型案例进行研究,提出一种基于统一计算架构(CUDA)的大规模群体行为实时仿真并行实现及优化的方法。实现中将个体与GPU逻辑线程一一对应,通过将仿真环境离散化来提高相关个体查找的效率,通过并行化基数排序法将个体信息组织成具有空间局部性的数组,提高图形处理器(GPU)内存带宽的利用率。通过实验验证了该方法将仿真个体的数量提升到CPU方法的约7.3倍。相似文献

5.

Generating parallel quasirandom sequences via randomization

Hongmei Chi Edward L. Jones 《Journal of Parallel and Distributed Computing》2007

Quasi-Monte Carlo (QMC) methods are now widely used in scientific computation, especially in estimating integrals over multidimensional domains. One advantage of QMC is that it is easy to parallelize applications, and so the success of any parallel QMC application depends crucially on the quality of parallel quasirandom sequences used. Much of the recent work dealing with parallel QMC methods has been aimed at splitting a single quasirandom sequence into many subsequences. In contrast with this perspective to concentrate on breaking one sequence up, this paper proposes an alternative approach to generating parallel sequences for QMC. This method generates parallel sequences of quasirandom numbers via scrambling. The exact meaning of scrambling depends on the type of parallel quasirandom numbers. In general, we seek to randomize the generator matrix for each quasirandom number generator. Specifically, this paper will discuss how to parallelize the Halton sequence via scrambling. The proposed scheme for generating parallel random number streams is especially good for heterogeneous and unreliable computing environments. 相似文献

6.

GPU parallelization of the sequential matrix diagonalization algorithm and its application to high-dimensional data

Manuel Carcenac Soydan Redif Server Kasap 《The Journal of supercomputing》2017,73(8):3603-3634

This paper presents the parallelization on a GPU of the sequential matrix diagonalization (SMD) algorithm, a method for diagonalizing polynomial covariance matrices, which is the most recent technique for polynomial eigenvalue decomposition. We first parallelize with CUDA the calculation of the polynomial covariance matrix. Then, following a formal transformation of the polynomial matrix multiplication code—extensively used by SMD—we insert in this code the cublasDgemm function of CUBLAS library. Furthermore, a specialized cache memory system is implemented within the GPU to greatly limit the PC-to-GPU transfers of slices of polynomial matrices. The resulting SMD code can be applied efficiently over high-dimensional data. The proposed method is verified using sequences of images of airplanes with varying spatial orientation. The performance of the parallel codes for polynomial covariance matrix generation and SMD is evaluated and reveals speedups of up to 161 and 67, respectively, relative to sequential execution on a PC. 相似文献

7.

Range query processing on single and multi GPU environments

Ricardo J. Barrientos José I. Gómez Christian Tenllado Manuel Prieto Matias Mauricio Marin 《Computers & Electrical Engineering》2013

Metric-space similarity search has proven suitable in a number of application domains such as multimedia retrieval and computational biology to name a few. These applications usually work on very large databases that are often indexed to speed-up on-line searches. To achieve efficient throughput, it is essential to exploit the intrinsic parallelism in the respective search query processing algorithms. Many strategies have been proposed in the literature to parallelize these algorithms either on shared or distributed memory multiprocessor systems. Lately, GPUs have been used to implement brute-force parallel search strategies instead of using index data structures. Indexing poses difficulties when it comes to achieve efficient exploitation of GPU resources. In this paper we propose single and multi GPU metric space techniques that efficiently exploit GPU tailored index data structures for parallel similarity search in large databases. The experimental results show that our proposal outperforms previous index-based sequential and OpenMP search strategies. 相似文献

8.

量子线路模拟器QuEST在多GPU平台上的性能优化

张亮常旭秦志楷沈立《计算机工程与科学》2021,43(1):17-23

在当前量子计算的研究中,量子线路模拟器作为重要的研究工具,一直受到研究者们的高度重视。QuEST是一款开源的通用量子线路模拟器,能在单个CPU结点、多个CPU结点和单个GPU等多种测试平台上灵活运行。量子线路模拟固有的并行性使其非常适合在GPU上运行,并能获得较大的性能加速。但是其缺点在于所消耗的内存空间巨大,单个GPU受显存容量限制,无法模拟具有更多量子位的量子系统。设计并实现了多GPU版本的QuEST模拟器,解决了单个GPU显存不足的问题,能够使用多个GPU模拟更多的量子位。而且,与单CPU版本相比可获得7~9倍的性能加速,与多CPU版本相比可获得3倍的性能加速。相似文献

9.

基于近似最近邻搜索的并行光流计算

下载免费PDF全文

杨昕欣姜精萍《计算机工程与应用》2018,54(18):201-207

Barnes近似最近邻算法是当前匹配性能优秀的近似块匹配算法,将其应用于稠密光流的计算中,并与OpenCV中实现的两种稠密光流算法进行对比。针对Barnes算法不易并行化的不足,对Barnes算法中的传播过程进行修改,使其易于在GPU上实现并行加速。实验表明,经并行加速后的光流算法比原算法快两倍以上,而在精确度上与原算法接近,并且都优于OpenCV实现的两种稠密光流算法。相似文献

10.

面向GPU平台的并行结构化稀疏三角方程组求解器

下载免费PDF全文

陈道琨杨超刘芳芳马文静《软件学报》2023,34(11):4941-4951

稀疏三角线性方程组求解(SpTRSV)是预条件子部分的重要操作, 其中结构化SpTRSV问题, 在以迭代方法求解偏微分方程组的科学计算程序中, 是一种较为常见的问题类型, 而且通常是科学计算程序的需要解决的一个性能瓶颈. 针对GPU平台, 目前以CUSPARSE为代表的商用GPU数学库, 采用分层调度(level-scheduling)方法并行化SpTRSV操作. 该方法不仅预处理耗时较长, 而且在处理结构化SpTRSV问题时会出现较为严重GPU线程闲置问题. 针对结构化SpTRSV问题, 提出一种面向结构化SpTRSV问题的并行算法. 该算法利用结构化SpTRSV问题的特殊非零元分布规律进行任务划分, 避免对输入问题的非零元结构进行预处理分析. 并对现有分层调度方法的逐元素处理策略进行改进, 在有效缓解GPU线程闲置问题的基础上, 还隐藏了部分矩阵非零元素的访存延迟. 还根据算法的任务划分特点, 采用状态变量压缩技术, 显著提高算法状态变量操作的缓存命中率. 在此基础上, 还结合谓词执行等GPU硬件特性, 对算法实现进行全面的优化. 所提算法在NVIDIA V100 GPU上的实测性能, 相比CUSPARSE平均有2.71倍的加速效果, 有效访存带宽最高可达225.2 GB/s. 改进后的逐元素处理策略, 配合针对GPU硬件的一系列调优手段, 优化效果显著, 将算法的有效访存带宽提高了约1.15倍. 相似文献

11.

Using Entanglement in Quantum Multi-Prover Interactive Proofs

Julia Kempe Hirotada Kobayashi Keiji Matsumoto Thomas Vidick 《Computational Complexity》2009,18(2):273-307

相似文献

12.

基于OPS的计算流体力学软件多平台自动并行

王巍车永刚徐传福王正华《计算机工程与科学》2021,43(5):773-781

当前高性能计算机体系结构呈现多样性特征,给并行应用软件开发带来巨大挑战.采用领域特定语言OPS对高阶精度计算流体力学软件HNSC进行面向多平台的并行化,使用OPS API实现了代码的重构,基于OPS前后端自动生成了纯M PI、OpenM P、M PI+OpenM P和M PI+CUDA版本的可执行程序.在一个配有2块Intel Xeon CPU E5-2660 V3 CPU和1块NVIDIA Tesla K80 GPU的服务器上的性能测试表明,基于O PS自动生成的并行代码性能与手工并行代码的性能可比甚至更优,并且O PS自动生成的GPU并行代码相对于其CPU并行代码有明显的性能加速.测试结果说明,使用OPS等领域特定语言进行面向多平台的计算流体力学并行软件开发是一种可行且高效的途径. 相似文献

13.

Parallel construction of classification trees on a GPU

D. Strnad A. Nerat 《Concurrency and Computation》2016,28(5):1417-1436

Algorithms for constructing tree‐based classifiers are aimed at building an optimal set of rules implicitly described by some dataset of training samples. As the number of samples and/or attributes in the dataset increases, the required construction time becomes the limiting factor for interactive or even functional use. The problem is emphasized if tree derivation is part of an iterative optimization method, such as boosting. Attempts to parallelize the construction of classification trees have therefore been presented in the past. This paper describes a parallel method for binary classification tree construction implemented on a graphics processing unit (GPU) using compute unified device architecture (CUDA). By employing node‐level, attribute‐level, and split‐level parallelism, the task parallel and data parallel sections of tree induction are mapped to the architecture of a modern GPU. The GPU‐based solution is compared with the sequential and multi‐threaded CPU versions on public access datasets, and it is shown that an order of magnitude acceleration can be achieved on this data‐intensive task using inexpensive commodity hardware. The influence of dataset characteristics on the efficiency of parallel implementation is also analyzed. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

14.

GPU并行计算集群上的LAMMPS分子动力学模拟性能测试

李伯杨聂峰光李晓霞郭力《计算机与应用化学》2011,28(10)

近年来GPU作为一种具有极强运算能力的多核处理器,得到了快速的发展,成为高性能计算领域的主要发展方向。各种分子动力学模拟的主流软件也纷纷使用GPU技术,其中LAMMPS较早地开发出了通用的并行GPU版本。本文利用nVIDIA公司最新Femi架构的Tesla C2050 GPU搭建了小型的基于LAMMPS的分子动力学模拟GPU并行计算集群,通过氩原子熔化的算例对集群性能进行了测试,测试的内容包括CPU集群、单节点单GPU、单节点多GPU以及多节点GPU集群。比较了各种情况的加速倍数并对造成性能改变的原因进行了讨论,分析了用于MD模拟的GPU并行计算集群性能的瓶颈所在,提出可能的解决方法,搭建集群时,充分考虑PCI总线的承受能力,对于集群效率的提高有很大好处。测试结果表明,集群的性能较高,相对于以往的单机以及CPU集群,计算的规模大大提高了,加速比也在20倍以上。可以预测,在未来一段时间内,多GPU并行是分子动力学模拟的发展方向。相似文献

15.

Parallel multi-objective Ant Programming for classification using GPUs

Alberto Cano Juan Luis Olmo Sebastián Ventura 《Journal of Parallel and Distributed Computing》2013

Classification using Ant Programming is a challenging data mining task which demands a great deal of computational resources when handling data sets of high dimensionality. This paper presents a new parallelization approach of an existing multi-objective Ant Programming model for classification, using GPUs and the NVIDIA CUDA programming model. The computational costs of the different steps of the algorithm are evaluated and it is discussed how best to parallelize them. The features of both the CPU parallel and GPU versions of the algorithm are presented. An experimental study is carried out to evaluate the performance and efficiency of the interpreter of the rules, and reports the execution times and speedups regarding variable population size, complexity of the rules mined and dimensionality of the data sets. Experiments measure the original single-threaded and the new multi-threaded CPU and GPU times with different number of GPU devices. The results are reported in terms of the number of Giga GP operations per second of the interpreter (up to 10 billion GPops/s) and the speedup achieved (up to 834× vs CPU, 212× vs 4-threaded CPU). The proposed GPU model is demonstrated to scale efficiently to larger datasets and to multiple GPU devices, which allows the expansion of its applicability to significantly more complicated data sets, previously unmanageable by the original algorithm in reasonable time. 相似文献

16.

Efficient finite impulse response filters in massively-parallel recursive systems

André Maximo 《Journal of Real-Time Image Processing》2016,12(3):603-611

This paper presents strategies to massively parallelize complete recursive systems. Each algorithm handles systems with feedforward and feedback coefficients allowing to compute high-complexity filtering operators. The final algorithm is linear in time and memory, exposes a high number of parallel tasks, and it is implemented on graphics processing units, i.e. GPUs. The key to the final algorithm is the derivation of closed-form formulas to combine both non-recursive and recursive linear filters, based on an efficient state-of-the-art block-based strategy. Applications to early vision are considered in this work, hence the GPU implementation runs on images computing an approximation of the Gaussian filter and its first and second derivatives. Finally, comparison results are given showing that this work outperforms prior state-of-the-art algorithms, enabling it to achieve real-time image filtering on ultra-high-definition videos. 相似文献

17.

面向众核GPU加速系统的网络编码并行化及优化

唐绍华《计算机工程与应用》2014,50(21):79-84

网络编码允许网络节点在数据存储转发的基础上参与数据处理,已成为提高网络吞吐量、均衡网络负载和提高网络带宽利用率的有效方法,但是网络编码的计算复杂性严重影响了系统性能。基于众核GPU加速的系统可以充分利用众核GPU强大的计算能力和有效利用GPU的存储层次结构来优化加速网络编码。基于CUDA架构提出了以片段并行的技术来加速网络编码和基于纹理Cache的并行解码方法。利用提出的方法实现了线性随机编码,同时结合体系结构对其进行优化。实验结果显示,基于众核GPU的网络编码并行化技术是行之有效的,系统性能提升显著。相似文献

18.

基于GPU的非牛顿流体自由表面绘制方法 总被引：1，自引：0，他引：1

下载免费PDF全文

蒋杰应龙杨冰吴玲达《计算机工程与应用》2007,43(18):19-23

提出一种基于GPU的非牛顿流体自由表面绘制方法。首先,分析了非牛顿流体的物理模型,将流体的运动规律用合理的数学表达式进行描述;其次,针对非牛顿流体的特点设计了合理的可视模型,提出了流体运动及自由表面的绘制方法,并且设计了相应的GPU实现算法;最后的实验证明了算法在合理的时间内能完全逼真的对非牛顿流体的自由表面进行真实的再现。算法充分吸收了以往方法的优点,采用了合理的数学模型,并利用GPU的运算特性实现了非牛顿流体自由表面的绘制,在绘制效果和效率上较以往算法都有较大改进。相似文献

19.

一种基于GPU的粒子系统 总被引：2，自引：0，他引：2

许楠郝爱民王莉莉《计算机工程与应用》2006,42(19):77-79,139

粒子系统在当今不定形物体仿真中已经得到广泛的应用,但是普通的粒子系统在实时仿真中,粒子总数最多只能达到10000个左右,其中瓶颈在于粒子数据从主处理器到图形硬件的传输和CPU的并行处理能力。文章研究并实现了一种完全基于图形硬件(GPU)的粒子系统,利用GPU的多通道并行处理功能,提高处理速度,可以很大程度地增加粒子系统实时仿真应用中的粒子数量,从而可以提高虚拟环境的逼真程度。实验证明基于GPU的粒子系统的实时性能远远高于普通粒子系统。相似文献

20.

基于GPU的流体动力学模拟 总被引：2，自引：0，他引：2

杨冰蒋杰应龙吴玲达《计算机工程与应用》2007,43(11):7-10

提出了一种基于GPU的流体动力学可视化方法。首先,分析了流体动力学的物理模型,用合理的数学表达式表述了该模型,并且给出了求解方法;其次,设计了在GPU上实现流体动力学模拟的算法,既模拟出逼真的运动效果又控制了算法的复杂度;最后的实验证明了该文算法在执行效率上较以往基于CPU的算法有很大的提高,并且模拟的结果逼真、可信。算法充分吸收了以往方法的优点,对流体动力学的可视化的细节采用了最优的物理模型及快速的数值解法,具有较强的稳健性和创新性。相似文献