期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

魏洪昌朱正东董小社宁洁《计算机工程与应用》2016,52(21):30-35

针对CPU-GPU异构并行系统应用开发移植后优化不充分问题,提出了一种渐近拟合优化与源到源编译相结合的方法,该方法能够对插入了制导语句的C语言程序转换为CUDA语言后的程序进行多次剖分,根据源程序特性和硬件信息自动完成源到源编译与优化,并基于该方法实现了原型系统。通过在不同环境中的该原型系统在功能和性能方面进行的测试表明,由系统生成的CUDA目标程序与C源程序在功能上一致,性能上却有了大幅度提高,通过与CUDA基准测试程序相比表明,该目标程序在性能上明显优于其他源到源编译转换生成的程序。相似文献

2.

GPU上实现的向量点积的性能分析

郭雷刘进锋《计算机工程与应用》2012,48(2):201-202

CUDA是一种较为简便的利用GPU进行通用计算的技术。研究了GPU上基于CUDA的几种向量点积算法,比较、分析了每种算法的性能。实验表明,GPU上最快的算法比CPU上的算法快了约7倍。相似文献

3.

华硕ESC1000成就桌面超算

《CAD/CAM与制造业信息化》2010,(1):91-91

CUDA从推出到进入各大应用领域仅用了不到两年时间,这一为程序开发者所津津乐道的颠覆性新技术,正在以惊人的速度展开普及。CUDA（Compute Unified Device Architecture,统一计算设备架构）是由NVIDIA推出的通用并行计算架构,该架构将GPU强大的并行计算能力充分调动起来,使GPU在解决复杂计算问题时发挥计算优势,开发人员使用C语言即可在基于CUDA架构的GPU上编写程序,借助GPU建立高密集的数据计算解决方案。相似文献

4.

基于CUDA的并行粒子群优化算法研究及实现

陈风田雨波杨敏《计算机科学》2014,41(9):263-268

应用图形处理器(GPU)来加速粒子群优化(PSO)算法并行计算时,为突出其加速性能,经常有文献以恶化CPU端PSO算法性能为代价。为了科学比较GPU-PSO算法和CPU-PSO算法的性能,提出用"有效加速比"作为算法的性能指标。文中给出的评价方法不需要CPU和GPU端粒子数相同,将GPU并行算法与最优CPU串行算法的性能作比较,以加速收敛到目标精度为准则,在统一计算设备架构(CUDA)下对多个基准测试函数进行了数值仿真实验。结果表明,在GPU上大幅增加粒子数能够加速PSO算法收敛到目标精度,与CPU-PSO相比,获得了10倍以上的"有效加速比"。相似文献

5.

二维扩散方程的GPU加速 总被引：1，自引：0，他引：1

董廷星王龙迟学斌《计算机工程与科学》2009,31(11)

近几年来,GPU因拥有比CPU更强大的浮点性能备受瞩目。NVIDIA推出的CUDA架构,使得GPU上的通用计算成为现实。本文将计算流体力学中Benchmark问题的二维扩散方程移植到GPU,并采用了全局存储和纹理存储两种方法。结果显示,当网格达到百万量级的时候,得到了34倍的加速。相似文献

6.

面向众核GPU加速系统的网络编码并行化及优化

唐绍华《计算机工程与应用》2014,50(21):79-84

网络编码允许网络节点在数据存储转发的基础上参与数据处理,已成为提高网络吞吐量、均衡网络负载和提高网络带宽利用率的有效方法,但是网络编码的计算复杂性严重影响了系统性能。基于众核GPU加速的系统可以充分利用众核GPU强大的计算能力和有效利用GPU的存储层次结构来优化加速网络编码。基于CUDA架构提出了以片段并行的技术来加速网络编码和基于纹理Cache的并行解码方法。利用提出的方法实现了线性随机编码,同时结合体系结构对其进行优化。实验结果显示,基于众核GPU的网络编码并行化技术是行之有效的,系统性能提升显著。相似文献

7.

CUDA软硬件环境简介 总被引：1，自引：0，他引：1

《程序员》2008,(3):36-37

CUDA是用于GPU计算的开发环境,它是一个全新的软硬件架构,可以将GPU视为一个并行数据计算的设备,对所进行的计算进行分配和管理。在CUDA的架构中,这些计算不再像过去所谓的GPGPU架构那样必须将计算映射到图形API（OpenGL和Direct3D）中,因此对于开发者来说,CUDA的开发门槛大大降低了。CUDA的GPU编程语言基于标准的C语言,因此任何有C语言基础的用户都很容易地开发CUDA的应用程序。相似文献

8.

多核CPU和GPU加速分子动力学模拟

林江宏林锦贤吕暾《计算机应用》2011,31(3):843-847

在多核中央处理器(CPU)—图形处理器(GPU)异构并行体系结构上,采用OpenMP和计算统一设备架构(CUDA)编程实现了基于AMBER力场的蛋白质分子动力学模拟程序。通过合理地将程序划分为CPU单线程、CPU多线程和GPU多线程执行部分,高效地利用了计算机的处理能力。性能测试结果表明,相对于优化后的CPU串行计算,多核CPU-GPU异构并行计算模型有强大的性能优势,特别是将占整个程序执行时间90%的作用力的计算移植到GPU上执行,获得了最高可达12倍的计算加速比。相似文献

9.

CUDA软硬件环境简介

《Internet》2008,(3):36-37

CUDA是用于GPU计算的开发环境，它是一个全新的软硬件架构，可以将GPU视为一个并行数据计算的设备，对所进行的计算进行分配和管理。在CUDA的架构中，这些计算不再像过去所谓的GPGPU架构那样必须将计算映射到图形API（OpenGL和Direct3D）中，因此对于开发者来说，CUDA的开发门槛大大降低了。CUDA的GPU编程语言基于标准的C语言，因此任何有C语言基础的用户都很容易地开发CUDA的应用程序。相似文献

10.

基于GPU的加锁并行化非结构网格生成方法研究

蔡云龙肖素梅齐龙《计算机工程与应用》2014,50(6):56-60

非结构网格的生成在时间和内存上有一定的缺陷,这里提出了一种新的方法,命名为GPU-PDMG,是基于CUDA架构的GPU并行非结构网格生成技术。该技术结合了GPU的高速并行计算能力与Delaunay三角化的优点,在英伟达GPU模块下采用CUDA程序模型,开发出了加锁并行区划分技术,通过对NACA0012翼型、多段翼型等算例进行测试,分析此方法的加速比和效率,对其计算性能展开评估。实验结果表明,GPU-PDMG优于现存在的CPU算法的速度,在保证网格质量的同时,提高了效率。相似文献

11.

运动估计搜索算法的CUDA优化与实现

下载免费PDF全文

陈佐陈汉季加良《计算机工程与应用》2010,46(32):171-176

针对H.264压缩编码中计算量大以及最为耗时的运动估计搜索算法的特点,利用图形处理器的并行优化思想,研究基于CUDA计算平台的运动估计搜索算法GEA（全域消除算法）的并行化处理方法,并对其中的并行设计、数据处理、结果反馈等关键技术问题,进行了详细论述。最后通过实验数据对算法运行效率进行对比分析。实验结果表明GPU中的GEA搜索算法运动搜索性能较之CPU中有显著提高。相似文献

12.

SafeGPU: Contract- and library-based GPGPU for object-oriented languages

《Computer Languages, Systems and Structures》2017

Using GPUs as general-purpose processors has revolutionized parallel computing by providing, for a large and growing set of algorithms, massive data-parallelization on desktop machines. An obstacle to their widespread adoption, however, is the difficulty of programming them and the low-level control of the hardware required to achieve good performance. This paper proposes a programming approach, SafeGPU, that aims to make GPU data-parallel operations accessible through high-level libraries for object-oriented languages, while maintaining the performance benefits of lower-level code. The approach provides data-parallel operations for collections that can be chained and combined to express compound computations, with data synchronization and device management all handled automatically. It also integrates the design-by-contract methodology, which increases confidence in functional program correctness by embedding executable specifications into the program text. We present a prototype of SafeGPU for Eiffel, and show that it leads to modular and concise code that is accessible for GPGPU non-experts, while still providing performance comparable with that of hand-written CUDA code. We also describe our first steps towards porting it to C#, highlighting some challenges, solutions, and insights for implementing the approach in different managed languages. Finally, we show that runtime contract-checking becomes feasible in SafeGPU, as the contracts can be executed on the GPU. 相似文献

13.

RSA算法的CUDA高效实现技术 总被引：1，自引：1，他引：0

下载免费PDF全文

孙迎红童元满王志英《计算机工程与应用》2011,47(2):84-87

CUDA（Compute Unified Device Architecture）作为一种支持GPU通用计算的新型计算架构,在大规模数据并行计算方面得到了广泛的应用。RSA算法是一种计算密集型的公钥密码算法,给出了基于CUDA的RSA算法并行化高效实现技术,其关键为引入大量独立并发的Montgomery模乘线程,并给出了具体的线程组织、数据存储结构以及基于共享内存的性能优化实现技术。根据RSA算法CUDA实现方法,在某款GPU上测试了RSA算法的运算性能和吞吐率。实验结果表明,与RSA算法的通用CPU实现方式相比,CUDA实现能够实现超过40倍的性能加速。相似文献

14.

CU++: an object oriented framework for computational fluid dynamics applications using graphics processing units

Dominic D. J. Chandar Jayanarayanan Sitaraman Dimitri Mavriplis 《The Journal of supercomputing》2014,67(1):47-68

The application of graphics processing units (GPU) to solve partial differential equations is gaining popularity with the advent of improved computer hardware. Various lower level interfaces exist that allow the user to access GPU specific functions. One such interface is NVIDIA’s Compute Unified Device Architecture (CUDA) library. However, porting existing codes to run on the GPU requires the user to write kernels that execute on multiple cores, in the form of Single Instruction Multiple Data (SIMD). In the present work, a higher level framework, termed CU++, has been developed that uses object oriented programming techniques available in C++ such as polymorphism, operator overloading, and template meta programming. Using this approach, CUDA kernels can be generated automatically during compile time. Briefly, CU++ allows a code developer with just C/C++ knowledge to write computer programs that will execute on the GPU without any knowledge of specific programming techniques in CUDA. This approach is tremendously beneficial for Computational Fluid Dynamics (CFD) code development because it mitigates the necessity of creating hundreds of GPU kernels for various purposes. In its current form, CU++ provides a framework for parallel array arithmetic, simplified data structures to interface with the GPU, and smart array indexing. An implementation of heterogeneous parallelism, i.e., utilizing multiple GPUs to simultaneously process a partitioned grid system with communication at the interfaces using Message Passing Interface (MPI) has been developed and tested. 相似文献

15.

基于GPU的位并行多模式串匹配研究 总被引：1，自引：0，他引：1

下载免费PDF全文

赵光南吴承荣《计算机工程》2011,37(14):265-267

图形处理器(GPU)具有较强的单一运算能力及高度并行的体系结构。根据上述特点,选择基于位并行技术的多模式串匹配算法M-BNDM,将其移植到GPU上加以实现和优化。通过对需要处理的数据进行预处理,将串匹配的过程简化为更适合CUDA计算数据的位操作。对基于CUDA架构的并行串匹配算法的性能影响因子进行分析。实验结果表明,与同等CPU算法相比,该算法能够获得约十几倍的加速比。相似文献

16.

gAC:基于GPU的高性能AC算法

陈虎彭江锋施少怀《计算机工程与应用》2012,48(12):43-48

字符串匹配是计算科学中研究最广泛的问题之一,已成为信息检索和生物计算等领域的核心操作。然而受限于CPU的计算能力和存储器访问带宽,传统的串行字符串匹配算法难以进一步提升性能。GPU在计算能力和存储器访问带宽上有很大提升,已经在很多应用上取得了卓越成效。gAC作为一种基于GPU的并行AC算法,针对GPU的SIMT(Single-Instruction Multiple-Thread)以及合并存储器访问的技术特点,采取了减少条件分支、合并访问全局存储器等优化方法,使得在C1060GPU上的字符串扫描速度达到51Gb/s,比基于CPU的串行算法提升了28倍。相似文献

17.

Two-phase execution of binary applications on CPU/GPU machines

Erzhou Zhu Ruhui Ma Yang Hou Yindong Yang Feng Liu Haibing Guan 《Computers & Electrical Engineering》2014

High computational power of GPUs (Graphics Processing Units) offers a promising accelerator for general-purpose computing. However, the need for dedicated programming environments has made the usage of GPUs rather complicated, and a GPU cannot directly execute binary code of a general-purpose application. This paper proposes a two-phase virtual execution environment (GXBIT) for automatically executing general-purpose binary applications on CPU/GPU architectures. GXBIT incorporates two execution phases. The first phase is responsible for extracting parallel hot spots from the sequential binary code. The second phase is responsible for generating the hybrid executable (both CPU and GPU instructions) for execution. This virtual execution environment works well for any applications that run repeatedly. The performance of generated CUDA (Compute Unified Device Architecture) code from GXBIT on a number of benchmarks is close to 63% of the hand-tuned GPU code. It also achieves much better overall performance than the native platforms. 相似文献

18.

基于CUDA技术的卷积神经网络识别算法

下载免费PDF全文

张佳康陈庆奎《计算机工程》2010,36(15):179-181

针对具有高浮点运算能力的流处理器设备GPU对神经网络的适用性问题,提出卷积神经网络的并行化识别算法,采用计算统一设备架构(CUDA)技术,并定义其上的并行化数据结构,描述计算任务到CUDA的映射机制。实验结果证明,在GTX200硬件架构的GPU上实现的并行识别算法的平均浮点运算能力峰值较CPU上串行算法提高了近60倍,更适用于神经网络的相关应用。相似文献