期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

吕建陆陪于大川 David L.Shang 《软件学报》1999,10(3):270-276

为了支持并行程序设计,几乎所有的程序设计语言均通过提供并行与同步通信机制来支持某一高级并行计算模型,如Ada语言的任务与会合机制以及Java语言的线程和同步方法.显然,这样的程序设计语言仅能支持一种高级并行计算模型.尽管单模型的途径对某些应用来说简单而有效,但由于现实世界中的问题往往较为繁杂而难以完全用单一模型来解决.文章采用面向对象的语言机制和框架技术来解决此问题.通过分析现有各种语言中高级并行计算模型的共性,提出了若干新的面向对象语言机制.以此为基础,提出了并行面向对象框架的概念,并讨论用其表达和使用相似文献

2.

支持分页显存的高性能哈希表索引系统

熊轶翔蒋筱斌张珩武延军《计算机系统应用》2022,31(9):82-90

哈希表以访问效率时间复杂度O(1)著称,作为一类可提供大规模数据高效访问的算法和数据结构为各类大数据应用所采用,例如,适用于各类新兴高性能(HPC)领域、数据库领域的工作负载和场景.随着高性能协处理器GPU硬件性能的日益提升,面向高性能GPU环境的哈希表并行优化已逐渐吸引了大量研究工作.当前的各类GPU哈希表优化方法和解决方案集中于利用GPU的大规模线程环境和高内存带宽来提升哈希表的事务高并发性处理和键值对数据快速访问.然而,由于现有GPU哈希表结构的研究工作普遍忽略了GPU资源有效管理,并没有以如何充分利用GPU线程资源和显存资源.同时,由于GPU显存空间的大小限制,用于存储哈希表结构数据的空间有限,无法应对更大规模的哈希表结构.因此,面向GPU环境下的哈希表方法的可扩展性和性能仍存在着技术挑战.本文提出并设计了一种面向GPU环境的可处理大规模并发事务的哈希表技术,命名为Starfish. Starfish提出了新的基于异步GPU流的“交换层”(swap layer)技术,用以支持GPU显存外的动态哈希表,同时也保障了GPU哈希表的索引方法性能.为了解决GPU大规模线程的访问带来的哈... 相似文献

3.

基于CUDA的并行AES算法的实现和加速效率探索

费雄伟李肯立阳王东杜家宜《计算机科学》2015,42(1):59-62,74

网络应用服务(尤其是电子银行和电子商务)需要数据加密提供安全通信.很多应用服务器面临着执行大量计算稠密的加密挑战.CUDA(统一计算架构)是在GPU进行并行和通用计算的平台,能够利用现有显卡资源,以低成本的方式提升加密性能.在Nvidia GeForce G210显卡上实现CUDA的AES(高级加密标准)并行算法并且在AMD Athlon 7850上实现串行AES算法.实现的AES并行算法避免了同一线程块的线程同步和通信,提升了GPU的加速性能,加速比要比Manavski的AES-128并行算法提升2.66～3.34倍.在大数据量(至32MB)加密环境下探索AES并行算法的性能模型,并首次从加速效率角度分析加速性能.该并行AES算法在16核的GPU上能最高达到15.83倍的加速比和99.898％的加速效率. 相似文献

4.

异构多核上支持OpenMP3.0的自适应任务粒度策略

曹倩左敏《小型微型计算机系统》2012,33(6):1350-1357

任务粒度是决定任务并行程序性能的关键因素,鉴于不同应用其最优的任务粒度可能不同,提出一种异构多核Cell处理器上支持OpenMP3.0的自适应任务粒度策略.该策略首先广度生成任务,直到所有的线程达到饱和,之后若某个线程执行完自身任务而处于空闲状态时,通过回溯到忙碌线程的任务树中最早可以派生任务的结点处生成新任务,以供空闲线程窃取执行.该策略不仅保证生成的任务粒度最大化,并且有效地解决了负载不均衡问题.实验在一个Cell处理器上进行,结果表明与顺序执行速度相比,自适应任务粒度策略达到了4.1到7.2的加速比,并且该策略优于现有的Tascell和AdaptiveTC方案,同时对于绝大部分应用表现出了良好的可扩展行. 相似文献

5.

RSA算法的CUDA高效实现技术 总被引：1，自引：1，他引：0

下载免费PDF全文

孙迎红童元满王志英《计算机工程与应用》2011,47(2):84-87

CUDA（Compute Unified Device Architecture）作为一种支持GPU通用计算的新型计算架构,在大规模数据并行计算方面得到了广泛的应用。RSA算法是一种计算密集型的公钥密码算法,给出了基于CUDA的RSA算法并行化高效实现技术,其关键为引入大量独立并发的Montgomery模乘线程,并给出了具体的线程组织、数据存储结构以及基于共享内存的性能优化实现技术。根据RSA算法CUDA实现方法,在某款GPU上测试了RSA算法的运算性能和吞吐率。实验结果表明,与RSA算法的通用CPU实现方式相比,CUDA实现能够实现超过40倍的性能加速。相似文献

6.

全局基因调控网络构建CPU/GPU并行算法

陈绪伟钟诚《小型微型计算机系统》2015,(2):234-239

对基因表达谱分块,使之符合GPU并行计算的线程结构特性,根据GPU线程结构特性设计双层并行模式,并利用纹理缓存实现访存高效;依据CPU二级缓存容量对基本块进一步细分成子块以提高缓存命中率,利用数据预取技术减少访存次数,利用线程绑定技术减少线程在核心之间的迁移;依据多核CPU和GPU的计算能力分配CPU和GPU的基因互信息计算任务以平衡CPU与GPU的计算负载;在设计新的阈值计算算法基础上,设计实现了访存高效的构建全局基因调控网络CPU/GPU并行算法.实验结果表明,与已有算法相比,本文算法加速更明显,并且能够构建更大规模的全局基因调控网络. 相似文献

7.

每月新技术之星

张健浪《个人电脑》2009,15(7):102-106

得益于高并行计算架构．GPU在执行流处理时效能远超CPU．其浮点计算性能往往达到同级CPU的十倍以上。这种趋势在未来将继续保持．伴随着软件环境的支持．GPU的应用范围将延伸到所有涉及高负载浮点计算的领域，通用性将不断增强．GPU在计算机中的地位也不断提升．并达到与CPU分庭抗礼的地步。相似文献

8.

影像数据分布并行计算处理平台体系架构研究

《计算机工程》2017,(5)

遥感影像数据并行处理系统大多依赖于国外商用产品,而国内自主化并行计算处理系统的任务流程化支撑能力以及并行计算性能难以适应规模化生产。为此,基于Hadoop的HDFS,MapReduce集群并行架构、CPU和GPU协同并行处理、内存映像、BMP等技术,提出流程驱动执行的高性能分布式并行计算处理平台体系架构。实验结果表明,工作站集群和工作站内多粒度混合的并行计算架构提高了平台并行处理性能,为海量遥感影像数据产品的批量生产提供一种自主化解决方案。相似文献

9.

通用图形处理器GPGPU的并行计算研究

张鹏博郭兵黄义纯曹亚波《单片机与嵌入式系统应用》2017,17(8)

随着图形处理器(GPU)从仅用来进行图形图像渲染,脱离成为并行计算平台通用图形处理器(GPGPU),其计算能力越来越强,本文在研究GPGPU体系结构的基础上对GPGPU并行计算线程调度进行深入研究,阐述了GPU线程调度原理,揭示了SIMT调度模式的不足.通过公式推导阐述了系统功耗与系统运行频率的关系. 相似文献

10.

面向GPU平台的复杂网络core分解方法研究

张珩崔强侯朋朋武延军赵琛《软件学报》2020,31(4):1225-1239

在复杂网络理论中,core分解是一种最基本的度量网络节点“重要性”并分析核心子图的方法.Core分解广泛应用于社交网络的用户行为分析、复杂网络的可视化、大型软件的代码静态分析等应用.随着复杂网络的图数据规模和复杂性的增大,现有研究工作基于多核CPU环境设计core分解并行算法,由于CPU核数和内存带宽的局限性,已经无法满足大数据量的高性能计算需求,严重影响了复杂网络的分析应用.通用GPU提供了1万以上线程数的高并行计算能力和高于100GB/s访存带宽,已被广泛应用于大规模图数据的高效并行分析,如广度优先遍历和最短路径算法等.为了实现更为高效的core分解,提出面向GPU平台下的复杂网络core分解的两种并行策略.第1种RLCore策略基于图遍历思想,利用GPU高并发计算能力对网络图结构自底向上遍历,逐步迭代设置各节点所属的core层;第2种ESCore策略基于局部收敛思想,对各节点从邻居节点当前值进行汇聚计算更新直至收敛.ESCore相比RLCore能够大大降低遍历过程中GPU线程更新同一节点的同步操作开销,而其算法的迭代次数受收敛率的影响.在真实网络图数据上的实验结果表明,所提出的两个策略在效率和扩展性方面能够大幅优于现有其他方法,相比单线程上的算法高达33.6倍性能提升,且遍历边的吞吐性能(TEPS)达到406万条/s,单轮迭代的ESCore的执行效率高于RLCore. 相似文献

11.

A Distributed PTX Virtual Machine on Hybrid CPU/GPU Clusters

《Journal of Systems Architecture》2016

Hybrid CPU/GPU cluster recently has drawn lots of attention from high performance computing because of excellent execution performance and energy efficiency. Many supercomputing sites in the newest TOP 500 and Green 500 are built by hybrid CPU/GPU clusters instead of CPU clusters. However, the programming complexity of hybrid CPU/GPU clusters is so high such that most of users usually hesitate to move toward to this new cluster computing platform. To resolve this problem, we propose a distributed PTX virtual machine called BigGPU on heterogeneous clusters in this paper. As named, this virtual machine physically is a distributed system which is aimed at parallel re-compiling and executing the PTX codes by aggregating CPUs and GPUs available in a computational cluster. With the support of this virtual machine, users can regard a hybrid CPU/GPU as a single large-scale GPU. Consequently, they can develop applications by using only CUDA without combining MPI and multithreading APIs while can simultaneously use distributed CPUs and GPUs for resolving the same problem. Moreover, they need not handle the problem of load balance among heterogeneous processors and the constraints of device memory and thread configuration existing in physical GPUs because BigGPU supports large-scale virtual device memory space and thread configuration. On the other hand, we have evaluated the execution performance of BigGPU in this paper. Our experimental results have shown that BigGPU indeed can effectively exploit the computational power of CPUs and GPUs for enhancing the execution performance of user's CUDA programs. 相似文献

12.

A survey of graph processing on graphics processing units

Ha-Nguyen Tran Erik Cambria 《The Journal of supercomputing》2018,74(5):2086-2115

Graphics processing units (GPUs) have become popular high-performance computing platforms for a wide range of applications. The trend of processing graph structures on modern GPUs has also attracted an increasing interest in recent years. This article aims to review research works on adapting the massively parallel architecture of GPUs to accelerate the performance of fundamental graph operations. Despite their merits, some factors such as the unique architecture of GPUs, limited programming models, and irregular structures of graphs prevent GPU implementations from achieving high performance. Thus, this survey also discusses challenges and optimization techniques used by recent studies to fully utilize the GPU capability. A categorization of the existing research works is also presented based on the specific issues these attempted to solve. 相似文献

13.

模板操作在GPU上的实现与优化 总被引：1，自引：0，他引：1

方旭东唐玉华王桂彬唐滔《计算机工程与科学》2011,33(3):41

随着GPU的快速发展,使用GPU来加速科学计算应用已成为必然趋势。本文抽取了SPEC2000中富含模板操作的Mgrid的两个典型子程序Rprj3和Interp,使用Brook+语言把它们移植到AMD GPU上运行。采用Brook+语言提供的线程调节机制,我们实现了不同线程粒度下的程序版本,并分析了加速比不同的原因,总结了线程粒度调节对模板程序移植的指导意义。我们使用AMD RadeonHD4870 GPU作为实验平台,对比Intel Xeon E5405 CPU上的运行结果发现,在最大规模下,Rprj3获得的相对于CPU版本的加速比为5.37×,Interp获得的相对于CPU版本的加速比为12.8×。相似文献

14.

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hong Jun Choi Dong Oh Son Jong Myon Kim Cheol Hong Kim 《The Journal of supercomputing》2014,69(1):330-356

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead. 相似文献

15.

MPtostream:an OpenMP compiler for CPU-GPU heterogeneous parallel systems

《中国科学:信息科学(英文版)》2012,(9):1961-1971

In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone. 相似文献

16.

Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters 总被引：2，自引：0，他引：2

Chao-Tung Yang Chih-Lin Huang Cheng-Fang Lin 《Computer Physics Communications》2011,(1):266-269

Nowadays, NVIDIA's CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same computational node. 相似文献

17.

Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

《Parallel Computing》2016

Graphic Processing Units (GPUs) are widely used in high performance computing, due to their high computational power and high performance per Watt. However, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. The most common way to utilize a GPU cluster is a hybrid model, in which the GPU is used to accelerate the computation, while the CPU is responsible for the communication. This approach always requires a dedicated CPU thread, which consumes additional CPU cycles and therefore increases the power consumption of the complete application. In recent work we have shown that the GPU is able to control the communication independently of the CPU. However, there are several problems with GPU-controlled communication. The main problem is intra-GPU synchronization, since GPU blocks are non-preemptive. Therefore, the use of communication requests within a GPU can easily result in a deadlock. In this work we show how dynamic parallelism solves this problem. GPU-controlled communication in combination with dynamic parallelism allows keeping the control flow of multi-GPU applications on the GPU and bypassing the CPU completely. Using other in-kernel synchronization methods results in massive performance losses, due to the forced serialization of the GPU thread blocks. Although the performance of applications using GPU-controlled communication is still slightly worse than the performance of hybrid applications, we will show that performance per Watt increases by up to 10% while still using commodity hardware. 相似文献

18.

Exploiting graphical processing units for data‐parallel scientific applications

A. Leist D. P. Playne K. A. Hawick 《Concurrency and Computation》2009,21(18):2400-2437

Graphical processing units (GPUs) have recently attracted attention for scientific applications such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them more accessible to scientific programmers. We discuss the application of GPU programming to two significantly different paradigms—regular mesh field equations with unusual boundary conditions and graph analysis algorithms. The differing optimization techniques required for these two paradigms cover many of the challenges faced when developing GPU applications. We discuss the relevance of these application paradigms to simulation engines and games. GPUs were aimed primarily at the accelerated graphics market but since this is often closely coupled to advanced game products it is interesting to speculate about the future of fully integrated accelerator hardware for both visualization and simulation combined. As well as reporting the speed‐up performance on selected simulation paradigms, we discuss suitable data‐parallel algorithms and present code examples for exploiting GPU features like large numbers of threads and localized texture memory. We find a surprising variation in the performance that can be achieved on GPUs for our applications and discuss how these findings relate to past known effects in parallel computing such as memory speed‐related super‐linear speed up. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

19.

Two-phase execution of binary applications on CPU/GPU machines

Erzhou Zhu Ruhui Ma Yang Hou Yindong Yang Feng Liu Haibing Guan 《Computers & Electrical Engineering》2014

High computational power of GPUs (Graphics Processing Units) offers a promising accelerator for general-purpose computing. However, the need for dedicated programming environments has made the usage of GPUs rather complicated, and a GPU cannot directly execute binary code of a general-purpose application. This paper proposes a two-phase virtual execution environment (GXBIT) for automatically executing general-purpose binary applications on CPU/GPU architectures. GXBIT incorporates two execution phases. The first phase is responsible for extracting parallel hot spots from the sequential binary code. The second phase is responsible for generating the hybrid executable (both CPU and GPU instructions) for execution. This virtual execution environment works well for any applications that run repeatedly. The performance of generated CUDA (Compute Unified Device Architecture) code from GXBIT on a number of benchmarks is close to 63% of the hand-tuned GPU code. It also achieves much better overall performance than the native platforms. 相似文献

20.

An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration

Jens Lang Gudula Rünger 《Journal of Parallel and Distributed Computing》2014

The parallel preconditioned conjugate gradient method (CGM) is used in many applications of scientific computing and often has a critical impact on their performance and energy consumption. This article investigates the energy-aware execution of the CGM on multi-core CPUs and GPUs used in an adaptive FEM. Based on experiments, an application-specific execution time and energy model is developed. The model considers the execution speed of the CPU and the GPU, their electrical power, voltage and frequency scaling, the energy consumption of the memory as well as the time and energy needed for transferring the data between main memory and GPU memory. The model makes it possible to predict how to distribute the data to the processing units for achieving the most energy efficient execution: the execution might deploy the CPU only, the GPU only or both simultaneously using a dynamic and adaptive collaboration scheme. The dynamic collaboration enables an execution minimising the execution time. By measuring execution times for every FEM iteration, the data distribution is adapted automatically to changing properties, e.g. the data sizes. 相似文献