期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

dOpenCL: Towards uniform programming of distributed heterogeneous multi-/many-core systems

Philipp Kegel Michel Steuwer Sergei Gorlatch 《Journal of Parallel and Distributed Computing》2013

Modern computer systems become increasingly distributed and heterogeneous by comprising multi-core CPUs, GPUs, and other accelerators. Current programming approaches for such systems usually require the application developer to use a combination of several programming models (e.g., MPI with OpenCL or CUDA) in order to exploit the system’s full performance potential. In this paper, we present dOpenCL (distributed OpenCL)—a uniform approach to programming distributed heterogeneous systems with accelerators. dOpenCL allows the user to run unmodified existing OpenCL applications in a heterogeneous distributed environment. We describe the challenges of implementing the OpenCL programming model for distributed systems, as well as its extension for running multiple applications concurrently. Using several example applications, we compare the performance of dOpenCL with MPI + OpenCL and standard OpenCL implementations. 相似文献

2.

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

José M. Andión Manuel Arenaz François Bodin Gabriel Rodríguez Juan Touriño 《International journal of parallel programming》2016,44(3):620-643

The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate these accelerators with high-level programming languages, giving place to heterogeneous computing systems. Unfortunately, this heterogeneity is also exposed to the programmer complicating its exploitation. This paper presents a new technique to automatically rewrite sequential programs into a parallel counterpart targeting GPU-based heterogeneous systems. The original source code is analyzed through domain-independent computational kernels, which hide the complexity of the implementation details by presenting a non-statement-based, high-level, hierarchical representation of the application. Next, a locality-aware technique based on standard compiler transformations is applied to the original code through OpenHMPP directives. Two representative case studies from scientific applications have been selected: the three-dimensional discrete convolution and the simple-precision general matrix multiplication. The effectiveness of our technique is corroborated by a performance evaluation on NVIDIA GPUs. 相似文献

3.

Generating data transfers for distributed GPU parallel programs

F. Silber-Chaussumier A. Muller R. Habel 《Journal of Parallel and Distributed Computing》2013

Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and the accelerators and between computing nodes. We consider the case where the intra-node parallelism is handled with HMPP compiler directives and message-passing programming with MPI is used to program the inter-node communications. This way of programming on such an heterogeneous architecture is costly and error-prone. In this paper, we specifically demonstrate the transformation of HMPP programs designed to exploit a single computing node equipped with a GPU into an heterogeneous HMPP + MPI exploiting multiple GPUs located on different computing nodes. 相似文献

4.

Mapping of option pricing algorithms onto heterogeneous many-core architectures

Shuai Zhang Zhao Wang Ying Peng Bertil Schmidt Weiguo Liu 《The Journal of supercomputing》2017,73(9):3715-3737

The rapid development of technologies and applications in recent years poses high demands and challenges for high-performance computing. Because of their competitive performance/price ratio, heterogeneous many-core architectures are widely used in high-performance computing areas. GPU and Xeon Phi are two popular general-purpose many-core accelerators. In this paper, we demonstrate how heterogeneous many-core architectures, powered by multi-core CPUs, CUDA-enabled GPUs and Xeon Phis can be used as an efficient computational platform to accelerate popular option pricing algorithms. In order to make full use of the compute power of this architecture, we have used a hybrid computing model which consists of two types of data parallelism: worker level and device level. The worker level data parallelism uses a distributed computing infrastructure for task distribution, while the device level data parallelism uses both the multi-core CPUs and many-core accelerators for fast option pricing calculation. Experiments show that our implementations achieve good performance and scalability on this architecture and also outperform other state-of-the-art GPU-based solutions for Monte Carlo European/American option pricing and BSDE European option pricing. 相似文献

5.

基于代理的网格计算中间件 总被引：11，自引：0，他引：11

陈亚玲桂小林王庆江钱德沛《计算机研究与发展》2003,40(12):1806-1810

WADE系统是基于代理技术实现的一个可屏蔽异构和分布性的动态自适应的校园计算网格，提出了基于代理技术在校园网络内实现并行计算的方法，详细论述了基于代理的网格计算中间件的体系结构和主要模块功能，阐述了利用代理实现异构编译、协同计算的过程，给出了代理的Java实现方法，利用软件代理实现网格计算中间件，可以解决异构计算平台下多种并行编程环境的协同计算问题，为用户提供统一的服务接口，这将大大增强系统的可用性。相似文献

6.

Data-intensive document clustering on graphics processing unit (GPU) clusters 总被引：1，自引：0，他引：1

Yongpeng ZhangAuthor VitaeFrank MuellerAuthor Vitae Xiaohui CuiAuthor Vitae 《Journal of Parallel and Distributed Computing》2011,71(2):211-224

Document clustering is a central method to mine massive amounts of data. Due to the explosion of raw documents generated on the Internet and the necessity to analyze them efficiently in various intelligent information systems, clustering techniques have reached their limitations on single processors. Instead of single processors, general-purpose multi-core chips are increasingly deployed in response to diminishing returns in single-processor speedup due to the frequency wall, but multi-core benefits only provide linear speedups while the number of documents in the Internet is growing exponentially. Accelerating hardware devices represent a novel promise for improving the performance for data-intensive problems such as document clustering. They offer more radical designs with a higher level of parallelism but adaptation to novel programming environments.In this paper, we assess the benefits of exploiting the computational power of graphics processing units (GPUs) to study two fundamental problems in document mining, namely to calculate the term frequency-inverse document frequency (TF-IDF) and cluster a large set of documents. We transform traditional algorithms into accelerated parallel counterparts that can be efficiently executed on many-core GPU architectures. We assess our implementations on various platforms, ranging from stand-alone GPU desktops to Beowulf-like clusters equipped with contemporary GPU cards. We observe at least one order of magnitude speedups over CPU-only desktops and clusters. This demonstrates the potential of exploiting GPU clusters to efficiently solve massive document mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge. 相似文献

7.

MPtostream:an OpenMP compiler for CPU-GPU heterogeneous parallel systems

《中国科学:信息科学(英文版)》2012,(9):1961-1971

In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone. 相似文献

8.

Implementing molecular dynamics on hybrid high performance computers – short range forces

W. Michael Brown Peng Wang Steven J. Plimpton Arnold N. Tharrington 《Computer Physics Communications》2011,182(4):898-911

The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines – (1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory, (2) minimizing the amount of code that must be ported for efficient acceleration, (3) utilizing the available processing power from both multi-core CPUs and accelerators, and (4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS, however, the methods can be applied in many molecular dynamics codes. Specifically, we describe algorithms for efficient short range force calculation on hybrid high-performance machines. We describe an approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPUs and 180 CPU cores. 相似文献

9.

Towards a performance-portable description of geometric multigrid algorithms using a domain-specific language

《Journal of Parallel and Distributed Computing》2014,74(12):3191-3201

相似文献

10.

一种面向异构计算的结构化并行编程框架 总被引：1，自引：0，他引：1

李安民计卫星廖心怡高建花谈兆年王一拙石峰《计算机工程与科学》2019,41(3):424-432

随着人工智能时代的到来,异构计算在深度学习、科学计算等领域发挥着越来越重要的作用。目前异构计算系统在应用上的瓶颈之一在于缺少高效的软件开发框架,已有的OpenCL、CUDA等支持GPU、DSP及FPGA的编程框架基于C/C++语言和传统的并行编程方法,导致软件开发效率较低,软件推理和调试困难,难以灵活处理计算设备之间的协作和调度。提出一种面向异构计算平台的基于脚本语言的结构化并行编程框架,提供结构化的并行编程接口,支持计算任务到异构计算设备的映射,便于并行程序的推理和验证。设计并实现了基于遗传算法的结构化调度算法,充分利用异构计算系统的计算能力,提高了异构计算系统的软件开发效率。实验结果表明,提出的编程框架在CPU+GPU平台上实现了相对于单处理器1.5到2.5倍的加速比。相似文献