期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A new era in scientific computing: Domain decomposition methods in hybrid CPU–GPU architectures

M. Papadrakakis G. Stavroulakis A. Karatarakis 《Computer Methods in Applied Mechanics and Engineering》2011,200(13-16):1490-1508

Recent advances in graphics processing units (GPUs) technology open a new era in high performance computing. Applications of GPUs to scientific computations are attracting a lot of attention due to their low cost in conjunction with their inherently remarkable performance features and the recently enhanced computational precision and improved programming tools. Domain decomposition methods (DDM) constitute today an important category of methods for the solution of highly demanding problems in simulation-based applied science and engineering. Among them, dual domain decomposition methods have been successfully applied in a variety of problems in both sequential as well as in parallel/distributed processing systems. In this work, we demonstrate the implementation of the FETI method to a hybrid CPU–GPU computing environment. Parametric tests on implicit finite element structural mechanics benchmark problems revealed the tremendous potential of this type of hybrid computing environment as a result of the full exploitation of multi-core CPU hardware resources and the intrinsic software and hardware features of the GPUs as well as the numerical properties of the solution method. 相似文献

2.

Mapping of option pricing algorithms onto heterogeneous many-core architectures

Shuai Zhang Zhao Wang Ying Peng Bertil Schmidt Weiguo Liu 《The Journal of supercomputing》2017,73(9):3715-3737

The rapid development of technologies and applications in recent years poses high demands and challenges for high-performance computing. Because of their competitive performance/price ratio, heterogeneous many-core architectures are widely used in high-performance computing areas. GPU and Xeon Phi are two popular general-purpose many-core accelerators. In this paper, we demonstrate how heterogeneous many-core architectures, powered by multi-core CPUs, CUDA-enabled GPUs and Xeon Phis can be used as an efficient computational platform to accelerate popular option pricing algorithms. In order to make full use of the compute power of this architecture, we have used a hybrid computing model which consists of two types of data parallelism: worker level and device level. The worker level data parallelism uses a distributed computing infrastructure for task distribution, while the device level data parallelism uses both the multi-core CPUs and many-core accelerators for fast option pricing calculation. Experiments show that our implementations achieve good performance and scalability on this architecture and also outperform other state-of-the-art GPU-based solutions for Monte Carlo European/American option pricing and BSDE European option pricing. 相似文献

3.

GPU-based Acceleration of System-level Design Tasks

Unmesh D. Bordoloi Samarjit Chakraborty 《International journal of parallel programming》2010,38(3-4):225-253

Many system-level design tasks (e.g., high-level timing analysis, hardware/software partitioning and design space exploration) involve computational kernels that are intractable (usually NP-hard). As a result, they involve high running times even for mid-sized problems. In this paper we explore the possibility of using commodity graphics processing units (GPUs) to accelerate such tasks that commonly arise in the electronic design automation (EDA) domain. We demonstrate this idea via two detailed case studies. The first explores the possibility of using GPUs to speedup standard schedulability analysis problems. The second proposes a GPU-based engine for a general hardware/software design space exploration problem. Not only do these problems commonly arise in the embedded systems domain, their computational kernels turn out to be variants of a combinatorial optimization problem—viz., the knapsack problem—that lies at the heart of several EDA applications. Experimental results show that our GPU-based implementations offer very attractive speedups for the computational kernels (up to 100×), and speedups of up to 17× for the full problem. In contrast to ASIC/FPGA-based accelerators—given that even low-end desktop and notebook computers are now equipped with GPUs—our solution involves no extra hardware cost. Although recent research has shown the benefits of using GPUs for a variety of non-graphics applications (e.g., in databases and bioinformatics), harnessing the parallelism of GPUs to accelerate problems from the EDA domain has not been sufficiently explored so far. We believe that our results and the generality of the core problem that we address will motivate researchers from this community to explore the possibility of using GPUs for a wider variety of problems from the EDA domain. 相似文献

4.

Containerization technologies: taxonomies,applications and challenges

Bentaleb Ouafa Belloum Adam S. Z. Sebaa Abderrazak El-Maouhab Aouaouche 《The Journal of supercomputing》2022,78(1):1144-1181

Modern scientific research challenges require new technologies, integrated tools, reusable and complex experiments in distributed computing infrastructures. But above all, computing power for efficient data processing and analyzing. Containers technologies have emerged as a new paradigm to address such intensive scientific applications problems. Their easy deployment in a reasonable amount of time and the few required computational resource make them more suitable. Containers are considered light virtualization solutions. They enable performance isolation and flexible deployment of complex, parallel, and high-performance systems. Moreover, they gained popularity to modernize and migrate scientific applications in computing infrastructure management. Additionally, they reduce computational time processing. In this paper, we first give an overview of virtualization and containerization technologies. We discuss the taxonomies of containerization technologies of the literature, and then we provide a new one that covers and completes those proposed in the literature. We identify the most important application domains of containerization and their technological progress. Furthermore, we discuss the performance metrics used in most containerization techniques. Finally, we point out research gaps in the related aspects of containerization technology that require more research.

相似文献

5.

Evaluation of performance and architectural efficiency of FPGAs and GPUs in the 40 and 28 nm generations for algorithms in 3D ultrasound computer tomography

Matthias Birk Matthias Balzer Nicole V. Ruiter Juergen Becker 《Computers & Electrical Engineering》2014

In heterogeneous computing, application developers have to identify the best-suited target platform from a variety of alternatives. In this work, we compare performance and architectural efficiency of Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs) for two algorithms taken from a novel medical imaging method named 3D ultrasound computer tomography. From the 40 nm and 28 nm generations, we use top-notch devices and those with similar power consumption values. For our two benchmark algorithms from the signal processing and imaging domain, the results show that if power consumption is not considered, the GPU and FPGA from the 40nm generation give both, a similar performance and efficiency per transistor. In the 28 nm process, in contrast, the FPGA is superior to its GPU counterpart by 86% and 39%, depending on the algorithm. If power is limited, FPGAs outperform GPUs in each investigated case by at least a factor of four. 相似文献

6.

A Distributed PTX Virtual Machine on Hybrid CPU/GPU Clusters

《Journal of Systems Architecture》2016

Hybrid CPU/GPU cluster recently has drawn lots of attention from high performance computing because of excellent execution performance and energy efficiency. Many supercomputing sites in the newest TOP 500 and Green 500 are built by hybrid CPU/GPU clusters instead of CPU clusters. However, the programming complexity of hybrid CPU/GPU clusters is so high such that most of users usually hesitate to move toward to this new cluster computing platform. To resolve this problem, we propose a distributed PTX virtual machine called BigGPU on heterogeneous clusters in this paper. As named, this virtual machine physically is a distributed system which is aimed at parallel re-compiling and executing the PTX codes by aggregating CPUs and GPUs available in a computational cluster. With the support of this virtual machine, users can regard a hybrid CPU/GPU as a single large-scale GPU. Consequently, they can develop applications by using only CUDA without combining MPI and multithreading APIs while can simultaneously use distributed CPUs and GPUs for resolving the same problem. Moreover, they need not handle the problem of load balance among heterogeneous processors and the constraints of device memory and thread configuration existing in physical GPUs because BigGPU supports large-scale virtual device memory space and thread configuration. On the other hand, we have evaluated the execution performance of BigGPU in this paper. Our experimental results have shown that BigGPU indeed can effectively exploit the computational power of CPUs and GPUs for enhancing the execution performance of user's CUDA programs. 相似文献

7.

Algorithmic performance studies on graphics processing units

Olaf Schenk Matthias Christen Helmar Burkhart 《Journal of Parallel and Distributed Computing》2008

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications. 相似文献

8.

Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

《Parallel Computing》2016

Graphic Processing Units (GPUs) are widely used in high performance computing, due to their high computational power and high performance per Watt. However, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. The most common way to utilize a GPU cluster is a hybrid model, in which the GPU is used to accelerate the computation, while the CPU is responsible for the communication. This approach always requires a dedicated CPU thread, which consumes additional CPU cycles and therefore increases the power consumption of the complete application. In recent work we have shown that the GPU is able to control the communication independently of the CPU. However, there are several problems with GPU-controlled communication. The main problem is intra-GPU synchronization, since GPU blocks are non-preemptive. Therefore, the use of communication requests within a GPU can easily result in a deadlock. In this work we show how dynamic parallelism solves this problem. GPU-controlled communication in combination with dynamic parallelism allows keeping the control flow of multi-GPU applications on the GPU and bypassing the CPU completely. Using other in-kernel synchronization methods results in massive performance losses, due to the forced serialization of the GPU thread blocks. Although the performance of applications using GPU-controlled communication is still slightly worse than the performance of hybrid applications, we will show that performance per Watt increases by up to 10% while still using commodity hardware. 相似文献

9.

Optimizing convolution operations on GPUs using adaptive tiling

《Future Generation Computer Systems》2014

The research domain of Multimedia Content Analysis (MMCA) considers all aspects of the automated extraction of knowledge from multimedia data. High-performance computing techniques are necessary to satisfy the ever increasing computational demands of MMCA applications. The introduction of Graphics Processing Units (GPUs) in modern cluster systems presents application developers with a challenge. While GPUs are well known to be capable of providing significant performance improvements, the programming complexity vastly increases. To this end, we have extended a user transparent parallel programming model for MMCA, named Parallel-Horus, to allow the execution of compute intensive operations on the GPUs present in the cluster. The most important class of operations in the MMCA domain are convolutions, which are typically responsible for a large fraction of the execution time. Existing optimization approaches for CUDA kernels in general as well as those specific to convolution operations are too limited in both performance and flexibility. In this paper, we present a new optimization approach, called adaptive tiling, to implement a highly efficient, yet flexible, library-based convolution operation for modern GPUs. To the best of our knowledge, our implementation is the most optimized and best performing implementation of 2D convolution in the spatial domain available to date. 相似文献

10.

LBM算法在GPU组中的应用

王鹏封卫兵《计算机工程与设计》2011,32(12):4237-4240

为提高大规模并行计算的并行效率,充分发挥CPU与GPU的功能特点,特别是体现GPU强大的运算能力,提出了用消息传递接口(MPI)将一组GPU连接起来。使GPU通用计算与计算流体力学中的LBM(latticeBoltzmannmethod)算法相结合。根据GPU通用计算与LBM算法的原理,使MPI作为计算分配的机制,CUDA(compute unified device architecture)作为主要的计算执行引擎,建立支持CUDA的GPU集群,在集群上对LBM算法中的D2Q9模型进行二维方腔流数值模拟。实验结果表明,利用GPU组模拟与CPU模拟结果一致,更充分发挥了GPU的计算能力,提高了并行效率。相似文献

11.

Highly interactive computational steering for coupled 3D flow problems utilizing multiple GPUs

Jan Linxweiler Manfred Krafczyk Jonas Tölke 《Computing and Visualization in Science》2010,13(7):299-314

Most computational fluid dynamics (CFD) simulations require massive computational power which is usually provided by traditional High Performance Computing (HPC) environments. Although interactivity of the simulation process is highly appreciated by scientists and engineers, due to limitations of typical HPC environments, present CFD simulations are usually executed non interactively. A recent trend is to harness the parallel computational power of graphics processing units (GPUs) for general purpose applications. As an alternative to traditional massively parallel computing, GPU computing has also gained popularity in the CFD community, especially for its application to the lattice Boltzmann method (LBM). For instance, Tölke and others presented very efficient implementations of the LBM for 2D as well as 3D space (Toelke J, in Comput Visual Sci. (2008); Toelke J and Krafczk M, in Int J Comput Fluid Dyn 22(7): 443–456 (2008)). In this work we motivate the use of GPU computing to facilitate interactive CFD simulations. In our approach, the simulation is executed on multiple GPUs instead of traditional HPC environments, which allows the integration of the complete simulation process into a single desktop application. To demonstrate the feasibility of our approach, we show a fully bidirectional fluid-structure-interaction for self induced membrane oscillations in a turbulent flow. The efficiency of the approach allows a 3D simulation close to realtime. 相似文献

12.

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

José M. Andión Manuel Arenaz François Bodin Gabriel Rodríguez Juan Touriño 《International journal of parallel programming》2016,44(3):620-643

The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate these accelerators with high-level programming languages, giving place to heterogeneous computing systems. Unfortunately, this heterogeneity is also exposed to the programmer complicating its exploitation. This paper presents a new technique to automatically rewrite sequential programs into a parallel counterpart targeting GPU-based heterogeneous systems. The original source code is analyzed through domain-independent computational kernels, which hide the complexity of the implementation details by presenting a non-statement-based, high-level, hierarchical representation of the application. Next, a locality-aware technique based on standard compiler transformations is applied to the original code through OpenHMPP directives. Two representative case studies from scientific applications have been selected: the three-dimensional discrete convolution and the simple-precision general matrix multiplication. The effectiveness of our technique is corroborated by a performance evaluation on NVIDIA GPUs. 相似文献

13.

Android^\mathrm{TM} development and performance analysis

Alejandro Acosta Francisco Almeida 《The Journal of supercomputing》2014,70(2):649-659

相似文献

14.

The emergence of cellular computing

Sipper M. 《Computer》1999,32(7):18-26

The von Neumann architecture-which is based upon the principle of one complex processor that sequentially performs a single complex task at a given moment-has dominated computing technology for the past 50 years. Recently however, researchers have begun exploring alternative computational systems based on entirely different principles. Although emerging from disparate domains, the work behind these systems shares a common computational philosophy, which the author calls cellular computing. This philosophy promises to provide new means for doing computation more efficiently-in terms of speed, cost, power dissipation, information storage, and solution quality. Simultaneously, cellular computing offers the potential of addressing much larger problem instances than previously possible, at least for some application domains. Cellular computing has attracted increasing research interest. Work in this field has produced results that hold prospects for a bright future. Yet questions must be answered before cellular computing can become a mainstream paradigm. What classes of computational tasks are most suited to it? How do we match the specific properties and behaviors of a given model to a suitable class of problems? At its heart, cellular computing consists of three principles: simplicity, vast parallelism, and locality 相似文献

15.

基于延迟隐藏因子的GPU 计算模型

袁良张云泉龙国平王可张先轶《软件学报》2010,21(Z1):251-262

近年来在生物计算,科学计算等领域成功地应用了GPU 加速计算并获得了较高加速比.然而在GPU 上编程和调优过程非常繁琐,为此,研究人员提出了许多提高编程效率的编程模型和编译器,以及指导程序优化的计算模型,在一定程度上简化了GPU上的算法设计和优化,但是已有工作都存在一些不足.针对GPU低延迟高带宽的特性,提出了基于延迟隐藏因子的GPU 计算模型,模型提取算法隐藏延迟的能力,以指导算法优化.利用3 种矩阵乘算法进行实测与模型预测,实验结果表明,在简化模型的情况下,平均误差率为0.19. 相似文献

16.

面向边缘计算的嵌入式FPGA卷积神经网络构建方法

卢冶陈瑶李涛蔡瑞初宫晓利《计算机研究与发展》2018,55(3):551-562

当前,高计算消耗的应用和服务逐渐从集中式云计算中心向网络边缘的嵌入式环境迁移,FPGA因其灵活性和高能效特性,使其在边缘计算的嵌入式系统中得到广泛的应用.传统的FPGA卷积神经网络构造方法存在设计周期长和优化空间小等缺点,无法有效探索硬件加速器的设计空间,在网络边缘的的嵌入式环境下尤为明显.针对该问题,提出一种面向边缘计算的嵌入式FPGA平台卷积神经网络通用的构建方法.通过设计卷积神经网络函数中的网络层间可复用的加速器核心,以少量硬件资源实现性能优化的卷积神经网络硬件;通过拓展设计、缓存优化及数据流优化等技术,实现HLS设计优化;利用该方法在嵌入式FPGA平台上构建相应卷积神经网络,实验结果表明:优化后的网络模型在与Xeon E5-1620 CPU和GTX Titan GPU相比时,在功耗与性能方面具有一定优势,适合应用于边缘计算环境中. 相似文献

17.

Accelerating floating-point fitness functions in evolutionary algorithms: a FPGA-CPU-GPU performance comparison

Juan A. Gomez-Pulido Miguel A. Vega-Rodriguez Juan M. Sanchez-Perez Silvio Priem-Mendes Vitor Carreira 《Genetic Programming and Evolvable Machines》2011,12(4):403-427

Many large combinatorial optimization problems tackled with evolutionary algorithms often require very high computational times, usually due to the fitness evaluation. This fact forces programmers to use clusters of computers, a computational solution very useful for running applications of intensive calculus but having a high acquisition price and operation cost, mainly due to the Central Processing Unit (CPU) power consumption and refrigeration devices. A low-cost and high-performance alternative comes from reconfigurable computing, a hardware technology based on Field Programmable Gate Array devices (FPGAs). The main objective of the work presented in this paper is to compare implementations on FPGAs and CPUs of different fitness functions in evolutionary algorithms in order to study the performance of the floating-point arithmetic in FPGAs and CPUs that is often present in the optimization problems tackled by these algorithms. We have taken advantage of the parallelism at chip-level of FPGAs pursuing the acceleration of the fitness functions (and consequently, of the evolutionary algorithms) and showing the parallel scalability to reach low cost, low power and high performance computational solutions based on FPGA. Finally, the recent popularity of GPUs as computational units has moved us to introduce these devices in our performance comparisons. We analyze performance in terms of computation times and economic cost. 相似文献

18.

Exploiting GPUs with the Super Instruction Architecture

Nakul Jindal Victor Lotrich Erik Deumens Beverly A. Sanders 《International journal of parallel programming》2016,44(2):309-324

The Super Instruction Architecture (SIA) is a parallel programming environment designed for problems in computational chemistry involving complicated expressions defined in terms of tensors. Tensors are represented by multidimensional arrays which are typically very large. The SIA consists of a domain specific programming language, Super Instruction Assembly Language (SIAL), and its runtime system, Super Instruction Processor. An important feature of SIAL is that algorithms are expressed in terms of blocks (or tiles) of multidimensional arrays rather than individual floating point numbers. In this paper, we describe how the SIA was enhanced to exploit GPUs, obtaining speedups ranging from two to nearly four for computational chemistry calculations, thus saving hours of elapsed time on large-scale computations. The results provide evidence that the “programming-with-blocks” approach embodied in the SIA will remain successful in modern, heterogeneous computing environments. 相似文献

19.

EESSI: A cross-platform ready-to-use optimised scientific software stack

Bob Dröge Victor Holanda Rusu Kenneth Hoste Caspar van Leeuwen Alan O'Cais Thomas Röblitz 《Software》2023,53(1):176-210

Getting scientific software installed correctly and ensuring it performs well has been a ubiquitous problem for several decades now, which is compounded currently by the changing landscape of computational science with the (re-)emergence of different microprocessor families, and the expansion to additional scientific domains like artificial intelligence and next-generation sequencing. The European Environment for Scientific Software Installations (EESSI) project aims to provide a ready-to-use stack of scientific software installations that can be leveraged easily on a variety of platforms, ranging from personal workstations to cloud environments and supercomputer infrastructure, without making compromises with respect to performance. In this article, we provide a detailed overview of the project, highlight potential use cases, and demonstrate that the performance of the provided scientific software installations can be competitive with system-specific installations. 相似文献

20.

Interactive Workspaces 总被引：1，自引：0，他引：1

Johanson B. Winograd T. Fox A. 《Computer》2003,36(4):99-101

We are rapidly entering a world in which people equip themselves with a small constellation of mobile devices and pass through environments rich in embedded technology. However, we are still a long way from harnessing the power of this technology in a way that seamlessly and invisibly assists users in their day-to-day activities. Bringing pervasive computing to maturity requires innovations in systems integration and human-computer interaction (HCI). To investigate these challenges, Stanford University's Interactive Workspaces project (http://iwork.stanford.edu/) is exploring team-based collaboration in technology-augmented environments. These workspaces are designed to allow groups of five to ten people to work together using computing and interaction devices on many scales. 相似文献