期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

Scott Rostrup Hans De Sterck 《Computer Physics Communications》2010,181(12):2164-2179

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.

Program summary

Program title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Sub-program numdiff is used for the test run. 相似文献

2.

Fast parallel solver for the levelset equations on unstructured meshes

Zhisong Fu Sergey Yakovlev Robert M. Kirby Ross T. Whitaker 《Concurrency and Computation》2015,27(7):1639-1657

相似文献

3.

LDMBL: An architecture for reducing code duplication in heavyweight binary instrumentations

下载免费PDF全文

Behnam Momeni Mehdi Kharrazi 《Software》2018,48(9):1642-1659

Emergence of instrumentation frameworks has vastly contributed to the software engineering practices. As the instrumentation use cases become more complex, complexity of instrumenting programs also increases, leading to a higher risk of software defects, increased development time, and decreased maintainability. In security applications such as symbolic execution and taint analysis, which need to instrument a large number of instruction types, this complexity is prominent. This paper presents an architecture based on the Pin binary instrumentation framework to abstract the low‐level OS and hardware‐dependent implementation details, facilitate code reuse in heavyweight instrumentation use cases, and improve instrumenting program development time. Instructions of x86 and x86‐64 hardware architectures are formally categorized using the Z language based on the Pin framework API. This categorization is used to automate the instrumentation phase on the basis of a configuration list. Furthermore, instrumentation context data such as register data are modeled in an object‐oriented scheme. This makes it possible to focus the instrumenting program development time on writing the essential analysis logics while access to low‐level OS and hardware dependencies are streamlined. The proposed architecture is evaluated by instrumenting 135 instruction types in a concrete symbolic execution engine, resulting in a reduction of the instrumenting program size by 59.7%. Furthermore, performance overhead measure against the SPEC CINT2006 programs is limited to 8.7%. 相似文献

4.

GPU implementation of a parallel two‐list algorithm for the subset‐sum problem

Lanjun Wan Kenli Li Jing Liu Keqin Li 《Concurrency and Computation》2015,27(1):119-145

The subset‐sum problem is a well‐known non‐deterministic polynomial‐time complete (NP‐complete) decision problem. This paper proposes a novel and efficient implementation of a parallel two‐list algorithm for solving the problem on a graphics processing unit (GPU) using Compute Unified Device Architecture (CUDA). The algorithm is composed of a generation stage, a pruning stage, and a search stage. It is not easy to effectively implement the three stages of the algorithm on a GPU. Ways to achieve better performance, reasonable task distribution between CPU and GPU, effective GPU memory management, and CPU–GPU communication cost minimization are discussed. The generation stage of the algorithm adopts a typical recursive divide‐and‐conquer strategy. Because recursion cannot be well supported by current GPUs with compute capability less than 3.5, a new vector‐based iterative implementation mechanism is designed to replace the explicit recursion. Furthermore, to optimize the performance of the GPU implementation, this paper improves the three stages of the algorithm. The experimental results show that the GPU implementation has much better performance than the CPU implementation and can achieve high speedup on different GPU cards. The experimental results also illustrate that the improved algorithm can bring significant performance benefits for the GPU implementation. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

5.

Performance analysis of distributed symmetric sparse matrix vector multiplication algorithm for multi‐core architectures

Dossay Oryspayev Hasan Metin Aktulga Masha Sosonkina Pieter Maris James P. Vary 《Concurrency and Computation》2015,27(17):5019-5036

Sparse matrix vector multiply (SpMVM) is an important kernel that frequently arises in high performance computing applications. Due to its low arithmetic intensity, several approaches have been proposed in literature to improve its scalability and efficiency in large scale computations. In this paper, our target systems are high end multi‐core architectures and we use messaging passing interface + open multiprocessing hybrid programming model for parallelism. We analyze the performance of recently proposed implementation of the distributed symmetric SpMVM, originally developed for large sparse symmetric matrices arising in ab initio nuclear structure calculations. We study important features of this implementation and compare with previously reported implementations that do not exploit underlying symmetry. Our SpMVM implementations leverage the hybrid paradigm to efficiently overlap expensive communications with computations. Our main comparison criterion is the ‘CPU core hours’ metric, which is the main measure of resource usage on supercomputers. We analyze the effects of topology‐aware mapping heuristic using simplified network load model. We have tested the different SpMVM implementations on two large clusters with 3D Torus and Dragonfly topology. Our results show that the distributed SpMVM implementation that exploits matrix symmetry and hides communication yields the best value for the ‘CPU core hours’ metric and significantly reduces data movement overheads. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

6.

CPU–GPU hybrid parallel strategy for cosmological simulations

Yueqing Wang Yong Dou Song Guo Yuanwu Lei Dan Zou 《Concurrency and Computation》2014,26(3):748-765

Gadget is a simulation application for N‐body and smoothed particle hydrodynamics problems in cosmology, and it is widely applied in solving series of cosmological problems. N‐body focuses on the motion of the interaction of N particles, and smoothed particle hydrodynamics is a fluid simulation algorithm that studies the movement of fluid through particle simulation. Most scholars focus their attention on accelerating Gadget on multi‐core CPU or graphics processing units (GPUs) platforms. However, these research activities failed to achieve CPU–GPU hybrid computing, which resulted in tremendous waste of CPU computing resources. In this paper, we propose a CPU–GPU hybrid parallel strategy to accelerate Gadget‐2, a massively parallel structure formation code for cosmological simulations. This strategy uses CPU and GPU to process the calculation of short‐range force. To ensure CPU and GPU workload balance, a dynamic task allocation scheme is proposed according to the computational performance difference between the CPU and GPU. Experimental results showed that our CPU–GPU hybrid parallel strategy achieved an overall speedup factor of 18.6 and a partial speedup factor for short‐range force calculation of 28.35 compared with a single‐core CPU implementation for particles in million‐size magnitudes. Moreover, compared with a GPU platform that contained 12 CPU cores and one GPU, our hybrid parallel strategy obtained overall speedup and partial speedup factors of 6% and 20%, respectively. Furthermore, the scalability of the hybrid strategy is very fine – its performance will be enhanced when the problem scale is increasing. However, this strategy also has its limitation that the performance enhancement will be decreasing if the ratio(the number of CPU cores divides that of the GPU cards) reduces. Finally, in our hybrid strategy, the CPU coefficient of utilization improved by 17.14% or better. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

7.

Achieving energy efficiency during collective communications

Vaibhav Sundriyal Masha Sosonkina Zhao Zhang 《Concurrency and Computation》2013,25(15):2140-2156

Energy consumption has become a major design constraint in modern computing systems. With the advent of petaflops architectures, power‐efficient software stacks have become imperative for scalability. Techniques such as dynamic voltage and frequency scaling (called DVFS) and CPU clock modulation (called throttling) are often used to reduce the power consumption of the compute nodes. To avoid significant performance losses, these techniques should be used judiciously during parallel application execution. For example, its communication phases may be good candidates to apply the DVFS and CPU throttling without incurring a considerable performance loss. They are often considered as indivisible operations although little attention is being devoted to the energy saving potential of their algorithmic steps. In this work, two important collective communication operations, all‐to‐all and allgather, are investigated as to their augmentation with energy saving strategies on the per‐call basis. The experiments prove the viability of such a fine‐grain approach. They also validate a theoretical power consumption estimate for multicore nodes proposed here. While keeping the performance loss low, the obtained energy savings were always significantly higher than those achieved when DVFS or throttling were switched on across the entire application run. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

8.

Performance evaluation of the SX‐6 vector architecture for scientific computations

Leonid Oliker Andrew Canning Jonathan Carter John Shalf David Skinner Stphane Ethier Rupak Biswas Jahed Djomehri Rob Van der Wijngaart 《Concurrency and Computation》2005,17(1):69-93

The growing gap between sustained and peak performance for scientific applications is a well‐known problem in high‐performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX‐6 vector processor, and compares it against the cache‐based IBM Power3 and Power4 superscalar architectures, across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines many low‐level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Overall results demonstrate that the SX‐6 achieves high performance on a large fraction of our application suite and often significantly outperforms the cache‐based architectures. However, certain classes of applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX‐6 effectively. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献

9.

一个高可移植性的轻量级x86模拟器

曹欢寅张妍《计算机系统应用》2011,20(5):101-104,143

介绍一个可以在多种处理器体系结构上运行的轻量级x86模拟器PIT(Portable x86 Instruction Translator)."动态二进制指令翻译"足一个可以让一种机器的指令运行在另一种机器上的技术.PIT采用了可移植的动态二进制指令翻译技术,可以在多种体系结构(包括x86,PowerPC,ARM,Spa... 相似文献

10.

Parallel solution of the subset‐sum problem: an empirical study

Saniyah S. Bokhari 《Concurrency and Computation》2012,24(18):2241-2254

The subset‐sum problem is a well‐known NP‐complete combinatorial problem that is solvable in pseudo‐polynomial time, that is, time proportional to the number of input objects multiplied by the sum of their sizes. This product defines the size of the dynamic programming table used to solve the problem. We show how this problem can be parallelized on three contemporary architectures, that is, a 128‐processor Cray Extreme Multithreading (XMT) massively multithreaded machine, a 16‐processor IBM x3755 shared memory machine, and a 240‐core NVIDIA FX 5800 graphics processing unit (GPU). We show that it is straightforward to parallelize this algorithm on the Cray XMT primarily because of the word‐level locking that is available on this architecture. For the other two machines, we present an alternating word algorithm that can implement an efficient solution. Our results show that the GPU performs well for problems whose tables fit within the device memory. Because GPUs typically have memories in the order of 10GB, such architectures are best for small problem sizes that have tables of size approximately 10¹⁰. The IBM x3755 performs very well on medium‐sized problems that fit within its 64‐GB memory but has poor scalability as the number of processors increases and is unable to sustain performance as the problem size increases. This machine tends to saturate for problem sizes of 10¹¹ bits. The Cray XMT shows very good scaling for large problems and demonstrates sustained performance as the problem size increases. However, this machine has poor scaling for small problem sizes; it performs best for problem sizes of 10¹² bits or more. The results in this paper illustrate that the subset‐sum problem can be parallelized well on all three architectures, albeit for different ranges of problem sizes. The performance of these three machines under varying problem sizes show the strengths and weaknesses of the three architectures. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

11.

Emerging Architectures Enable to Boost Massively Parallel Data Mining Using Adaptive Sparse Grids

Alexander Heinecke Dirk Pflüger 《International journal of parallel programming》2013,41(3):357-399

Gaining knowledge out of vast datasets is a main challenge in data-driven applications nowadays. Sparse grids provide a numerical method for both classification and regression in data mining which scales only linearly in the number of data points and is thus well-suited for huge amounts of data. Due to the recursive nature of sparse grid algorithms and their classical random memory access pattern, they impose a challenge for the parallelization on modern hardware architectures such as accelerators. In this paper, we present the parallelization on several current task- and data-parallel platforms, covering multi-core CPUs with vector units, GPUs, and hybrid systems. We demonstrate that a less efficient implementation from an algorithmical point of view can be beneficial if it allows vectorization and a higher degree of parallelism instead. Furthermore, we analyze the suitability of parallel programming languages for the implementation. Considering hardware, we restrict ourselves to the x86 platform with SSE and AVX vector extensions and to NVIDIA’s Fermi architecture for GPUs. We consider both multi-core CPU and GPU architectures independently, as well as hybrid systems with up to 12 cores and 2 Fermi GPUs. With respect to parallel programming, we examine both the open standard OpenCL and Intel Array Building Blocks, a recently introduced high-level programming approach, and comment on their ease of use. As the baseline, we use the best results obtained with classically parallelized sparse grid algorithms and their OpenMP-parallelized intrinsics counterpart (SSE and AVX instructions), reporting both single and double precision measurements. The huge data sets we use are a real-life dataset stemming from astrophysics and artificial ones, all of which exhibit challenging properties. In all settings, we achieve excellent results, obtaining speedups of up to 188 × using single precision on a hybrid system. 相似文献

12.

Ternary Sparse Matrix Representation for Volumetric Mesh Subdivision and Processing on GPUs

下载免费PDF全文

J. S. Mueller‐Roemer C. Altenhofen A. Stork 《Computer Graphics Forum》2017,36(5):59-69

In this paper, we present a novel volumetric mesh representation suited for parallel computing on modern GPU architectures. The data structure is based on a compact, ternary sparse matrix storage of boundary operators. Boundary operators correspond to the first‐order top‐down relations of k‐faces to their (k ? 1)‐face facets. The compact, ternary matrix storage format is based on compressed sparse row matrices with signed indices and allows for efficient parallel computation of indirect and bottom‐up relations. This representation is then used in the implementation of several parallel volumetric mesh algorithms including Laplacian smoothing and volumetric Catmull‐Clark subdivision. We compare these algorithms with their counterparts based on OpenVolumeMesh and achieve speedups from 3× to 531×, for sufficiently large meshes, while reducing memory consumption by up to 36%. 相似文献

13.

Development of a multiplier method for dynamic response optimization problems

A. I. Chahande J. S. Arora 《Structural and Multidisciplinary Optimization》1993,6(2):69-78

The multiplier method is studied for optimum design of mechanical and structural systems subjected to dynamic loads. Certain key parameters in the algorithm are identified and extensive numerical experiments are conducted to see their effect on the performance of the method. Several mathematical programming problems, and static and dynamic response structural design problems are used to evaluate the method. Some new numerical procedures are proposed and evaluated to improve performance of the method. As a result of this study, a better understanding of the multiplier method has been achieved, and the effect of various parameters and procedures of the algorithm is better understood.Notation Number of equality constraints - COST Cost function value at the solution point - CPU Total CPU time on DN10000 - f(x) Cost function - g(x) Constraint vector of dimension m×1 - IFAIL Number of failed problems - A parameter used in the algorithm - L-BFGS Unconstrained minimization program that uses limited memory BFGS method - m Total number of constraints - n Number of design variables - v Number of degrees of freedom - NF Average number of function evaluations - NG Average number of gradient evaluations - NIT Average number of unconstrained minimizations - Parameter vector of dimension m×1 used in the definition of augmented Lagrange functional - r Penalty parameter vector of dimension m×1 - TRDDB Unconstrained minimization program that computes trust region step using the double dogleg method - u Lagrange multiplier vector of dimension m×1 - x Design variable vector of dimension n×1 - x _i Lower bound onx _i - x _ui Upper bound onx _i - (x) Augmented Lagrangian - P(x) Penalty function 相似文献

14.

Designing 2D Vector Fields of Arbitrary Topology

H.Theisel 《Computer Graphics Forum》2002,21(3):595-604

We introduce a scheme of control polygons to design topological skeletons for vector fields of arbitrary topology. Based on this we construct piecewise linear vector fields of exactly the topology specified by the control polygons. This way a controlled construction of vector fields of any topology is possible. Finally we apply this method for topology‐preserving compression of vector fields consisting of a simple topology. 相似文献

15.

Fast Fluid Simulations with Sparse Volumes on the GPU

下载免费PDF全文

Kui Wu Nghia Truong Cem Yuksel Rama Hoetzlein 《Computer Graphics Forum》2018,37(2):157-167

We introduce efficient, large scale fluid simulation on GPU hardware using the fluid‐implicit particle (FLIP) method over a sparse hierarchy of grids represented in NVIDIA^® GVDB Voxels. Our approach handles tens of millions of particles within a virtually unbounded simulation domain. We describe novel techniques for parallel sparse grid hierarchy construction and fast incremental updates on the GPU for moving particles. In addition, our FLIP technique introduces sparse, work efficient parallel data gathering from particle to voxel, and a matrix‐free GPU‐based conjugate gradient solver optimized for sparse grids. Our results show that our method can achieve up to an order of magnitude faster simulations on the GPU as compared to FLIP simulations running on the CPU. 相似文献

16.

A General Multi-step Algorithm for Voxel Traversing Along a Line

Y. K. Liu H. Y. Song B. &#;alik 《Computer Graphics Forum》2008,27(1):73-80

Traversing voxels along a three dimensional (3D) line is one of the most fundamental algorithms for voxel‐based applications. This paper presents a new 6‐connectivity integer algorithm for this task. The proposed algorithm accepts voxels having different sizes in x, y and z directions. To explain the idea of the proposed approach, a 2D algorithm is firstly considered and then extended in 3D. This algorithm is a multi‐step as up to three voxels may be added in one iteration. It accepts both integer and floating‐point input. The new algorithm was compared to other popular voxel traversing algorithms. Counting the number of arithmetic operations showed that the proposed algorithm requires the least amount of operations per traversed voxel. A comparison of spent CPU time using either integer or floating‐point arithmetic confirms that the proposed algorithm is the most efficient. This algorithm is simple, and in compact form which also makes it attractive for hardware implementation. 相似文献

17.

Exploring the performance of massively multithreaded architectures

Shahid Bokhari Joel Saltz 《Concurrency and Computation》2010,22(5):588-616

We present a new scheme for evaluating the performance of multithreaded computers and demonstrate its application to the Cray MTA‐2 and XMT supercomputers. Our scheme is based on the concept of clock cycles per element, ${\cal C}$, plotted against both problem size and the number of processors. This scheme clearly shows if an implementation has achieved its asymptotic efficiency and is more general than (but includes) the commonly used speedup metric. It permits the discovery of any imperfections in both the software as well as the hardware, and is expected to permit a unified comparison of many different parallel architectures. Measurements on a number of well‐known parallel algorithms, ranging from matrix multiply to quicksort, are presented for the MTA‐2 and XMT and highlight some interesting differences between these machines. The performance of sequence alignment using dynamic programming is evaluated on the MTA‐2, XMT, IBM x3755 and SGI Altix 350 and provides a useful comparison of the capabilities of the Cray machines with more conventional shared memory architectures. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

18.

Fiuncho: a program for any-order epistasis detection in CPU clusters

Ponte-Fernández Christian González-Domínguez Jorge Martín María J. 《The Journal of supercomputing》2022,78(13):15338-15357

Epistasis can be defined as the statistical interaction of genes during the expression of a phenotype. It is believed that it plays a fundamental role in gene expression, as individual genetic variants have reported a very small increase in disease risk in previous Genome-Wide Association Studies. The most successful approach to epistasis detection is the exhaustive method, although its exponential time complexity requires a highly parallel implementation in order to be used. This work presents Fiuncho, a program that exploits all levels of parallelism present in x86_64 CPU clusters in order to mitigate the complexity of this approach. It supports epistasis interactions of any order, and when compared with other exhaustive methods, it is on average 358, 7 and 3 times faster than MDR, MPI3SNP and BitEpi, respectively.

相似文献

19.

Gauging intelligence of unmanned vehicle systems–an IGVC perspective

Ka C. Cheok G. Edzko Smid Gerald R. Lane William G. Agnew Asif Khan 《野外机器人技术杂志》2004,21(8):405-417

This paper offers an approach for defining and gauging the intelligence of an unmanned vehicle system. The meaning of “intelligence,” as in natural intelligence and machine intelligence, varies greatly depending on the context in which the word is used. The paper describes certain basic expectations associated with being intelligent in general, and proceeds to present three possible means of gauging the intelligent behavior of an unmanned vehicle system (UVS). The first is a qualitative perception that is based on an extension of the Turin's test for perceiving an unsuspecting UVS behavior as that of a person. The second is a quantitative measure where task‐specific intelligence performance of a smart system is evaluated. Examples of task‐specific intelligence and how they are gauged are presented in the context of the Intelligent Ground Vehicle Competition. The third is the comparative scale that gauges the difficulty of challenges against the intelligent skills of humans. For example, an experiment revealed that it requires the mind of at least a four‐year‐old child to successfully navigate an autonomous navigation course. Alternative architectures for intelligent unmanned vehicle systems were also presented, including complementing machine intelligence with human supervision via telematics and information technology. © 2004 Wiley Periodicals, Inc. 相似文献

20.

Gibraltar: A Reed‐Solomon coding library for storage applications on programmable graphics processors

Matthew L. Curry Anthony Skjellum H. Lee Ward Ron Brightwell 《Concurrency and Computation》2011,23(18):2477-2495

Reed–Solomon coding is a method for generating arbitrary amounts of erasure correction information from original data via matrix–vector multiplication in finite fields. Previous work has shown that modern CPUs are not well‐matched to this type of computation, requiring applications that depend on Reed–Solomon coding at high speeds (such as high‐performance storage arrays) to use hardware implementations. This work demonstrates that high performance is possible with current cost‐effective graphics processing units across a wide range of operating conditions and describes how performance will likely evolve in similar architectures. It describes the characteristics of the graphics processing unit architecture that enable high‐speed Reed–Solomon coding. A high‐performance practical library, Gibraltar, has been prototyped that performs Reed–Solomon coding on graphics processors in a manner suitable for storage arrays, along with applications with similar data resiliency needs. This library enables variably resilient erasure correcting codes to be used in a broad range of applications. Its performance is compared with that of a widely available CPU implementation, and a rationale for its API is presented. Its practicality is demonstrated through a usage example. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献