期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Adaptable Particle-in-Cell algorithms for graphical processing units

Viktor K. Decyk Tajendra V. Singh 《Computer Physics Communications》2011,(3):641-648

We developed new parameterized Particle-in-Cell algorithms and data structures for emerging multi-core and many-core architectures. Four parameters allow tuning of this PIC code to different hardware configurations. Particles are kept ordered at each time step. The first application of these algorithms is to NVIDIA graphical processing units, where speedups of about 15–25 compared to an Intel Nehalem processor were obtained for a simple 2D electrostatic code. Electromagnetic codes are expected to get higher speedups due to their greater computational intensity. 相似文献

2.

Computing resultants on Graphics Processing Units: Towards GPU-accelerated computer algebra

Pavel Emeliyanenko 《Journal of Parallel and Distributed Computing》2013

In this article we report on our experience in computing resultants of bivariate polynomials on Graphics Processing Units (GPU). Following the outline of Collins’ modular approach [6], our algorithm starts by mapping the input polynomials to a finite field for sufficiently many primes m

m

. Next, the GPU algorithm evaluates the polynomials at a number of fixed points x∈_Z_m

x \in Z_{m}

, and computes a set of univariate resultants for each modular image. Afterwards, the resultant is reconstructed using polynomial interpolation and Chinese remaindering. The GPU returns resultant coefficients in the form of Mixed Radix (MR) digits. Finally, large integer coefficients are recovered from the MR representation on the CPU. All computations performed by the algorithm (except for, partly, Chinese remaindering) are outsourced to the graphics processor thereby minimizing the amount of work to be done on the host machine. The main theoretical contribution of this work is the modification of Collins’ modular algorithm using the methods of matrix algebra to make an efficient realization on the GPU feasible. According to the benchmarks, our algorithm outperforms a CPU-based resultant algorithm from 64-bit Maple 14 by a factor of 100. 相似文献

3.

Parallelization and optimization of electrostatic Particle-in-Cell/Monte-Carlo Coupled codes as applied to RF discharges

Hong-Yu Wang Wei Jiang 《Computer Physics Communications》2009,180(8):1305-1314

We have developed a full paralleled 2D electrostatic Particle-in-Cell/Monte-Carlo Coupled (PIC-MCC) code for capacitively coupled plasma (CCP) simulations. In this code, we distributed the grid between processors along radial direction, and Poisson equation is solved accordingly paralleled. We applied a couple of numerical accelerating technologies: paralleled fast Poisson solver, assembler pushing code, particle sorting and so on. Theoretical analysis and numerical benchmark showed that this parallel framework had good efficiency and scalability. The framework of the code and the optimization technologies and algorithms are discussed, benchmarks and simulation results are also shown. 相似文献

4.

Programming methodology and performance issues for advanced computer architectures

Jack J. Dongarra Danny C. Sorensen Kathryn Connolly Jim Patterson 《Parallel Computing》1988,8(1-3):41-58

This paper will describe some recent attempts to construct transportable numerical software for high-performance computers. Restructuring algorithms in terms of simple linear algebra modules is reviewed. This technique has proved very succesful in obtaining a high level of transportability without severe loss of performance on a wide variety of both vector and parallel computers. The use of modules to encapsulate parallelism and reduce the ratio of data movement to floating-point operations has been demonstrably effective for regular problems such as those found in dense linear algebra. In other situations it may be necessary to express explicitly parallel algorithms. We also present a programming methodology that is useful for constructing new parallel algorithms which require sophisticated synchronization at a large grain level. We describe the SCHEDULE package which provides an environment for developing and analyzing explicitly parallel programs in FORTRAN which are portable. This package now includes a preprocessor to achieve complete portability of user level code and also a graphics post processor for performance analysis and debugging. We discuss details of porting both the SCHEDULE package and user code. Examples from linear algebra, and partial differential equations are used to illustrate the utility of this approach. 相似文献

5.

A Systolic Genetic Search for reducing the execution cost of regression testing

《Applied Soft Computing》2016

相似文献

6.

Designing OP2 for GPU architectures

M.B. Giles G.R. Mudalige B. Spencer C. Bertolli I. Reguly 《Journal of Parallel and Distributed Computing》2013

OP2 is an “active” library framework for the solution of unstructured mesh applications. It aims to decouple the specification of a scientific application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents the design of the current OP2 library for generating efficient code targeting contemporary GPU platforms. In this we focus on some of the software architecture design choices and low-level optimizations to maximize performance on NVIDIA’s Fermi architecture GPUs. The performance impact of these design choices is quantified on two NVIDIA GPUs (GTX560Ti, Tesla C2070) using the end-to-end performance of an industrial representative CFD application developed using the OP2 API. Results show that for each system, a number of key configuration parameters need to be set carefully in order to gain good performance. Utilizing a recently developed auto-tuning framework, we explore the effect of these parameters, their limitations and insights into optimizations for improved performance. 相似文献

7.

基于GPU的现代并行优化算法

张庆科杨波王琳朱福祥《计算机科学》2012,39(4):304-311

针对现代优化算法在处理相对复杂问题中所面临的求解时间复杂度较高的问题,引入基于GPU的并行处理解决方法。首先从宏观角度阐释了基于计算统一设备架构CUDA的并行编程模型,然后在GPU环境下给出了基于CUDA架构的5种典型现代优化算法(模拟退火算法、禁忌搜索算法、遗传算法、粒子群算法以及人工神经网络)的并行实现过程。通过对比分析在不同环境下测试的实验案例统计结果,指出基于GPU的单指令多线程并行优化策略的优势及其未来发展趋势。相似文献

8.

CUDA体系结构上的一层浅水系统的模拟

张哲《微型机与应用》2012,31(10):85-88

对于使用支持NVIDACUDA程序设计模型的GPU的二维一层浅水系统,给出了如何加速平衡性良好的有限体积模式的数值解,同时给出并实现了在单双浮点精度下使用CUDA模型利用潜在数据并行的算法。数值实验表明,CUDA体系结构的求解程序比CPU并行实现求解程序高效。相似文献

9.

GPU-based parallel algorithms for sparse nonlinear systems

V. Galiano H. Migallón V. Migallón J. Penadés 《Journal of Parallel and Distributed Computing》2012

In this work we describe some parallel algorithms for solving nonlinear systems using CUDA (Compute Unified Device Architecture) over a GPU (Graphics Processing Unit). The proposed algorithms are based on both the Fletcher–Reeves version of the nonlinear conjugate gradient method and a polynomial preconditioner type based on block two-stage methods. Several strategies of parallelization and different storage formats for sparse matrices are discussed. The reported numerical experiments analyze the behavior of these algorithms working in a fine grain parallel environment compared with a thread-based environment. 相似文献

10.

Parallel geometric algorithms on a mesh-connected computer

C. S. Jeong D. T. Lee 《Algorithmica》1990,5(1):155-177

We show that a number of geometric problems can be solved on a n × n mesh-connected computer (MCC) inO(n) time, which is optimal to within a constant factor, since a nontrivial data movement on an MCC requires (n) time. The problems studied here include multipoint location, planar point location, trapezoidal decomposition, intersection detection, intersection of two convex polygons, Voronoi diagram, the largest empty circle, the smallest enclosing circle, etc. TheO(n) algorithms for all of the above problems are based on the classical divide-and-conquer problem-solving strategy.This work was supported in part by the National Science Foundation under Grant DCR 8420814. A preliminary version was presented in the 1987 FJCC, Dallas, TX. 相似文献

11.

Parallel design for error-resilient entropy coding algorithm on GPU

Yuan Dai Yong Fang Dongjian He Bormin Huang 《Journal of Parallel and Distributed Computing》2013

The error-resilient entropy coding (EREC) algorithm is an effective method for combating error propagation at low cost in many compression methods using variable-length coding (VLC). However, the main drawback of the EREC is its high complexity. In order to overcome this disadvantage, a parallel EREC is implemented on a graphics processing unit (GPU) using the NVIDIA CUDA technology. The original EREC is a finer-grained parallel at each stage which brings additional communication overhead. To achieve high efficiency of parallel EREC, we propose partitioning the EREC (P-EREC) algorithm, which splits variable-length blocks into groups and then every group is coded using the EREC separately. Each GPU thread processes one group so as to make the EREC coarse-grained parallel. In addition, some optimization strategies are discussed in order to obtain higher performance using the GPU. In the case that the variable-length data blocks are divided into 128 groups (256 groups, resp.), experimental results show that the parallel P-EREC achieves 32×

32 \times

to 123×

123 \times

(54×

54 \times

to 350×

350 \times

, resp.) speedup over the original C code of EREC compiled with the _O₂

O_{2}

optimization option. Higher speedup can even be obtained with more groups. Compared to the EREC, the P-EREC not only achieves a good speedup performance, but it also slightly improves the resilience of the VLC bit-stream against burst or random errors. 相似文献

12.

高光谱遥感蚀变填图SCM并行算法设计与实现

程宾洋王茂芝罗耀华郭科《软件》2012,(8):144-146

由于空间和波谱分辨率的不断提高,高光谱遥感影像的海量数据特性导致高光谱遥感影像并行处理成为遥感影像处理技术的发展趋势。本文基于CUDA和GPU环境,设计并实现了高光谱遥感蚀变填图的SCM并行算法。实验结果表明,并行加速比可达到25,SCM并行算法能有效改善算法性能。相似文献

13.

Parallel algorithms for arrangements

R. Anderson P. Beanie E. Brisson 《Algorithmica》1996,15(2):104-125

We give the first efficient parallel algorithms for solving the arrangement problem. We give a deterministic algorithm for the CREW PRAM which runs in nearly optimal bounds ofO (logn log^* n) time andn ²/logn processors. We generalize this to obtain anO (logn log^* n)-time algorithm usingn ^d/logn processors for solving the problem ind dimensions. We also give a randomized algorithm for the EREW PRAM that constructs an arrangement ofn lines on-line, in which each insertion is done in optimalO (logn) time usingn/logn processors. Our algorithms develop new parallel data structures and new methods for traversing an arrangement.This work was supported by the National Science Foundation, under Grants CCR-8657562 and CCR-8858799, NSF/DARPA under Grant CCR-8907960, and Digital Equipment Corporation. A preliminary version of this paper appeared at the Second Annual ACM Symposium on Parallel Algorithms and Architectures [3]. 相似文献

14.

Efficient parallel algorithms for path problems in directed graphs

Joan M. Lucas Marian Gunsher Sackrowitz 《Algorithmica》1992,7(1):631-648

In this paper we describe a technique for finding efficient parallel algorithms for problems on directed graphs that involve checking the existence of certain kinds of paths in the graph. This technique provides efficient algorithms for finding dominators in flow graphs, performing interval and loop analysis on reducible flow graphs, and finding the feedback vertices of a digraph. Each of these algorithms takesO(log² n) time using the same number of processors needed for fast matrix multiplication. All of these bounds are for an EREW PRAM. 相似文献

15.

Cellular matrix model for parallel combinatorial optimization algorithms in Euclidean plane

《Applied Soft Computing》2017

We propose a parallel computation model, called cellular matrix model (CMM), to address large-size Euclidean graph matching problems in the plane. The parallel computation takes place by partitioning the plane into a regular grid of cells, each cell being affected to a single processor. Each processor operates on local data, starting from its cell location and extending its search to the neighborhood cells in a spiral search way. In order to deal with large-size problems, memory size and processor number are fixed as O(N), where N is the problem size. Then one key point is that closest point searching in the plane is performed in O(1) expected time for uniform or bounded distribution, for each processor independently. We define a generic loop that models the parallel projection between graphs and their matching, as executed by the many cells at a given level of computation granularity. To illustrate its efficacy and versatility, we apply the CMM, on GPU platforms, to two problems in image processing: superpixel segmentation and stereo matching energy minimization. Firstly, we propose an extended version of the well-known SLIC superpixel segmentation algorithm, which we call SPASM algorithm, by using a parallel 2D self-organizing map instead of k-means algorithm. Secondly, we investigate the idea of distributed variable neighborhood search, and propose a parallel search heuristic, called distributed local search (DLS), for global energy minimization of stereo matching problem. We evaluate the approach with regards to the state-of-the-art graph cut and belief propagation algorithms. For each problem, we argue that the parallel GPU implementation provides new competitive quality/time trade-offs, with substantial acceleration factors as the problem size increases. 相似文献

16.

Designing fast LTL model checking algorithms for many-core GPUs

Jiří BarnatAuthor Vitae Petr BauchAuthor VitaeLuboš BrimAuthor Vitae Milan Češka 《Journal of Parallel and Distributed Computing》2012

Recent technological developments made various many-core hardware platforms widely accessible. These massively parallel architectures have been used to significantly accelerate many computation demanding tasks. In this paper, we show how the algorithms for LTL model checking can be redesigned in order to accelerate LTL model checking on many-core GPU platforms. Our detailed experimental evaluation demonstrates that using the NVIDIA CUDA technology results in a significant speedup of the verification process. Together with state space generation based on shared hash-table and DFS exploration, our CUDA accelerated model checker is the fastest among state-of-the-art shared memory model checking tools. 相似文献

17.

Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU

George Stantchev William Dorland Nail Gumerov 《Journal of Parallel and Distributed Computing》2008

Particle-In-Cell (PIC) methods have been widely used for plasma physics simulations in the past three decades. To ensure an acceptable level of statistical accuracy relatively large numbers of particles are needed. State-of-the-art Graphics Processing Units (GPUs), with their high memory bandwidth, hundreds of SPMD processors, and half-a-teraflop performance potential, offer a viable alternative to distributed memory parallel computers for running medium-scale PIC plasma simulations on inexpensive commodity hardware. In this paper, we present an overview of a typical plasma PIC code and discuss its GPU implementation. In particular we focus on fast algorithms for the performance bottleneck operation of Particle-To-Grid interpolation. 相似文献

18.

Parallel multi-dimensional range query processing with R-trees on GPU

Jinwoong KimAuthor Vitae Sul-Gi KimAuthor VitaeBeomseok Nam 《Journal of Parallel and Distributed Computing》2013

The general purpose computing on graphics processing unit (GP-GPU) has emerged as a new cost effective parallel computing paradigm in high performance computing research that enables large amount of data to be processed in parallel. Large scale scientific data intensive applications have been playing an important role in modern high performance computing research. A common access pattern into such scientific data analysis applications is multi-dimensional range query, but not much research has been conducted on multi-dimensional range query on the GPU. Inherently multi-dimensional indexing trees such as R-Trees are not well suited for GPU environment because of its irregular tree traversal. Traversing irregular tree search path makes it hard to maximize the utilization of massively parallel architectures. In this paper, we propose a novel MPTS (Massively Parallel Three-phase Scanning) R-tree traversal algorithm for multi-dimensional range query, that converts recursive access to tree nodes into sequential access. Our extensive experimental study shows that MPTS R-tree traversal algorithm on NVIDIA Tesla M2090 GPU consistently outperforms traditional recursive R-trees search algorithm on Intel Xeon E5506 processors. 相似文献

19.

Parallel solution of American option derivatives on GPU clusters

Lilia Ziane Khodja Ming Chau Raphaël Couturier Jacques Bahi Pierre Spitéri 《Computers & Mathematics with Applications》2013,65(11):1830-1848

This paper deals with the numerical solution of financial applications, more specifically the computation of American option derivatives modeled by nonlinear boundary values problems. In such applications we have to solve large-scale algebraic systems. We concentrate on synchronous and asynchronous parallel iterative algorithms carried out on CPU and GPU networks. The properties of the operators arising in the discretized problem ensure the convergence of the parallel iterative synchronous and asynchronous algorithms. Computational experiments performed on CPU and GPU networks are presented and analyzed. 相似文献

20.

基于GPU的精确串匹配算法综述

张春燕谭建龙刘燕兵郭莉《计算机应用研究》2016,33(7)

精确串匹配是计算机领域的一个经典问题。在大数据时代,海量的数据给串匹配问题带来巨大的挑战。当前,GPU的应用得到学术界和工业界的广泛关注。近年,基于GPU的串匹配算法研究已成为学术界的焦点。为展示近年的研究,本文综述了基于GPU的精确串匹配技术,针对不同的算法和GPU架构介绍精确串匹配技术在GPU上的改进：不同算法的改进具有差异性,研究时需扩展具体算法,并比较上述算法的优缺点。最后对评测指标进行介绍,展望其发展趋势。相似文献