首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
We developed new parameterized Particle-in-Cell algorithms and data structures for emerging multi-core and many-core architectures. Four parameters allow tuning of this PIC code to different hardware configurations. Particles are kept ordered at each time step. The first application of these algorithms is to NVIDIA graphical processing units, where speedups of about 15–25 compared to an Intel Nehalem processor were obtained for a simple 2D electrostatic code. Electromagnetic codes are expected to get higher speedups due to their greater computational intensity.  相似文献   

In this article we report on our experience in computing resultants of bivariate polynomials on Graphics Processing Units (GPU). Following the outline of Collins’ modular approach [6], our algorithm starts by mapping the input polynomials to a finite field for sufficiently many primes mm. Next, the GPU algorithm evaluates the polynomials at a number of fixed points x∈ZmxZm, and computes a set of univariate resultants for each modular image. Afterwards, the resultant is reconstructed using polynomial interpolation and Chinese remaindering. The GPU returns resultant coefficients in the form of Mixed Radix (MR) digits. Finally, large integer coefficients are recovered from the MR representation on the CPU. All computations performed by the algorithm (except for, partly, Chinese remaindering) are outsourced to the graphics processor thereby minimizing the amount of work to be done on the host machine. The main theoretical contribution of this work is the modification of Collins’ modular algorithm using the methods of matrix algebra to make an efficient realization on the GPU feasible. According to the benchmarks, our algorithm outperforms a CPU-based resultant algorithm from 64-bit Maple 14 by a factor of 100.  相似文献   

We have developed a full paralleled 2D electrostatic Particle-in-Cell/Monte-Carlo Coupled (PIC-MCC) code for capacitively coupled plasma (CCP) simulations. In this code, we distributed the grid between processors along radial direction, and Poisson equation is solved accordingly paralleled. We applied a couple of numerical accelerating technologies: paralleled fast Poisson solver, assembler pushing code, particle sorting and so on. Theoretical analysis and numerical benchmark showed that this parallel framework had good efficiency and scalability. The framework of the code and the optimization technologies and algorithms are discussed, benchmarks and simulation results are also shown.  相似文献   

This paper will describe some recent attempts to construct transportable numerical software for high-performance computers. Restructuring algorithms in terms of simple linear algebra modules is reviewed. This technique has proved very succesful in obtaining a high level of transportability without severe loss of performance on a wide variety of both vector and parallel computers. The use of modules to encapsulate parallelism and reduce the ratio of data movement to floating-point operations has been demonstrably effective for regular problems such as those found in dense linear algebra. In other situations it may be necessary to express explicitly parallel algorithms. We also present a programming methodology that is useful for constructing new parallel algorithms which require sophisticated synchronization at a large grain level. We describe the SCHEDULE package which provides an environment for developing and analyzing explicitly parallel programs in FORTRAN which are portable. This package now includes a preprocessor to achieve complete portability of user level code and also a graphics post processor for performance analysis and debugging. We discuss details of porting both the SCHEDULE package and user code. Examples from linear algebra, and partial differential equations are used to illustrate the utility of this approach.  相似文献   

OP2 is an “active” library framework for the solution of unstructured mesh applications. It aims to decouple the specification of a scientific application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents the design of the current OP2 library for generating efficient code targeting contemporary GPU platforms. In this we focus on some of the software architecture design choices and low-level optimizations to maximize performance on NVIDIA’s Fermi architecture GPUs. The performance impact of these design choices is quantified on two NVIDIA GPUs (GTX560Ti, Tesla C2070) using the end-to-end performance of an industrial representative CFD application developed using the OP2 API. Results show that for each system, a number of key configuration parameters need to be set carefully in order to gain good performance. Utilizing a recently developed auto-tuning framework, we explore the effect of these parameters, their limitations and insights into optimizations for improved performance.  相似文献   

针对现代优化算法在处理相对复杂问题中所面临的求解时间复杂度较高的问题,引入基于GPU的并行处理解决方法。首先从宏观角度阐释了基于计算统一设备架构CUDA的并行编程模型,然后在GPU环境下给出了基于CUDA架构的5种典型现代优化算法(模拟退火算法、禁忌搜索算法、遗传算法、粒子群算法以及人工神经网络)的并行实现过程。通过对比分析在不同环境下测试的实验案例统计结果,指出基于GPU的单指令多线程并行优化策略的优势及其未来发展趋势。  相似文献   

张哲 《微型机与应用》2012,31(10):85-88
对于使用支持NVIDACUDA程序设计模型的GPU的二维一层浅水系统,给出了如何加速平衡性良好的有限体积模式的数值解,同时给出并实现了在单双浮点精度下使用CUDA模型利用潜在数据并行的算法。数值实验表明,CUDA体系结构的求解程序比CPU并行实现求解程序高效。  相似文献   

In this work we describe some parallel algorithms for solving nonlinear systems using CUDA (Compute Unified Device Architecture) over a GPU (Graphics Processing Unit). The proposed algorithms are based on both the Fletcher–Reeves version of the nonlinear conjugate gradient method and a polynomial preconditioner type based on block two-stage methods. Several strategies of parallelization and different storage formats for sparse matrices are discussed. The reported numerical experiments analyze the behavior of these algorithms working in a fine grain parallel environment compared with a thread-based environment.  相似文献   

We show that a number of geometric problems can be solved on a n × n mesh-connected computer (MCC) inO(n) time, which is optimal to within a constant factor, since a nontrivial data movement on an MCC requires (n) time. The problems studied here include multipoint location, planar point location, trapezoidal decomposition, intersection detection, intersection of two convex polygons, Voronoi diagram, the largest empty circle, the smallest enclosing circle, etc. TheO(n) algorithms for all of the above problems are based on the classical divide-and-conquer problem-solving strategy.This work was supported in part by the National Science Foundation under Grant DCR 8420814. A preliminary version was presented in the 1987 FJCC, Dallas, TX.  相似文献   

The error-resilient entropy coding (EREC) algorithm is an effective method for combating error propagation at low cost in many compression methods using variable-length coding (VLC). However, the main drawback of the EREC is its high complexity. In order to overcome this disadvantage, a parallel EREC is implemented on a graphics processing unit (GPU) using the NVIDIA CUDA technology. The original EREC is a finer-grained parallel at each stage which brings additional communication overhead. To achieve high efficiency of parallel EREC, we propose partitioning the EREC (P-EREC) algorithm, which splits variable-length blocks into groups and then every group is coded using the EREC separately. Each GPU thread processes one group so as to make the EREC coarse-grained parallel. In addition, some optimization strategies are discussed in order to obtain higher performance using the GPU. In the case that the variable-length data blocks are divided into 128 groups (256 groups, resp.), experimental results show that the parallel P-EREC achieves 32×32× to 123×123× (54×54× to 350×350×, resp.) speedup over the original C code of EREC compiled with the O2O2 optimization option. Higher speedup can even be obtained with more groups. Compared to the EREC, the P-EREC not only achieves a good speedup performance, but it also slightly improves the resilience of the VLC bit-stream against burst or random errors.  相似文献   

程宾洋  王茂芝  罗耀华  郭科 《软件》2012,(8):144-146
由于空间和波谱分辨率的不断提高,高光谱遥感影像的海量数据特性导致高光谱遥感影像并行处理成为遥感影像处理技术的发展趋势。本文基于CUDA和GPU环境,设计并实现了高光谱遥感蚀变填图的SCM并行算法。实验结果表明,并行加速比可达到25,SCM并行算法能有效改善算法性能。  相似文献   

We give the first efficient parallel algorithms for solving the arrangement problem. We give a deterministic algorithm for the CREW PRAM which runs in nearly optimal bounds ofO (logn log* n) time andn 2/logn processors. We generalize this to obtain anO (logn log* n)-time algorithm usingn d /logn processors for solving the problem ind dimensions. We also give a randomized algorithm for the EREW PRAM that constructs an arrangement ofn lines on-line, in which each insertion is done in optimalO (logn) time usingn/logn processors. Our algorithms develop new parallel data structures and new methods for traversing an arrangement.This work was supported by the National Science Foundation, under Grants CCR-8657562 and CCR-8858799, NSF/DARPA under Grant CCR-8907960, and Digital Equipment Corporation. A preliminary version of this paper appeared at the Second Annual ACM Symposium on Parallel Algorithms and Architectures [3].  相似文献   

In this paper we describe a technique for finding efficient parallel algorithms for problems on directed graphs that involve checking the existence of certain kinds of paths in the graph. This technique provides efficient algorithms for finding dominators in flow graphs, performing interval and loop analysis on reducible flow graphs, and finding the feedback vertices of a digraph. Each of these algorithms takesO(log2 n) time using the same number of processors needed for fast matrix multiplication. All of these bounds are for an EREW PRAM.  相似文献   

We propose a parallel computation model, called cellular matrix model (CMM), to address large-size Euclidean graph matching problems in the plane. The parallel computation takes place by partitioning the plane into a regular grid of cells, each cell being affected to a single processor. Each processor operates on local data, starting from its cell location and extending its search to the neighborhood cells in a spiral search way. In order to deal with large-size problems, memory size and processor number are fixed as O(N), where N is the problem size. Then one key point is that closest point searching in the plane is performed in O(1) expected time for uniform or bounded distribution, for each processor independently. We define a generic loop that models the parallel projection between graphs and their matching, as executed by the many cells at a given level of computation granularity. To illustrate its efficacy and versatility, we apply the CMM, on GPU platforms, to two problems in image processing: superpixel segmentation and stereo matching energy minimization. Firstly, we propose an extended version of the well-known SLIC superpixel segmentation algorithm, which we call SPASM algorithm, by using a parallel 2D self-organizing map instead of k-means algorithm. Secondly, we investigate the idea of distributed variable neighborhood search, and propose a parallel search heuristic, called distributed local search (DLS), for global energy minimization of stereo matching problem. We evaluate the approach with regards to the state-of-the-art graph cut and belief propagation algorithms. For each problem, we argue that the parallel GPU implementation provides new competitive quality/time trade-offs, with substantial acceleration factors as the problem size increases.  相似文献   

Recent technological developments made various many-core hardware platforms widely accessible. These massively parallel architectures have been used to significantly accelerate many computation demanding tasks. In this paper, we show how the algorithms for LTL model checking can be redesigned in order to accelerate LTL model checking on many-core GPU platforms. Our detailed experimental evaluation demonstrates that using the NVIDIA CUDA technology results in a significant speedup of the verification process. Together with state space generation based on shared hash-table and DFS exploration, our CUDA accelerated model checker is the fastest among state-of-the-art shared memory model checking tools.  相似文献   

Particle-In-Cell (PIC) methods have been widely used for plasma physics simulations in the past three decades. To ensure an acceptable level of statistical accuracy relatively large numbers of particles are needed. State-of-the-art Graphics Processing Units (GPUs), with their high memory bandwidth, hundreds of SPMD processors, and half-a-teraflop performance potential, offer a viable alternative to distributed memory parallel computers for running medium-scale PIC plasma simulations on inexpensive commodity hardware. In this paper, we present an overview of a typical plasma PIC code and discuss its GPU implementation. In particular we focus on fast algorithms for the performance bottleneck operation of Particle-To-Grid interpolation.  相似文献   

The general purpose computing on graphics processing unit (GP-GPU) has emerged as a new cost effective parallel computing paradigm in high performance computing research that enables large amount of data to be processed in parallel. Large scale scientific data intensive applications have been playing an important role in modern high performance computing research. A common access pattern into such scientific data analysis applications is multi-dimensional range query, but not much research has been conducted on multi-dimensional range query on the GPU. Inherently multi-dimensional indexing trees such as R-Trees are not well suited for GPU environment because of its irregular tree traversal. Traversing irregular tree search path makes it hard to maximize the utilization of massively parallel architectures. In this paper, we propose a novel MPTS (Massively Parallel Three-phase Scanning) R-tree traversal algorithm for multi-dimensional range query, that converts recursive access to tree nodes into sequential access. Our extensive experimental study shows that MPTS R-tree traversal algorithm on NVIDIA Tesla M2090 GPU consistently outperforms traditional recursive R-trees search algorithm on Intel Xeon E5506 processors.  相似文献   

This paper deals with the numerical solution of financial applications, more specifically the computation of American option derivatives modeled by nonlinear boundary values problems. In such applications we have to solve large-scale algebraic systems. We concentrate on synchronous and asynchronous parallel iterative algorithms carried out on CPU and GPU networks. The properties of the operators arising in the discretized problem ensure the convergence of the parallel iterative synchronous and asynchronous algorithms. Computational experiments performed on CPU and GPU networks are presented and analyzed.  相似文献   

精确 串匹配是计算机领域的一个经典问题。在大数据时代,海量的数据给串匹配问题带来巨大的挑战。当前,GPU的应用得到学术界和工业界的广泛关注。近年,基于GPU的串匹配算法研究已成为学术界的焦点。为展示近年的研究,本文综述了基于GPU的精确串匹配技术,针对不同的算法和GPU架构介绍精确串匹配技术在GPU上的改进:不同算法的改进具有差异性,研究时需扩展具体算法,并比较上述算法的优缺点。最后对评测指标进行介绍,展望其发展趋势。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号