期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallel Ant Colony Optimization on Graphics Processing Units

Audrey Delévacq Pierre Delisle Marc Gravel Michaël Krajecki 《Journal of Parallel and Distributed Computing》2013

The purpose of this paper is to propose effective parallelization strategies for the Ant Colony Optimization (ACO) metaheuristic on Graphics Processing Units (GPUs). The Max–Min Ant System (MMAS) algorithm augmented with 3-opt local search is used as a framework for the implementation of the parallel ants and multiple ant colonies general parallelization approaches. The four resulting GPU algorithms are extensively evaluated and compared on both speedup and solution quality on a state-of-the-art Fermi GPU architecture. A rigorous effort is made to keep parallel algorithms true to the original MMAS applied to the Traveling Salesman Problem. We report speedups of up to 23.60 with solution quality similar to the original sequential implementation. With the intent of providing a parallelization framework for ACO on GPUs, a comparative experimental study highlights the performance impact of ACO parameters, GPU technical configuration, memory structures and parallelization granularity. 相似文献

2.

Performance models for asynchronous data transfers on consumer Graphics Processing Units

Juan Gómez-Luna José María González-Linares José Ignacio Benavides Nicolás Guil 《Journal of Parallel and Distributed Computing》2012

Graphics Processing Units (GPU) have impressively arisen as general-purpose coprocessors in high performance computing applications, since the launch of the Compute Unified Device Architecture (CUDA). However, they present an inherent performance bottleneck in the fact that communication between two separate address spaces (the main memory of the CPU and the memory of the GPU) is unavoidable. The CUDA Application Programming Interface (API) provides asynchronous transfers and streams, which permit a staged execution, as a way to overlap communication and computation. Nevertheless, a precise manner to estimate the possible improvement due to overlapping does not exist, neither a rule to determine the optimal number of stages or streams in which computation should be divided. In this work, we present a methodology that is applied to model the performance of asynchronous data transfers of CUDA streams on different GPU architectures. Thus, we illustrate this methodology by deriving expressions of performance for two different consumer graphic architectures belonging to the more recent generations. These models permit programmers to estimate the optimal number of streams in which the computation on the GPU should be broken up, in order to obtain the highest performance improvements. Finally, we have checked the suitability of our performance models with three applications based on codes from the CUDA Software Development Kit (SDK) with successful results. 相似文献

3.

Particle-in-Cell algorithms for emerging computer architectures

Viktor K. Decyk Tajendra V. Singh 《Computer Physics Communications》2014

We have designed Particle-in-Cell algorithms for emerging architectures. These algorithms share a common approach, using fine-grained tiles, but different implementations depending on the architecture. On the GPU, there were two different implementations, one with atomic operations and one with no data collisions, using CUDA C and Fortran. Speedups up to about 50 compared to a single core of the Intel i7 processor have been achieved. There was also an implementation for traditional multi-core processors using OpenMP which achieved high parallel efficiency. We believe that this approach should work for other emerging designs such as Intel Phi coprocessor from the Intel MIC architecture. 相似文献

4.

Air pollution modelling using a Graphics Processing Unit with CUDA

F. Molnár Jr. R. Mészáros 《Computer Physics Communications》2010,181(1):105-85

The Graphics Processing Unit (GPU) is a powerful tool for parallel computing. In the past years the performance and capabilities of GPUs have increased, and the Compute Unified Device Architecture (CUDA) - a parallel computing architecture - has been developed by NVIDIA to utilize this performance in general purpose computations. Here we show for the first time a possible application of GPU for environmental studies serving as a basement for decision making strategies. A stochastic Lagrangian particle model has been developed on CUDA to estimate the transport and the transformation of the radionuclides from a single point source during an accidental release. Our results show that parallel implementation achieves typical acceleration values in the order of 80-120 times compared to CPU using a single-threaded implementation on a 2.33 GHz desktop computer. Only very small differences have been found between the results obtained from GPU and CPU simulations, which are comparable with the effect of stochastic transport phenomena in atmosphere. The relatively high speedup with no additional costs to maintain this parallel architecture could result in a wide usage of GPU for diversified environmental applications in the near future. 相似文献

5.

高光谱遥感蚀变填图SCM并行算法设计与实现

程宾洋王茂芝罗耀华郭科《软件》2012,(8):144-146

由于空间和波谱分辨率的不断提高,高光谱遥感影像的海量数据特性导致高光谱遥感影像并行处理成为遥感影像处理技术的发展趋势。本文基于CUDA和GPU环境,设计并实现了高光谱遥感蚀变填图的SCM并行算法。实验结果表明,并行加速比可达到25,SCM并行算法能有效改善算法性能。相似文献

6.

基于GPU的现代并行优化算法 总被引：2，自引：2，他引：0

张庆科杨波王琳朱福祥《计算机科学》2012,39(4):304-311

针对现代优化算法在处理相对复杂问题中所面临的求解时间复杂度较高的问题,引入基于GPU的并行处理解决方法。首先从宏观角度阐释了基于计算统一设备架构CUDA的并行编程模型,然后在GPU环境下给出了基于CUDA架构的5种典型现代优化算法(模拟退火算法、禁忌搜索算法、遗传算法、粒子群算法以及人工神经网络)的并行实现过程。通过对比分析在不同环境下测试的实验案例统计结果,指出基于GPU的单指令多线程并行优化策略的优势及其未来发展趋势。相似文献

7.

Computing convexity properties of images on a pyramid computer

Russ Miller Quentin F. Stout 《Algorithmica》1991,6(1):658-684

We present efficient parallel algorithms for using a pyramid computer to determine convexity properties of digitized black/white pictures and labeled figures. Algorithms are presented for deciding convexity, identifying extreme points of convex hulls, and using extreme points in a variety of fashions. For a pyramid computer with a base ofn simple processing elements arranged in ann ^1/2 ×n ^1/2 square, the running times of the algorithms range from (logn) to find the extreme points of a convex figure in a digitized picture, to (n ^1/6) to find the diameter of a labeled figure, (n ^1/4 logn) to find the extreme points of every figure in a digitized picture, to (n ^1/2) to find the extreme points of every labeled set of processing elements. Our results show that the pyramid computer can be used to obtain efficient solutions to nontrivial problems in image analysis. We also show the sensitivity of efficient pyramid-computer algorithms to the rate at which essential data can be compressed. Finally, we show that a wide variety of techniques are needed to make full and efficient use of the pyramid architecture.This research was partially supported by National Science Foundation Grants MCS-83-01019, DCR-8507851, DCR-8608640, IRI-8800514 and an Incentives for Excellence Award from the Digital Equipment Corporation. 相似文献

8.

Solving knapsack problems on GPU

V. Boyer D. El Baz M. Elkihel 《Computers & Operations Research》2012,39(1):42-47

A parallel implementation via CUDA of the dynamic programming method for the knapsack problem on NVIDIA GPU is presented. A GTX 260 card with 192 cores (1.4 GHz) is used for computational tests and processing times obtained with the parallel code are compared to the sequential one on a CPU with an Intel Xeon 3.0 GHz. The results show a speedup factor of 26 for large size problems. Furthermore, in order to limit the communication between the CPU and the GPU, a compression technique is presented which decreases significantly the memory occupancy. 相似文献

9.

Fast ray-tracing of human eye optics on Graphics Processing Units

Qi Wei Saket Patkar Dinesh K. Pai 《Computer methods and programs in biomedicine》2014

We present a new technique for simulating retinal image formation by tracing a large number of rays from objects in three dimensions as they pass through the optic apparatus of the eye to objects. Simulating human optics is useful for understanding basic questions of vision science and for studying vision defects and their corrections. Because of the complexity of computing such simulations accurately, most previous efforts used simplified analytical models of the normal eye. This makes them less effective in modeling vision disorders associated with abnormal shapes of the ocular structures which are hard to be precisely represented by analytical surfaces. We have developed a computer simulator that can simulate ocular structures of arbitrary shapes, for instance represented by polygon meshes. Topographic and geometric measurements of the cornea, lens, and retina from keratometer or medical imaging data can be integrated for individualized examination. We utilize parallel processing using modern Graphics Processing Units (GPUs) to efficiently compute retinal images by tracing millions of rays. A stable retinal image can be generated within minutes. We simulated depth-of-field, accommodation, chromatic aberrations, as well as astigmatism and correction. We also show application of the technique in patient specific vision correction by incorporating geometric models of the orbit reconstructed from clinical medical images. 相似文献

10.

Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU

George Stantchev William Dorland Nail Gumerov 《Journal of Parallel and Distributed Computing》2008

Particle-In-Cell (PIC) methods have been widely used for plasma physics simulations in the past three decades. To ensure an acceptable level of statistical accuracy relatively large numbers of particles are needed. State-of-the-art Graphics Processing Units (GPUs), with their high memory bandwidth, hundreds of SPMD processors, and half-a-teraflop performance potential, offer a viable alternative to distributed memory parallel computers for running medium-scale PIC plasma simulations on inexpensive commodity hardware. In this paper, we present an overview of a typical plasma PIC code and discuss its GPU implementation. In particular we focus on fast algorithms for the performance bottleneck operation of Particle-To-Grid interpolation. 相似文献

11.

Parallel solution of American option derivatives on GPU clusters

Lilia Ziane Khodja Ming Chau Raphaël Couturier Jacques Bahi Pierre Spitéri 《Computers & Mathematics with Applications》2013,65(11):1830-1848

This paper deals with the numerical solution of financial applications, more specifically the computation of American option derivatives modeled by nonlinear boundary values problems. In such applications we have to solve large-scale algebraic systems. We concentrate on synchronous and asynchronous parallel iterative algorithms carried out on CPU and GPU networks. The properties of the operators arising in the discretized problem ensure the convergence of the parallel iterative synchronous and asynchronous algorithms. Computational experiments performed on CPU and GPU networks are presented and analyzed. 相似文献

12.

GPGPU implementation of growing neural gas: Application to 3D scene reconstruction

Sergio Orts Jose Garcia-Rodriguez Diego Viejo Miguel Cazorla Vicente Morell 《Journal of Parallel and Distributed Computing》2012

Self-organising neural models have the ability to provide a good representation of the input space. In particular the Growing Neural Gas (GNG) is a suitable model because of its flexibility, rapid adaptation and excellent quality of representation. However, this type of learning is time-consuming, especially for high-dimensional input data. Since real applications often work under time constraints, it is necessary to adapt the learning process in order to complete it in a predefined time. This paper proposes a Graphics Processing Unit (GPU) parallel implementation of the GNG with Compute Unified Device Architecture (CUDA). In contrast to existing algorithms, the proposed GPU implementation allows the acceleration of the learning process keeping a good quality of representation. Comparative experiments using iterative, parallel and hybrid implementations are carried out to demonstrate the effectiveness of CUDA implementation. The results show that GNG learning with the proposed implementation achieves a speed-up of 6×

6 \times

compared with the single-threaded CPU implementation. GPU implementation has also been applied to a real application with time constraints: acceleration of 3D scene reconstruction for egomotion, in order to validate the proposal. 相似文献

13.

一种新的并行计算环境:PVM on Win32

秦忠国姜弘道《新电脑》1998,(6)

介绍了一种新的并行计算环境PVMonWin32．讨论了它的通信层以及应用程序接口、使用配置等，并且给出了两个通过高速以太网的试验结果。相似文献

14.

Efficient local search on the GPU—Investigations on the vehicle routing problem 总被引：1，自引：0，他引：1

Christian Schulz 《Journal of Parallel and Distributed Computing》2013

We study how to implement local search efficiently on data parallel accelerators such as Graphics Processing Units. The Distance-constrained Capacitated Vehicle Routing Problem, a computationally very hard discrete optimization problem with high industrial relevance, is the selected vehicle for our investigations. More precisely, we investigate local search with the Best Improving strategy for the 2-opt and 3-opt operators on a giant tour representation. Resource extension functions are used for constant time move evaluation. Using CUDA, a basic implementation called The Benchmark Version has been developed and deployed on a Fermi architecture Graphics Processing Unit. Both neighborhood setup and evaluation are performed entirely on the device. The Benchmark Version is the initial step of an incremental improvement process where a number of important implementation aspects have been systematically studied. Ten well-known test instances from the literature are used in computational experiments, and profiling tools are used to identify bottlenecks. In the final version, the device is fully saturated, given a large enough problem instance. A speedup of almost an order of magnitude relative to The Benchmark Version is observed. We conclude that, with some effort, local search may be implemented very efficiently on Graphics Processing Units. Our experiments show that a maximum efficiency, however, requires a neighborhood cardinality of at least one million. Full exploration of a billion neighbors takes a few seconds and may be deemed too expensive with the current technology. Reduced neighborhoods through filtering is an obvious remedy. Experiments on simple models of neighborhood filtering indicate, however, that the speedup effect is limited on data parallel accelerators. We believe these insights are valuable in the design of new metaheuristics that fully utilize modern, heterogeneous processors. 相似文献

15.

A performance study of general-purpose applications on graphics processors using CUDA 总被引：1，自引：0，他引：1

Shuai Che Michael Boyer Jiayuan Meng David Tarjan Jeremy W. Sheaffer Kevin Skadron 《Journal of Parallel and Distributed Computing》2008

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIA’s C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications. 相似文献

16.

ParaCQ:一种基于移动Agent的并行计算中间件

陶晓峰孙健史有群刘慧《计算机工程》2006,32(7):49-51,60

主要从软件工程的角度讨论了一种Internet环境下并行计算中间件（ParaCQ）的软件系统设计思想和工作原理。该中间件基于移动Agent技术,融合了基于元组空间的协作模型,采用基于好友协商策略的P2P的资源共享模式,不仅支持同步、异步计算,也支持在线和离线计算。此外,它还具有友好的用户界面,简单的并行计算程序编程接口。因此,该并行计算中间件是在Internet环境下推广与普及并行计算的有益尝试。相似文献

17.

Implementation and evaluation of the Level Set method: Towards efficient and accurate simulation of wet etching for microengineering applications

C. Montoliu N. Ferrando M.A. Gosálvez J. Cerdá R.J. Colom 《Computer Physics Communications》2013

The use of atomistic methods, such as the Continuous Cellular Automaton (CCA), is currently regarded as a computationally efficient and experimentally accurate approach for the simulation of anisotropic etching of various substrates in the manufacture of Micro-electro-mechanical Systems (MEMS). However, when the features of the chemical process are modified, a time-consuming calibration process needs to be used to transform the new macroscopic etch rates into a corresponding set of atomistic rates. Furthermore, changing the substrate requires a labor-intensive effort to reclassify most atomistic neighborhoods. In this context, the Level Set (LS) method provides an alternative approach where the macroscopic forces affecting the front evolution are directly applied at the discrete level, thus avoiding the need for reclassification and/or calibration. Correspondingly, we present a fully-operational Sparse Field Method (SFM) implementation of the LS approach, discussing in detail the algorithm and providing a thorough characterization of the computational cost and simulation accuracy, including a comparison to the performance by the most recent CCA model. We conclude that the SFM implementation achieves similar accuracy as the CCA method with less fluctuations in the etch front and requiring roughly 4 times less memory. Although SFM can be up to 2 times slower than CCA for the simulation of anisotropic etchants, it can also be up to 10 times faster than CCA for isotropic etchants. In addition, we present a parallel, GPU-based implementation (gSFM) and compare it to an optimized, multicore CPU version (cSFM), demonstrating that the SFM algorithm can be successfully parallelized and the simulation times consequently reduced, while keeping the accuracy of the simulations. Although modern multicore CPUs provide an acceptable option, the massively parallel architecture of modern GPUs is more suitable, as reflected by computational times for gSFM up to 7.4 times faster than for cSFM. 相似文献

18.

Recognition of circular patterns on GPUs: Performance analysis and contributions

Antonio Ruiz Nicolás Guil Manuel Ujaldón 《Journal of Parallel and Distributed Computing》2008

We develop a novel approach for computing the circle Hough transform entirely on graphics hardware (GPU). A primary role is assigned to vertex processors and the rasterizer, overshadowing the traditional foreground of pixel processors and enhancing parallel processing. Resources like the vertex cache or blending units are studied too, with our set of optimizations leading to extraordinary peak gain factors exceeding 358x over a typical CPU execution. Software optimizations, like the use of precomputed tables or gradient information and hardware improvements, like hyperthreading and multicores are explored on CPUs as well. Overall, the GPU exhibits better scalability and much greater parallel performance to become a solid alternative for computing the classical circle Hough transform versus those optimal methods run on emerging multicore architectures. 相似文献