首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarz’s, Cimmino’s, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradient-accelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method.  相似文献   

2.
Until now, most results reported for parallelism in production systems (rulebased systems) have been simulation results-very few real parallel implementations exist. In this paper, we present results from our parallel implementation of OPS5 on the Encore multiprocessor. The implementation exploits very finegrained parallelism to achieve significant speed-ups. For one of the applications, we achieve 12.4 fold speed-up using 13 processes. Our implementation is also distinct from other parallel implementations in that we parallelize a highly optimized C-based implementation of OPS5. Running on a uniprocessor, our C-based implementation is 10–20 times faster than the standard lisp implementation distributed by Carnegie Mellon University. In addition to presenting the performance numbers, the paper discusses the details of the parallel implementation-the data structures used, the amount of contention observed for shared data structures, and the techniques used to reduce such contention.  相似文献   

3.
In this paper we present two algorithms for the parallel solution of first-order linear recurrences, We show that the algorithms can be used to efficiently solve both scalar and blocked versions of the problem on vector and SIMD architectures. The first algorithm is a parallel approach whose resulting code can be explicitly vectorized, making it suitable for efficient execution on vector architectures such as the Cray 2. The second algorithm is a modified recursive approach designed to reduce the communication overhead encountered in SIMD architectures such as the Connection Machine 2 (CM-2). We present the performance exhibited by the parallel algorithm implementations on the Cray 2 and CM-2 for both scalar and blocked versions of the recurrence problem.  相似文献   

4.
5.
Numerous applications such as stock market or medical information systems require that both historical and current data be logically integrated into a temporal database. The underlying access method must support different forms of “time-travel” queries, the migration of old record versions onto inexpensive archive media, and high insertion and update rates. This paper presents an access method for transaction-time temporal data, called the log-structured history data access method (LHAM) that meets these demands. The basic principle of LHAM is to partition the data into successive components based on the timestamps of the record versions. Components are assigned to different levels of a storage hierarchy, and incoming data is continuously migrated through the hierarchy. The paper discusses the LHAM concepts, including concurrency control and recovery, our full-fledged LHAM implementation, and experimental performance results based on this implementation. A detailed comparison with the TSB-tree, both analytically and based on experiments with real implementations, shows that LHAM is highly superior in terms of insert performance, while query performance is in almost all cases at least as good as for the TSB-tree; in many cases it is much better. Received March 4, 1999 / Accepted September 28, 1999  相似文献   

6.
NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior.  相似文献   

7.
Three‐dimensional curve skeletons are a very compact representation of three‐dimensional objects with many uses and applications in fields such as computer graphics, computer vision, and medical imaging. An important problem is that the calculation of the skeleton is a very time‐consuming process. Thinning is a widely used technique for calculating the curve skeleton because of the properties it ensures and the ease of implementation. In this paper, we present parallel versions of a thinning algorithm for efficient implementation in both graphics processing units and multicore CPUs. The parallel programming models used in our implementations are Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). The speedup achieved with the optimized parallel algorithms for the graphics processing unit achieves 106.24x against the CPU single‐process version and more than 19x over the CPU multithreaded version. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

8.
Parallel finite-element computation of 3D flows   总被引:5,自引:0,他引:5  
The authors describe their work on the massively parallel finite-element computation of compressible and incompressible flows with the CM-200 and CM-5 Connection Machines. Their computations are based on implicit methods, and their parallel implementations are based on the assumption that the mesh is unstructured. Computations for flow problems involving moving boundaries and interfaces are achieved by using the deformable-spatial-domain/stabilized-space-time method. Using special mesh update schemes, the frequency of remeshing is minimized to reduce the projection errors involved and also to make parallelizing the computations easier. This method and its implementation on massively parallel supercomputers provide a capability for solving a large class of practical problems involving free surfaces, two-liquid interfaces, and fluid-structure interactions  相似文献   

9.
Particle swarm optimization (PSO), like other population-based meta-heuristics, is intrinsically parallel and can be effectively implemented on Graphics Processing Units (GPUs), which are, in fact, massively parallel processing architectures. In this paper we discuss possible approaches to parallelizing PSO on graphics hardware within the Compute Unified Device Architecture (CUDA™), a GPU programming environment by nVIDIA™ which supports the company’s latest cards. In particular, two different ways of exploiting GPU parallelism are explored and evaluated. The execution speed of the two parallel algorithms is compared, on functions which are typically used as benchmarks for PSO, with a standard sequential implementation of PSO (SPSO), as well as with recently published results of other parallel implementations. An in-depth study of the computation efficiency of our parallel algorithms is carried out by assessing speed-up and scale-up with respect to SPSO. Also reported are some results about the optimization effectiveness of the parallel implementations with respect to SPSO, in cases when the parallel versions introduce some possibly significant difference with respect to the sequential version.  相似文献   

10.

Parallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

  相似文献   

11.
An efficient parallel implementation of a nonparaxial beam propagation method for the numerical study of the nonlinear Helmholtz equation is presented. Our solution focuses on minimizing communication and computational demands of the method which are dependent on a nonparaxiality parameter. Performance tests carried out on different types of parallel systems behave according theoretical predictions and show that our proposal exhibits a better behavior than those solutions based on the use of conventional parallel fast Fourier transform implementations. The application of our design is illustrated in a particularly demanding scenario: the study of dark solitons at interfaces separating two defocusing Kerr media, where it is shown to play a key role.  相似文献   

12.
数字脊波变换的实现与一种改进方法   总被引:5,自引:1,他引:4  
脊波变换作为一种新的连续空间中函数的多尺度表示方法,其离散变换形式仍然有许多问题有待解决.目前大多将离散脊波变换形式看做Radon变换与小波变换的复合变换形式,进而对其分步进行处理.利用计算机图形学中的Bresenham算法思想,使得在实现Radon变换的过程中提高了变换的效率.与先前的最近邻方法相比,快速准确,并可完全重构.数值实验显示,与Zp^2方法实现的脊波变换相比较,利用此方法生成的图像重构、压缩、去噪效果都有显著提高,为进一步的研究工作奠定了基础.  相似文献   

13.
In this paper we study the impact of the simultaneous exploitation of data‐ and task‐parallelism, so called mixed‐parallelism, on the Strassen and Winograd matrix multiplication algorithms. This work takes place in the context of Grid computing and, in particular, in the Client–Agent(s)–Server(s) model, where data can already be distributed on the platform. For each of those algorithms, we propose two mixed‐parallel implementations. The former follows the phases of the original algorithms while the latter has been designed as the result of a list scheduling algorithm. We give a theoretical comparison, in terms of memory usage and execution time, between our algorithms and classical data‐parallel implementations. This analysis is corroborated by experiments. Finally, we give some hints about heterogeneous and recursive versions of our algorithms. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

14.
The paper deals with the parallelization of Delaunay triangulation algorithms, giving more emphasis to pratical issues and implementation than to theoretical complexity. Two parallel implementations are presented. The first one is built on De Wall, an Ed triangulator based on an original interpretation of the divide & conquer paradigm. The second is based on an incremental construction algorithm. The parallelization strategies are presented and evaluated. The target parallel machine is a distributed computing environment, composed of coarse grain processing nodes. Results of first implementations are reported and compared with the performance of the serial versions running on a Unix workstation.  相似文献   

15.
16.
Recent development in Graphics Processing Units (GPUs) has enabled inexpensive high performance computing for general-purpose applications. Compute Unified Device Architecture (CUDA) programming model provides the programmers adequate C language like APIs to better exploit the parallel power of the GPU. Data mining is widely used and has significant applications in various domains. However, current data mining toolkits cannot meet the requirement of applications with large-scale databases in terms of speed. In this paper, we propose three techniques to speedup fundamental problems in data mining algorithms on the CUDA platform: scalable thread scheduling scheme for irregular pattern, parallel distributed top-k scheme, and parallel high dimension reduction scheme. They play a key role in our CUDA-based implementation of three representative data mining algorithms, CU-Apriori, CU-KNN, and CU-K-means. These parallel implementations outperform the other state-of-the-art implementations significantly on a HP xw8600 workstation with a Tesla C1060 GPU and a Core-quad Intel Xeon CPU. Our results have shown that GPU + CUDA parallel architecture is feasible and promising for data mining applications.  相似文献   

17.
In this paper we focus on Jacobi like resolution of the eigenproblem for a real symmetric matrix from a parallel performance point of view: we try to optimize the algorithm working on the communication intensive part of the code. We discuss several parallel implementations and propose an implementation which overlaps the communications by the computations to reach a better efficiency. We show that the overlapping implementation can lead to significant improvements. We conclude by presenting our future work.  相似文献   

18.
19.
The level-set method, a technique for the computation of evolving interfaces, is a solution commonly used to segment images and volumes in medical applications. GPUs have become a commodity hardware with hundreds of cores that can execute thousands of threads in parallel, and they are nowadays ideal platforms to execute computational intensive tasks, such as the 3D level-set-based segmentation, in real time. In this paper, we propose two GPU implementations of the level-set-based segmentation method called Fast Two-Cycle. Our proposals perform computations in independent domains called tiles and modify the structure of the original algorithm to better exploit the features of the GPU. The implementations were tested with real images of brain vessels and a synthetic MRI image of the brain. Results show that they execute faster than a CPU-sequential implementation of the same method, without any significant loss of the segmentation quality and without requiring distributed parallel computer infrastructures.  相似文献   

20.
In this work, a practical implementation of two parallel thinning algorithms on a multicomputer system are described. The solution has been conceived for a multiprocessor using the SPMD (single program multiple data) programming model. Our main goal is intended to describe our experiences on data partition/distribution among processors for parallel thinning algorithms as a representative type of algorithm where communications take place between neighbor processors and the work load for each processor depends on the input data. It will be shown how the efficiency of the parallel implementation can be improved through the application of a preprocess. This preprocess is based on the analysis of the work load balance. An analysis of the communication cost is also made. Although the results shown here are concerned with the implementations of two parallel thinning algorithms we think that our proposal about data distribution is general and useful for a wide set of algorithms in the field of image processing. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号