共查询到20条相似文献,搜索用时 0 毫秒
1.
基于GPU的大规模拓扑优化问题并行计算方法 总被引:1,自引:0,他引:1
针对进行大规模拓扑优化问题计算量庞大且计算效率低的问题,设计并实现了一种基于图形处理器(GPU)的并行拓扑优化方法.采用双向渐进结构拓扑优化(BESO)为基础优化算法,采用一种基于节点计算的共轭梯度求解方法用于有限元方程组求解.通过对原串行算法的研究,并结合GPU的计算特点,实现了迭代过程全流程的并行计算.上述方法的程序设计和编写采用统一计算架构(CUDA),提出了基于单元和基于节点的两种并行策略.编写程序时充分使用CUDA自带的各种数学运算库,保证了程序的稳定性和易用性.数值算例证明,并行计算方法稳定并且高效,在优化结果一致的前提下,采用GTX580显卡可以取得巨大的计算加速比. 相似文献
2.
Jinwoong KimAuthor Vitae Sul-Gi KimAuthor VitaeBeomseok Nam 《Journal of Parallel and Distributed Computing》2013
The general purpose computing on graphics processing unit (GP-GPU) has emerged as a new cost effective parallel computing paradigm in high performance computing research that enables large amount of data to be processed in parallel. Large scale scientific data intensive applications have been playing an important role in modern high performance computing research. A common access pattern into such scientific data analysis applications is multi-dimensional range query, but not much research has been conducted on multi-dimensional range query on the GPU. Inherently multi-dimensional indexing trees such as R-Trees are not well suited for GPU environment because of its irregular tree traversal. Traversing irregular tree search path makes it hard to maximize the utilization of massively parallel architectures. In this paper, we propose a novel MPTS (Massively Parallel Three-phase Scanning) R-tree traversal algorithm for multi-dimensional range query, that converts recursive access to tree nodes into sequential access. Our extensive experimental study shows that MPTS R-tree traversal algorithm on NVIDIA Tesla M2090 GPU consistently outperforms traditional recursive R-trees search algorithm on Intel Xeon E5506 processors. 相似文献
3.
4.
The error-resilient entropy coding (EREC) algorithm is an effective method for combating error propagation at low cost in many compression methods using variable-length coding (VLC). However, the main drawback of the EREC is its high complexity. In order to overcome this disadvantage, a parallel EREC is implemented on a graphics processing unit (GPU) using the NVIDIA CUDA technology. The original EREC is a finer-grained parallel at each stage which brings additional communication overhead. To achieve high efficiency of parallel EREC, we propose partitioning the EREC (P-EREC) algorithm, which splits variable-length blocks into groups and then every group is coded using the EREC separately. Each GPU thread processes one group so as to make the EREC coarse-grained parallel. In addition, some optimization strategies are discussed in order to obtain higher performance using the GPU. In the case that the variable-length data blocks are divided into 128 groups (256 groups, resp.), experimental results show that the parallel P-EREC achieves 32× to 123× (54× to 350×, resp.) speedup over the original C code of EREC compiled with the O2 optimization option. Higher speedup can even be obtained with more groups. Compared to the EREC, the P-EREC not only achieves a good speedup performance, but it also slightly improves the resilience of the VLC bit-stream against burst or random errors. 相似文献
5.
Audrey Delévacq Pierre Delisle Marc Gravel Michaël Krajecki 《Journal of Parallel and Distributed Computing》2013
The purpose of this paper is to propose effective parallelization strategies for the Ant Colony Optimization (ACO) metaheuristic on Graphics Processing Units (GPUs). The Max–Min Ant System (MMAS) algorithm augmented with 3-opt local search is used as a framework for the implementation of the parallel ants and multiple ant colonies general parallelization approaches. The four resulting GPU algorithms are extensively evaluated and compared on both speedup and solution quality on a state-of-the-art Fermi GPU architecture. A rigorous effort is made to keep parallel algorithms true to the original MMAS applied to the Traveling Salesman Problem. We report speedups of up to 23.60 with solution quality similar to the original sequential implementation. With the intent of providing a parallelization framework for ACO on GPUs, a comparative experimental study highlights the performance impact of ACO parameters, GPU technical configuration, memory structures and parallelization granularity. 相似文献
6.
Young In Yeo Tianyun Ni Ashish Myles Vineet Goel Jörg Peters 《The Visual computer》2009,25(8):757-769
For use in real-time applications, we present a fast algorithm for converting a quad mesh to a smooth, piecewise polynomial
surface on the Graphics Processing Unit (GPU). The surface has well-defined normals everywhere and closely mimics the shape
of Catmull–Clark subdivision surfaces. It consists of bicubic splines wherever possible, and a new class of patches—c-patches—where a vertex has a valence different from 4. The algorithm fits well into parallel streams so that meshes with 12,000 input
quads, of which 60% have one or more non-4-valent vertices, are converted, evaluated and rendered with 9×9 resolution per
quad at 50 frames per second. The GPU computations are ordered so that evaluation avoids pixel dropout.
相似文献
Young In YeoEmail: |
7.
Qi Li Raied SalmanAuthor VitaeErik TestAuthor Vitae Robert StrackAuthor VitaeVojislav KecmanAuthor Vitae 《Journal of Parallel and Distributed Computing》2013
The Support Vector Machine (SVM) is an efficient tool in machine learning with high accuracy performance. However, in order to achieve the highest accuracy performance, n-fold cross validation is commonly used to identify the best hyperparameters for SVM. This becomes a weak point of SVM due to the extremely long training time for various hyperparameters of different kernel functions. In this paper, a novel parallel SVM training implementation is proposed to accelerate the cross validation procedure by running multiple training tasks simultaneously on a Graphics Processing Unit (GPU). All of these tasks with different hyperparameters share the same cache memory which stores the kernel matrix of the support vectors. Therefore, this heavily reduces redundant computations of kernel values across different training tasks. Considering that the computations of kernel values are the most time consuming operations in SVM training, the total time cost of the cross validation procedure decreases significantly. The experimental tests indicate that the time cost for the multitask cross validation training is very close to the time cost of the slowest task trained alone. Comparison tests have shown that the proposed method is 10 to 100 times faster compared to the state of the art LIBSVM tool. 相似文献
8.
Control of autonomous systems subject to stochastic uncertainty is a challenging task. In guided airdrop applications, random wind disturbances play a crucial role in determining landing accuracy and terrain avoidance. This paper describes a stochastic parafoil guidance system which couples uncertainty propagation with optimal control to protect against wind and parameter uncertainty in the presence of impact area obstacles. The algorithm uses real-time Monte Carlo simulation performed on a graphics processing unit (GPU) to evaluate robustness of candidate trajectories in terms of delivery accuracy, obstacle avoidance, and other considerations. Building upon prior theoretical developments, this paper explores performance of the stochastic guidance law compared to standard deterministic guidance schemes, particularly with respect to obstacle avoidance. Flight test results are presented comparing the proposed stochastic guidance algorithm with a standard deterministic one. Through a comprehensive set of simulation results, key implementation aspects of the stochastic algorithm are explored including tradeoffs between the number of candidate trajectories considered, algorithm runtime, and overall guidance performance. Overall, simulation and flight test results demonstrate that the stochastic guidance scheme provides a more robust approach to obstacle avoidance while largely maintaining delivery accuracy. 相似文献
9.
A parallel molecular dynamics simulation method, designed for large-scale problems, employing dynamic spatial domain decomposition for short-ranged molecular interactions is proposed. In this parallel cellular molecular dynamics (PCMD) simulation method, the link-cell data structure is used to reduce the searching time required for forming the cut-off neighbor list as well as for domain decomposition, which utilizes the multi-level graph-partitioning technique. A simple threshold scheme (STS), in which workload imbalance is monitored and compared with some threshold value during the runtime, is proposed to decide the proper time for repartitioning the domain. The simulation code is implemented and tested on the memory-distributed parallel machine, e.g., PC-cluster system. Parallel performance is studied using approximately one million L-J atoms in the condensed, vaporized and supercritical states. Results show that fairly good parallel efficiency at 49 processors can be obtained for the condensed and supercritical states (∼60%), while it is comparably lower for the vaporized state (∼40%). 相似文献
10.
Parallel implementation of simulated annealing to reproduce multiple-point statistics 总被引:2,自引:0,他引:2
Oscar Peredo 《Computers & Geosciences》2011,37(8):1110-1121
This paper shows an innovative implementation of simulated annealing in the context of parallel computing. Details regarding the use of parallel computing through a cluster of processors, as well as the implementation decisions, are provided. Simulated annealing is presented aiming at the generation of stochastic realizations of categorical variables reproducing multiple-point statistics.The procedure starts with the use of a training image to determine the frequencies of occurrence of particular configurations of nodes and values. These frequencies are used as target statistics that must be matched by the stochastic images generated with the algorithm. The simulation process considers an initial random image of the spatial distribution of the categories. Nodes are perturbed randomly and after each perturbation the mismatch between the target statistics and the current statistics of the image is calculated. The perturbation is accepted if the statistics are closer to the target, or conditionally rejected if not, based on the annealing schedule.The simulation was implemented using parallel processes with C++ and MPI. The message passing scheme was implemented using a speculative computation framework, by which prior to making the decision of acceptance or rejection of a proposed perturbation, processes already start calculating the next possible perturbation at a second level; one as if the perturbation on level one is accepted, and another process as if the proposed perturbation is rejected. Additional levels can start their calculation as well, conditional to the second level processes. Once a process reaches a decision as to whether accept or reject the suggested perturbation, all processes within the branch incompatible with that decision are dropped. This allows a speed up of up to logn(p+1), where n is the number of categories and p the number of processes simultaneously active.Examples are provided to demonstrate improvements and speed ups that can be achieved. 相似文献
11.
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.
Program summary
Program title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Sub-program numdiff is used for the test run. 相似文献12.
The ordinary differential equations of Newtonian dynamics are used in atomic simulations with the method of molecular dynamics. The basic issues are surveyed and standard algorithms are described. Several algorithmic variants are discussed. Some advanced ideas relating to parallel computation are considered. 相似文献
13.
14.
QR分解作为一个基本计算模块,广泛应用在图像处理、信号处理、通信工程等众多领域.传统的并行QR分解算法只能挖掘计算过程中的数据级并行.在分析快速Givens Rotation分解特征的基础上,提出了一种多层次并行算法,能够同时挖掘计算过程中的任务级并行和数据级并行,非常适合于以图形处理器(GPU)为代表的大规模并行处理器.同时,采用GPU的并行QR分解算法可以作为基本运算模块被GPU平台上的众多应用程序直接调用.实验结果显示,与CPU平台上使用OpenMP实现的算法相比,基于GPU的多层次并行算法能够获得5倍以上的性能提升,而调用QR分解模块的奇异值分解(SVD)应用可以获得3倍以上的性能提升. 相似文献
15.
Heterogeneous systems with nodes containing more than one type of computation units, e.g., central processing units (CPUs) and graphics processing units (GPUs), are becoming popular because of their low cost and high performance. In this paper, we have developed a Three-Level Parallelization Scheme (TLPS) for molecular dynamics (MD) simulation on heterogeneous systems. The scheme exploits multi-level parallelism combining (1) inter-node parallelism using spatial decomposition via message passing, (2) intra-node parallelism using spatial decomposition via dynamically scheduled multi-threading, and (3) intra-chip parallelism using multi-threading and short vector extension in CPUs, and employing multiple CUDA threads in GPUs. By using a hierarchy of parallelism with optimizations such as communication hiding intra-node, and memory optimizations in both CPUs and GPUs, we have implemented and evaluated a MD simulation on a petascale heterogeneous supercomputer TH-1A. The results show that MD simulations can be efficiently parallelized with our TLPS scheme and can benefit from the optimizations. 相似文献
16.
A highly desired part of the synthetic biology toolbox is an embedded chemical microcontroller, capable of autonomously following
a logic program specified by a set of instructions, and interacting with its cellular environment. Strategies for incorporating
logic in aqueous chemistry have focused primarily on implementing components, such as logic gates, that are composed into
larger circuits, with each logic gate in the circuit corresponding to one or more molecular species. With this paradigm, designing
and producing new molecular species is necessary to perform larger computations. An alternative approach begins by noticing
that chemical systems on the small scale are fundamentally discrete and stochastic. In particular, the exact molecular counts
of each molecular species present, is an intrinsically available form of information. This might appear to be a very weak
form of information, perhaps quite difficult for computations to utilize. Indeed, it has been shown that error-free Turing
universal computation is impossible in this setting. Nevertheless, we show a design of a chemical computer that achieves fast
and reliable Turing-universal computation using molecular counts. Our scheme uses only a small number of different molecular
species to do computation of arbitrary complexity. The total probability of error of the computation can be made arbitrarily
small (but not zero) by adjusting the initial molecular counts of certain species. While physical implementations would be
difficult, these results demonstrate that molecular counts can be a useful form of information for small molecular systems
such as those operating within cellular environments.
相似文献
Jehoshua BruckEmail: |
17.
W. Michael Brown Peng Wang Steven J. Plimpton Arnold N. Tharrington 《Computer Physics Communications》2011,182(4):898-911
The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines – (1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory, (2) minimizing the amount of code that must be ported for efficient acceleration, (3) utilizing the available processing power from both multi-core CPUs and accelerators, and (4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS, however, the methods can be applied in many molecular dynamics codes. Specifically, we describe algorithms for efficient short range force calculation on hybrid high-performance machines. We describe an approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPUs and 180 CPU cores. 相似文献
18.
Theodoros Athanasiadis Ioannis Fudos 《Computers & Graphics》2011,35(3):569-579
Mesh parameterization is central to a broad spectrum of applications. In this paper, we present a novel approach to spherical mesh parameterization based on an iterative quadratic solver that is efficiently parallelizable on modern massively parallel architectures. We present an extensive analysis of performance results on both GPU and multicore architectures. We introduce a number of heuristics that exploit various system characteristics of the underlying architectures to speed up the parallel realization of our algorithms. Furthermore, we demonstrate the applicability of our approach to real-time feature detection, mesh decomposition and similarity-based 3D object retrieval. Finally, we offer visual results and a demonstration video. 相似文献
19.
Classification using Ant Programming is a challenging data mining task which demands a great deal of computational resources when handling data sets of high dimensionality. This paper presents a new parallelization approach of an existing multi-objective Ant Programming model for classification, using GPUs and the NVIDIA CUDA programming model. The computational costs of the different steps of the algorithm are evaluated and it is discussed how best to parallelize them. The features of both the CPU parallel and GPU versions of the algorithm are presented. An experimental study is carried out to evaluate the performance and efficiency of the interpreter of the rules, and reports the execution times and speedups regarding variable population size, complexity of the rules mined and dimensionality of the data sets. Experiments measure the original single-threaded and the new multi-threaded CPU and GPU times with different number of GPU devices. The results are reported in terms of the number of Giga GP operations per second of the interpreter (up to 10 billion GPops/s) and the speedup achieved (up to 834× vs CPU, 212× vs 4-threaded CPU). The proposed GPU model is demonstrated to scale efficiently to larger datasets and to multiple GPU devices, which allows the expansion of its applicability to significantly more complicated data sets, previously unmanageable by the original algorithm in reasonable time. 相似文献
20.
Lilia Ziane Khodja Ming Chau Raphaël Couturier Jacques Bahi Pierre Spitéri 《Computers & Mathematics with Applications》2013,65(11):1830-1848
This paper deals with the numerical solution of financial applications, more specifically the computation of American option derivatives modeled by nonlinear boundary values problems. In such applications we have to solve large-scale algebraic systems. We concentrate on synchronous and asynchronous parallel iterative algorithms carried out on CPU and GPU networks. The properties of the operators arising in the discretized problem ensure the convergence of the parallel iterative synchronous and asynchronous algorithms. Computational experiments performed on CPU and GPU networks are presented and analyzed. 相似文献