首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Interactive isosurface visualisation has been made possible by mapping algorithms to GPU architectures. However, current state‐of‐the‐art isosurfacing algorithms usually consume large amounts of GPU memory owing to the additional acceleration structures they require. As a result, the continued limitations on available GPU memory mean that they are unable to deal with the larger datasets that are now increasingly becoming prevalent. This paper proposes a new parallel isosurface‐extraction algorithm that exploits the blocked organisation of the parallel threads found in modern many‐core platforms to achieve fast isosurface extraction and reduce the associated memory requirements. This is achieved by optimising thread co‐operation within thread‐blocks and reducing redundant computation; ultimately, an indexed triangular mesh can be produced. Experiments have shown that the proposed algorithm is much faster (up to 10×) than state‐of‐the‐art GPU algorithms and has a much smaller memory footprint, enabling it to handle much larger datasets (up to 64×) on the same GPU.  相似文献   

2.
This paper presents a Graphics Processing Unit (GPU)-based implementation of a Bellman-Ford (BF) routing algorithm using NVIDIA’s Compute Unified Device Architecture (CUDA). In the proposed GPU-based approach, multiple threads run concurrently over numerous streaming processors in the GPU to dynamically update routing information. Instead of computing the individual vertex distances one-by-one, a number of threads concurrently update a larger number of vertex distances, and an individual vertex distance is represented in a single thread. This paper compares the performance of the GPU-based approach to an equivalent CPU implementation while varying the number of vertices. Experimental results show that the proposed GPU-based approach outperforms the equivalent sequential CPU implementation in terms of execution time by exploiting the massive parallelism inherent in the BF routing algorithm. In addition, the reduction in energy consumption (about 99 %) achieved by using the GPU is reflective of the overall merits of deploying GPUs across the entire landscape of IP routing for emerging multimedia communications.  相似文献   

3.
基于CUDA的并行布谷鸟搜索算法设计与实现   总被引:1,自引:0,他引:1  
布谷鸟搜索(cuckoo search,CS)算法是近几年发展起来的智能元启发式算法,已经被成功应用于多种优化问题中。针对CS算法在求解大数据、大规模复杂问题时,计算时间过长的问题,提出了一种基于统一计算设备架构(compute unified device architecture,CUDA)的并行布谷鸟搜索算法。该算法的并行实现采用任务并行与数据并行相结合的方式,利用图形处理器(graphic processing unit,GPU)线程块与线程分别映射布谷鸟个体与个体的每一维数据,并行实现CS算法中的鸟巢位置更新、个体适应度评估、鸟巢重建、寻找最优个体操作。整个CS算法的寻优迭代过程完全通过GPU实现,降低了算法计算过程中CPU与GPU的通信开销。对4个经典基准测试函数进行了仿真实验,结果表明,相比标准CS算法,基于CUDA架构的并行CS算法在求解收敛性一致的前提下,在求解速度上获得了高达110倍的计算加速比。  相似文献   

4.
The GPU (Graphics Processing Unit) has recently become one of the most power efficient processors in embedded and many other environments, and has been integrated into more and more SoCs (System on Chip). Thus modern GPUs play a very important role in power aware computing. Strongly Connected Component (SCC) decomposition is a fundamental graph algorithm which has wide applications in model checking, electronic design automation, social network analysis and other fields. GPUs have been shown to have great potential in accelerating many types of computations including graph algorithms. Recent work have demonstrated the plausibility of GPU SCC decomposition, but the implementation is inefficient due to insufficient consideration of the distinguishing GPU programming model, which leads to poor performance on irregular and sparse graphs.This paper presents a new GPU SCC decomposition algorithm that focuses on full utilization of the contemporary embedded and desktop GPU architecture. In particular, a subgraph numbering scheme is proposed to facilitate the safe and efficient management of the subgraph IDs and to serve as the basis of efficient source selection. Furthermore, we adopt a multi-source partition procedure that greatly reduces the recursion depth and use a vertex labeling approach that can highly optimize the GPU memory access. The evaluation results show that the proposed approach achieves up to 41× speedup over Tarjan’s algorithm, one of the most efficient sequential SCC decomposition algorithms, and up to 3.8× speedup over the previous GPU algorithms.  相似文献   

5.
In this paper we describe a new parallel Frequent Itemset Mining algorithm called “Frontier Expansion.” This implementation is optimized to achieve high performance on a heterogeneous platform consisting of a shared memory multiprocessor and multiple Graphics Processing Unit (GPU) coprocessors. Frontier Expansion is an improved data-parallel algorithm derived from the Equivalent Class Clustering (Eclat) method, in which a partial breadth-first search is utilized to exploit maximum parallelism while being constrained by the available memory capacity. In our approach, the vertical transaction lists are represented using a “bitset” representation and operated using wide bitwise operations across multiple threads on a GPU. We evaluate our approach using four NVIDIA Tesla GPUs and observed a 6–30× speedup relative to state-of-the-art sequential Eclat and FPGrowth implementations executed on a multicore CPU.  相似文献   

6.
This paper proposed a multi-cue-based face-tracking algorithm with the supporting framework using parallel multi-core and one Graphic Processing Unit (GPU). Due to illumination and partial-occlusion problems, face tracking usually cannot stably work based on a single cue. Focusing on the above-mentioned problems, we first combined three different visual cues??color histogram, edge orientation histogram, and wavelet feature??under the framework of particle filters to considerably improve tracking performance. Furthermore, an online updating strategy made the algorithm adaptive to illumination changes and slight face rotations. Subsequently, attempting two parallel approaches resulted in real-time responses. However, the computational efficiency decreased considerably with the increase of particles and visual cues. In order to handle the large amount of computation costs resulting from the introduced multi-cue strategy, we explored two parallel computing techniques to speed up the tracking process, especially the most computation-intensive observational steps. One is a multi-core-based parallel algorithm with a MapReduce thread model, and the other is a GPU-based speedup approach. The GPU-based technique uses features-matching and particle weight computations, which have been put into the GPU kernel. The results demonstrate that the proposed face-tracking algorithm can work robustly with cluttered backgrounds and differing illuminations; the multi-core parallel scheme can increase the speed by 2?C6 times compared with that of the corresponding sequential algorithms. Furthermore, a GPU parallel scheme and co-processing scheme can achieve a greater increase in speed (8×?C12×) compared with the corresponding sequential algorithms.  相似文献   

7.
We present initial comparison performance results for Intel many integrated core (MIC), Sandy Bridge (SB), and graphical processing unit (GPU). A 1D explicit electrostatic particle‐in‐cell code is used to simulate a two‐stream instability in plasma. We compare the computation times for various number of cores/threads and compiler options. The parallelization is implemented via OpenMP with a maximum thread number of 128. Parallelization and vectorization on the GPU is achieved with modifying the code syntax for compatibility with CUDA. We assess the speedup due to various auto‐vectorization and optimization level compiler options. Our results show that the MIC is several times slower than SB for a single thread, and it becomes faster than SB when the number of cores increases with vectorization switched on. The compute times for the GPU are consistently about six to seven times faster than the ones for MIC. Compared with SB, the GPU is about two times faster for a single thread and about an order of magnitude faster for 128 threads. The net speedup, however, for MIC and GPU are almost the same. An initial attempt to offload parts of the code to the MIC coprocessor shows that there is an optimal number of threads where the speedup reaches a maximum. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

8.
CPU/FPGA混合架构是可重构计算的普遍结构,为了简化混合架构上FPGA的使用,提出了一种硬件线程方法,并设计了硬件线程的执行机制,以硬件线程的方式使用可重构资源.同时,软硬件线程可以通过共享数据存储方式进行多线程并行执行,将程序中计算密集部分以FPGA上的硬件线程方式执行,而控制密集部分则以CPU上的软件线程方式执行.在Simics仿真软件模拟的混合架构平台上,对DES,MD5SUM和归并排序算法进行软硬件多线程改造后的实验结果表明,平均执行加速比达到了2.30,有效地发挥了CPU/FPGA混合架构的计算性能.  相似文献   

9.
In this paper, a graphics processor unit (GPU) accelerated particle filtering algorithm is presented with an introduction to a novel resampling technique. The aim remains in the mitigation of particle impoverishment as well as computational burden, problems which are commonly associated with classical (systematic) resampled particle filtering. The proposed algorithm employs a priori-space dependent distribution in addition to the likelihood, and hence is christened as dual distribution dependent (D3) resampling method. Simulation results exhibit lesser values for root mean square error (RMSE) in comparison to that for systematic resampling. D3 resampling is shown to improve particle diversity after each iteration, thereby affecting the overall quality of estimation. However, computational burden is significantly increased owing to few excessive computations within the newly formulated resampling framework. With a view to obtaining parallel speedup we introduce a CUDA version of the proposed method for necessary acceleration by GPU. The GPU programming model is detailed in the context of this paper. Implementation issues are discussed along with illustration of empirical computational efficiency, as obtained by executing the CUDA code on Quadro 2000 GPU. The GPU enabled code has a speedup of 3 and 4 over the sequential executions of systematic and D3 resampling methods respectively. Performance both in terms of RMSE and running time have been elaborated with respect to different selections for threads per block towards effective implementations. It is in this context that, we further introduce a cost to performance metric (CPM) for assessing the algorithmic efficiency of the estimator, involving both quality of estimation and running time as comparative factors, transformed into a unified parameter for assessment. CPM values for estimators obtained from all such different choices for threads per block have been determined and a final value for the chosen parameter is resolved for generation of a holistic effective estimator.  相似文献   

10.
Speculative multithreading (SpMT) is a thread-level automatic parallelization technique, which partitions sequential programs into multithreads to be executed in parallel. This paper presents different thread partitioning strategies for nonloops and loops. For nonloops, we propose a cost estimation based on combined run-time effects of various speculation factors to predict the resulting performance of candidate threads to guide the thread partitioning. For loops, we parallelize all the profitable loops that can potentially offer additional performance benefits by multilevel spawning in loop bodies, loop iterations, and inner loops. Then we select a proper thread boundary located in the front of loop branch instruction to reduce invalid spawning threads that waste core resources. Experimental results show that the proposed approach can obtain a significant increase in speedup and Olden benchmarks reach a performance improvement of 6.62 % on average.  相似文献   

11.
In this paper, we present two new parallel algorithms to solve large instances of the scheduling of independent tasks problem. First, we describe a parallel version of the Min–min heuristic. Second, we present GraphCell, an advanced parallel cellular genetic algorithm (CGA) for the GPU. Two new generic recombination operators that take advantage of the massive parallelism of the GPU are proposed for GraphCell. A speedup study shows the high performance of the parallel Min–min algorithm in the GPU versus several CPU versions of the algorithm (both sequential and parallel using multiple threads). GraphCell improves state-of-the-art solutions, especially for larger problems, and it provides an alternative to our GPU Min–min heuristic when more accurate solutions are needed, at the expense of an increased runtime.  相似文献   

12.
The most commonly used approach for solving reaction–diffusion systems relies upon stencil computations. Although stencil computations feature low compute intensity, they place high demands on memory bandwidth. Fortunately, GPU computing allows for the heavy reliance of stencil computations on neighboring data points to be exploited to significantly increase simulation speeds by reducing these memory bandwidth demands. Upon reviewing previously published works, a wide-variety of efforts have been made to optimize NVIDIA CUDA-based stencil computations. However, a critical aspect contributing to algorithm performance is commonly glossed over: the halo region loading technique utilized in conjunction with a given spatial blocking technique. This paper presents an in-depth examination of this aspect and the associated single iteration performance impacts when using symmetric, nearest neighbor 19-point stencils. This is accomplished by closely examining how the simulated space is partitioned into thread blocks and the balance between memory accesses, divergence, and computing threads. The resulting optimization strategy for accelerating 3-dimensional reaction–diffusion simulations offers up to 2.45 times speedup for single-precision floating point numbers in reference to GPU-based speedups found within the previously published work that this paper directly extends. In reference to our multithreaded CPU-based implementation, the resulting optimization strategy offers up to 8.69 times speedup for single-precision floating point numbers.  相似文献   

13.
Implementations of relational operators on GPU processors have resulted in order of magnitude speedups compared to their multicore CPU counterparts. Here we focus on the efficient implementation of string matching operators common in SQL queries. Due to different architectural features the optimal algorithm for CPUs might be suboptimal for GPUs. GPUs achieve high memory bandwidth by running thousands of threads, so it is not feasible to keep the working set of all threads in the cache in a naive implementation. In GPUs the unit of execution is a group of threads and in the presence of loops and branches, threads in a group have to follow the same execution path; if some threads diverge, then different paths are serialized. We study the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory. We evaluate the memory efficiency in terms of memory access pattern and achieved memory bandwidth for different parallelization methods. To reduce thread divergence, we split string matching into multiple steps. We evaluate the different matching algorithms in terms of average- and worst-case performance and compare them against state-of-the-art CPU and GPU libraries. Our experimental evaluation shows that thread and memory efficiency affect performance significantly and that our proposed methods outperform previous CPU and GPU algorithms in terms of raw performance and power efficiency. The Knuth–Morris–Pratt algorithm is a good choice for GPUs because its regular memory access pattern makes it amenable to several GPU optimizations.  相似文献   

14.
General purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi‐clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro‐benchmark1 considered in this paper resulted in 10–40% and 80–92% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9% and 152.3% improvement compared with separate kernel launch and 39.5% and 36.8% improvement compared with block level kernel coalesce method, respectively.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

15.
Recent multi-core designs migrated from Symmetric Multi Processing to cache coherent Non Uniform Memory Access architectures. In this paper we discuss performance issues that arise when designing parallel Finite Element programs for a 64-core ccNUMA computer and explore solutions for these issues. We first present the overview of the computer architecture and show that highly parallel code that does not take into account the aspects of the system memory organization scales poorly, achieving only 2.8× speedup when running with 64 threads. Then, we discuss how we identified the sources of overhead and evaluate three possible solutions for the problem. We show that the first solution does not require the application’s code to be modified, however, the speedup achieved is only 10.6×. The second solution enables the performance to scale up to 30.9×, however, it requires the programmer to manually schedule threads and allocate related data on local CPUs and memory banks and rely on ccNUMA aware libraries that are not portable across operating systems. Also, we propose and evaluate “copy-on-thread”, an alternative solution that enables the performance to scale up to 25.5× without relying on specialized libraries nor requiring specific data allocation and thread scheduling. Finally, we argue that the issues reported only happen for large data sets and conclude the paper with recommendations to help programmers to design algorithms and programs that perform well on such kind of machine.  相似文献   

16.
In recent years, graphical processing unit(GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard. These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods.  相似文献   

17.
Peng  Lizhi  Zhang  Haibo  Hassan  Houcine  Chen  Yuehui  Yang  Bo 《The Journal of supercomputing》2019,75(6):2930-2949

Data gravitation-based classification model, a new physic law inspired classification model, has been demonstrated to be an effective classification model for both standard and imbalanced tasks. However, due to its large scale of gravitational computation during the feature weighting process, DGC suffers from high computational complexity, especially for large data sets. In this paper, we address the problem of speeding up gravitational computation using graphics processing unit (GPU). We design a GPU parallel algorithm namely GPU–DGC to accelerate the feature weighting process of the DGC model. Our GPU–DGC model distributes the gravitational computing process to parallel GPU threads, in order to compute gravitation simultaneously. We use 25 open classification data sets to evaluate the parallel performance of our algorithm. The relationship between the speedup ratio and the number of GPU threads is discovered and discussed based on the empirical studies. The experimental results show the effectiveness of GPU–DGC, with the maximum speedup ratio of 87 to the serial DGC. Its sensitivity to the number of GPU threads is also discovered in the empirical studies.

  相似文献   

18.
Speeding up the evaluation phase of GP classification algorithms on GPUs   总被引:2,自引:1,他引:1  
The efficiency of evolutionary algorithms has become a studied problem since it is one of the major weaknesses in these algorithms. Specifically, when these algorithms are employed for the classification task, the computational time required by them grows excessively as the problem complexity increases. This paper proposes an efficient scalable and massively parallel evaluation model using the NVIDIA CUDA GPU programming model to speed up the fitness calculation phase and greatly reduce the computational time. Experimental results show that our model significantly reduces the computational time compared to the sequential approach, reaching a speedup of up to 820×. Moreover, the model is able to scale to multiple GPU devices and can be easily extended to any evolutionary algorithm.  相似文献   

19.
针对大尺度压缩感知重构算法实时性应用的需要,探讨了基于图形处理器(GPU)的正交匹配追踪算法(OMP)的加速方法及实现。为降低中央处理器与GPU之间传输的高延迟,将整个OMP算法的迭代过程转移到GPU上并行执行。其中,在GPU端根据全局存储器的访问特点,改进CUDA程序使存储访问满足合并访问条件,降低访问延迟。同时,根据流多处理器(SM)的资源条件,增加SM中共享存储器的分配,通过改进线程访问算法来降低bank conflict,提高访存速度。在NVIDIA Tesla K20Xm GPU和Intel(R) E5-2650 CPU上进行了测试,结果表明,算法中耗时长的投影模块、更新权值模块分别可获得32和46倍的加速比,算法整体可获得34倍的加速比。  相似文献   

20.
张硕  何发智  周毅  鄢小虎 《计算机应用》2016,36(12):3274-3279
基于统一计算设备架构(CUDA)对图形处理器(GPU)下的并行粒子群优化(PSO)算法作改进研究。根据CUDA的硬件体系结构特点,可知Block是串行执行的,线程束(Warp)才是流多处理器(SM)调度和执行的基本单位。为了充分利用Block中线程的并行性,提出基于自适应线程束的GPU并行PSO算法:将粒子的维度和线程相对应;利用GPU的Warp级并行,根据维度的不同自适应地将每个粒子与一个或多个Warp相对应;自适应地将一个或多个粒子与每个Block相对应。与已有的粗粒度并行方法(将每个粒子和线程相对应)以及细粒度并行方法(将每个粒子和Block相对应)进行了对比分析,实验结果表明,所提出的并行方法相对前两种并行方法,CPU加速比最多提高了40。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号