首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

2.
Empty‐space skipping is an essential acceleration technique for volume rendering. Image‐order empty‐space skipping is not well suited to GPU implementation, since it must perform checks on, essentially, a per‐sample basis, as in kd‐tree traversal, which can lead to a great deal of divergent branching at runtime, which is very expensive in a modern GPU pipeline. In contrast, object‐order empty‐space skipping is extremely fast on a GPU and has negligible overheads compared with approaches without empty‐space skipping, since it employs the hardware unit for rasterisation. However, previous object‐order algorithms have been able to skip only exterior empty space and not the interior empty space that lies inside or between volume objects. In this paper, we address these issues by proposing a multi‐layer depth‐peeling approach that can obtain all of the depth layers of the tight‐fitting bounding geometry of the isosurface by a single rasterising pass. The maximum count of layers peeled by our approach can be up to thousands, while maintaining 32‐bit float‐point accuracy, which was not possible previously. By raytracing only the valid ray segments between each consecutive pair of depth layers, we can skip both the interior and exterior empty space efficiently. In comparisons with 3 state‐of‐the‐art GPU isosurface rendering algorithms, this technique achieved much faster rendering across a variety of data sets.  相似文献   

3.
Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed‐memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task‐parallel programs executed on hybrid distributed‐memory CPU‐graphics processing unit (GPU) systems in a global‐address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU‐GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state‐of‐the‐art CCSD(T) application module from the computational chemistry domain. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

4.
Membrane systems are parallel distributed computing models that are used in a wide variety of areas. Use of a sequential machine to simulate membrane systems loses the advantage of parallelism in Membrane Computing. In this paper, an innovative classification algorithm based on a weighted network is introduced. Two new algorithms have been proposed for simulating membrane systems models on a Graphics Processing Unit (GPU). Communication and synchronization between threads and thread blocks in a GPU are time-consuming processes. In previous studies, dependent objects were assigned to different threads. This increases the need for communication between threads, and as a result, performance decreases. In previous studies, dependent membranes have also been assigned to different thread blocks, requiring inter-block communications and decreasing performance. The speedup of the proposed algorithm on a GPU that classifies dependent objects using a sequential approach, for example with 512 objects per membrane, was 82×, while for the previous approach (Algorithm 1), it was 8.2×. For a membrane system with high dependency among membranes, the speedup of the second proposed algorithm (Algorithm 3) was 12×, while for the previous approach (Algorithm 1) and the first proposed algorithm (Algorithm 2) that assign each membrane to one thread block, it was 1.8×.  相似文献   

5.
This paper proposed a multi-cue-based face-tracking algorithm with the supporting framework using parallel multi-core and one Graphic Processing Unit (GPU). Due to illumination and partial-occlusion problems, face tracking usually cannot stably work based on a single cue. Focusing on the above-mentioned problems, we first combined three different visual cues??color histogram, edge orientation histogram, and wavelet feature??under the framework of particle filters to considerably improve tracking performance. Furthermore, an online updating strategy made the algorithm adaptive to illumination changes and slight face rotations. Subsequently, attempting two parallel approaches resulted in real-time responses. However, the computational efficiency decreased considerably with the increase of particles and visual cues. In order to handle the large amount of computation costs resulting from the introduced multi-cue strategy, we explored two parallel computing techniques to speed up the tracking process, especially the most computation-intensive observational steps. One is a multi-core-based parallel algorithm with a MapReduce thread model, and the other is a GPU-based speedup approach. The GPU-based technique uses features-matching and particle weight computations, which have been put into the GPU kernel. The results demonstrate that the proposed face-tracking algorithm can work robustly with cluttered backgrounds and differing illuminations; the multi-core parallel scheme can increase the speed by 2?C6 times compared with that of the corresponding sequential algorithms. Furthermore, a GPU parallel scheme and co-processing scheme can achieve a greater increase in speed (8×?C12×) compared with the corresponding sequential algorithms.  相似文献   

6.
In order to deal with the common trend in size increase of volumetric datasets, in the past few years research in isosurface extraction has focused on related aspects such as surface simplification and load-balanced parallel algorithms.We present a parallel, block-wise extension of the tandem algorithm [Attali D, Cohen-Steiner D, Edelsbrunner H. Extraction and simplification of iso-surfaces in tandem. In: SGP ’05: Proceedings of the third Eurographics symposium on Geometry processing. Aire-la-Ville, Switzerland: Eurographics Association; 2005. p. 139-148], which simplifies on the fly an isosurface being extracted. Our approach minimizes the overall memory consumption using an adequate block splitting and merging strategy along with the introduction of a component dumping mechanism that drastically reduces the amount of memory needed for particular datasets such as those encountered in geophysics. As soon as detected, surface components are migrated to the disk along with a meta-data index (oriented bounding box, volume, etc.) that permits further improved exploration scenarios (small component removal or particularly oriented component selection for instance).For ease of implementation, we carefully describe a master and worker algorithm architecture that clearly separates the four required basic tasks. We show several results of our parallel algorithm applied on a geophysical dataset of size 7000×1600×2000.  相似文献   

7.
In this paper, we present a novel volumetric mesh representation suited for parallel computing on modern GPU architectures. The data structure is based on a compact, ternary sparse matrix storage of boundary operators. Boundary operators correspond to the first‐order top‐down relations of k‐faces to their (k ? 1)‐face facets. The compact, ternary matrix storage format is based on compressed sparse row matrices with signed indices and allows for efficient parallel computation of indirect and bottom‐up relations. This representation is then used in the implementation of several parallel volumetric mesh algorithms including Laplacian smoothing and volumetric Catmull‐Clark subdivision. We compare these algorithms with their counterparts based on OpenVolumeMesh and achieve speedups from 3× to 531×, for sufficiently large meshes, while reducing memory consumption by up to 36%.  相似文献   

8.
Interactive isosurface extraction has recently become possible through successful efforts to map algorithms such as Marching Cubes (MC) and Marching Tetrahedra (MT) to modern Graphics Processing Unit (GPU) architectures. Other isosurfacing algorithms, however, are not so easily portable to GPUs, either because they involve more complex operations or because they are not based on discrete case tables, as is the case with most marching techniques. In this paper, we revisit the Dual Contouring (MC) and Macet isosurface extraction algorithms and propose, respectively: (i) a novel, efficient and parallelizable version of Dual Contouring and (ii) a set of GPU modules which extend the original Marching Cubes algorithm. Similar to marching methods, our novel technique is based on a case table, which allows for a very efficient GPU implementation. In addition, we enumerate and evaluate several alternatives to implement efficient contouring algorithms on the GPU, and present trade‐offs among all approaches. Finally, we validate the efficiency and quality of the tessellations produced in all these alternatives.  相似文献   

9.
郭勇  顾力栩 《计算机工程》2007,33(20):222-224
提出了一种新颖的等值面提取算法,该算法基于Ray-Isosurface intersection方法,通过模拟三维扫描仪的工作过程来提取等值面。算法应用GPU的并行计算能力来完成主要的计算密度大的计算过程,计算结果存储为点的形式,并通过iso-splatting的技术绘制出来。算法通过控制GPU所使用的缓冲区大小和通过模拟三维扫描仪工作原理进而避免不可见部分的等值面提取,来达到一个高质量的绘制效果和高FPS的交互环境。通过该算法和传统算法在几组数据上面的绘制质量,FPS的比较证明了算法的优越性。  相似文献   

10.
This paper presents a fast, high‐quality, GPU‐based isosurface rendering pipeline for implicit surfaces defined by a regular volumetric grid. GPUs are designed primarily for use with polygonal primitives, rather than volume primitives, but here we directly treat each volume cell as a single rendering primitive by designing a vertex program and fragment program on a commodity GPU. Compared with previous raycasting methods, ours has a more effective memory footprint (cache locality) and better coherence between multiple parallel SIMD processors. Furthermore, we extend and speed up our approach by introducing a new view‐dependent sorting algorithm to take advantage of the early‐z‐culling feature of the GPU to gain significant performance speed‐up. As another advantage, this sorting algorithm makes multiple transparent isosurfaces rendering available almost for free. Finally, we demonstrate the effectiveness and quality of our techniques in several real‐time rendering scenarios and include analysis and comparisons with previous work.  相似文献   

11.
We present an efficient Graphics Processing Unit GPU‐based implementation of the Projected Tetrahedra (PT) algorithm. By reducing most of the CPU–GPU data transfer, the algorithm achieves interactive frame rates (up to 2.0 M Tets/s) on current graphics hardware. Since no topology information is stored, it requires substantially less memory than recent interactive ray casting approaches. The method uses a two‐pass GPU approach with two fragment shaders. This work includes extended volume inspection capabilities by supporting interactive transfer function editing and isosurface highlighting using a Phong illumination model.  相似文献   

12.
Parallel generation of architecture on the GPU   总被引:1,自引:0,他引:1  
In this paper, we present a novel approach for the parallel evaluation of procedural shape grammars on the graphics processing unit (GPU). Unlike previous approaches that are either limited in the kind of shapes they allow, the amount of parallelism they can take advantage of, or both, our method supports state of the art procedural modeling including stochasticity and context‐sensitivity. To increase parallelism, we explicitly express independence in the grammar, reduce inter‐rule dependencies required for context‐sensitive evaluation, and introduce intra‐rule parallelism. Our rule scheduling scheme avoids unnecessary back and forth between CPU and GPU and reduces round trips to slow global memory by dynamically grouping rules in on‐chip shared memory. Our GPU shape grammar implementation is multiple orders of magnitude faster than the standard in CPU‐based rule evaluation, while offering equal expressive power. In comparison to the state of the art in GPU shape grammar derivation, our approach is nearly 50 times faster, while adding support for geometric context‐sensitivity.  相似文献   

13.
Sorting is a very important task in computer science and becomes a critical operation for programs making heavy use of sorting algorithms. General‐purpose computing has been successfully used on Graphics Processing Units (GPUs) to parallelize some sorting algorithms. Two GPU‐based implementations of the quicksort were presented in literature: the GPU‐quicksort, a compute‐unified device architecture (CUDA) iterative implementation, and the CUDA dynamic parallel (CDP) quicksort, a recursive implementation provided by NVIDIA Corporation. We propose CUDA‐quicksort an iterative GPU‐based implementation of the sorting algorithm. CUDA‐quicksort has been designed starting from GPU‐quicksort. Unlike GPU‐quicksort, it uses atomic primitives to perform inter‐block communications while ensuring an optimized access to the GPU memory. Experiments performed on six sorting benchmark distributions show that CUDA‐quicksort is up to four times faster than GPU‐quicksort and up to three times faster than CDP‐quicksort. An in‐depth analysis of the performance between CUDA‐quicksort and GPU‐quicksort shows that the main improvement is related to the optimized GPU memory access rather than to the use of atomic primitives. Moreover, in order to assess the advantages of using the CUDA dynamic parallelism, we implemented a recursive version of the CUDA‐quicksort. Experimental results show that CUDA‐quicksort is faster than the CDP‐quicksort provided by NVIDIA, with better performance achieved using the iterative implementation. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

14.
Support Vector Machine (SVM) regression is an important technique in data mining. The SVM training is expensive and its cost is dominated by: (i) the kernel value computation, and (ii) a search operation which finds extreme training data points for adjusting the regression function in every training iteration. Existing training algorithms for SVM regression are not scalable to large datasets because: (i) each training iteration repeatedly performs expensive kernel value computations, which is inefficient and requires holding the whole training dataset in memory; (ii) the search operation used in each training iteration considers the whole search space which is very expensive. In this article, we significantly improve the scalability and efficiency of SVM regression by exploiting the high performance of Graphics Processing Units (GPUs) and solid state drives (SSDs). Our key ideas are as follows. (i) To reduce the cost of repeated kernel value computations and avoid holding the whole training dataset in the GPU memory, we precompute all the kernel values and store them in the CPU memory extended by the SSD; together with an efficient strategy to read the precomputed kernel values, reusing precomputed kernel values with an efficient retrieval is much faster than computing them on-the-fly. This also alleviates the restriction that the training dataset has to fit into the GPU memory, and hence makes our algorithm scalable to large datasets, especially for large datasets with very high dimensionality. (ii) To enhance the performance of the frequently used search operation, we design an algorithm that minimizes the search space and the number of accesses to the GPU global memory; this optimized search algorithm also avoids branch divergence (one of the causes for poor performance) among GPU threads to achieve high utilization of the GPU resources. Our proposed techniques together form a scalable solution to the SVM regression which we call SIGMA. Our extensive experimental results show that SIGMA is highly efficient and can handle very large datasets which the state-of-the-art GPU-based algorithm cannot handle. On the datasets of size that the state-of-the-art GPU-based algorithm can handle, SIGMA consistently outperforms the state-of-the-art GPU-based algorithm by an order of magnitude and achieves up to 86 times speedup.  相似文献   

15.
With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well‐known algorithm in the field of biosequence matching and database searching, the Smith–Waterman (S‐W) algorithm as an example, and demonstrate approaches that fully exploit its performance potentials on CPU, GPU, and field‐programmable gate array (FPGA) computing platforms. For CPU platforms, we perform two optimizations, single instruction, multiple data and multithread, with compiler options, to gain over 70 × speedups over naive CPU versions on quad‐core CPU platforms. For GPU platforms, we propose the combination of coalesced global memory accesses, shared memory tiles, and loop unfolding, achieving 50 × speedups over initial GPU versions on an NVIDIA GeForce GTX 470 card. Experimental results show that the GPU GTX 470 gains 12 × speedups, instead of 100 × reported by some studies, over Intel quadcore CPU Q9400, under the same manufacturing technology and both with fully optimized schemes. In addition, for FPGA platforms, we customize a linear systolic array for the S‐W algorithm in a 45‐nm FPGA chip from Xilinx (XC6VLX760), with up to 1024 processing elements. Under only 133 MHz clock rate, the FPGA platform reaches the highest performance and becomes the most power‐efficient platform, using only 25 W compared with 190 W of the GPU GTX 470. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

16.
提出了一种医学图像快速面绘制的方法,该方法将新一代图形处理器GeForce 8800的特性和MC(Marching Cubes)算法相结合。利用几何着色器的数据批处理能力在每个立方体中提取等值面并生成三角形带;在片段着色器上采用Phong光照模型对生成三角形渲染显示。建模和显示过程均在GPU上完成,对CPU的依赖低。实验表明,在保证绘制效果的前提下,该方法可在通用PC平台上实现大小为512×512×400的CT数据的实时建模,有很好的实用价值。  相似文献   

17.
Zippy: A Framework for Computation and Visualization on a GPU Cluster   总被引:1,自引:0,他引:1  
Due to its high performance/cost ratio, a GPU cluster is an attractive platform for large scale general‐purpose computation and visualization applications. However, the programming model for high performance general‐purpose computation on GPU clusters remains a complex problem. In this paper, we introduce the Zippy frame‐work, a general and scalable solution to this problem. It abstracts the GPU cluster programming with a two‐level parallelism hierarchy and a non‐uniform memory access (NUMA) model. Zippy preserves the advantages of both message passing and shared‐memory models. It employs global arrays (GA) to simplify the communication, synchronization, and collaboration among multiple GPUs. Moreover, it exposes data locality to the programmer for optimal performance and scalability. We present three example applications developed with Zippy: sort‐last volume rendering, Marching Cubes isosurface extraction and rendering, and lattice Boltzmann flow simulation with online visualization. They demonstrate that Zippy can ease the development and integration of parallel visualization, graphics, and computation modules on a GPU cluster.  相似文献   

18.
The GPU (Graphics Processing Unit) has recently become one of the most power efficient processors in embedded and many other environments, and has been integrated into more and more SoCs (System on Chip). Thus modern GPUs play a very important role in power aware computing. Strongly Connected Component (SCC) decomposition is a fundamental graph algorithm which has wide applications in model checking, electronic design automation, social network analysis and other fields. GPUs have been shown to have great potential in accelerating many types of computations including graph algorithms. Recent work have demonstrated the plausibility of GPU SCC decomposition, but the implementation is inefficient due to insufficient consideration of the distinguishing GPU programming model, which leads to poor performance on irregular and sparse graphs.This paper presents a new GPU SCC decomposition algorithm that focuses on full utilization of the contemporary embedded and desktop GPU architecture. In particular, a subgraph numbering scheme is proposed to facilitate the safe and efficient management of the subgraph IDs and to serve as the basis of efficient source selection. Furthermore, we adopt a multi-source partition procedure that greatly reduces the recursion depth and use a vertex labeling approach that can highly optimize the GPU memory access. The evaluation results show that the proposed approach achieves up to 41× speedup over Tarjan’s algorithm, one of the most efficient sequential SCC decomposition algorithms, and up to 3.8× speedup over the previous GPU algorithms.  相似文献   

19.
In this paper, we proposed a method for accelerating brain extraction computations from cerebral MRI volume using compute unified device architecture (CUDA) based on multi-core graphic processing units (GPU). This algorithm is based on the well-known brain extraction method—Brain Extraction Tool (BET). In order to significantly reduce the computational time for real-time processing, the algorithm was performed in a parallel way by assigning one thread in GPU to calculate the new position of one vertex on the brain surface and all the vertices on the brain surface on one slice are processed in the same thread block, thus all the positions of the vertices on the brain’s surface can be updated in the same time. Experiments showed the computational time of this parallel method was less than one second and much less than that of normal BET. A slice-by-slice way was also used to improve the accuracy of our algorithm, and both the result and consuming time are desirable.  相似文献   

20.
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号