首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We present a GPU‐based streaming algorithm to perform high‐resolution and accurate cloth simulation. We map all the components of cloth simulation pipeline, including time integration, collision detection, collision response, and velocity updating to GPU‐based kernels and data structures. Our algorithm perform intra‐object and inter‐object collisions, handles contacts and friction, and is able to accurately simulate folds and wrinkles. We describe the streaming pipeline and address many issues in terms of obtaining high throughput on many‐core GPUs. In practice, our algorithm can perform high‐fidelity simulation on a cloth mesh with 2M triangles using 3GB of GPU memory. We highlight the parallel performance of our algorithm on three different generations of GPUs. On a high‐end NVIDIA Tesla K20c, we observe up to two orders of magnitude performance improvement as compared to a single‐threaded CPU‐based algorithm, and about one order of magnitude improvement over a 16‐core CPU‐based parallel implementation.  相似文献   

2.
We present novel parallel algorithms for collision detection and separation distance computation for rigid and deformable models that exploit the computational capabilities of many‐core GPUs. Our approach uses thread and data parallelism to perform fast hierarchy construction, updating, and traversal using tight‐fitting bounding volumes such as oriented bounding boxes (OBB) and rectangular swept spheres (RSS). We also describe efficient algorithms to compute a linear bounding volume hierarchy (LBVH) and update them using refitting methods. Moreover, we show that tight‐fitting bounding volume hierarchies offer improved performance on GPU‐like throughput architectures. We use our algorithms to perform discrete and continuous collision detection including self‐collisions, as well as separation distance computation between non‐overlapping models. In practice, our approach (gProximity) can perform these queries in a few milliseconds on a PC with NVIDIA GTX 285 card on models composed of tens or hundreds of thousands of triangles used in cloth simulation, surgical simulation, virtual prototyping and N‐body simulation. Moreover, we observe more than an order of magnitude performance improvement over prior GPU‐based algorithms.  相似文献   

3.
4.
A software framework taking advantage of parallel processing capabilities of CPUs and GPUs is designed for the real‐time interactive cutting simulation of deformable objects. Deformable objects are modelled as voxels connected by links. The voxels are embedded in an octree mesh used for deformation. Cutting is performed by disconnecting links swept by the cutting tool and then adaptively refining octree elements near the cutting tool trajectory. A surface mesh used for visual display is reconstructed from disconnected links using the dual contour method. Spatial hashing of the octree mesh and topology‐aware interpolation of distance field are used for collision. Our framework uses a novel GPU implementation for inter‐object collision and object self collision, while tool‐object collision, cutting and deformation are assigned to CPU, using multiple threads whenever possible. A novel method that splits cutting operations into four independent tasks running in parallel is designed. Our framework also performs data transfers between CPU and GPU simultaneously with other tasks to reduce their impact on performances. Simulation tests show that when compared to three‐threaded CPU implementations, our GPU accelerated collision is 53–160% faster; and the overall simulation frame rate is 47–98% faster.  相似文献   

5.
Graphical processing units (GPUs) have recently attracted attention for scientific applications such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them more accessible to scientific programmers. We discuss the application of GPU programming to two significantly different paradigms—regular mesh field equations with unusual boundary conditions and graph analysis algorithms. The differing optimization techniques required for these two paradigms cover many of the challenges faced when developing GPU applications. We discuss the relevance of these application paradigms to simulation engines and games. GPUs were aimed primarily at the accelerated graphics market but since this is often closely coupled to advanced game products it is interesting to speculate about the future of fully integrated accelerator hardware for both visualization and simulation combined. As well as reporting the speed‐up performance on selected simulation paradigms, we discuss suitable data‐parallel algorithms and present code examples for exploiting GPU features like large numbers of threads and localized texture memory. We find a surprising variation in the performance that can be achieved on GPUs for our applications and discuss how these findings relate to past known effects in parallel computing such as memory speed‐related super‐linear speed up. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

6.
Collective behaviour of winged insects is a wondrous and familiar phenomenon in the real world. In this paper, we introduce a highly efficient field‐based approach to simulate various insect swarms. Its core idea is to construct a smooth yet noise‐aware governing velocity field that can be further decomposed into two sub‐fields: (i) a divergence‐free curl‐noise field to model noise‐induced movements of individual insects in a swarm, and (ii) an enhanced global velocity field to control navigational paths in a complex environment along which all the insects in a swarm fly. Through simulation experiments and comparisons with existing crowd simulation approaches, we demonstrate that our approach is effective to simulate various insect swarm behaviours including aggregation, positive phototaxis, sedation, mass‐migrating, and so on. Besides its high efficiency, our approach is very friendly to parallel implementation on GPUs (e.g. the speedup achieved through GPU acceleration is higher than 50 if the number of simulated insects is more than 10 000 on an off‐the‐shelf computer). Our approach is the first multi‐agent modelling system that introduces curl‐noise into agents' velocity field and uses its non‐scattering nature to maintain non‐colliding movements in 3D crowd simulation.  相似文献   

7.
We present a novel, hybrid parallel continuous collision detection (HPCCD) method that exploits the availability of multi‐core CPU and GPU architectures. HPCCD is based on a bounding volume hierarchy (BVH) and selectively performs lazy reconstructions. Our method works with a wide variety of deforming models and supports self‐collision detection. HPCCD takes advantage of hybrid multi‐core architectures – using the general‐purpose CPUs to perform the BVH traversal and culling while GPUs are used to perform elementary tests that reduce to solving cubic equations. We propose a novel task decomposition method that leads to a lock‐free parallel algorithm in the main loop of our BVH‐based collision detection to create a highly scalable algorithm. By exploiting the availability of hybrid, multi‐core CPU and GPU architectures, our proposed method achieves more than an order of magnitude improvement in performance using four CPU‐cores and two GPUs, compared to using a single CPU‐core. This improvement results in an interactive performance, up to 148 fps, for various deforming benchmarks consisting of tens or hundreds of thousand triangles.  相似文献   

8.
http://gamma.cs.unc.edu/BSC/ We present a realtime and reliable continuous collision detection (CCD) algorithm between triangulated models that exploits the floating point hardware capability of current CPUs and GPUs. Our formulation is based on Bernstein Sign Classification that takes advantage of the geometry properties of Bernstein basis and Bézier curves to perform Boolean collision queries. We derive tight numerical error bounds on the computations and employ those bounds to design an accurate algorithm using finite‐precision arithmetic. Compared with prior floatingpoint CCD algorithms, our approach eliminates all the false negatives and 90–95% of the false positives. We integrated our algorithm (TightCCD) with physically‐based simulation system and observe speedups in collision queries of 5–15X compared with prior reliable CCD algorithms. Furthermore, we demonstrate its benefits in terms of improving the performance or robustness of cloth simulation systems.  相似文献   

9.
We present a new hybrid CPU/GPU collision detection technique for rigid and deformable objects based on spatial subdivision. Our approach efficiently exploits the massive computational capabilities of modern CPUs and GPUs commonly found in off‐the‐shelf computer systems. The algorithm is specifically tailored to be highly scalable on both the CPU and the GPU sides. We can compute discrete and continuous external and self‐collisions of non‐penetrating rigid and deformable objects consisting of many tens of thousands of triangles in a few milliseconds on a modern PC. Our approach is orders of magnitude faster than earlier CPU‐based approaches and up to twice as fast as the most recent GPU‐based techniques.  相似文献   

10.
Most popular methods in cloth rendering rely on volumetric data in order to model complex optical phenomena such as sub‐surface scattering. These approaches are able to produce very realistic illumination results, but their volumetric representations are costly to compute and render, forfeiting any interactive feedback. In this paper, we introduce a method based on the Graphics Processing Unit (GPU) for voxelization and visualization, suitable for both interactive and offline rendering. Recent features in the OpenGL model, like the ability to dynamically address arbitrary buffers and allocate bindless textures, are combined into our pipeline to interactively voxelize millions of polygons into a set of large three‐dimensional (3D) textures (>109 elements), generating a volume with sub‐voxel accuracy, which is suitable even for high‐density woven cloth such as linen.  相似文献   

11.
Ray‐based simulations have been shown to generate impressively realistic ultrasound images in interactive frame rates. Recent efforts used GPU‐based surface raytracing to simulate complex ultrasound interactions such as multiple reflections and refractions. These methods are restricted to perfectly specular reflections (i.e. following only a single reflective/refractive ray), whereas real tissue exhibits roughness of varying degree at tissue interfaces, causing partly diffuse reflections and refractions. Such surface interactions are significantly more complex and can in general not be handled by conventional deterministic raytracing approaches. However, these can be efficiently computed by Monte‐Carlo sampling techniques, where many ray paths are generated with respect to a probability distribution. In this paper, we introduce Monte‐Carlo raytracing for ultrasound simulation. This enables the realistic simulation of ultrasound‐tissue interactions such as soft shadows and fuzzy reflections. We discuss how to properly weight the contribution of each ray path in order to simulate the behaviour of a beamformed ultrasound signal. Tracing many individual rays per transducer element is easily parallelizable on modern GPUs, as opposed to previous approaches based on recursive binary raytracing. We further propose a significant performance optimization based on adaptive sampling.  相似文献   

12.
The use of Graphics Processing Units (GPUs) for high‐performance computing has gained growing momentum in recent years. Unfortunately, GPU‐programming platforms like Compute Unified Device Architecture (CUDA) are complex, user unfriendly, and increase the complexity of developing high‐performance parallel applications. In addition, runtime systems that execute those applications often fail to fully utilize the parallelism of modern CPU‐GPU systems. Typically, parallel kernels run entirely on the most powerful device available, leaving other devices idle. These observations sparked research in two directions: (1) high‐level approaches to software development for GPUs, which strike a balance between performance and ease of programming; and (2) task partitioning to fully utilize the available devices. In this paper, we propose a framework, called PSkel, that provides a single high‐level abstraction for stencil programming on heterogeneous CPU‐GPU systems, while allowing the programmer to partition and assign data and computation to both CPU and GPU. Our current implementation uses parallel skeletons to transparently leverage Intel Threading Building Blocks (Intel Corporation, Santa Clara, CA, USA) and NVIDIA CUDA (Nvidia Corporation, Santa Clara, CA, USA). In our experiments, we observed that parallel applications with task partitioning can improve average performance by up to 76% and 28% compared with CPU‐only and GPU‐only parallel applications, respectively. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

13.
Many high performance computing applications require computing both sparse matrix‐vector product (SMVP) and sparse matrix‐transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing patterns with the underlying sparse matrix encoded in a single storage format. The compressed sparse block (CSB) format proposed by Buluç et al. allows computing both problems on multi‐core CPUs with nearly identical throughputs. On the other hand, a direct porting of CSB to graphics processing units (GPUs), which have been recently recognized as a powerful general purpose computing platform, turns out to be inefficient. In this work, we propose a new data structure, designated as expanded CSB (eCSB), to minimize the throughput gap between SMVP and SMTVP computations on GPUs, while at the same time enable a high computing throughput. We also use a hybrid storage format to store elements in each block, which can be selected dynamically at runtime. Experimental results show that the proposed techniques implemented on a Kepler GPU delivers similar throughput on both SMVP and SMTVP and the throughput is up to 13 times faster than that of the CPU‐based CSB implementation. In addition, our eCSB procedure outperforms the previous GPU results by up to 188% and 914% in computing SMVP and SMTVP, and we validate the effectiveness of eCSB by means of wall‐clock time of bi‐conjugate gradient algorithm; our eCSB is 25% faster than Compressed Sparse Rows (CSR) and 6% faster than HYB, respectively. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

14.
We present an efficient algorithm for object‐space proximity queries between multiple deformable triangular meshes. Our approach uses the rasterization capabilities of the GPU to produce an image‐space representation of the vertices. Using this image‐space representation, inter‐object vertex‐triangle distances and closest points lying under a user‐defined threshold are computed in parallel by conservative rasterization of bounding primitives and sorted using atomic operations. We additionally introduce a similar technique to detect penetrating vertices. We show how mechanisms of modern GPUs such as mipmapping, Early‐Z and Early‐Stencil culling can optimize the performance of our method. Our algorithm is able to compute dense proximity information for complex scenes made of more than a hundred thousand triangles in real time, outperforming a CPU implementation based on bounding volume hierarchies by more than an order of magnitude.  相似文献   

15.
《Parallel Computing》2014,40(5-6):86-99
Simulation of in vivo cellular processes with the reaction–diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel efficiency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.  相似文献   

16.
In this paper, we present the first algorithm for progressive sampling of 3D surfaces with blue noise characteristics that runs entirely on the GPU. The performance of our algorithm is comparable to state‐of‐the‐art GPU Poisson‐disk sampling methods, while additionally producing ordered sequences of samples where every prefix exhibits good blue noise properties. The basic idea is, to reduce the 3D sampling domain to a set of 2.5D images which we sample in parallel utilizing the rasterization hardware of current GPUs. This allows for simple visibility‐aware sampling that only captures the surface as seen from outside the sampled object, which is especially useful for point‐based level‐of‐detail rendering methods. However, our method can be easily extended for sampling the entire surface without changing the basic algorithm. We provide a statistical analysis of our algorithm and show that it produces good blue noise characteristics for every prefix of the resulting sample sequence and analyze the performance of our method compared to related state‐of‐the‐art sampling methods.  相似文献   

17.
We present a novel hierarchical grid based method for fast collision detection (CD) for deformable models on GPU architecture. A two‐level grid is employed to accommodate the non‐uniform distribution of practical scene geometry. A bottom‐to‐top method is implemented to assign the triangles into the hierarchical grid without any iteration while a deferred scheme is introduced to efficiently update the data structure. To address the issue of load balancing, which greatly influences the performance in SIMD parallelism, a propagation scheme which utilizes a parallel scan and a segmented scan is presented, distributing workloads evenly across all concurrent threads. The proposed method supports both discrete collision detection (DCD) and continuous collision detection (CCD) with self‐collision. Some typical benchmarks are tested to verify the effectiveness of our method. The results highlight our speedups over prior algorithms on different commodity GPUs.  相似文献   

18.
We present a novel approach to ray tracing execution on commodity graphics hardware using CUDA. We decompose a standard ray tracing algorithm into several data‐parallel stages that are mapped efficiently to the massively parallel architecture of modern GPUs. These stages include: ray sorting into coherent packets, creation of frustums for packets, breadth‐first frustum traversal through a bounding volume hierarchy for the scene, and localized ray‐primitive intersections. We utilize the well known parallel primitives scan and segmented scan in order to process irregular data structures, to remove the need for a stack, and to minimize branch divergence in all stages. Our ray sorting stage is based on applying hash values to individual rays, ray stream compression, sorting and decompression. Our breadth‐first BVH traversal is based on parallel frustum‐bounding box intersection tests and parallel scan per each BVH level. We demonstrate our algorithm with area light sources to get a soft shadow effect and show that our concept is reasonable for GPU implementation. For the same data sets and ray‐primitive intersection routines our pipeline is ~3x faster than an optimized standard depth first ray tracing implemented in one kernel.  相似文献   

19.
Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

20.
We present a novel method for massively parallel hierarchical scene processing on the GPU, which is based on sequential decomposition of the given hierarchical algorithm into small functional blocks. The computation is fully managed by the GPU using a specialized task pool which facilitates synchronization and communication of processing units. We present two applications of the proposed approach: construction of the bounding volume hierarchies and collision detection based on divide‐and‐conquer ray tracing. The results indicate that using our approach we achieve high utilization of the GPU even for complex hierarchical problems which pose a challenge for massive parallelization. The results indicate that using our approach we achieve high utilization of the GPU even for complex hierarchical problems which pose a challenge for massive parallelization.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号