期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Speeding Up Cycle Based Logic Simulation Using Graphics Processing Units

Alper Sen Baris Aksanli Murat Bozkurt 《International journal of parallel programming》2011,39(5):639-661

Verification has grown to dominate the cost of electronic system design, consuming about 60% of design effort. Among several verification techniques, logic simulation remains the major verification technique. Speeding up logic simulation results in great savings and shorter time-to-market. We parallelize logic simulation using Graphics Processing Units (GPUs). In the past, GPUs were special-purpose application accelerators, suitable only for conventional graphics applications. The new generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. We develop a parallel cycle-based logic simulation algorithm that uses And Inverter Graphs (AIGs) as design representations. AIGs have proven to be an effective representation for various design automation applications, and we obtain similar benefits for speeding up logic simulation. We develop two clustering algorithms that partition the gates in the designs into independent blocks. Our algorithms exploit the massively parallel GPU architecture featuring thousands of concurrent threads, fast memory, and memory coalescing for optimizations. We demonstrate up-to 5x and 21x speedups on several benchmarks using our simulation system with the first and second clustering algorithms, respectively. Our work ultimately results in significant reduction in the overall design cycle. 相似文献

2.

Parallel probabilistic relaxation labelling based on Markov random fields for spectral-spatial hyperspectral image classification

Brajesh Kumar Onkar Dikshit 《International journal of remote sensing》2016,37(18):4356-4379

The large volume of data and computational complexity of algorithms limit the application of hyperspectral image classification to real-time operations. This work addresses the use of different parallel processing techniques to speed up the Markov random field (MRF)-based method to perform spectral-spatial classification of hyperspectral imagery. The Metropolis relaxation labelling approach is modified to take advantage of multi-core central processing units (CPUs) and to adapt it to massively parallel processing systems like graphics processing units (GPUs). The experiments on different hyperspectral data sets revealed that the implementation approach has a huge impact on the execution time of the algorithm. The results demonstrated that the modified MRF algorithm produced classification accuracy similar to conventional methods with greatly improved computational performance. With modern multi-core CPUs, good computational speed-up can be achieved even without additional hardware support. The CPU-GPU hybrid framework rendered the otherwise computationally expensive approach suitable for time-constrained applications. 相似文献

3.

New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance

ThienLuan Ho Seung-Rohk Oh HyunJin Kim 《The Journal of supercomputing》2018,74(5):1815-1834

This paper proposes new algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. Fixed-length approximate string matching and approximate circular string matching are special cases of approximate string matching and have numerous direct applications in bioinformatics and text searching. Firstly, a counter-vector-mismatches (CVM) algorithm is proposed to solve fixed-length approximate string matching with k-mismatches. The development of CVM algorithm is based on the parallel summation of counters located in the same machine word. Secondly, a parallel counter-vector-mismatches (PCVM) algorithm is proposed to accelerate CVM algorithm in parallel. The PCVM algorithm is integrated into two-level parallelisms that exploit not only word-level parallelism but also data parallelism via parallel environments such as multi-core processors and graphics processing units (GPUs). In the particular case of adopting GPUs, a shared-mem parallel counter-vector-mismatches (PCVMsmem) scheme can be implemented from PCVM algorithm. The PCVMsmem scheme can exploit the memory model of GPUs to optimize performance of PCVM algorithm. Finally, this paper shows several methods to adopt CVM and PCVM algorithms in case the input pattern is in circular structure. In the experiments with real DNA packages, our proposed algorithms and scheme work greatly faster than previous bit-vector-mismatches and parallel bit-vector-mismatches algorithms. 相似文献

4.

Computation on programmable graphics hardware 总被引：1，自引：0，他引：1

Goodnight N. Wang R. Humphreys G. 《Computer Graphics and Applications, IEEE》2005,25(5):12-15

GPUs have evolved into powerful and flexible streaming processors with fully programmable floating-point pipelines and tremendous aggregate computational power and memory bandwidth. With these advances, modern GPUs can now perform more functions than the specific graphics computations for which they were designed. This article describes approaches to using GPU processing power to accelerate traditionally CPU-based tasks. We discuss some important characteristics of algorithms that make them good candidates for GPU acceleration. We discuss a specific GPU image-processing application that is a common postprocess for many physically based rendering systems. 相似文献

5.

Particle filtering on GPU architectures for manufacturing applications

《Computers in Industry》2015

Particle filters are nonlinear estimators that can be used to detect anomalies in manufacturing processes. Although promising, their high computational cost often prevents their implementation in real-time applications. Recently, the introduction of graphics processing units (GPUs) has enabled the acceleration of computationally intensive processes with their massive parallel capabilities. This article presents the acceleration of the particle filter and the auxiliary particle filter, two of the most important particle methods, on a GPU using NVIDIA CUDA technology. This is illustrated via simulation for a remelting process where the accelerated algorithms return accurate estimates while still being two orders of magnitude faster than the physical process even for calculations that involve millions of particles. 相似文献

6.

基于流的实时碰撞检测算法 总被引：21，自引：0，他引：21

范昭炜万华根高曙明《软件学报》2004,15(10):1505-1514

实时碰撞检测是计算机图形应用中不可或缺的问题之一,复杂物体间的实时碰撞检测至今仍未能得以很好的解决.高性能可编程图形硬件的出现,正在改变着通用计算仅能由CPU完成的传统观念.探索性地采用了可编程图形硬件来解决复杂物体间的实时碰撞检测问题.通过将两个任意物体间的碰撞检测计算映射到图形硬件以有效利用图形硬件的并行架构,由实时绘制过程快速产生碰撞检测结果.为此,算法首先将碰撞检测问题转化为一组线段集合与三角形的求交问题,以实现碰撞检测算法向可编程图形硬件的迁移.在对算法复杂度进行理性分析的基础上,给出了两种有效的优化技术以提升算法效率.实验结果表明,与现有的图像空间碰撞检测算法相比,该算法在效率、精确性和实用性方面具有明显优势. 相似文献

7.

Scientific computation for simulations on programmable graphics hardware

Robert Strzodka Michael Doggett Andreas Kolb 《Simulation Modelling Practice and Theory》2005,13(8):667-680

Graphics processor units (GPUs) have emerged as powerful parallel processors in recent years. Although floating point computations and high level programming languages are now available, the efficient use of the enormous computing power of GPUs still requires a significant amount of graphics specific knowledge.The paper explains how to use GPUs for scientific computations without graphics specific terminology. It offers an algorithmic view on GPUs with comparisons to cache aware and parallel programming of CPUs. Two typical simulation techniques, namely grid based and particle based methods are discussed. 相似文献

8.

Supernodal sparse Cholesky factorization on graphics processing units

Dan Zou Yong Dou Song Guo Rongchun Li Lin Deng 《Concurrency and Computation》2014,26(16):2713-2726

Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

9.

Parallel quantum-inspired evolutionary algorithms for community detection in social networks

《Applied Soft Computing》2017

The world around us may be viewed as a network of entities interconnected via their social, economic, and political interactions. These entities and their interactions form a social network. A social network is often modeled as a graph whose nodes represent entities, and edges represent interactions between these entities. These networks are characterized by the collective latent behavior that does not follow trivially from the behaviors of the individual entities in the network. One such behavior is the existence of hierarchy in the network structure, the sub-networks being popularly known as communities. Discovery of the community structure in a social network is a key problem in social network analysis as it refines our understanding of the social fabric. Not surprisingly, the problem of detecting communities in social networks has received substantial attention from the researchers.In this paper, we propose parallel implementations of recently proposed community detection algorithms that employ variants of the well-known quantum-inspired evolutionary algorithm (QIEA). Like any other evolutionary algorithm, a quantum-inspired evolutionary algorithm is also characterized by the representation of the individual, the evaluation function, and the population dynamics. However, individual bits called qubits, are in a superposition of states. As chromosomes evolve individually, the quantum-inspired evolutionary algorithms (QIEAs) are intrinsically suitable for parallelization.In recent years, programmable graphics processing units — GPUs, have evolved into massively parallel environments with tremendous computational power. NVIDIA® compute unified device architecture (CUDA®) technology, one of the leading general-purpose parallel computing architectures with hundreds of cores, can concurrently run thousands of computing threads. The paper proposes novel parallel implementations of quantum-inspired evolutionary algorithms in the field of community detection on CUDA-enabled GPUs.The proposed implementations employ a single-population fine-grained approach that is suited for massively parallel computations. In the proposed approach, each element of a chromosome is assigned to a separate thread. It is observed that the proposed algorithms perform significantly better than the benchmark algorithms. Further, the proposed parallel implementations achieve significant speedup over the serial versions. Due to the highly parallel nature of the proposed algorithms, an increase in the number of multiprocessors and GPU devices may lead to a further speedup. 相似文献

10.

基于图形处理器的数据流快速聚类 总被引：17，自引：1，他引：16

曹锋周傲英《软件学报》2007,18(2):291-302

在数据流环境下,聚类算法不仅需要有较高的聚类质量,同时需要有实时处理速度.因而,提出了一类基于图形处理器(graphics processing unit,简称GPU)的快速聚类方法,包括基于K-means的基本聚类方法、基于GPU的数据流聚类以及数据流簇进化分析方法.这些方法的共同特点是充分利用了GPU强大的处理能力和流水线特性.与以往具有独立框架的数据流聚类算法不同,这些基于GPU的聚类算法具有同一框架和多种聚类分析功能,为数据流聚类分析提供了统一的平台.从分析可知,数据流聚类分析的核心操作实际上就是距离计算和比较.基于这一认识,利用GPU的子素向量处理功能进行距离计算.性能验证实验是在配有Pentium IV 3.4G CPU和NVIDIA GeForce 6800 GT显卡的PC上进行的.综合分析和实验结果表明,基于GPU的数据流聚类算法比传统的CPU算法平均快7倍,从而为高速数据流应用提供了良好的支持. 相似文献

11.

High performance evaluation of evolutionary-mined association rules on GPUs

Alberto Cano José María Luna Sebastián Ventura 《The Journal of supercomputing》2013,66(3):1438-1461

Association rule mining is a well-known data mining task, but it requires much computational time and memory when mining large scale data sets of high dimensionality. This is mainly due to the evaluation process, where the antecedent and consequent in each rule mined are evaluated for each record. This paper presents a novel methodology for evaluating association rules on graphics processing units (GPUs). The evaluation model may be applied to any association rule mining algorithm. The use of GPUs and the compute unified device architecture (CUDA) programming model enables the rules mined to be evaluated in a massively parallel way, thus reducing the computational time required. This proposal takes advantage of concurrent kernels execution and asynchronous data transfers, which improves the efficiency of the model. In an experimental study, we evaluate interpreter performance and compare the execution time of the proposed model with regard to single-threaded, multi-threaded, and graphics processing unit implementation. The results obtained show an interpreter performance above 67 billion giga operations per second, and speed-up by a factor of up to 454 over the single-threaded CPU model, when using two NVIDIA 480 GTX GPUs. The evaluation model demonstrates its efficiency and scalability according to the problem complexity, number of instances, rules, and GPU devices. 相似文献

12.

Exploiting graphical processing units for data‐parallel scientific applications

A. Leist D. P. Playne K. A. Hawick 《Concurrency and Computation》2009,21(18):2400-2437

Graphical processing units (GPUs) have recently attracted attention for scientific applications such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them more accessible to scientific programmers. We discuss the application of GPU programming to two significantly different paradigms—regular mesh field equations with unusual boundary conditions and graph analysis algorithms. The differing optimization techniques required for these two paradigms cover many of the challenges faced when developing GPU applications. We discuss the relevance of these application paradigms to simulation engines and games. GPUs were aimed primarily at the accelerated graphics market but since this is often closely coupled to advanced game products it is interesting to speculate about the future of fully integrated accelerator hardware for both visualization and simulation combined. As well as reporting the speed‐up performance on selected simulation paradigms, we discuss suitable data‐parallel algorithms and present code examples for exploiting GPU features like large numbers of threads and localized texture memory. We find a surprising variation in the performance that can be achieved on GPUs for our applications and discuss how these findings relate to past known effects in parallel computing such as memory speed‐related super‐linear speed up. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

13.

GPU-accelerated simulations of mass-action kinetics models with cupSODA

Marco S. Nobile Paolo Cazzaniga Daniela Besozzi Giancarlo Mauri 《The Journal of supercomputing》2014,69(1):17-24

In the last years, graphics processing units (GPUs) witnessed ever growing applications for a wide range of computational analyses in the field of life sciences. Despite its large potentiality, GPU computing risks remaining a niche for specialists, due to the programming and optimization skills it requires. In this work we present cupSODA, a simulator of biological systems that exploits the remarkable memory bandwidth and computational capability of GPUs. cupSODA allows to efficiently execute in parallel large numbers of simulations, which are usually required to investigate the emergent dynamics of a given biological system under different conditions. cupSODA works by automatically deriving the system of ordinary differential equations from a reaction-based mechanistic model, defined according to the mass-action kinetics, and then exploiting the numerical integration algorithm, LSODA. We show that cupSODA can achieve a \(86 \times \) speedup on GPUs with respect to equivalent executions of LSODA on the CPU. 相似文献

14.

Cache-efficient numerical algorithms using graphics hardware

《Parallel Computing》2007,33(10-11):663-684

We present cache-efficient algorithms for scientific computations using graphics processing units (GPUs). Our approach is based on mapping the nested loops in the numerical algorithms to the texture mapping hardware and efficiently utilizing GPU caches. This mapping exploits the inherent parallelism, pipelining and high memory bandwidth on GPUs. We further improve the performance of numerical algorithms by accounting for the same relative memory address accesses performed at data elements in nested loops. Based on the similarity of memory accesses performed at the data elements in the input array, we decompose the input arrays into sub-arrays with similar memory access patterns and execute on the sub-arrays for faster execution. Our approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency. Overall, our formulation for GPU-based algorithms extends the current graphics runtime APIs without exposing the underlying hardware complexity to the programmer. This makes it possible to achieve portability and higher performance across different GPUs. We use this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we observe 2–10× improvement in performance. 相似文献

15.

Unmixing-based content retrieval system for remotely sensed hyperspectral imagery on GPUs

Jorge Sevilla Sergio Bernabe Antonio Plaza 《The Journal of supercomputing》2014,70(2):588-599

This paper presents a new unmixing-based retrieval system for remotely sensed hyperspectral imagery. The need for this kind of system is justified by the exponential growth in the volume and number of remotely sensed data sets from the surface of the Earth. This is particularly the case for hyperspectral images, which comprise hundreds of spectral bands at different (almost contiguous) wavelength channels. To deal with the high computational cost of extracting the spectral information needed to catalog new hyperspectral images in our system, we resort to efficient implementations of spectral unmixing algorithms on commodity graphics processing units (GPUs). Spectral unmixing is a very popular approach for interpreting hyperspectral data with sub-pixel precision. This paper particularly focuses on the design of the proposed framework as a web service, as well as on the efficient implementation of the system on GPUs. In addition, we present a comparison of spectral unmixing algorithms available in the system on both CPU and GPU architectures. 相似文献

16.

A computational comparison of scaling techniques for linear optimization problems on a graphical processing unit

Nikolaos Ploskas 《国际计算机数学杂志》2015,92(2):319-336

Preconditioning techniques are important in solving linear problems, as they improve their computational properties. Scaling is the most widely used preconditioning technique in linear optimization algorithms and is used to reduce the condition number of the constraint matrix, to improve the numerical behavior of the algorithms and to reduce the number of iterations required to solve linear problems. Graphical processing units (GPUs) have gained a lot of popularity in the recent years and have been applied for the solution of linear optimization problems. In this paper, we review and implement ten scaling techniques with a focus on the parallel implementation of them on GPUs. All these techniques have been implemented under the MATLAB and CUDA environment. Finally, a computational study on the Netlib set is presented to establish the practical value of GPU-based implementations. On average the speedup gained from the GPU implementations of all scaling methods is about 7×. 相似文献

17.

Multi-physics bi-directional evolutionary topology optimization on GPU-architecture

Munk David J. Kipouros Timoleon Vio Gareth A. 《Engineering with Computers》2019,35(3):1059-1079

Topology optimization has proven to be viable for use in the preliminary phases of real world design problems. Ultimately, the restricting factor is the computational expense since a multitude of designs need to be considered. This is especially imperative in such fields as aerospace, automotive and biomedical, where the problems involve multiple physical models, typically fluids and structures, requiring excessive computational calculations. One possible solution to this is to implement codes on massively parallel computer architectures, such as graphics processing units (GPUs). The present work investigates the feasibility of a GPU-implemented lattice Boltzmann method for multi-physics topology optimization for the first time. Noticeable differences between the GPU implementation and a central processing unit (CPU) version of the code are observed and the challenges associated with finding feasible solutions in a computational efficient manner are discussed and solved here, for the first time on a multi-physics topology optimization problem. The main goal of this paper is to speed up the topology optimization process for multi-physics problems without restricting the design domain, or sacrificing considerable performance in the objectives. Examples are compared with both standard CPU and various levels of numerical precision GPU codes to better illustrate the advantages and disadvantages of this implementation. A structural and fluid objective topology optimization problem is solved to vary the dependence of the algorithm on the GPU, extending on the previous literature that has only considered structural objectives of non-design dependent load problems. The results of this work indicate some discrepancies between GPU and CPU implementations that have not been seen before in the literature and are imperative to the speed-up of multi-physics topology optimization algorithms using GPUs.

相似文献

18.

Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach 总被引：2，自引：1，他引：1

下载免费PDF全文

Gabriel Falcã o Student Member IEEE Shinichi Yamagiwa Member IEEE Vitor Silva and Leonel Sousa Member ACM Senior Member IEEE 《计算机科学技术学报》2009,24(5):913-924

Low-Density Parity-Check (LDPC) codes are powerful error correcting codes adopted by recent communication standards. LDPC decoders are based on belief propagation algorithms, which make use of a Tanner graph and very intensive message-passing computation, and usually require hardware-based dedicated solutions. With the exponential increase of the computational power of commodity graphics processing units (GPUs), new opportunities have arisen to develop general purpose processing on GPUs. This paper proposes the use of GPUs for implementing flexible and programmable LDPC decoders. A new stream-based approach is proposed, based on compact data structures to represent the Tanner graph. It is shown that such a challenging application for stream-based computing, because of irregular memory access patterns, memory bandwidth and recursive flow control constraints, can be efficiently implemented on GPUs. The proposal was experimentally evaluated by programming LDPC decoders on GPUs using the Caravela platform, a generic interface tool for managing the kernels' execution regardless of the GPU manufacturer and operating system. Moreover, to relatively assess the obtained results, we have also implemented LDPC decoders on general purpose processors with Streaming Single Instruction Multiple Data (SIMD) Extensions. Experimental results show that the solution proposed here efficiently decodes several codewords simultaneously, reducing the processing time by one order of magnitude. 相似文献

19.

Experiences with implementing parallel discrete-event simulation on GPU

Sang Janche Lee Che-Rung Rego Vernon King Chung-Ta 《The Journal of supercomputing》2019,75(8):4132-4149

The Journal of Supercomputing - Modern graphics processing units (GPUs) offer much more computational power than recent CPUs by providing a vast number of simple, data-parallel, multithreaded... 相似文献

20.

Algorithmic performance studies on graphics processing units

Olaf Schenk Matthias Christen Helmar Burkhart 《Journal of Parallel and Distributed Computing》2008

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications. 相似文献