期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient GPU algorithms for parallel decomposition of graphs into strongly connected and maximal end components

Anton Wijs Joost-Pieter Katoen Dragan Bošnački 《Formal Methods in System Design》2016,48(3):274-300

相似文献

2.

Massively parallelized replica-exchange simulations of polymers on GPUs

Jonathan Gross Wolfhard Janke Michael Bachmann 《Computer Physics Communications》2011,(8):1638-1644

We discuss the advantages of parallelization by multithreading on graphics processing units (GPUs) for parallel tempering Monte Carlo computer simulations of an exemplified bead-spring model for homopolymers. Since the sampling of a large ensemble of conformations is a prerequisite for the precise estimation of statistical quantities such as typical indicators for conformational transitions like the peak structure of the specific heat, the advantage of a strong increase in performance of Monte Carlo simulations cannot be overestimated. Employing multithreading and utilizing the massive power of the large number of cores on GPUs, being available in modern but standard graphics cards, we find a rapid increase in efficiency when porting parts of the code from the central processing unit (CPU) to the GPU. 相似文献

3.

Efficient decomposition and performance of parallel PDE,FFT, Monte Carlo simulations,simplex, and Sparse solvers

Zarka Cvetanovic Edward G. Freedman Charles Nofsinger 《The Journal of supercomputing》1991,5(2-3):219-238

In this paper, we describe the decomposition of six algorithms: two partial differential equations (PDE) solvers (successive over-relaxation [SOR] and alternating direction implicit [ADI]), fast Fourier transform (FFT), Monte Carlo simulations, Simplex linear programming, and Sparse solvers. The algorithms were selected not only because of their importance in scientific applications, but also because they represent a variety of computational (structured to irregular) and communication (low to high) requirements. We present the performance results of these algorithms on two shared-memory VAX/VMS^TM1 multiprocessor prototypes: the VAX 6300 series with up to 8 processors and the M31 with up to 22 processors. We demonstrate that by efficient decomposition it is possible to achieve high performance for all algorithms on both prototypes. We describe the efficient decomposition techniques applied to optimize the performance of parallel algorithms. Also, we discuss the performance implications due to different cache designs on two multiprocessors.An earlier version of this paper was presented at Supercomputing '90.At the time of writing, all three authors were with Digital Equipment Corporation, VMS Systems and Servers Group. 相似文献

4.

基于GPU的实时超分辨率算法实现

章拓王知衍《广东电脑与电讯》2009,(3)

高分辨率显示设备的发展意味着需要高分辨率的图象与之匹配。本文通过GPU,实现了一种实时超分辨率,使分辨率较低的视频资料在高分辨率显示设备上有较好的显示效果。相似文献

5.

Adaptable Particle-in-Cell algorithms for graphical processing units

Viktor K. Decyk Tajendra V. Singh 《Computer Physics Communications》2011,(3):641-648

We developed new parameterized Particle-in-Cell algorithms and data structures for emerging multi-core and many-core architectures. Four parameters allow tuning of this PIC code to different hardware configurations. Particles are kept ordered at each time step. The first application of these algorithms is to NVIDIA graphical processing units, where speedups of about 15–25 compared to an Intel Nehalem processor were obtained for a simple 2D electrostatic code. Electromagnetic codes are expected to get higher speedups due to their greater computational intensity. 相似文献

6.

Parallel algorithms for the connected components and minimal spanning tree problems

Dhruva Nath S.N. Maheshwari 《Information Processing Letters》1982,14(1):7-11

相似文献

7.

An effective and efficient parallel approach for random graph generation over GPUs

Stéphane Bressan Alfredo Cuzzocrea Panagiotis Karras Xuesong Lu Sadegh Heyrani Nobari 《Journal of Parallel and Distributed Computing》2013

The widespread usage of random graphs has been highlighted in the context of database applications for several years. This because such data structures turn out to be very useful in a large family of database applications ranging from simulation to sampling, from analysis of complex networks to study of randomized algorithms, and so forth. Amongst others, Erd?s–Rényi _Γ_v_,_p

Γ_{v, p}

is the most popular model to obtain and manipulate random graphs. Unfortunately, it has been demonstrated that classical algorithms for generating Erd?s–Rényi based random graphs do not scale well in large instances and, in addition to this, fail to make use of the parallel processing capabilities of modern hardware. Inspired by this main motivation, in this paper we propose and experimentally assess a novel parallel algorithm for generating random graphs under the Erd?s–Rényi model that is designed and implemented in a Graphics Processing Unit (GPU), called PPreZER. We demonstrate the nice amenities due to our solution via a succession of several intermediary algorithms, both sequential and parallel, which show the limitations of classical approaches and the benefits due to the PPreZER algorithm. Finally, our comprehensive experimental assessment and analysis brings to light a relevant average speedup gain of PPreZER over baseline algorithms. 相似文献

8.

Parallel multi-objective Ant Programming for classification using GPUs

Alberto Cano Juan Luis Olmo Sebastián Ventura 《Journal of Parallel and Distributed Computing》2013

Classification using Ant Programming is a challenging data mining task which demands a great deal of computational resources when handling data sets of high dimensionality. This paper presents a new parallelization approach of an existing multi-objective Ant Programming model for classification, using GPUs and the NVIDIA CUDA programming model. The computational costs of the different steps of the algorithm are evaluated and it is discussed how best to parallelize them. The features of both the CPU parallel and GPU versions of the algorithm are presented. An experimental study is carried out to evaluate the performance and efficiency of the interpreter of the rules, and reports the execution times and speedups regarding variable population size, complexity of the rules mined and dimensionality of the data sets. Experiments measure the original single-threaded and the new multi-threaded CPU and GPU times with different number of GPU devices. The results are reported in terms of the number of Giga GP operations per second of the interpreter (up to 10 billion GPops/s) and the speedup achieved (up to 834× vs CPU, 212× vs 4-threaded CPU). The proposed GPU model is demonstrated to scale efficiently to larger datasets and to multiple GPU devices, which allows the expansion of its applicability to significantly more complicated data sets, previously unmanageable by the original algorithm in reasonable time. 相似文献

9.

Estimating the number of connected components in sublinear time

Petra Berenbrink Bruce KrayenhoffFrederik Mallmann-Trenn 《Information Processing Letters》2014

相似文献

10.

Particle-field decomposition and domain decomposition in parallel particle-in-cell beam dynamics simulation

Ji Qiang Xiaoye Li 《Computer Physics Communications》2010,181(12):2024-2034

Particle-in-cell (PIC) simulation is widely used in many branches of physics and engineering. In this paper, we give an analysis of the particle-field decomposition method and the domain decomposition method in parallel particle-in-cell beam dynamics simulation. The parallel performance of the two decomposition methods was studied on the Cray XT4 and the IBM Blue Gene/P Computers. The domain decomposition method shows better scalability but is slower than the particle-field decomposition in most cases (up to a few thousand processors) for macroparticle dominant applications. The particle-field decomposition method also shows less memory usage than the domain decomposition method due to its use of perfect static load balance. For applications with a smaller ratio of macroparticles to grid points, the domain decomposition method exhibits better scalability and faster speed. Application of the particle-field decomposition scheme to high-resolution macroparticle-dominant parallel beam dynamics simulation for a future light source linear accelerator is presented as an example. 相似文献

11.

Efficient parallel algorithms for path problems in directed graphs

Joan M. Lucas Marian Gunsher Sackrowitz 《Algorithmica》1992,7(1):631-648

In this paper we describe a technique for finding efficient parallel algorithms for problems on directed graphs that involve checking the existence of certain kinds of paths in the graph. This technique provides efficient algorithms for finding dominators in flow graphs, performing interval and loop analysis on reducible flow graphs, and finding the feedback vertices of a digraph. Each of these algorithms takesO(log² n) time using the same number of processors needed for fast matrix multiplication. All of these bounds are for an EREW PRAM. 相似文献

12.

基于GPU的大规模拓扑优化问题并行计算方法 总被引：1，自引：0，他引：1

韩琪蔡勇《计算机仿真》2015,32(4):221-226,304

针对进行大规模拓扑优化问题计算量庞大且计算效率低的问题,设计并实现了一种基于图形处理器(GPU)的并行拓扑优化方法.采用双向渐进结构拓扑优化(BESO)为基础优化算法,采用一种基于节点计算的共轭梯度求解方法用于有限元方程组求解.通过对原串行算法的研究,并结合GPU的计算特点,实现了迭代过程全流程的并行计算.上述方法的程序设计和编写采用统一计算架构(CUDA),提出了基于单元和基于节点的两种并行策略.编写程序时充分使用CUDA自带的各种数学运算库,保证了程序的稳定性和易用性.数值算例证明,并行计算方法稳定并且高效,在优化结果一致的前提下,采用GTX580显卡可以取得巨大的计算加速比. 相似文献

13.

Efficient utilization of launched threads on GPUs: The spherical harmonic transform as a case study

Feng-shun Lu Jun-qiang Song Wang-qun Lin Yu-fei Pang Kai-jun Ren Pei-chang Shi 《Computer Physics Communications》2013

Maximum utilization of hardware resources is crucial to leverage the enormous computational power of graphics processing units (GPUs). However, there lacks an effective metric to denote whether the launched threads are kept busy. To address this issue, we propose a metric called ETU to describe the efficiency of threads utilization. First, we execute several CUDA-SDK sample codes, with(out) double precision arithmetic, on two generations of GPUs so as to perform a preliminary validation of the ETU metric. Taking the spherical harmonic transform as an example, we then give two GPU implementations for Legendre transforms and check the relationship between ETU and application performance. Experimental results show that applications with larger ETU can usually achieve better performance, which is more accurate than occupancy proposed by NVIDIA. Finally, we select the GPU implementations with better performance to accelerate Legendre transforms in STSWM, which is a spectral transform shallow water model. 相似文献

14.

Efficient parallel processing with spin-wave nanoarchitectures

Mary M. Eshaghian-Wilner Shiva Navab 《The Journal of supercomputing》2009,49(2):248-267

In this paper, we study the algorithm design aspects of three newly developed spin-wave architectures. The architectures are capable of simultaneously transmitting multiple signals using different frequencies, and allow for concurrent read/write operations. Using such features, we show a number of parallel and fault-tolerant routing schemes and introduce a set of generic parallel processing techniques that can be used for design of fast algorithms on these spin-wave architectures. We also present a set of application examples to illustrate the operation of the proposed generic parallel techniques.

Mary M. Eshaghian-WilnerEmail:

相似文献

15.

Accelerating Advanced MRI Reconstructions on GPUs 总被引：1，自引：0，他引：1

Stone SS Haldar JP Tsao SC Hwu WM Sutton BP Liang ZP 《Journal of Parallel and Distributed Computing》2008,68(10):1307-1318

Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. This paper describes the acceleration of such an algorithm on NVIDIA's Quadro FX 5600. The reconstruction of a 3D image with 128(3) voxels achieves up to 180 GFLOPS and requires just over one minute on the Quadro, while reconstruction on a quad-core CPU is twenty-one times slower. Furthermore, relative to the true image, the error exhibited by the advanced reconstruction is only 12%, while conventional reconstruction techniques incur error of 42%. 相似文献

16.

Recognition of circular patterns on GPUs: Performance analysis and contributions

Antonio Ruiz Nicolás Guil Manuel Ujaldón 《Journal of Parallel and Distributed Computing》2008

We develop a novel approach for computing the circle Hough transform entirely on graphics hardware (GPU). A primary role is assigned to vertex processors and the rasterizer, overshadowing the traditional foreground of pixel processors and enhancing parallel processing. Resources like the vertex cache or blending units are studied too, with our set of optimizations leading to extraordinary peak gain factors exceeding 358x over a typical CPU execution. Software optimizations, like the use of precomputed tables or gradient information and hardware improvements, like hyperthreading and multicores are explored on CPUs as well. Overall, the GPU exhibits better scalability and much greater parallel performance to become a solid alternative for computing the classical circle Hough transform versus those optimal methods run on emerging multicore architectures. 相似文献

17.

一种基于GPU的并行算法功耗评估方法

王卓薇程良伦赵武清《计算机科学》2013,40(11):23-28

随着软件和硬件的不断发展,图形处理器(GPUs)已经广泛用于通用计算领域,并作为加速器来协助CPU加速程序的运行。为了追求高性能,GPU往往包含成百上千个核心运算单元,高密度的计算资源使其在性能远高于CPU的同时功耗也高于CPU,因此功耗问题已经成为制约GPU发展的重要问题之一。分析了并行程序在GPU上运行时消耗的功耗,提出了并行算法在GPU上运行的功耗评估方法,接着通过并行前缀求和算法对该方法进行了详细的论述与分析。在实验部分通过稀疏矩阵向量乘算法的实际应用对该方法的正确性以及敏感性进行了证明与分析。结果表明,对于给定的程序,在满足性能要求的前提下,最优线程块数、存储访问方式以及任务分配顺序是影响系统功耗的关键因素。相似文献

18.

Efficient parallel term matching and anti-unification

Arthur L. Delcher Simon Kasif 《Journal of Automated Reasoning》1992,9(3):391-406

In this paper we present optimal processor x time parallel algorithms for term matching and anti-unification of terms represented as trees. Term matching is the special case of unification in which one of the terms is restricted to contain no variables. It has wide applicability to logic programming, term rewriting systems and symbolic pattern matching. Anti-unification is the dual problem of unification in which one computes the most specific generalization of two terms. It has application to inductive inference and theorem proving. Our algorithms run in O(log² N) time using N/log² N processors on a shared-memory model of computation that prohibits simultaneous reads or writes (EREW PRAM). These algorithms are the first polylogarithmic-time EREW algorithms with a processor x time product of the same order as that of their sequential counterparts, thereby permitting optimal speed-ups using any number of processors up to N/log² N. We also use the techniques developed in the paper to provide an N/log N-processor, O(log N)-time algorithm for a shared-memory model that allows both simultaneous reads and simultaneous writes (CRCW PRAM).Supported by NSF Grant IRI-88-09324 and NSF/DARPA Grant CCR-8908092. 相似文献

19.

A 2-approximation NC algorithm for connected vertex cover and tree cover

Toshihiro Fujito Takashi Doi 《Information Processing Letters》2004,90(2):59-63

The connected vertex cover problem is a variant of the vertex cover problem, in which a vertex cover is additional required to induce a connected subgraph in a given connected graph. The problem is known to be NP-hard and to be at least as hard to approximate as the vertex cover problem is. While several 2-approximation NC algorithms are known for vertex cover, whether unweighted or weighted, no parallel algorithm with guaranteed approximation is known for connected vertex cover. Moreover, converting the existing sequential 2-approximation algorithms for connected vertex cover to parallel ones results in RNC algorithms of rather high complexity at best.In this paper we present a 2-approximation NC (and RNC) algorithm for connected vertex cover (and tree cover). The NC algorithm runs in O(log²n) time using O(Δ²(m+n)/logn) processors on an EREW-PRAM, while the RNC algorithm runs in O(logn) expected time using O(m+n) processors on a CRCW-PRAM, when a given graph has n vertices and m edges with maximum vertex degree of Δ. 相似文献

20.

基于GPU的轮廓提取算法的并行计算方法研究

柴志雷张圆蒲《计算机应用研究》2015,(2):630-634

为解决高质量的轮廓提取算法计算复杂、实时性差的问题,基于GPU并行计算架构提出了一种针对高质量的轮廓提取算法——Pb(probability boundary,概率轮廓)提取算法的高效并行计算方法。重点讨论了如何利用多计算单元加速计算最耗时的梯度计算部分。详细介绍了多方向直方图并行统计机制及χ2并行计算中访存冲突避免机制。对比实验表明,在GPU上基于该并行方法的轮廓提取相比传统CPU方式具有明显加速效果,且随着图像分辨率变大,加速效果更加明显,例如图像大小为1024×1024时可获得160倍的加速;此外,基于伯克利标准测试集验证了该并行方法可保持原有算法的计算准确度。为大规模图像数据智能分析中的轮廓提取提供了快速、实时的计算方法。相似文献