期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

GPU加速的图像匹配技术 总被引：1，自引：0，他引：1

厉旭杰《计算机工程与应用》2012,48(2):173-176

传统的模板图像匹配算法,匹配速度较慢。应用GPU通用高性能编程技术实现了一种加速图像匹配算法的新方法。应用CUDA编程技术对图像匹配算法进行并行化改造。采用了四种不同的存储方案,在第四种存储方案中获得了43.5倍的加速比,并对四种不同的存储方案的性能进行了深入研究。相似文献

2.

一种使用GPU加速地震叠前时间偏移的方法

张清谢海波赵开勇吴庆陈维王狮虎迟旭光褚晓文《微型机与应用》2011,30(10):87-90

应用GPU通用高性能编程技术实现一种加速地震叠前时间偏移的新方法.该技术是地震勘探处理的常规流程,其核心算法具有计算密集、数据独立性强、并行性高等特点.通过性能剖析获得其计算热点,通过CUDA技术对其进行并行化改造,并利用CUDA的流技术实现CPU到GPU的异步传输.通过集群环境下的性能测试,应用GPU并行化的PSTM程序可明显缩短运行时间. 相似文献

3.

GPGPU编程技术初探

林茂董玉敏邹杰杨敏张晋楠《电脑编程技巧与维护》2010,(2):15-17,23

伴随着GPGPU计算技术的不断发展,HPC高性能计算系统体系结构正在悄然发生着一场变革,这场变革为高性能计算发展提供了一个新的方向、CUDA是NIVIDIA公司提供的利用GPGPU进行并行运算应用开发的一套C语言编程平台,通过它可以利用特定显卡的高性能运算能力进行一些大规模高性能计算,有效提升计算机系统的使用效率,本文主要介绍GPU发展现状以及如何利用CUDA编程技术进行并行运算软件开发．相似文献

4.

一种基于GPGPU的SIFT加速算法 总被引：4，自引：1，他引：3

杨天天鲁云萍张为华《电子技术应用》2015,41(1)

SIFT是目前应用最广泛的基于局部特征的图像特征提取算法之一,针对其运行速度制约其应用范围的问题,提出在图像处理器(GPGPU)上设计并实现将算法各核心模块映射到GPGPU的计算单元并针对GPUPU特性进行优化的SIFT并行加速算法。测试结果表明,基于GPGPU的SIFT并行算法相比于原始串行版本达到了118.2倍的加速,吞吐量达到了76.86图片/s,相比于已有的技术获得了明显的性能提升。相似文献

5.

利用GPGPU进行快速稀疏磁共振数据重建

下载免费PDF全文

王聪冯衍秋《计算机工程与应用》2011,47(17):203-206

利用GPGPU（General Purpose GPU）强大的并行处理能力,基于NVIDIA CUDA框架对已有的稀疏磁共振（Sparse MRI）重建算法进行了并行化改造,使其能够适应实际应用的要求。稀疏磁共振成像的重建算法包含大量的浮点运算,计算耗时严重,难以应用于实际,必须对其进行加速和优化。实验结果显示,NVIDIA GTX275 GPU使运算时间从4分多钟缩短到3.4秒左右,与Intel Q8200 CPU相比,达到了76倍的加速。相似文献

6.

Postgres数据管理系统查询优化的研究与实现

李艳彬刘波朱薇薇《计算机工程与应用》2002,38(21):205-206

查询优化是传统和并行数据库管理系统中的重要组成部分。该文通过介绍传统和并行数据库的查询优化技术,对Postges数据库中查询优化模块的工作流程及主要实现算法进行了分析,并对其进一步并行化提出了思考性建议。相似文献

7.

OVALS海洋资料同化系统并行计算研究

下载免费PDF全文

卢风顺宋君强朱小谦《计算机工程与科学》2010,32(1):113-116

海洋数值预报技术的发展与高性能计算密切相关。为提高OVALS海洋资料同化系统的时效性,本文实现了OVALS系统的并行化。在温盐资料同化模块并行化过程中,本文提出了层优先处理器划分算法,并研究了基于该算法的并行I/O、全局通讯等实现方法;在高度计资料同化模块并行化过程中,设计实现了基于预处理的非规则区域分解算法,较好地实现了OVALS并行计算负载平衡。数值实验表明,OVALS并行系统在36并行规模下取得了17.45的并行加速比。相似文献

8.

模糊C均值聚类算法的并行化研究

张建强郑晓薇吴华平《微型机与应用》2010,29(23)

使用Intel Parallel Amplifier高性能工具,针对模糊C均值聚类算法在多核平台的性能问题,找出串行程序的热点和并发性,提出并行化设计方案.基于Intel并行库TBB(线程构建模块)和OpenMP运行时库函数,对多核平台下的串行程序进行循环并行化和任务分配的并行化设计. 相似文献

9.

基于块排序索引的生物序列局部比对查询技术

李永光王镝王国仁马宜菲《计算机科学》2005,32(12):159-163

生物数据库中的查询是在生物序列数据集中查找与输入查询序列相似的目标,目前的一些流行工具如BLAST等,是利用启发式算法来提高查询的速度。然而,这些启发式算法无法找到所有的满足要求的结果,而一些精确算法,如动态规划算法,却需要非常高昂的代价。最近,一种新的技术,QASIS,提出了在后缀树的遍历中使用动态规划的精确查找算法,其性能与BLAST相当。但是它的主要缺点就是后缀树这种索引结构需要巨大的空间开销。本文采用基于无损压缩的块排序结构来索引超常的生物序列,减小索引的存储空间开销,有效地减少动态规划算法的计算代价。实验结果表明基于块排序索引的算法在性能方面优于OASIS算法。相似文献

10.

一种基于FPGA加速的高性能数据解压方法

刘谱光魏子令黄成龙陈曙晖《计算机学报》2023,(12):2687-2704

在数据库、深度学习、高效存储等数据读取性能敏感的应用场景中，数据解压性能对上层应用的服务质量有着重要影响.LZ4无损数据压缩算法具备高速解压特性，因此被广泛应用在高速解压场景中，但其运行需要消耗大量CPU资源.为减少LZ4数据解压开销，学界和业界提出了基于FPGA的LZ4数据解压加速方法 .但现有方法大多采用逐字节顺序处理的计算模式，导致并行度和吞吐率存在较大不足.因此，设计实现高性能LZ4数据解压加速方法成为当前研究亟需解决的关键问题.以LZ4解压的高性能加速为目标，本文研究从多层次对LZ4解压进行并行加速设计，提出了一种基于FPGA加速的高性能LZ4数据解压方法 .首先，本方法研究对LZ4序列解析过程进行并行化改进，设计实现了一个基于多字段并行解析方法的并行化序列解析器，将吞吐率从每周期单字节扩展到每周期多字节.此外，本方法对序列解析器中的高时延长度字段解析逻辑进行优化改进，设计了基于二分法的最大匹配长度快速解析方法，显著减小序列解析器的关键路径时延，使得改进后的设计时钟频率比改进前提高了约21%.其次，基于并行化序列解析器，本方法设计实现了一个高性能数据解压引擎.该引擎将序列解析... 相似文献

11.

Auto-tuning for GPGPU applications using performance and energy model

《Journal of Systems Architecture》2016

The general-purpose graphic processing unit (GPGPU) is a popular accelerator for general applications such as scientific computing because the applications are massively parallel and the significant power of parallel computing inheriting from GPUs. However, distributing workload among the large number of cores as the execution configuration in a GPGPU is currently still a manual trial-and-error process. Programmers try out manually some configurations and might settle for a sub-optimal one leading to poor performance and/or high power consumption. This paper presents an auto-tuning approach for GPGPU applications with the performance and power models. First, a model-based analytic approach for estimating performance and power consumption of kernels is proposed. Second, an auto-tuning framework is proposed for automatically obtaining a near-optimal configuration for a kernel computation. In this work, we formulated that automatically finding an optimal configuration as the constraint optimization and solved it using either simulated annealing (SA) or genetic algorithm (GA). Experiment results show that the fidelity of the proposed models for performance and energy consumption are 0.86 and 0.89, respectively. Further, the optimization algorithms result in a normalized optimality offset of 0.94% and 0.79% for SA and GA, respectively. 相似文献

12.

Parallelization of Full Search Motion Estimation Algorithm for Parallel and Distributed Platforms

Eduarda Monteiro Bruno Vizzotto Cláudio Diniz Marilena Maule Bruno Zatt Sergio Bampi 《International journal of parallel programming》2014,42(2):239-264

This work presents an efficient method to map the Full Search algorithm for Motion Estimation (ME) onto General Purpose Graphic Processing Unit (GPGPU) architectures using Compute Unified Device Architecture (CUDA) programming model. Our method jointly exploits the massive parallelism available in current GPGPU devices and the parallelism potential of Full Search algorithm. Our main goal is to evaluate the feasibility of video codecs implementation using GPGPUs and its advantages and drawbacks compared to other platforms. Therefore, for comparison reasons, three solutions were developed using distinct programming paradigms for distinct underlying hardware architectures: (i) a sequential solution for general-purpose processor (GPP); (ii) a parallel solution for multi-core GPP using OpenMP library; (iii) a distributed solution for cluster/grid machines using Message Passing Interface (MPI) library. The CUDA-based solution for GPGPUs achieves speed-up compatible to the indicated by the theoretical model for different search areas. Our GPGPU Full Search Motion Estimation provides 2×, 20× and 1664× speed-up when compared to MPI, OpenMP and sequential implementations, respectively. Compared to state-of-the-art, our solution reaches up to 17× speed-up. 相似文献

13.

Accelerating wildfire susceptibility mapping through GPGPU

Salvatore Di Gregorio Giuseppe Filippone William Spataro Giuseppe A. Trunfio 《Journal of Parallel and Distributed Computing》2013

In the field of wildfire risk management the so-called burn probability maps (BPMs) are increasingly used with the aim of estimating the probability of each point of a landscape to be burned under certain environmental conditions. Such BPMs are usually computed through the explicit simulation of thousands of fires using fast and accurate models. However, even adopting the most optimized algorithms, the building of simulation-based BPMs for large areas results in a highly intensive computational process that makes mandatory the use of high performance computing. In this paper, General-Purpose Computation with Graphics Processing Units (GPGPU) is applied, in conjunction with a wildfire simulation model based on the Cellular Automata approach, to the process of BPM building. Using three different GPGPU devices, the paper illustrates several implementation strategies to speedup the overall mapping process and discusses some numerical results obtained on a real landscape. 相似文献

14.

三维Navier-Stokes方程分步法的并行算法在异构平台上实现初探

徐莹徐磊姜恺《计算机工程与科学》2012,34(9):33-39

本文选取了三维不可压缩流动方程的分步法(fractional-step method),其中动量方程使用BiCGSTAB算法进行迭代求解,而压力泊松方程使用Fourier变换法进行直接求解。本文研究该算法在集群平台上的并行算法,从区域分解入手,分析一维、两维、三维区域划分三种情况下,各并行处理器上的计算量与通讯量,根据分析结果使用两维区域分解。分析BiCGSTAB算法和泊松Fourier变换法在GPGPU异构平台上的移植方法。最后,本文分析了BiCGSTAB和泊松方程Fourier变换法两种算法在CPU集群和GPGPU异构平台上的并行性能结果。相似文献

15.

Parallel Optimization of Sparse Portfolios with AR-HMMs

I. Róbert Sipos Attila Ceffer János Levendovszky 《Computational Economics》2017,49(4):563-578

In this paper we optimize mean reverting portfolios subject to cardinality constraints. First, the parameters of the corresponding Ornstein–Uhlenbeck (OU) process are estimated by auto-regressive Hidden Markov Models (AR-HMM), in order to capture the underlying characteristics of the financial time series. Portfolio optimization is then performed by maximizing the return achieved with a predefined probability instead of optimizing the predictability parameter, which provides more profitable portfolios. The selection of the optimal portfolio according to the goal function is carried out by stochastic search algorithms. The presented solutions satisfy the cardinality constraint in terms of providing a sparse portfolios which minimize the transaction costs (and, as a result, maximize the interpretability of the results). In order to use the method for high frequency trading (HFT) we utilize a massively parallel GPGPU architecture. Both the portfolio optimization and the model identification algorithms are successfully tailored to be running on GPGPU to meet the challenges of efficient software implementation and fast execution time. The performance of the new method has been extensively tested both on historical daily and intraday FOREX data and on artificially generated data series. The results demonstrate that a good average return can be achieved by the proposed trading algorithm in realistic scenarios. The speed profiling has proven that GPGPU is capable of HFT, achieving high-throughput real-time performance. 相似文献

16.

A regression‐based performance prediction framework for synchronous iterative algorithms on general purpose graphical processing unit clusters

Vivek K. Pallipuram Melissa C. Smith Nimisha Raut Xiaoyu Ren 《Concurrency and Computation》2014,26(2):532-560

Heterogeneous performance prediction models are valuable tools to accurately predict application runtime, allowing for efficient design space exploration and application mapping. The existing performance models require intricate system architecture knowledge, making the modeling task difficult. In this research, we propose a regression‐based performance prediction framework for general purpose graphical processing unit (GPGPU) clusters that statistically abstracts the system architecture characteristics, enabling performance prediction without detailed system architecture knowledge. The regression‐based framework targets deterministic synchronous iterative algorithms using our synchronous iterative GPGPU execution model and is broken into two components: the computation component that models the GPGPU device and host computations and the communication component that models the network‐level communications. The computation component regression models use algorithm characteristics such as the number of floating‐point operations and total bytes as predictor variables and are trained using several small, instrumented executions of synchronous iterative algorithms that include a range of floating‐point operations‐to‐byte requirements. The regression models for network‐level communications are developed using micro‐benchmarks and employ data transfer size and processor count as predictor variables. Our performance prediction framework achieves prediction accuracy over 90% compared with the actual implementations for several tested GPGPU cluster configurations. The end goal of this research is to offer the scientific computing community, an accurate and easy‐to‐use performance prediction framework that empowers users to optimally utilize the heterogeneous resources. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

17.

基于CUDA的IDEA密钥生成方案的设计与实现

刘景瑞《网络安全技术与应用》2014,(9):92-92

本文首先对GPGPU模型CUDA进行了简单的介绍,描述了IDEA密码体制加、解密密钥生成过程,最后通过使用CUDA架构在GPU上实现了IDEA密码体制加、解密密钥的生成过程. 相似文献

18.

Reconstructing permutation table to improve the Tabu Search for the PFSP on GPU

Kai-Cheng Wei Xue Sun Hsun Chu Chao-Chin Wu 《The Journal of supercomputing》2017,73(11):4711-4738

General-purpose computing on graphics processing unit (GPGPU) has been adopted to accelerate the running of applications which require long execution time in various problem domains. Tabu Search belonging to meta-heuristics optimization has been used to find a suboptimal solution for NP-hard problems within a more reasonable time interval. In this paper, we have investigated in how to improve the performance of Tabu Search algorithm on GPGPU and took the permutation flow shop scheduling problem (PFSP) as the example for our study. In previous approach proposed recently for solving PFSP by Tabu Search on GPU, all the job permutations are stored in global memory to successfully eliminate the occurrences of branch divergence. Nevertheless, the previous algorithm requires a large amount of global memory space, because of a lot of global memory access resulting in system performance degradation. We propose a new approach to address the problem. The main contribution of this paper is an efficient multiple-loop struct to generate most part of the permutation on the fly, which can decrease the size of permutation table and significantly reduce the amount of global memory access. Computational experiments on problems according with benchmark suite for PFSP reveal that the best performance improvement of our approach is about 100%, comparing with the previous work. 相似文献

19.

基于CUDA的BP算法并行化与实例验证

孙香玉冯百明杨鹏斐《计算机工程与应用》2013,(23):31-34,51

CUDA是应用较广的GPU通用计算模型,BP算法是目前应用最广泛的神经网络模型之一。提出了用CUDA模型并行化BP算法的方法。用该方法训练BP神经网络,训练开始前将数据传到GPU,训练开始后计算隐含层和输出层的输入输出和误差,更新权重和偏倚的过程都在GPU上实现。将该方法用于手写数字图片训练练实验,与在四核CPU上的训练相比,加速比为6．12～8．17。分别用在CPU和GPU上训练得到的结果识别相同的测试集图片,GPU上的训练结果对图片的识别率比CPU上的高0．05％～0．22％。相似文献