期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

马车平曾理《计算机工程与应用》2008,44(7):227-230

三维锥束CT图像重建运算量大,纯软件（仅使用CPU）计算时间较长。为了充分利用计算机图形处理器（Graphic Process Unit,GPU）的并行处理能力以及提高数据传输效率,研究了一种结合使用GPU多重纹理（multitexture）加速三维锥束CT的FDK图像重建过程的方法。该方法采用多重纹理映射来提高反投影速度、减少中间数据存储量、减少浮点累加次数,使用顶点颜色通道来实现距离加权运算,采用扩展方法来增加并行反投影的纹理单元,从而提高重建速度。计算机实验结果表明,使用普通PC机重建尺寸为2563的图像,在保证数据精度为16 bit浮点数的要求下,GPU反投影计算可以在10 s以内完成。与仅使用CPU的重建方法相比,GPU重建图像加速方法达到了较高的时间加速比。相似文献

2.

基于Tesla平台的快速反投影运算

下载免费PDF全文

张峰陆利忠闫镔李磊《计算机工程》2011,37(10):275-277

反投影运算是锥束CT图像重建算法中运算量最大,最耗时的部分,是制约重建速度的瓶颈。为此,在计算统一设备架构模型下,应用体素驱动法实现基于Tesla平台的反投影(BP)并行运算,并对BP运算上的访存和数学指令进行优化。实际CT数据的重建结果表明,该方法的运算速度是CPU串行程序的198倍,效率高且易于实现。相似文献

3.

基于双CPU的继电保护测试仪设计

下载免费PDF全文

刘红海侯向华蒋云良《计算机工程》2010,36(9):257-259

设计一种基于快速傅里叶变换(FFT)算法的继电保护测试仪。该系统采用双CPU的结构,上位机为MCU,主要用于人机界面和参数的输入,下位机为DSP,主要用于FFT算法的运算。在双CPU系统中,上下位机的通信采用并行通信中的共享内存原理。通过应用消息控制机制,上下位机能有效减少通信中的中断次数,提高测试的实时性。相似文献

4.

基于FPGA+PCI的并行计算平台实现

米文罡朱文贵李亚麟徐佩霞《电子技术应用》2006,32(6):103-105

介绍了一种基于PCI总线和多片并行FPGA的高速计算平台。FPGA+PCI板卡利用普通PC机作为CPU,通过PCI总线互联,实现了一个并行高速的通用数字运算平台。利用VHDL语言编写各种算法,可用于加解密算法实现和高速数字信号处理等领域,而速度相当于数台PC机并行运算。相似文献

5.

序列图像的非局部均值超分辨率重建算法及GPU实现

李瑶孙涛黄驰张子健《计算机应用研究》2016,33(7)

针对序列图像超分辨率重建非局部均值(Non-Local Means, NLM)算法重建结果图像边缘区域过平滑的问题,提出了一种局部参数自适应改进方法,首先将整幅图像划分为图像子块,然后根据图像子块平均像素信息计算出其对应的滤波参数, 这样有助于减少因整幅图像使用统一滤波参数而导致的某些高频信息的丢失。实验结果表明,和经典NLM重构算法相比,改进算法重建出的结果图像的轮廓边缘更清晰,字符辨识度更高;在算法实现方面,图像重构程序在CPU/GPU平台上实现,使用GPU并行化加速的程序比单CPU运算的程序,加速比最高可达到30倍,显著缩短重构程序计算时间,提高了该图像超分辨率重建算法应用于实际场所的可能性。相似文献

6.

压缩感知自动校准并行成像重建算法

张久明郭树旭王淼石钟菲《计算机应用》2014,34(5):1491-1493

针对核磁共振并行成像重建提出了一种联合稀疏性模型,并与新的软阈值函数结合,将有助于提高重建图像质量。首先利用校准数据生成重建核,重建未采样数据点;然后采用联合稀疏性模型和新的软阈值函数,对各线圈图像数据进行处理;最后用改进的凸投影集算法（POCS）对压缩感知核磁共振并行成像进行重建。对于仿真图像和脑部图像,改进算法相比原算法,重建图像归一化均方根误差(nRMSE)在加速比为4时分别减少了23%和9%。实验结果表明,加速比较大时改进算法能明显提高并行成像重建图像的准确性。相似文献

7.

PMVS算法的CPU多线程和GPU两级粒度并行策略

刘金硕江庄毅徐亚渤邓娟章岚昕《计算机科学》2017,44(2):296-301

PMVS(Patch-based Multi-View Stereo)三维重建算法被广泛应用于无人机航拍影像的三维场景重建中。针对PMVS三维重建算法计算量大、时间复杂度高的问题,提出了PMVS算法的CPU多线程和GPU两级粒度并行策略(Multithread and GPU Parallel Schema,MGPS),方法具体包括:基于GPU的PMVS算法特征提取和片面扩散的并行设计;多影像的GPU和CPU任务分配机制,以使得部分任务分配给CPU采用多线程并行,部分任务分配给GPU并行时,程序总运行时间最短。实验采用搭载24核CPU和NVIDIA Tesla K20 GPU的高性能服务器作为测试平台,针对分辨率为4081×2993的16幅无人机影像进行三维重建。实验结果表明,相比串行的PMVS算法,基于MGPS的PMVS算法取得4倍左右的加速比,其中特征提取最高加速13倍,计算误差在10%以内,该方法实现了更高效的PMVS三维重建。基于MGPS的PMVS算法还可用于文物保护、医学图像处理、虚拟现实等领域。相似文献

8.

GPU加速卷积反投影算法的滤波并行化方法

《传感器与微系统》2019,(4):69-72

当重建的图像规模偏大、实时性要求高时,卷积反投影(CBP)重建过程比较慢,达不到预期满意的速度。针对这一不足,通过深入研究卷积反投影算法的原理,优化投影数据在图形处理器(GPU)中的存储结构、分析和挖掘算法执行过程中滤波阶段的可并行性,对其中的滤波操作进行并行化处理,从而提出并行滤波过程的方法。通过在MATLAB进行仿真实验,实验结果表明:所提出的并行化方法在保证重建图像精度和清晰度的前提下,同串行卷积法相比较,滤波过程运算的加速比得到了较大程度的提高。相似文献

9.

基于图形处理器的边缘检测算法 总被引：1，自引：0，他引：1

张楠王建立王鸣浩《计算机科学》2010,37(1):265-267

边缘检测是一种高度并行的算法,计算量较大,传统的CPU处理难以满足实时要求。针对图像边缘检测问题的计算密集性,在分析常用边缘检测算法的基础上,利用CUDA(Compute Unified Device Architecture,计算统一设备架构)软硬件体系架构,提出了图像边缘检测的GPU(Graphics Processing Unit,图形处理器)实现方案。首先介绍GPU高强度并行运算的体系结构基础,并将Roberts和Sobel这两个具有代表性的图像边缘检测算法移植到GPU,然后利用当前同等价格的CPU和GPU进行对比实验,利用多幅不同分辨率图像作为测试数据,对比CPU和GPU方案的计算效率。实验结果表明,与相同算法的CPU实现相比,其GPU实现获得了相同的处理效果,并将计算效率最高提升到了17倍以上,以此证明GPU在数字图像处理的实际应用中大有潜力。相似文献

10.

基于GPU的Gabor人脸图像特征提取算法的研究

潘峥嵘李伟池《计算机与数字工程》2013,41(4)

论文针对传统Gabor小波计算在人脸图像特征提取中实时性较弱的问题,提出了一种基于GPU并行计算的Gabor小波特征提取方法.所提算法将Gabor小波与人脸图像的卷积运算在GPU(Graphic Processing Unit图形处理器)上并行实现,并采用CUDA (Com pute Unified Device Architecture)编程模型,利用多线程并行映射实现.与传统的Gabor小波人脸特征提取算法对比实验表明,文中方法的计算速度在CPU上速度提高了近12倍,为人脸特征实时提取提供了有效的技术保障. 相似文献

11.

Parallelizing and optimizing large‐scale 3D multi‐phase flow simulations on the Tianhe‐2 supercomputer

Dali Li Chuanfu Xu Yongxian Wang Zhifang Song Min Xiong Xiang Gao Xiaogang Deng 《Concurrency and Computation》2016,28(5):1678-1692

The lattice Boltzmann method (LBM) is a widely used computational fluid dynamics method for flow problems with complex geometries and various boundary conditions. Large‐scale LBM simulations with increasing resolution and extending temporal range require massive high‐performance computing (HPC) resources, thus motivating us to port it onto modern many‐core heterogeneous supercomputers like Tianhe‐2. Although many‐core accelerators such as graphics processing unit and Intel MIC have a dramatic advantage of floating‐point performance and power efficiency over CPUs, they also pose a tough challenge to parallelize and optimize computational fluid dynamics codes on large‐scale heterogeneous system. In this paper, we parallelize and optimize the open source 3D multi‐phase LBM code openlbmflow on the Intel Xeon Phi (MIC) accelerated Tianhe‐2 supercomputer using a hybrid and heterogeneous MPI+OpenMP+Offload+single instruction, mulitple data (SIMD) programming model. With cache blocking and SIMD‐friendly data structure transformation, we dramatically improve the SIMD and cache efficiency for the single‐thread performance on both CPU and Phi, achieving a speedup of 7.9X and 8.8X, respectively, compared with the baseline code. To collaborate CPUs and Phi processors efficiently, we propose a load‐balance scheme to distribute workloads among intra‐node two CPUs and three Phi processors and use an asynchronous model to overlap the collaborative computation and communication as far as possible. The collaborative approach with two CPUs and three Phi processors improves the performance by around 3.2X compared with the CPU‐only approach. Scalability tests show that openlbmflow can achieve a parallel efficiency of about 60% on 2048 nodes, with about 400K cores in total. To the best of our knowledge, this is the largest scale CPU‐MIC collaborative LBM simulation for 3D multi‐phase flow problems. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

12.

基于GPU加速的锥束CT重建算法研究

下载免费PDF全文

张宾张正强王洪凯《计算机工程与应用》2019,55(4):208-213

锥束计算机断层扫描（Cone-Beam Computed Tomography，CBCT）具有采集速度快和空间分辨率高等特点，被生物医学等领域广泛关注。然而通过CPU串行处理CBCT重建中海量投影数据非常耗时，难以满足实时性的需求。GPU的发展为CBCT重建的并行加速提供了条件。根据三角函数周期性的特点对FDK算法进行了改进，并利用GPU实现了12幅投影数据同时并行计算。实验结果表明，相比于传统基于CPU的重建算法，基于GPU的CBCT重建算法在保证图像质量的前提下，将重建速度提高了超过310倍。相似文献

13.

基于Backfilling调度算法的“扩履适足”改进算法 总被引：2，自引：0，他引：2

下载免费PDF全文

付云虹白树仁方俊《计算机工程与科学》2006,28(9):94-96

在众多的并行作业调度算法中,Backfilling通常被广泛认为是有效提高CPU利用率的一种算法。该算法是在FCFS算法的基础上,将队列中较小的作业回填（Backfill）到空闲 CPU,以提高CPU利用率。但是,当空闲CPU数量仍然无法满足Backfilling算法中小作业的回填要求时,系统仍有部分CPU闲置,因而也难以达到更好地提高CPU利用率的目的。。对于共享内存体系结构的并行计算机系统,本文提出了基于Backfilling算法的“扩履适足”的改进算法。该算法以正在运行的作业的CPU利用率为依据,通过动态调整正在运行作业的CPU数,扩大可供回填（backfill）的CPU空间,使得Backfilling算法无法回填的作业得到运行,弥补了Backfilling算法的不足,大大提高了共享内存体系结构并
并行计算机系统的CPU利用率。相似文献

14.

OpenCL-based optimization methods for utilizing forward DCT and quantization of image compression on a heterogeneous platform

Nasser Alqudami Shin-Dug Kim 《Journal of Real-Time Image Processing》2016,12(2):219-235

Recent computer systems and handheld devices are equipped with high computing capability, such as general purpose GPUs (GPGPU) and multi-core CPUs. Utilizing such resources for computation has become a general trend, making their availability an important issue for the real-time aspect. Discrete cosine transform (DCT) and quantization are two major operations in image compression standards that require complex computations. In this paper, we develop an efficient parallel implementation of the forward DCT and quantization algorithms for JPEG image compression using Open Computing Language (OpenCL). This OpenCL-based parallel implementation utilizes a multi-core CPU and a GPGPU to perform DCT and quantization computations. We demonstrate the capability of this design via two proposed working scenarios. The proposed approach also applies certain optimization techniques to improve the kernel execution time and data movements. We developed an optimal OpenCL kernel for a particular device using device-based optimization factors, such as thread granularity, work-items mapping, workload allocation, and vector-based memory access. We evaluated the performance in a heterogeneous environment, finding that the proposed parallel implementation was able to speed up the execution time of the DCT and quantization by factors of 7.97 and 8.65, respectively, obtained from 1024 × 1024 and 2084 × 2048 image sizes in 4:4:4 format. 相似文献

15.

An image division approach for volume ray casting in multi-threading environment

Sukhyun Lim Daesung Lee Byeong-Seok Shin 《Multimedia Tools and Applications》2014,68(2):211-223

For an efficient parallel volume ray casting suitable for recent multi-core CPUs, we propose an image-ordered approach by using a cost function to allocate loaded tasks impartially per each processing node. At the first frame, we divide an image space evenly, and we compute a cost function. By applying the frame coherence property, we divide the image space unevenly using the computed previous cost function since the next frame. Conventional image-ordered parallel approaches have focused on dividing and compositing volume datasets. However, the divisions and accumulations are negligible for recent multi-core CPUs because they are performed inside one physical CPU. As a result, we can reduce the rendering time without deteriorating the image quality by applying a cost function reflecting on all time-consuming steps of the volume ray casting. 相似文献

16.

MPtostream:an OpenMP compiler for CPU-GPU heterogeneous parallel systems

《中国科学:信息科学(英文版)》2012,(9):1961-1971

In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone. 相似文献

17.

Parallel computing of 3D smoking simulation based on OpenCL heterogeneous platform

Zhiyong Yuan Weixin Si Xiangyun Liao Zhaoliang Duan Yihua Ding Jianhui Zhao 《The Journal of supercomputing》2012,61(1):84-102

Open Computing Language (OpenCL) is an open royalty-free standard for general purpose parallel programming across Central Processing Units (CPUs), Graphic Processing Units (GPUs) and other processors. This paper introduces OpenCL to implement real-time smoking simulation in a virtual surgery training simulation system. Firstly, the Computational Fluid Dynamics (CFD) is adopted to construct the real-time smoking simulation model based on the Navier?CStokes (N-S) equations of an incompressible fluid under the condition of normal temperature and pressure. Then we propose a parallel computing technique based on OpenCL to accomplish the parallel computing of smoking simulation model on CPU and GPU, respectively. Finally, we render the smoke in real time by using a three-dimensional (3D) texture volume rendering method. Experimental results show that the parallel computing technique we have proposed achieve a satisfactory effect on image quality and rendering rate both on CPU and GPU. 相似文献

18.

A Distributed PTX Virtual Machine on Hybrid CPU/GPU Clusters

《Journal of Systems Architecture》2016

Hybrid CPU/GPU cluster recently has drawn lots of attention from high performance computing because of excellent execution performance and energy efficiency. Many supercomputing sites in the newest TOP 500 and Green 500 are built by hybrid CPU/GPU clusters instead of CPU clusters. However, the programming complexity of hybrid CPU/GPU clusters is so high such that most of users usually hesitate to move toward to this new cluster computing platform. To resolve this problem, we propose a distributed PTX virtual machine called BigGPU on heterogeneous clusters in this paper. As named, this virtual machine physically is a distributed system which is aimed at parallel re-compiling and executing the PTX codes by aggregating CPUs and GPUs available in a computational cluster. With the support of this virtual machine, users can regard a hybrid CPU/GPU as a single large-scale GPU. Consequently, they can develop applications by using only CUDA without combining MPI and multithreading APIs while can simultaneously use distributed CPUs and GPUs for resolving the same problem. Moreover, they need not handle the problem of load balance among heterogeneous processors and the constraints of device memory and thread configuration existing in physical GPUs because BigGPU supports large-scale virtual device memory space and thread configuration. On the other hand, we have evaluated the execution performance of BigGPU in this paper. Our experimental results have shown that BigGPU indeed can effectively exploit the computational power of CPUs and GPUs for enhancing the execution performance of user's CUDA programs. 相似文献

19.

基于FPGA的高速图像压缩编码器设计与实现

姚远张晓林《电子技术应用》2009,35(7)

为解决高分辨率遥感图像和医学图像的实时压缩问题,提出一种适合FPGA实现的无链表小波零树压缩算法,通过预处理和主处理过程分解实现了并行流水编码结构。利用Altera公司的DE3开发平台完成了算法验证,实现了200MP/s的处理能力,可支持4096×2048分辨率的灰度图像25帧/s的实时编码。相似文献

20.

超分辨率图象重构技术的仿真实验研究 总被引：9，自引：0，他引：9

下载免费PDF全文

刘良云李英才相里斌《中国图象图形学报》2001,6(7):629-635

CCD相机在对空间频率较丰富的景物进行成象时,由于CCD象元尺寸的限制,图象分辨率低,混频现象有时很严重,红外相机尤其如此,超分辨率图象重构技术利用多帧重复拍照图象的冗余信息,重构出超分辨率图象,消除和降低混频效应,本文对图象微位移和微旋转角的精确估算算法,相机模型,超分辨率较象的重构算法等关键技术进行研究,设计了序列了集共轭梯度最优化算法,并提供了分辨率提高5倍的算法和研究成果,该项技术对于星载,机载图象融合（特别是红外凝视成象系统获取的图象）是十分有意义的,它将有可能将航天或航空图象的分辨率提高2－5倍左右。相似文献