首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Sliding Window Operations (SWOs) are widely used in image processing applications. They often have to be performed repeatedly across the target image, which can demand significant computing resources when processing large images with large windows. In applications in which real-time performance is essential, running these filters on a CPU often fails to deliver results within an acceptable timeframe. The emergence of sophisticated graphic processing units (GPUs) presents an opportunity to address this challenge. However, GPU programming requires a steep learning curve and is error-prone for novices, so the availability of a tool that can produce a GPU implementation automatically from the original CPU source code can provide an attractive means by which the GPU power can be harnessed effectively. This paper presents a GPU-enabled programming model, called GSWO, which can assist GPU novices by converting their SWO-based image processing applications from the original C/C++ source code to CUDA code in a highly automated manner. This model includes a new set of simple SWO pragmas to generate GPU kernels and to support effective GPU memory management. We have implemented this programming model based on a CPU-to-GPU translator (C2GPU). Evaluations have been performed on a number of typical SWO image filters and applications. The experimental results show that the GSWO model is capable of efficiently accelerating these applications, with improved applicability and a speed-up of performance compared to several leading CPU-to-GPU source-to-source translators.  相似文献   

2.
Recently graphic processing units (GPUs) are rising as a new vehicle for high-performance, general purpose computing. It is attractive to unleash the power of GPU for Electronic Design Automation (EDA) computations to cut the design turn-around time of VLSI systems. EDA algorithms, however, generally depend on irregular data structures such as sparse matrix and graphs, which pose major challenges for efficient GPU implementations. In this paper, we propose high-performance GPU implementations for a set of important irregular EDA computing patterns including sparse matrix, graph algorithms and message-passing algorithms. In the sparse matrix domain, we solve a core problem, sparse-matrix vector product (SMVP). On a wide range of EDA problem instances, our SMVP implementation outperforms all prior work and achieves a speedup up to 50× over the CPU baseline implementation. The GPU based SMVP procedure is applied to successfully accelerate two core EDA computing engines, timing analysis and linear system solution. In the graph algorithm domain, we developed a SMVP based formulation to efficiently solve the breadth-first search (BFS) problem on GPUs. We also developed efficient solutions for two message-passing algorithms, survey propagation (SP) based SAT solution and a register-transfer level (RTL) simulation. Our results prove that GPUs have a strong potential to accelerate EDA computing through designing GPU-friendly algorithms and/or re-organizing computing structures of sequential algorithms.  相似文献   

3.
GPU Computing   总被引:9,自引:0,他引:9  
The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart. The GPU's rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general-purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. We describe the background, hardware, and programming model for GPU computing, summarize the state of the art in tools and techniques, and present four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications.  相似文献   

4.
The next generation DVB-T2, DVB-S2, and DVB-C2 standards for digital television broadcasting specify the use of low-density parity-check (LDPC) codes with codeword lengths of up to 64800 bits. The real-time decoding of these codes on general purpose computing hardware is useful for completely software defined receivers, as well as for testing and simulation purposes. Modern graphics processing units (GPUs) are capable of massively parallel computation, and can in some cases, given carefully designed algorithms, outperform general purpose CPUs (central processing units) by an order of magnitude or more. The main problem in decoding LDPC codes on GPU hardware is that LDPC decoding generates irregular memory accesses, which tend to carry heavy performance penalties (in terms of efficiency) on GPUs. Memory accesses can be efficiently parallelized by decoding several codewords in parallel, as well as by using appropriate data structures. In this article we present the algorithms and data structures used to make log-domain decoding of the long LDPC codes specified by the DVB-T2 standard??at the high data rates required for television broadcasting??possible on a modern GPU. Furthermore, we also describe a similar decoder implemented on a general purpose CPU, and show that high performance LDPC decoders are also possible on modern multi-core CPUs.  相似文献   

5.
Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~?24 ms using an NVidia GeForce 8800 GTX and in ~?2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC architecture.  相似文献   

6.
Graphics Processing Units for Handhelds   总被引:2,自引:0,他引:2  
During the past few years, mobile phones and other handheld devices have gone from only handling dull text-based menu systems to, on an increasing number of models, being able to render high-quality three-dimensional graphics at high frame rates. This paper is a survey of the special considerations that must be taken when designing graphics processing units (GPUs) on such devices. Starting off by introducing desktop GPUs as a reference, the paper discusses how mobile GPUs are designed, often with power consumption rather than performance as the primary goal. Lowering the bus traffic between the GPU and the memory is an efficient way of reducing power consumption, and therefore some high-level algorithms for bandwidth reduction are presented. In addition, an overview of the different APIs that are used in the handheld market to handle both two-dimensional and three-dimensional graphics is provided. Finally, we present our outlook for the future and discuss directions of future research on handheld GPUs.  相似文献   

7.
Applications of virtual reality (VR) and augmented reality (AR) in medicine require real-time visualization and modeling of large three-dimensional data sets. Consequently, these applications require powerful computation, extensive high-bandwidth memory, and fast communication links. In the past, the manufacturers of medical imaging equipment produced their own special-purpose proprietary hardware for image processing and solid graphics. Due to the developments in computer hardware in general and in graphics accelerators in particular, there is a trend toward replacing the proprietary hardware off-the-shelf (OTS) equipment. Computer graphics itself has advanced in its quest for realism. Generic algorithms such as shading, texture mapping, and volume rendering have been developed to meet the resultant ever increasing requirements. Advances in both the OTS CPU and graphics hardware have enabled real-time implementations of these algorithms, thereby facilitating many of the medical VR/AR applications used today. The development of graphics libraries such as OpenGL has also been an important factor. These libraries provide an underlying portable software platform that optimizes the utilization of the available graphics hardware. OpenGL has become a standard graphics application programming interface, particularly for graphics-intensive applications, and more and more OTS systems provide hardware implementations of OpenGL commands. The review paper follows the evolution of these technologies and examines their crucial role in enabling the appearance of the current VR/AR applications in medicine and provides a look at current trends and future possibilities  相似文献   

8.
真三维活动视频数据的优化研究   总被引:1,自引:0,他引:1  
江寅川  袁杰 《现代电子技术》2012,35(8):116-119,126
提出了一种基于点阵的真三维视频显示技术,该系统利用LED为单元节点组成三维空间阵列,用于显示真三维活动影像。由于数据量巨大,为了加快处理速度,利用CUDA编程模型对计算过程进行优化,把处理过程中可以并行计算的部分交由GPU执行。先把要处理的视频数据传到内存中,由CPU进行一些预处理,然后传到显存,由GPU对视频运动过程等进行处理,处理完后再传到内存,由CPU进行一些后续处理,最终把处理后的数据传出加以显示或存储。通过比较仅由CPU处理与用GPU优化后的计算时间,发现优化后计算速度比优化前快了几十到几百倍,而且数据量越大,优化效果越好,核心多的GPU所得到的加速比大,最后在实验部分给出了用OpenGL仿真的结果。  相似文献   

9.
图形处理器协同运算的视频处理架构   总被引:1,自引:0,他引:1  
多媒体视频处理的任务繁重,计算量大,很多算法无法在仅使用一颗CPU的条件下达到实时处理的速度。设计一套图形处理器协同运算的视频处理架构,它采用图形处理器与中央处理器配合,共同完成视频计算的任务。这种架构可以大大加速处理速度,并减轻中央处理器的负担。  相似文献   

10.
星图配准是星图处理应用中的一个重要步骤,因此星图配准的速度直接影响了星图处理的整体速度.近几年来,图形处理器(GPU)在通用计算领域得到快速的发展.结合GPU在通用计算领域的优势与星图配准面临的处理速度的问题,研究了基于GPU加速处理星图配准的算法.在已有配准算法的基础上,根据算法特点提出了相应的GPU并行设计模型,利用CUDA编程语言进行仿真实验.实验结果表明:相较于传统基于CPU的配准算法,基于GPU的并行设计模型同样达到了配准要求,且配准速度的加速比达到29.043倍.  相似文献   

11.
We describe the implementation of monaural audio source separation algorithms in our toolkit openBliSSART (Blind Source Separation for Audio Recognition Tasks). To our knowledge, it provides the first freely available C+ + implementation of Non-Negative Matrix Factorization (NMF) supporting the Compute Unified Device Architecture (CUDA) for fast parallel processing on graphics processing units (GPUs). Besides integrating parallel processing, openBliSSART introduces several numerical optimizations of commonly used monaural source separation algorithms that reduce both computation time and memory usage. By illustrating a variety of use-cases from audio effects in music processing to speech enhancement and feature extraction, we demonstrate the wide applicability of our application framework for a multiplicity of research and end-user applications. We evaluate the toolkit by benchmark results of the NMF algorithms and discuss the influence of their parameterization on source separation quality and real-time factor. In the result, the GPU parallelization in openBliSSART introduces double-digit speedups with respect to conventional CPU computation, enabling real-time processing on a desktop PC even for high matrix dimensions.  相似文献   

12.
Maximum intensity projections (MIPs) are an important visualization technique for angiographic data sets. Efficient data inspection requires frame rates of at least five frames per second at preserved image quality. Despite the advances in computer technology, this task remains a challenge. On the one hand, the sizes of computed tomography and magnetic resonance images are increasing rapidly. On the other hand, rendering algorithms do not automatically benefit from the advances in processor technology, especially for large data sets. This is due to the faster evolving processing power and the slower evolving memory access speed, which is bridged by hierarchical cache memory architectures. In this paper, we investigate memory access optimization methods and use them for generating MIPs on general-purpose central processing units (CPUs) and graphics processing units (GPUs), respectively. These methods can work on any level of the memory hierarchy, and we show that properly combined methods can optimize memory access on multiple levels of the hierarchy at the same time. We present performance measurements to compare different algorithm variants and illustrate the influence of the respective techniques. On current hardware, the efficient handling of the memory hierarchy for CPUs improves the rendering performance by a factor of 3 to 4. On GPUs, we observed that the effect is even larger, especially for large data sets. The methods can easily be adjusted to different hardware specifics, although their impact can vary considerably. They can also be used for other rendering techniques than MIPs, and their use for more general image processing task could be investigated in the future.  相似文献   

13.
The second generation terrestrial TV broadcasting standard from the digital video broadcasting (DVB) project, DVB-T2, has recently been standardized. In this article we perform a complexity analysis of our software defined implementation of the modulator/demodulator parts of a DVB-T2 transmitter and receiver. First we describe the various stages of a DVB-T2 modulator and demodulator, as well as how they have been implemented in our system. We then perform an analysis of the computational complexity of each signal processing block. The complexity analysis is performed in order to identify the blocks that are not feasible to run in realtime on a general purpose processor. Furthermore, we discuss implementing these computationally heavy blocks on other architectures, such as GPUs (graphics processing units) and FPGAs (field-programmable gate arrays), that would still allow them to be implemented in software and thus be easily reconfigurable.  相似文献   

14.
为了满足机载显示器图形生成系统的小尺寸、低功耗需求,提出了一种基于龙芯2K1000的全国产图形生成与处理方案。该方案以2K1000为核心构建硬件平台,使用2K1000内部集成的中央处理器、图形处理器、显示控制器,以及外部的内存实现图形的计算生成和双缓存显示,配合国产可编程逻辑器件实现图形与外视频的叠加显示及机载通信总线扩展。软件上采用航空专用的国产天脉操作系统,基于天脉操作系统设计了关键的图形显示驱动、帧存驱动、显示控制器驱动。实验结果表明,在输出1 024 pixel×768 pixel分辨率显示时,典型机载图形画面帧率达到35 frame/s,整体功耗约10 W。该方案扩展性强、功耗低,满足实时显示需求,在机载显示器领域有着广泛的应用空间。  相似文献   

15.
GPUs can provide powerful computing ability especially for data parallel applications, such as video/image processing applications. However, the complexity of GPU system makes the optimization of even a simple algorithm difficult. Different optimization methods on a GPU often lead to different performances. The matrix–vector multiplication routine for general dense matrices (GEMV) is an important kernel in video/image processing applications. We find that the implementations of GEMV in CUBLAS or MAGMA are not efficient, especially for small or fat matrix. In this paper, we propose a novel register blocking method to optimize GEMV on GPU architecture. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization methods for GEMV are comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also evaluated in the experiment. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices.  相似文献   

16.
This paper describes the architecture, functionality, and design of NX-2700, a digital television and media processor chip from Philips Semiconductors. The NX-2700 is the second generation of an architectural family of programmable multimedia processors targeted at the digital television (DTV) markets, including the United States Advanced Television Systems Committee (ATSC) DTV-standard-based applications. The chip not only supports all of the 18 ATSC formats from standard-definition to wide-angle, high-definition video, but has also the power to handle high-definition television (HDTV) video and audio source decoding (high-level MPEG-5 AC-3 and ProLogic audio, closed captioning, etc.) as well as the flexibility to process advanced interactive services. NX-2700 is a programmable processor with a very powerful, general-purpose very long instruction word (VLIW) central processing unit (CPU) core that implements many nontrivial multimedia algorithms, coordinates all on-chip activities, and runs a small real-time operating system. The CPU core, aided by an array of peripheral devices (multimedia coprocessors and input-output units) and high-performance buses, facilitates concurrent processing of audio, video, graphics, and communication-data  相似文献   

17.
Including multiple cores on a single chip has become the dominant mechanism for scaling processor performance. Exponential growth in the number of cores on a single processor is expected to lead in a short time to mainstream computers with hundreds of cores. Scalable implementations of parallel algorithms will be necessary in order to achieve improved single-application performance on such processors. In addition, memory access will continue to be an important limiting factor on achieving performance, and heterogeneous systems may make use of cores with varying capabilities and performance characteristics. An appropriate programming model can address scalability and can expose data locality while making it possible to migrate application code between processors with different parallel architectures and variable numbers and kinds of cores. We survey and evaluate a range of multicore processor architectures and programming models with a focus on GPUs and the Cell BE processor. These processors have a large number of cores and are available to consumers today, but the scalable programming models developed for them are also applicable to current and future multicore CPUs.  相似文献   

18.

Recent advances in general-purpose graphics processing units (GPGPUs) have resulted in massively parallel hardware that is widely available to achieve high performance in desktop, notebook, and even mobile computer systems. While multicore technology has become the norm of modern computers, programming such systems requires the understanding of underlying hardware architecture and hence posts a great challenge for average programmers, who might be professionals in specific domains, but not experts in parallel programming. This paper presents a GUI tool called GPUBlocks that can facilitate parallel programming on multicore computer systems. GPUBlocks is developed based on the OpenBlocks framework, an extendable tool for graphical programming, to construct the GUI-based programming environment for CUDA and OpenCL parallel computing platforms. Programmers simply need to drag-n-drop blocks, fill the fields of the blocks, and connect them according to array or matrix computations that are specified by algorithms. GPUBlocks can then translate block-based code to CUDA or OpenCL programs. Furthermore, a couple of optimization constructs have also been offered for rapid program optimization. Experimental results have shown that the generated CUDA and OpenCL programs can achieve reasonable speedups on GPUs. Consequently, GPUBlocks can be used as a tool for fast prototyping of GPU applications or a platform for educational parallel programming.

  相似文献   

19.
A 250-MHz single-chip multiprocessor, which can implement multichannel decoding, encoding, and transcoding of various audio and video standards, was fabricated using 0.25-μm CMOS technology and consumes 2.38 W at 2.5 V. The multiprocessor integrates four processors and 64-kB shared level-2 cache and exploits coarse-grained parallelism inherent in audio and video signal processing with multithreaded programming. Three coprocessors and scratch-pad memory have been added to each processing element and perform subword parallel processing, background data transfer, and bitstream processing for audio and video signal processing. Useful-skew and clock gating have been utilized to achieve high-speed operation and low power consumption. Consequently, the multiprocessor achieves MPEG2 (MP@HL) video decoding at 20 frames/s  相似文献   

20.
The availability of international standards based video codecs for inexpensive platforms is reducing the cost of desk-top video systems to a level which makes their wide deployment possible. The current systems, however, are tuned for processing a single video stream only and their use in multiple video applications, such as continuous presence video teleconferencing and digital picture-in-picture, is difficult and inefficient. In this paper, we discuss two systems for efficient generation and display of multi-resolution, multiple compressed video material. Based on simple additions to standard codecs, both systems are capable of generating multiple bit streams corresponding to different resolutions of a single video source and they can decode such streams from multiple sources and display the resulting video sequences simultaneously. Both systems provide multiple video functionality to standard codecs with almost no extra memory or processing power requirements. Also, they make it possible for the users to customize the placement and size of their video windows, just as text and graphics windows can be manipulated.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号