期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient GPU and CPU-based LDPC decoders for long codewords

Stefan Gr?nroos Kristian Nybom Jerker Bj?rkqvist 《Analog Integrated Circuits and Signal Processing》2012,73(2):583-595

The next generation DVB-T2, DVB-S2, and DVB-C2 standards for digital television broadcasting specify the use of low-density parity-check (LDPC) codes with codeword lengths of up to 64800 bits. The real-time decoding of these codes on general purpose computing hardware is useful for completely software defined receivers, as well as for testing and simulation purposes. Modern graphics processing units (GPUs) are capable of massively parallel computation, and can in some cases, given carefully designed algorithms, outperform general purpose CPUs (central processing units) by an order of magnitude or more. The main problem in decoding LDPC codes on GPU hardware is that LDPC decoding generates irregular memory accesses, which tend to carry heavy performance penalties (in terms of efficiency) on GPUs. Memory accesses can be efficiently parallelized by decoding several codewords in parallel, as well as by using appropriate data structures. In this article we present the algorithms and data structures used to make log-domain decoding of the long LDPC codes specified by the DVB-T2 standard??at the high data rates required for television broadcasting??possible on a modern GPU. Furthermore, we also describe a similar decoder implemented on a general purpose CPU, and show that high performance LDPC decoders are also possible on modern multi-core CPUs. 相似文献

2.

Programming video cards for computational electromagnetics applications

Inman M.J. Elsherbeni A.Z. 《Antennas and Propagation Magazine, IEEE》2005,47(6):71-78

Recently, programming tools have become available to researchers and scientists that allow the use of video cards for general-purpose calculations in computational electromagnetics applications. Over the past few years, developments in the field of graphic processing units (GPUs) for video cards have vastly outpaced their general central processing unit (CPU) counterparts. As specifically applied to vector mathematic operations, the newest generation GPUs can generally outperform current CPU architecture by a wide margin. With the addition of large onboard memory units with significantly higher memory bandwidth than those found in the main system, graphic cards can be utilized as a highly efficient vector mathematic coprocessor. In the past, this power has been harnessed by writing low-level assembly code for the video cards. Recently, new tools have become available to make programming possible in high-level languages. By formulating proper procedures to realize general vector computations on the GPU, it will be possible to increase the processing power available by at least an order of magnitude compared to the current generation of CPUs. 相似文献

3.

Discrete fourier transform on multicore 总被引：1，自引：0，他引：1

Franchetti F. Puschel M. Voronenko Y. Chellappa S. Moura J.M.F. 《Signal Processing Magazine, IEEE》2009,26(6):90-102

This article gives an overview on the techniques needed to implement the discrete Fourier transform (DFT) efficiently on current multicore systems. The focus is on Intel-compatible multicores, but we also discuss the IBM Cell and, briefly, graphics processing units (GPUs). The performance optimization is broken down into three key challenges: parallelization, vectorization, and memory hierarchy optimization. In each case, we use the Kronecker product formalism to formally derive the necessary algorithmic transformations based on a few hardware parameters. Further code-level optimizations are discussed. The rigorous nature of this framework enables the complete automation of the implementation task as shown by the program generator, Spiral. Finally, we show and analyze DFT benchmarks of the fastest libraries available for the considered platforms. 相似文献

4.

GPU-friendly rendering for illumination adjustable images

Tze-Yui Ho Ping-Man Lam Chi-Sing Leung Tien-Tsin Wong 《Signal Processing: Image Communication》2006,21(10):919-933

An illumination adjustable image (IAI) contains a large number of pre-recorded reference images under various lighting directions. It describes the appearance of a scene illuminated under various lighting directions. Synthesized images about the scene illuminated by complicated lighting conditions are generated from those reference images. This paper presents practical and real-time rendering methods for IAIs based on spherical Gaussian kernel functions (SGKFs). The lighting property of an IAI is represented by a few number of lightmaps. With those lightmaps, we can use consumer-level graphics processing units (GPUs) to perform the rendering process. The rendering methods for directional light source, point light source and slide projector are discussed. Compared with the conventional spherical harmonic (SH) approach, the proposed SGKF approach offers similar distortion performance but it consumes less graphics memory and has a faster rendering speed. 相似文献

5.

Multimedia Signal Processors: An Architectural Platform with Algorithmic Compilation

Yen-Kuang Chen S.Y. Kung 《The Journal of VLSI Signal Processing》1998,20(1-2):181-204

Novel algorithmic features of multimedia applications and advances in VLSI technologies are driving forces behind the new multimedia signal processors. We propose an architecture platform which could provide high performance and flexibility, and would require less external I/O and memory access. It is comprised of array processors to be used as the hardware accelerator and RISC cores to be used as the basis of the programmable processor. It is a hierarchical and scalable architecture style which facilitates the hardware-software codesign of multimedia signal processing circuits and systems. While some control-intensive functions can be implemented using programmable CPUs, other computation-intensive functions can rely on hardware accelerators.To compile multimedia algorithms, we also present an operation placement and scheduling scheme suitable for the proposed architectural platform. Our scheme addresses data reusability and exploits local communication in order to avoid the memory/communication bandwidth bottleneck, which leads to faster program execution. Our method shows a promising performance: a linear speed-up of 16 times can be achieved for the block-matching motion estimation algorithm and the true motion tracking algorithm, which have formed many multimedia applications (e.g., MPEG-2 and MPEG-4). 相似文献

6.

Efficient automatic parallelization of a single GPU program for a multiple GPU system

《Integration, the VLSI Journal》2019

Single GPU scaling is unable to keep pace with the soaring demand for high throughput computing. As such executing an application on multiple GPUs connected through an off-chip interconnect will become an attractive option to explore. However, much of the current code is written for a single GPU system. Porting such a code for execution on multiple GPUs is difficulty task. In particular, it requires programmer effort to determine how data is partitioned across multiple GPU cards and then launch the appropriate thread blocks that mostly accesses the data that is local to that card. Otherwise, cross-card data movement is an expensive operation. In this work we explore hardware support to efficiently parallelize a single GPU code for execution on multiple GPUs. In particular, our approach focuses on minimizing the number of remote memory accesses across the off-chip network without burdening the programmer to perform data partitioning and workload assignment. We propose a data-location aware thread block scheduler to schedule the thread blocks on the GPU that has most of its input data. The scheduler exploits well known observation that GPU workloads tend to launch a kernel multiple times iteratively to process large volumes of data. The memory accesses of the thread block across different iterations of a kernel launch exhibit correlated behavior. Our data location aware scheduler exploits this predictability to track memory access affinity of each thread block to a specific GPU card and stores this information to make scheduling decisions for future iterations. To further reduce the number of remote accesses we propose a hybrid mechanism that enables migrating or copying the pages between the memory of multiple GPUs based on their access behavior. Hence, most of the memory accesses are to the local GPU memory. Over an architecture consisting of two GPUs, our proposed schemes are able to improve the performance by 1.55× when compared to single GPU execution across widely used Rodinia [17], Parboil [18], and Graph [23] benchmarks. 相似文献

7.

Image warping for compressing and spatially organizing a dense collection of images

《Signal Processing: Image Communication》2006,21(9):755-769

Image-based rendering (IBR) systems create photorealistic views of complex 3D environments by resampling large collections of images captured in the environment. The quality of the resampled images increases significantly with image capture density. Thus, a significant challenge in interactive IBR systems is to provide both fast image access along arbitrary viewpoint paths and efficient storage of large image data sets.We describe a spatial image hierarchy combined with an image compression scheme that meets the requirements of interactive IBR walkthroughs. By using image warping and exploiting image coherence over the image capture plane, we achieve compression performance similar to traditional motion-compensated schema, e.g., MPEG, yet allow image access along arbitrary paths. Furthermore, by exploiting graphics hardware for image resampling, we can achieve interactive rates during IBR walkthroughs. 相似文献

8.

Fast GPU based adaptive filtering of 4D echocardiography

Broxvall M Emilsson K Thunberg P 《IEEE transactions on medical imaging》2012,31(6):1165-1172

Time resolved three-dimensional (3D) echocardiography generates four-dimensional (3D+time) data sets that bring new possibilities in clinical practice. Image quality of four-dimensional (4D) echocardiography is however regarded as poorer compared to conventional echocardiography where time-resolved 2D imaging is used. Advanced image processing filtering methods can be used to achieve image improvements but to the cost of heavy data processing. The recent development of graphics processing unit (GPUs) enables highly parallel general purpose computations, that considerably reduces the computational time of advanced image filtering methods. In this study multidimensional adaptive filtering of 4D echocardiography was performed using GPUs. Filtering was done using multiple kernels implemented in OpenCL (open computing language) working on multiple subsets of the data. Our results show a substantial speed increase of up to 74 times, resulting in a total filtering time less than 30 s on a common desktop. This implies that advanced adaptive image processing can be accomplished in conjunction with a clinical examination. Since the presented GPU processor method scales linearly with the number of processing elements, we expect it to continue scaling with the expected future increases in number of processing elements. This should be contrasted with the increases in data set sizes in the near future following the further improvements in ultrasound probes and measuring devices. It is concluded that GPUs facilitate the use of demanding adaptive image filtering techniques that in turn enhance 4D echocardiographic data sets. The presented general methodology of implementing parallelism using GPUs is also applicable for other medical modalities that generate multidimensional data. 相似文献

9.

FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm,Architecture and Design Automation

Berkin Akin Franz Franchetti James C. Hoe 《Journal of Signal Processing Systems》2016,85(1):67-82

Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. These algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework and evaluate a family of DRAM-optimized FFT algorithms and their hardware implementation design space via automated techniques. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that Spiral generated pareto optimal designs can achieve close to theoretical peak performance of the targeted platform offering 6x and 6.5x system performance and power efficiency improvements respectively over conventional row-column FFT algorithms. 相似文献

10.

GPU Accelerated Generation of Digitally Reconstructed Radiographs for 2-D/3-D Image Registration

OM Dorgham SD Laycock MH Fisher 《IEEE transactions on bio-medical engineering》2012,59(9):2594-2603

Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ～?24 ms using an NVidia GeForce 8800 GTX and in ～?2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC architecture. 相似文献

11.

Multiprocessor SoC design methods and tools

Park H.-W. Oh H. Ha S. 《Signal Processing Magazine, IEEE》2009,26(6):72-79

With the continuous evolution of semiconductor process technology, it is now possible to integrate tens or hundreds of processors in a single chip and make an multiprocessor systems-on-chip (MPSoC), or a multicore platform. There are many dual or quad-core CPUs and 100+-core graphics processing units (GPUs) on the desktop computer market, and many MPSoC solutions are also in the embedded computing markets. A key benefit of multicore platforms is scalability in performance and power. 相似文献

12.

Set-membership identification and filtering for signal processing applications

J. R. Deller Jr. Y. F. Huang 《Circuits, Systems, and Signal Processing》2002,21(1):69-82

Optimal bounding ellipsoid (OBE) algorithms comprise a class of novel recursive identification methods for affine-in-parameters system and signal models. OBE algorithms exploit a priori knowledge of bounds on model disturbances to derive a monotonically nonincreasing set of solutions that are feasible in light of the observations and the model structure. In turn, these sets admit criteria for ascertaining the information content of incoming observations, obviating the expense of updating when data are redundant. Relative to classical recursive methods for this task, OBE algorithms are efficient, robust, and exhibit remarkable tracking ability, rendering them particularly attractive for realtime signal processing and control applications. After placing the OBE algorithms in the hierarchy of the broader set-membership identification methods, this article introduces the underlying set-theoretic concepts, compares and contrasts the various published OBE algorithms including the motivation for each development, then concludes with some illustrations of OBE algorithm performance. More recent work on the use of OBE processing infiltering tasks is also included in the discussion. The paper is a survey of a broad and evolving topic, and extensive references to further information are included.J.R.D. was supported in part by the National Science Foundation under Cooperative Agreement no. IBIS-9817485 and under grant no. MIP-9016734. Y.F.H. was supported in part by the NSF under grant no. 97-05173. Opinions, findings, or recommendations expressed are those of the authors and do not necessarily reflect the views of the NSF. 相似文献

13.

Highly parallel GEMV with register blocking method on GPU architecture

《Journal of Visual Communication and Image Representation》2014,25(7):1566-1573

GPUs can provide powerful computing ability especially for data parallel applications, such as video/image processing applications. However, the complexity of GPU system makes the optimization of even a simple algorithm difficult. Different optimization methods on a GPU often lead to different performances. The matrix–vector multiplication routine for general dense matrices (GEMV) is an important kernel in video/image processing applications. We find that the implementations of GEMV in CUBLAS or MAGMA are not efficient, especially for small or fat matrix. In this paper, we propose a novel register blocking method to optimize GEMV on GPU architecture. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization methods for GEMV are comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also evaluated in the experiment. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices. 相似文献

14.

Decoupled Processors Architecture for Accelerating Data Intensive Applications using Scratch-Pad Memory Hierarchy

Athanasios Milidonis Nikolaos Alachiotis Vasileios Porpodas Harris Michail Georgios Panagiotakopoulos Athanasios P. Kakarountas Costas E. Goutis 《Journal of Signal Processing Systems》2010,59(3):281-296

We present an architecture of decoupled processors with a memory hierarchy consisting only of scratch-pad memories, and a main memory. This architecture exploits the more efficient pre-fetching of Decoupled processors, that make use of the parallelism between address computation and application data processing, which mainly exists in streaming applications. This benefit combined with the ability of scratch-pad memories to store data with no conflict misses and low energy per access contributes significantly for increasing the system’s performance. The application code is split in two parallel programs the first runs on the Access processor and computes the addresses of the data in the memory hierarchy. The second processes the application data and runs on the Execute processor, a processor with a limited address space—just the register file addresses. Each transfer of any block in the memory hierarchy up to the Execute processor’s register file is controlled by the Access processor and the DMA units. This strongly differentiates this architecture from traditional uniprocessors and existing decoupled processors with cache memory hierarchies. The architecture is compared in performance with uniprocessor architectures with (a) scratch-pad and (b) cache memory hierarchies and (c) the existing decoupled architectures, showing its higher normalized performance. The reason for this gain is the efficiency of data transferring that the scratch-pad memory hierarchy provides combined with the ability of the Decoupled processors to eliminate memory latency using memory management techniques for transferring data instead of fixed prefetching methods. Experimental results show that the performance is increased up to almost 2 times compared to uniprocessor architectures with scratch-pad and up to 3.7 times compared to the ones with cache. The proposed architecture achieves the above performance without having penalties in energy delay product costs. 相似文献

15.

Scalable Programming Models for Massively Multicore Processors

McCool M.D. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2008,96(5):816-831

Including multiple cores on a single chip has become the dominant mechanism for scaling processor performance. Exponential growth in the number of cores on a single processor is expected to lead in a short time to mainstream computers with hundreds of cores. Scalable implementations of parallel algorithms will be necessary in order to achieve improved single-application performance on such processors. In addition, memory access will continue to be an important limiting factor on achieving performance, and heterogeneous systems may make use of cores with varying capabilities and performance characteristics. An appropriate programming model can address scalability and can expose data locality while making it possible to migrate application code between processors with different parallel architectures and variable numbers and kinds of cores. We survey and evaluate a range of multicore processor architectures and programming models with a focus on GPUs and the Cell BE processor. These processors have a large number of cores and are available to consumers today, but the scalable programming models developed for them are also applicable to current and future multicore CPUs. 相似文献

16.

Memory Access Optimized VLSI for 5000-Word Continuous Speech Recognition

Kisun You Young-kyu Choi Jungwook Choi Wonyong Sung 《Journal of Signal Processing Systems》2011,63(1):95-105

We have developed a memory access reduced VLSI chip for 5,000 word speaker-independent continuous speech recognition. This chip employs a context-dependent HMM (hidden Markov model) based speech recognition algorithm, and contains parallel and pipelined hardware units for emission probability computation and Viterbi beam search. To maximize the performance, we adopted several memory access reduction techniques such as sub-vector clustering and multi-block processing for the emission probability computation. We also employed a custom DRAM controller for efficient access of consecutive data. Moreover, we analyzed the access pattern of data to minimize the internal SRAM size while maintaining high performance. The experimental results show that the implemented system performs speech recognition 2.4 and 1.8 times faster than real-time utilizing 32-bit DDR SDRAM and SDR SDRAM, respectively. 相似文献

17.

基于双PowerPC7447A处理器的嵌入式系统硬件设计 总被引：1，自引：0，他引：1

张中华《现代电子技术》2008,31(24)

随着雷达数据和信号处理需求的不断攀升,传统雷达数字处理系统的处理能力己渐显不足,因此有必要提高系统中每个处理单元的处理能力。鉴于此,设计一种基于CPCI标准总线和双PowerPC 7447A高性能处理器的通用处理单元硬件平台,并对部分功能单元的设计进行描述。硬件平台由双处理节点、双PMC接口和CPCI总线接口等组成,本地互连采用PCI总线,对外采用CPCI总线。该平台具有数据处理能力强、功能扩展性强、通用性强、维护方便等特点,有较高的应用价值。相似文献

18.

SURF算法在通用GPU和OpenCL的优化与实现

王艳梅史晓华于湛麟《电子测试》2013,(12):51-55,42

Speeded Up Robust Feature（SURF）算法是在计算机视觉领域得到广泛应用的一种图像兴趣点检测和匹配方法。开放计算语言（OpenCL）提供了一个在异构体系结构上,包括GPU,CPU及其他类型处理器,编写并行程序的框架。本文介绍了如何在通用GPU和OpenCL平台上,对SURF算法进行优化与实现。本文对其中一些优化方法,例如kernel线程的配置,局部内存的使用方法等,进行了详细的对比和讨论。最终实现的OpenCL版本的算法在NVidiaGTX260平台上获得了比原始的CPU版本在IntelDual—CoreE54002．7G处理器上至少21倍的加速。相似文献

19.

Memory Controller for Vector Processor

Tassadaq Hussain Osman S. Ünsal Adrian Cristal Eduard Ayguadé 《Journal of Signal Processing Systems》2018,90(11):1533-1549

相似文献

20.

Heterogeneous Multicore SoC With SiP for Secure Multimedia Applications

《Solid-State Circuits, IEEE Journal of》2009,44(8):2251-2259

A heterogeneous multicore system-on-chip (SoC) has been developed for high-definition (HD) multimedia applications that require secure DRM (digital rights management). The SoC integrates three types of processors: two specific-purpose accelerators for cipher and high-resolution video decoding; one general-purpose accelerator (MX); and three CPUs. This is how our SoC achieves high performance and low power consumption with hardware customized for video processing applications that process a large amount of data. To achieve secure data control, hardware memory management and software system virtualization are adopted. The security of the system is the result of the cooperation between the hardware and software on the system. Furthermore, a highly tamper-resistant system is provided on our SiP (System in a package), through DDR2 SDRAMs and a flash memory that contain confidential information in one package. This secure multimedia processor provides a solution to protect contents and to safely deliver secure sensitive information when processing billing transactions that involve digital content delivery. The SoC was implemented in a 90 nm generic CMOS technology. 相似文献