期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

CPU–GPU hybrid parallel strategy for cosmological simulations

Yueqing Wang Yong Dou Song Guo Yuanwu Lei Dan Zou 《Concurrency and Computation》2014,26(3):748-765

Gadget is a simulation application for N‐body and smoothed particle hydrodynamics problems in cosmology, and it is widely applied in solving series of cosmological problems. N‐body focuses on the motion of the interaction of N particles, and smoothed particle hydrodynamics is a fluid simulation algorithm that studies the movement of fluid through particle simulation. Most scholars focus their attention on accelerating Gadget on multi‐core CPU or graphics processing units (GPUs) platforms. However, these research activities failed to achieve CPU–GPU hybrid computing, which resulted in tremendous waste of CPU computing resources. In this paper, we propose a CPU–GPU hybrid parallel strategy to accelerate Gadget‐2, a massively parallel structure formation code for cosmological simulations. This strategy uses CPU and GPU to process the calculation of short‐range force. To ensure CPU and GPU workload balance, a dynamic task allocation scheme is proposed according to the computational performance difference between the CPU and GPU. Experimental results showed that our CPU–GPU hybrid parallel strategy achieved an overall speedup factor of 18.6 and a partial speedup factor for short‐range force calculation of 28.35 compared with a single‐core CPU implementation for particles in million‐size magnitudes. Moreover, compared with a GPU platform that contained 12 CPU cores and one GPU, our hybrid parallel strategy obtained overall speedup and partial speedup factors of 6% and 20%, respectively. Furthermore, the scalability of the hybrid strategy is very fine – its performance will be enhanced when the problem scale is increasing. However, this strategy also has its limitation that the performance enhancement will be decreasing if the ratio(the number of CPU cores divides that of the GPU cards) reduces. Finally, in our hybrid strategy, the CPU coefficient of utilization improved by 17.14% or better. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

2.

Implementation and optimization of GPU‐based parallel one‐step leapfrog ADI‐FDTD for far‐field scattering problems

Bin Zou Shuo Liu Lamei Zhang 《国际射频与微波计算机辅助工程杂志》2020,30(10)

The one‐step leapfrog alternative‐direction‐implicit finite‐difference time‐domain (ADI‐FDTD), free from the Courant‐Friedrichs‐Lewy (CFL) stability condition and sub‐step computations, is efficient when dealing with fine grid problems. However, solution of the numerous tridiagonal systems still imposes a great computational burden and makes the method hard to execute in parallel. In this paper, we proposed an efficient graphic processing unit (GPU)‐based parallel implementation of the one‐step leapfrog ADI‐FDTD for the far‐field EM scattering simulation of objects, in which we present and analyze the manners of calculation area division and thread allocation and a data layout transformation of z components is proposed to achieve better memory access mode, which is a key factor affecting GPU execution efficiency. The simulation experiment is carried out to verify the accuracy and efficiency of the GPU‐based implementation. The simulation results show that there is a good agreement between the proposed one‐step leapfrog ADI‐FDTD method and Yee's FDTD in solving the far‐field scattering problem and huge benefits in performance were encountered when the method was accelerated using GPU technology. 相似文献

3.

Graphics processing unit‐accelerated bounding for branch‐and‐bound applied to a permutation problem using data access optimization

N. Melab I. Chakroun A. Bendjoudi 《Concurrency and Computation》2014,26(16):2667-2683

Branch‐and‐bound (B&B) algorithms are attractive methods for solving to optimality combinatorial optimization problems using an implicit enumeration of a dynamically built tree‐based search space. Nevertheless, they are time‐consuming when dealing with large problem instances. Therefore, pruning tree nodes (subproblems) is traditionally used as a powerful mechanism to reduce the size of the explored search space. Pruning requires to perform the bounding operation, which consists of applying a lower bound function to the subproblems generated during the exploration process. Preliminary experiments performed on the Flow‐Shop scheduling problem (FSP) have shown that the bounding operation consumes over 98% of the execution time of the B&B algorithm. In this paper, we investigate the use of graphics processing unit (GPU) computing as a major complementary way to speed up the search. We revisit the design and implementation of the parallel bounding model on GPU accelerators. The proposed approach enables data access optimization. Extensive experiments have been carried out on well‐known FSP benchmarks using an Nvidia Tesla C2050 GPU card. Compared to a CPU‐based single core execution using an Intel Core i7‐970 processor without GPU, speedups higher than 100 times faster are achieved for large problem instances. At an equivalent peak performance, GPU‐accelerated B&B is twice faster than its multi‐core counterpart. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

4.

A GPU‐based Streaming Algorithm for High‐Resolution Cloth Simulation

Min Tang Ruofeng Tong Rahul Narain Chang Meng Dinesh Manocha 《Computer Graphics Forum》2013,32(7):21-30

We present a GPU‐based streaming algorithm to perform high‐resolution and accurate cloth simulation. We map all the components of cloth simulation pipeline, including time integration, collision detection, collision response, and velocity updating to GPU‐based kernels and data structures. Our algorithm perform intra‐object and inter‐object collisions, handles contacts and friction, and is able to accurately simulate folds and wrinkles. We describe the streaming pipeline and address many issues in terms of obtaining high throughput on many‐core GPUs. In practice, our algorithm can perform high‐fidelity simulation on a cloth mesh with 2M triangles using 3GB of GPU memory. We highlight the parallel performance of our algorithm on three different generations of GPUs. On a high‐end NVIDIA Tesla K20c, we observe up to two orders of magnitude performance improvement as compared to a single‐threaded CPU‐based algorithm, and about one order of magnitude improvement over a 16‐core CPU‐based parallel implementation. 相似文献

5.

High‐speed parallel implementations of the rainbow method based on perfect tables in a heterogeneous system

Jung Woo Kim Jungjoo Seo Jin Hong Kunsoo Park Sung‐Ryul Kim 《Software》2015,45(6):837-855

The computing power of graphics processing units (GPU) has increased rapidly, and there has been extensive research on general‐purpose computing on GPU (GPGPU) for cryptographic algorithms such as RSA, Elliptic Curve Cryptosystem (ECC), NTRU, and Advanced Encryption Standard. With the rise of GPGPU, commodity computers have become complex heterogeneous GPU+CPU systems. This new architecture poses new challenges and opportunities in high‐performance computing. In this paper, we present high‐speed parallel implementations of the rainbow method based on perfect tables, which is known as the most efficient time‐memory trade‐off, in the heterogeneous GPU+CPU system. We give a complete analysis of the effect of multiple checkpoints on reducing the cost of false alarms and take advantage of it for load balancing between GPU and CPU. For GTX460, our implementation is about 1.86 and 3.25 times faster than other GPU‐accelerated implementations, RainbowCrack and Cryptohaze, respectively, and for GTX580, 1.53 and 2.40 times faster. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

6.

Efficient graphics processing unit based layered decoders for quasicyclic low‐density parity‐check codes

Rongchun Li Yong Dou Dan Zou Shi Wang Ying Zhang 《Concurrency and Computation》2015,27(1):29-46

Because layered low‐density parity‐check (LDPC) decoding algorithm was proposed, one can exploit the diversity gain to achieve performance comparable to the traditional two‐phase message passing (TPMP) decoding but with about twice faster decoding convergence compared to TPMP. In order to reduce the decoding time of layered LDPC decoder, a graphics processing unit (GPU) is exploited as the modem processor so that the decoding procedure can be processed in parallel using numerous threads in the GPU. In this paper, we present the parallel algorithms and efficient implementations on the GPU for two different layered message passing schemes, the row‐layered and column‐layered decoding. In the experiments, the quasicyclic LDPC codes for WiFi (802.11n) and WiMAX (802.16e) are decoded by the proposed layered LDPC decoders. The experimental results show that our decoder has good bit error ratio (BER) performance comparable to TPMP decoder. The peak throughput is 712 Mbps, which is about two orders of magnitude faster than that of CPU implementation and comparable to the dedicated hardware solutions. Compared to the existing fastest GPU‐based implementation, the presented decoder can achieve a performance improvement of 2.3 times. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

7.

Combinatorial Bidirectional Path‐Tracing for Efficient Hybrid CPU/GPU Rendering

Anthony Pajot Loïc Barthe Mathias Paulin Pierre Poulin 《Computer Graphics Forum》2011,30(2):315-324

This paper presents a reformulation of bidirectional path‐tracing that adequately divides the algorithm into processes efficiently executed in parallel on both the CPU and the GPU. We thus benefit from high‐level optimization techniques such as double buffering, batch processing, and asyncronous execution, as well as from the exploitation of most of the CPU, GPU, and memory bus capabilities. Our approach, while avoiding pure GPU implementation limitations (such as limited complexity of shaders, light or camera models, and processed scene data sets), is more than ten times faster than standard bidirectional path‐tracing implementations, leading to performance suitable for production‐oriented rendering engines. 相似文献

8.

Volume and Isosurface Rendering with GPU‐Accelerated Cell Projection*

R. Marroquim A. Maximo R. Farias C. Esperança 《Computer Graphics Forum》2008,27(1):24-35

We present an efficient Graphics Processing Unit GPU‐based implementation of the Projected Tetrahedra (PT) algorithm. By reducing most of the CPU–GPU data transfer, the algorithm achieves interactive frame rates (up to 2.0 M Tets/s) on current graphics hardware. Since no topology information is stored, it requires substantially less memory than recent interactive ray casting approaches. The method uses a two‐pass GPU approach with two fragment shaders. This work includes extended volume inspection capabilities by supporting interactive transfer function editing and isosurface highlighting using a Phong illumination model. 相似文献

9.

Efficient Image‐Based Proximity Queries with Object‐Space Precision

T. Morvan M. Reimers E. Samset 《Computer Graphics Forum》2012,31(1):62-74

We present an efficient algorithm for object‐space proximity queries between multiple deformable triangular meshes. Our approach uses the rasterization capabilities of the GPU to produce an image‐space representation of the vertices. Using this image‐space representation, inter‐object vertex‐triangle distances and closest points lying under a user‐defined threshold are computed in parallel by conservative rasterization of bounding primitives and sorted using atomic operations. We additionally introduce a similar technique to detect penetrating vertices. We show how mechanisms of modern GPUs such as mipmapping, Early‐Z and Early‐Stencil culling can optimize the performance of our method. Our algorithm is able to compute dense proximity information for complex scenes made of more than a hundred thousand triangles in real time, outperforming a CPU implementation based on bounding volume hierarchies by more than an order of magnitude. 相似文献

10.

Bandwidth‐aware divisible task scheduling for cloud computing

Weiwei Lin Chen Liang James Z. Wang Rajkumar Buyya 《Software》2014,44(2):163-174

Task scheduling is a fundamental issue in achieving high efficiency in cloud computing. However, it is a big challenge for efficient scheduling algorithm design and implementation (as general scheduling problem is NP‐complete). Most existing task‐scheduling methods of cloud computing only consider task resource requirements for CPU and memory, without considering bandwidth requirements. In order to obtain better performance, in this paper, we propose a bandwidth‐aware algorithm for divisible task scheduling in cloud‐computing environments. A nonlinear programming model for the divisible task‐scheduling problem under the bounded multi‐port model is presented. By solving this model, the optimized allocation scheme that determines proper number of tasks assigned to each virtual resource node is obtained. On the basis of the optimized allocation scheme, a heuristic algorithm for divisible load scheduling, called bandwidth‐aware task‐scheduling (BATS) algorithm, is proposed. The performance of algorithm is evaluated using CloudSim toolkit. Experimental result shows that, compared with the fair‐based task‐scheduling algorithm, the bandwidth‐only task‐scheduling algorithm, and the computation‐only task‐scheduling algorithm, the proposed algorithm (BATS) has better performance. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

11.

GPU‐accelerated finite‐difference time‐domain method for dielectric media based on CUDA

下载免费PDF全文

Ximin Wang Song Liu Xuan Li Shuangying Zhong 《国际射频与微波计算机辅助工程杂志》2016,26(6):512-518

The simulation of electromagnetic (EM) waves propagation in the dielectric media is presented using Compute Unified Device Architecture (CUDA) implementation of finite‐difference time‐domain (FDTD) method on graphic processing unit (GPU). The FDTD formulation in the dielectric media is derived in detail, and GPU‐accelerated FDTD method based on CUDA programming model is described in the flowchart. The accuracy and speedup of the presented CUDA‐implemented FDTD method are validated by the numerical simulation of the EM waves propagating into the lossless and lossy dielectric media from the free space on GPU, by comparison with the original FDTD method on CPU. The comparison of the numerical results of CUDA‐implemented FDTD method on GPU and original FDTD method on CPU demonstrates that the CUDA‐implemented FDTD method on GPU can obtain better application speedup performance with reasonable accuracy. © 2016 Wiley Periodicals, Inc. Int J RF and Microwave CAE 26:512–518, 2016. 相似文献

12.

Performance evaluation of a 3D multi-view-based particle filter for visual object tracking using GPUs and multicore CPUs

David Concha Raúl Cabido Juan José Pantrigo Antonio S. Montemayor 《Journal of Real-Time Image Processing》2018,15(2):309-327

This paper presents a deep and extensive performance analysis of the particle filter (PF) algorithm for a very compute intensive 3D multi-view visual tracking problem. We compare different implementations and parameter settings of the PF algorithm in a CPU platform taking advantage of the multithreading capabilities of the modern processors and a graphics processing unit (GPU) platform using NVIDIA CUDA computing environment as developing framework. We extend our experimental study to each individual stage of the PF algorithm, and evaluate the quality versus performance trade-off among different ways to design these stages. We have observed that the GPU platform performs better than the multithreaded CPU platform when handling a large number of particles, but we also demonstrate that hybrid CPU/GPU implementations can run almost as fast as only GPU solutions. 相似文献

13.

GPU accelerated sparse matrix‐vector multiplication and sparse matrix‐transpose vector multiplication

Yuan Tao Yangdong Deng Shuai Mu Zhenzhong Zhang Mingfa Zhu Limin Xiao Li Ruan 《Concurrency and Computation》2015,27(14):3771-3789

Many high performance computing applications require computing both sparse matrix‐vector product (SMVP) and sparse matrix‐transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing patterns with the underlying sparse matrix encoded in a single storage format. The compressed sparse block (CSB) format proposed by Buluç et al. allows computing both problems on multi‐core CPUs with nearly identical throughputs. On the other hand, a direct porting of CSB to graphics processing units (GPUs), which have been recently recognized as a powerful general purpose computing platform, turns out to be inefficient. In this work, we propose a new data structure, designated as expanded CSB (eCSB), to minimize the throughput gap between SMVP and SMTVP computations on GPUs, while at the same time enable a high computing throughput. We also use a hybrid storage format to store elements in each block, which can be selected dynamically at runtime. Experimental results show that the proposed techniques implemented on a Kepler GPU delivers similar throughput on both SMVP and SMTVP and the throughput is up to 13 times faster than that of the CPU‐based CSB implementation. In addition, our eCSB procedure outperforms the previous GPU results by up to 188% and 914% in computing SMVP and SMTVP, and we validate the effectiveness of eCSB by means of wall‐clock time of bi‐conjugate gradient algorithm; our eCSB is 25% faster than Compressed Sparse Rows (CSR) and 6% faster than HYB, respectively. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

14.

PSkel: A stencil programming framework for CPU‐GPU systems

Alyson D. Pereira Luiz Ramos Luís F. W. Ges 《Concurrency and Computation》2015,27(17):4938-4953

The use of Graphics Processing Units (GPUs) for high‐performance computing has gained growing momentum in recent years. Unfortunately, GPU‐programming platforms like Compute Unified Device Architecture (CUDA) are complex, user unfriendly, and increase the complexity of developing high‐performance parallel applications. In addition, runtime systems that execute those applications often fail to fully utilize the parallelism of modern CPU‐GPU systems. Typically, parallel kernels run entirely on the most powerful device available, leaving other devices idle. These observations sparked research in two directions: (1) high‐level approaches to software development for GPUs, which strike a balance between performance and ease of programming; and (2) task partitioning to fully utilize the available devices. In this paper, we propose a framework, called PSkel, that provides a single high‐level abstraction for stencil programming on heterogeneous CPU‐GPU systems, while allowing the programmer to partition and assign data and computation to both CPU and GPU. Our current implementation uses parallel skeletons to transparently leverage Intel Threading Building Blocks (Intel Corporation, Santa Clara, CA, USA) and NVIDIA CUDA (Nvidia Corporation, Santa Clara, CA, USA). In our experiments, we observed that parallel applications with task partitioning can improve average performance by up to 76% and 28% compared with CPU‐only and GPU‐only parallel applications, respectively. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

15.

CUDA‐quicksort: an improved GPU‐based implementation of quicksort

Emanuele Manca Andrea Manconi Alessandro Orro Giuliano Armano Luciano Milanesi 《Concurrency and Computation》2016,28(1):21-43

Sorting is a very important task in computer science and becomes a critical operation for programs making heavy use of sorting algorithms. General‐purpose computing has been successfully used on Graphics Processing Units (GPUs) to parallelize some sorting algorithms. Two GPU‐based implementations of the quicksort were presented in literature: the GPU‐quicksort, a compute‐unified device architecture (CUDA) iterative implementation, and the CUDA dynamic parallel (CDP) quicksort, a recursive implementation provided by NVIDIA Corporation. We propose CUDA‐quicksort an iterative GPU‐based implementation of the sorting algorithm. CUDA‐quicksort has been designed starting from GPU‐quicksort. Unlike GPU‐quicksort, it uses atomic primitives to perform inter‐block communications while ensuring an optimized access to the GPU memory. Experiments performed on six sorting benchmark distributions show that CUDA‐quicksort is up to four times faster than GPU‐quicksort and up to three times faster than CDP‐quicksort. An in‐depth analysis of the performance between CUDA‐quicksort and GPU‐quicksort shows that the main improvement is related to the optimized GPU memory access rather than to the use of atomic primitives. Moreover, in order to assess the advantages of using the CUDA dynamic parallelism, we implemented a recursive version of the CUDA‐quicksort. Experimental results show that CUDA‐quicksort is faster than the CDP‐quicksort provided by NVIDIA, with better performance achieved using the iterative implementation. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

16.

CPU–GPU Parallel Framework for Real‐Time Interactive Cutting of Adaptive Octree‐Based Deformable Objects

下载免费PDF全文

Shiyu Jia Weizhong Zhang Xiaokang Yu Zhenkuan Pan 《Computer Graphics Forum》2018,37(1):45-59

A software framework taking advantage of parallel processing capabilities of CPUs and GPUs is designed for the real‐time interactive cutting simulation of deformable objects. Deformable objects are modelled as voxels connected by links. The voxels are embedded in an octree mesh used for deformation. Cutting is performed by disconnecting links swept by the cutting tool and then adaptively refining octree elements near the cutting tool trajectory. A surface mesh used for visual display is reconstructed from disconnected links using the dual contour method. Spatial hashing of the octree mesh and topology‐aware interpolation of distance field are used for collision. Our framework uses a novel GPU implementation for inter‐object collision and object self collision, while tool‐object collision, cutting and deformation are assigned to CPU, using multiple threads whenever possible. A novel method that splits cutting operations into four independent tasks running in parallel is designed. Our framework also performs data transfers between CPU and GPU simultaneously with other tasks to reduce their impact on performances. Simulation tests show that when compared to three‐threaded CPU implementations, our GPU accelerated collision is 53–160% faster; and the overall simulation frame rate is 47–98% faster. 相似文献

17.

Parallel solution of the subset‐sum problem: an empirical study

Saniyah S. Bokhari 《Concurrency and Computation》2012,24(18):2241-2254

The subset‐sum problem is a well‐known NP‐complete combinatorial problem that is solvable in pseudo‐polynomial time, that is, time proportional to the number of input objects multiplied by the sum of their sizes. This product defines the size of the dynamic programming table used to solve the problem. We show how this problem can be parallelized on three contemporary architectures, that is, a 128‐processor Cray Extreme Multithreading (XMT) massively multithreaded machine, a 16‐processor IBM x3755 shared memory machine, and a 240‐core NVIDIA FX 5800 graphics processing unit (GPU). We show that it is straightforward to parallelize this algorithm on the Cray XMT primarily because of the word‐level locking that is available on this architecture. For the other two machines, we present an alternating word algorithm that can implement an efficient solution. Our results show that the GPU performs well for problems whose tables fit within the device memory. Because GPUs typically have memories in the order of 10GB, such architectures are best for small problem sizes that have tables of size approximately 10¹⁰. The IBM x3755 performs very well on medium‐sized problems that fit within its 64‐GB memory but has poor scalability as the number of processors increases and is unable to sustain performance as the problem size increases. This machine tends to saturate for problem sizes of 10¹¹ bits. The Cray XMT shows very good scaling for large problems and demonstrates sustained performance as the problem size increases. However, this machine has poor scaling for small problem sizes; it performs best for problem sizes of 10¹² bits or more. The results in this paper illustrate that the subset‐sum problem can be parallelized well on all three architectures, albeit for different ranges of problem sizes. The performance of these three machines under varying problem sizes show the strengths and weaknesses of the three architectures. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

18.

Work stealing for GPU‐accelerated parallel programs in a global address space framework

Humayun Arafat James Dinan Sriram Krishnamoorthy Pavan Balaji P. Sadayappan 《Concurrency and Computation》2016,28(13):3637-3654

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed‐memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task‐parallel programs executed on hybrid distributed‐memory CPU‐graphics processing unit (GPU) systems in a global‐address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU‐GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state‐of‐the‐art CCSD(T) application module from the computational chemistry domain. Copyright © 2016 John Wiley & Sons, Ltd. 相似文献

19.

ManyLoDs: Parallel Many‐View Level‐of‐Detail Selection for Real‐Time Global Illumination

Matthias Hollander Tobias Ritschel Elmar Eisemann Tamy Boubekeur 《Computer Graphics Forum》2011,30(4):1233-1240

Level‐of‐Detail structures are a key component for scalable rendering. Built from raw 3D data, these structures are often defined as Bounding Volume Hierarchies, providing coarse‐to‐fine adaptive approximations that are well‐adapted for many‐view rasterization. Here, the total number of pixels in each view is usually low, while the cost of choosing the appropriate LoD for each view is high. This task represents a challenge for existing GPU algorithms. We propose ManyLoDs, a new GPU algorithm to efficiently compute many LoDs from a Bounding Volume Hierarchy in parallel by balancing the workload within and among LoDs. Our approach is not specific to a particular rendering technique, can be used on lazy representations such as polygon soups, and can handle dynamic scenes. We apply our method to various many‐view rasterization applications, including Instant Radiosity, Point‐Based Global Illumination, and reflection/refraction mapping. For each of these, we achieve real‐time performance in complex scenes at high resolutions. 相似文献

20.

Efficient parallel implementation of three‐point viterbi decoding algorithm on CPU,GPU, and FPGA

Rongchun Li Yong Dou Dan Zou 《Concurrency and Computation》2014,26(3):821-840

In wireless communication, Viterbi decoding algorithm (VDA) is the one of most popular channel decoding algorithms, which is widely used in WLAN, WiMAX, or 3G communications. However, the throughput of Viterbi decoder is constrained by the convolutional characteristic. Recently, the three‐point VDA (TVDA) was proposed to solve this problem. In TVDA, the whole procedure can be divided into three phases, the forward, trace‐back, and decoding phases. In this paper, we analyze the parallelism of TVDA and propose parallel TVDA on the multi‐core CPU, graphics processing unit (GPU), and field programmable gate array (FPGA). We demonstrate approaches that fully exploit its performance potential on CPU, GPU, and FPGA computing platforms. For CPU platforms, we perform two optimization methods, single instruction multiple data and multithreading to gain over 145 × speedup over the naive CPU version on a quad‐core CPU platform. For GPU platforms, we propose the combination of cached memory optimization, coalesced global memory accesses, codeword packing scheme, and asynchronous data transition, achieving the throughput of 404.65 Mbps and 12 × speedup over initial GPU versions on an NVIDIA GeForce GTX580 card and 7 × speedup over Intel quad‐core CPU i5‐2300, under the same manufacturing year and both with fully optimized schemes. In addition, for FPGA platforms, we customize a radix‐4 pipelined architecture for the TVDA in a 45‐nm FPGA chip from Xilinx (XC6VLX760). Under 209.15‐MHz clock rate, it achieves a throughput of 418.30 Mbps. Finally, we also discuss the performance evaluation and efficiency comparison of different flexible architectures for real‐time Viterbi decoding in terms of the decoding throughput, power consumption, optimization schemes, programming costs, and price costs.Copyright © 2013 John Wiley & Sons, Ltd. 相似文献