期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient construction of bounding volume hierarchies into a complete octree for ray tracing

Ulises Olivares Hctor G. Rodríguez Arturo García Flix F. Ramos 《Computer Animation and Virtual Worlds》2016,27(3-4):358-368

This paper proposes an efficient construction scheme for bounding volume hierarchies based on a complete tree. This construction offers up to 4× faster construction times than binned‐surface area heuristic and offers competitive ray traversal performance. The construction is fully parallelized on x86 CPU architectures; it takes advantage of the eight‐wide vector units and exploits the advance vector extensions available for current x86 CPU architectures. Additionally, this work presents a clustering algorithm for grouping primitives, which can be computed in linear time O(n). Furthermore, this construction uses the graphics processing unit to perform intensive operations efficiently. Copyright © 2016 John Wiley & Sons, Ltd. 相似文献

2.

Parallelization of a color-entropy preprocessed Chan–Vese model for face contour detection on multi-core CPU and GPU

《Parallel Computing》2015

Face tracking is an important computer vision technology that has been widely adopted in many areas, from cell phone applications to industry robots. In this paper, we introduce a novel way to parallelize a face contour detecting application based on the color-entropy preprocessed Chan–Vese model utilizing a total variation G-norm. This particular application is a complicated and unsupervised computational method requiring a large amount of calculations. Several core parts therein are difficult to parallelize due to heavily correlated data processing among iterations and pixels.We develop a novel approach to parallelize the data-dependent core parts and significantly improve the runtime performance of the model computation. We implement the parallelized program on OpenCL for both multi-core CPU and GPU. For 640 * 480 input images, the parallelized program on a NVidia GTX970 GPU, a NVidia GTX660 GPU, and an AMD FX8530 8-core CPU is on average 18.6, 12.0 and 4.40 times faster than its single-thread C version on the AMD FX8530 CPU, respectively. Some parallelized routines have much higher performance improvement compared to the whole program. For instance, on the NVidia GTX970 GPU, the parallelized entropy filter routine is on average 74.0 times faster than its single-thread C version on the AMD FX8530 8-core CPU. We discuss the parallelization methodologies in detail, including the scalability, thread models, as well as synchronization methods for both multi-core CPU and GPU. 相似文献

3.

Direct volume rendering of unstructured tetrahedral meshes using CUDA and OpenMP

Erhan Okuyan Uğur Güdükbay 《The Journal of supercomputing》2014,67(2):324-344

Direct volume visualization is an important method in many areas, including computational fluid dynamics and medicine. Achieving interactive rates for direct volume rendering of large unstructured volumetric grids is a challenging problem, but parallelizing direct volume rendering algorithms can help achieve this goal. Using Compute Unified Device Architecture (CUDA), we propose a GPU-based volume rendering algorithm that itself is based on a cell projection-based ray-casting algorithm designed for CPU implementations. We also propose a multicore parallelized version of the cell-projection algorithm using OpenMP. In both algorithms, we favor image quality over rendering speed. Our algorithm has a low memory footprint, allowing us to render large datasets. Our algorithm supports progressive rendering. We compared the GPU implementation with the serial and multicore implementations. We observed significant speed-ups that, together with progressive rendering, enables reaching interactive rates for large datasets. 相似文献

4.

Fast Matlab compatible sparse assembly on multicore computers

《Parallel Computing》2016

We develop and implement in this paper a fast sparse assembly algorithm, the fundamental operation which creates a compressed matrix from raw index data. Since it is often a quite demanding and sometimes critical operation, it is of interest to design a highly efficient implementation. We show how to do this, and moreover, we show how our implementation can be parallelized to utilize the power of modern multicore computers. Our freely available code, fully Matlab compatible, achieves about a factor of 5 × in speedup on a typical 6-core machine and 10 × on a dual-socket 16-core machine compared to the built-in serial implementation. 相似文献

5.

Multiscale and local search methods for real time region tracking with particle filters: local search driven by adaptive scale estimation on GPUs 总被引：1，自引：0，他引：1

Raúl Cabido Antonio S. Montemayor Juan José Pantrigo Bryson R. Payne 《Machine Vision and Applications》2009,21(1):43-58

Tracking systems are important in computervision, with applications in surveillance, human computer interaction, etc. Consumer graphics processing units (GPUs) have experienced an extraordinary evolution in both computing performance and programmability, leading to greater use of the GPU for non-rendering applications. In this work we propose a real-time object tracking algorithm, based on the hybridization of particle filtering (PF) and a multi-scale local search (MSLS) algorithm, presented for both CPU and GPU architectures. The developed system provides successful results in precise tracking of single and multiple targets in monocular video, operating in real-time at 70 frames per second for 640 × 480 video resolutions on the GPU, up to 1,100% faster than the CPU version of the algorithm. 相似文献

6.

Optimization schemes and performance evaluation of Smith–Waterman algorithm on CPU,GPU and FPGA

Dan Zou Yong Dou Fei Xia 《Concurrency and Computation》2012,24(14):1625-1644

With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well‐known algorithm in the field of biosequence matching and database searching, the Smith–Waterman (S‐W) algorithm as an example, and demonstrate approaches that fully exploit its performance potentials on CPU, GPU, and field‐programmable gate array (FPGA) computing platforms. For CPU platforms, we perform two optimizations, single instruction, multiple data and multithread, with compiler options, to gain over 70 × speedups over naive CPU versions on quad‐core CPU platforms. For GPU platforms, we propose the combination of coalesced global memory accesses, shared memory tiles, and loop unfolding, achieving 50 × speedups over initial GPU versions on an NVIDIA GeForce GTX 470 card. Experimental results show that the GPU GTX 470 gains 12 × speedups, instead of 100 × reported by some studies, over Intel quadcore CPU Q9400, under the same manufacturing technology and both with fully optimized schemes. In addition, for FPGA platforms, we customize a linear systolic array for the S‐W algorithm in a 45‐nm FPGA chip from Xilinx (XC6VLX760), with up to 1024 processing elements. Under only 133 MHz clock rate, the FPGA platform reaches the highest performance and becomes the most power‐efficient platform, using only 25 W compared with 190 W of the GPU GTX 470. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

7.

Parallel Light Speed Labeling: an efficient connected component algorithm for labeling and analysis on multi-core processors

Laurent Cabaret Lionel Lacassagne Daniel Etiemble 《Journal of Real-Time Image Processing》2018,15(1):173-196

In the last decade, many papers have been published to present sequential connected component labeling (CCL) algorithms. As modern processors are multi-core and tend to many cores, designing a CCL algorithm should address parallelism and multithreading. After a review of sequential CCL algorithms and a study of their variations, this paper presents the parallel version of the Light Speed Labeling for connected component analysis (CCA) and compares it to our parallelized implementations of State-of-the-Art sequential algorithms. We provide some benchmarks that help to figure out the intrinsic differences between these parallel algorithms. We show that thanks to its run-based processing, the LSL is intrinsically more efficient and faster than all pixel-based algorithms. We show also, that all the pixel-based are memory-bound on multi-socket machines and so are inefficient and do not scale, whereas LSL, thanks to its RLE compression can scale on such high-end machines. On a 4 × 15-core machine, and for 8192 × 8192 images, LSL outperforms its best competitor by a factor ×10.8 and achieves a throughput of 42.4 gigapixel labeled per second. 相似文献

8.

Heterogeneous parallel computing accelerated generalized likelihood uncertainty estimation (GLUE) method for fast hydrological model uncertainty analysis purpose

Kan Guangyuan He Xiaoyan Ding Liuqian Li Jiren Hong Yang Liang Ke 《Engineering with Computers》2020,36(1):75-96

The generalized likelihood uncertainty estimation (GLUE) is a famous and widely used sensitivity and uncertainty analysis method. It provides a new way to solve the “equifinality” problem encountered in the hydrological model parameter estimation. In this research, we focused on the computational efficiency issue of the GLUE method. Inspired by the emerging heterogeneous parallel computing technology, we parallelized the GLUE in algorithmic level and then implemented the parallel GLUE algorithm on a multi-core CPU and many-core GPU hybrid heterogeneous hardware system. The parallel GLUE was implemented using OpenMP and CUDA software ecosystems for multi-core CPU and many-core GPU systems, respectively. Application of the parallel GLUE for the Xinanjiang hydrological model parameter sensitivity analysis proved its much better computational efficiency than the traditional serial computing technology, and the correctness was also verified. The heterogeneous parallel computing accelerated GLUE method has very good application prospects for theoretical analysis and real-world applications.

相似文献

9.

LZSS文本压缩自满实现与研究 总被引：1，自引：0，他引：1

王平茅忠明《计算机工程》2001,27(8):22-24

设计实现了LZSS压缩算法，为了适合于中文压缩，作者对其进行了改进，通过测试证明改进是有效的，相比于标准LZSS压缩算法，压缩比有了很大幅度的提高，对于中文文本长文件，其最大压缩比已达到20左右，对于英文文本文件的压缩效果也好于LZSS12算法；同时，得出了LXSS算法的极限压缩率，有重要的实际应用价值。相似文献

10.

PMVS算法的CPU多线程和GPU两级粒度并行策略

刘金硕江庄毅徐亚渤邓娟章岚昕《计算机科学》2017,44(2):296-301

PMVS(Patch-based Multi-View Stereo)三维重建算法被广泛应用于无人机航拍影像的三维场景重建中。针对PMVS三维重建算法计算量大、时间复杂度高的问题,提出了PMVS算法的CPU多线程和GPU两级粒度并行策略(Multithread and GPU Parallel Schema,MGPS),方法具体包括:基于GPU的PMVS算法特征提取和片面扩散的并行设计;多影像的GPU和CPU任务分配机制,以使得部分任务分配给CPU采用多线程并行,部分任务分配给GPU并行时,程序总运行时间最短。实验采用搭载24核CPU和NVIDIA Tesla K20 GPU的高性能服务器作为测试平台,针对分辨率为4081×2993的16幅无人机影像进行三维重建。实验结果表明,相比串行的PMVS算法,基于MGPS的PMVS算法取得4倍左右的加速比,其中特征提取最高加速13倍,计算误差在10%以内,该方法实现了更高效的PMVS三维重建。基于MGPS的PMVS算法还可用于文物保护、医学图像处理、虚拟现实等领域。相似文献

11.

动态模式识别算法的GPU平台实现

林文愉王聪《计算技术与自动化》2013,(1):68-72

研究动态模式识别算法在GPU并行计算平台的实现。随着GPGPU(通用计算图形处理器)硬件的发展,基于GPU的大规模并行计算技术将有效地处理动态模式识别算法带来的海量计算问题。文中通过介绍动态模式识别算法,对算法中涉及的巨大计算量进行分析,并针对性地对其中密集计算部分进行并行化分解,移除原算法中在执行中存在的依赖关系,最终得到算法在特定的GPU平台———Jacket上的并行计算实现。实例验证表明,相比于原CPU串行程序,在GPU上运行的并行化程序能实现明显加速,因而具有很好的工程应用价值。相似文献

12.

Parallel multi‐level 2D‐DWT on CUDA GPUs and its application in ring artifact removal

Leqing Zhu Yadong Zhou Daxing Zhang Dadong Wang Huiyan Wang Xun Wang 《Concurrency and Computation》2015,27(17):5188-5202

This paper presented two schemes of parallel 2D discrete wavelet transform (DWT) on Compute Unified Device Architecture graphics processing units. For the first scheme, the image and filter are transformed to spectral domain by using Fast Fourier Transformation (FFT), multiplied and then transformed back to space domain by using inverse FFT. For the second scheme, the image pixels are convolved directly with filters. Because there is no data relevance, the convolution for data points on different positions could be executed concurrently. To reduce data transfer, the boundary extension and down‐sampling are processed during data loading stage, and transposing is completed implicitly during data storage. A similar skill is adopted when parallelizing inverse 2D DWT. To further speed up the data access, the filter coefficients are stored in the constant memory. We have parallelized the 2D DWT for dozens of wavelet types and achieved a speedup factor of over 380 times compared with that of its CPU version. We applied the parallel 2D DWT in a ring artifact removal procedure; the executing speed was accelerated near 200 times compared with its CPU version. The experimental results showed that the proposed parallel 2D DWT on graphics processing units can significantly improve the performance for a wide variety of wavelet types and is promising for various applications. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

13.

A parallel computing algorithm for 16S rRNA probe design

Dianhui Yuriy Richard C. Willson George E. 《Journal of Parallel and Distributed Computing》2006,66(12):1546-1551

With the continuing rapid increase in the number of available 16S ribosomal RNA (rRNA) sequences, it is a significant computational challenge to efficiently design 16S rRNA targeted probes. In our previous work, we designed a fast software tool called ProkProbePicker (PPP) that takes O(logN) time for a worst-case scenario search. Despite this improvement, it can still take many hours for PPP to extract probes for all the clusters in a phylogenetic tree. Herein, a parallelized version of PPP is described. When run on 80 processors, this version of PPP took only 67 min to extract probes, while some 87 h were needed by the sequential version of PPP. The speedup increased linearly with the increase of CPU numbers, which revealed the outstanding scalability of the parallelized version of PPP. 相似文献

14.

Efficient parallel implementation of three‐point viterbi decoding algorithm on CPU,GPU, and FPGA

Rongchun Li Yong Dou Dan Zou 《Concurrency and Computation》2014,26(3):821-840

In wireless communication, Viterbi decoding algorithm (VDA) is the one of most popular channel decoding algorithms, which is widely used in WLAN, WiMAX, or 3G communications. However, the throughput of Viterbi decoder is constrained by the convolutional characteristic. Recently, the three‐point VDA (TVDA) was proposed to solve this problem. In TVDA, the whole procedure can be divided into three phases, the forward, trace‐back, and decoding phases. In this paper, we analyze the parallelism of TVDA and propose parallel TVDA on the multi‐core CPU, graphics processing unit (GPU), and field programmable gate array (FPGA). We demonstrate approaches that fully exploit its performance potential on CPU, GPU, and FPGA computing platforms. For CPU platforms, we perform two optimization methods, single instruction multiple data and multithreading to gain over 145 × speedup over the naive CPU version on a quad‐core CPU platform. For GPU platforms, we propose the combination of cached memory optimization, coalesced global memory accesses, codeword packing scheme, and asynchronous data transition, achieving the throughput of 404.65 Mbps and 12 × speedup over initial GPU versions on an NVIDIA GeForce GTX580 card and 7 × speedup over Intel quad‐core CPU i5‐2300, under the same manufacturing year and both with fully optimized schemes. In addition, for FPGA platforms, we customize a radix‐4 pipelined architecture for the TVDA in a 45‐nm FPGA chip from Xilinx (XC6VLX760). Under 209.15‐MHz clock rate, it achieves a throughput of 418.30 Mbps. Finally, we also discuss the performance evaluation and efficiency comparison of different flexible architectures for real‐time Viterbi decoding in terms of the decoding throughput, power consumption, optimization schemes, programming costs, and price costs.Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

15.

一种基于多模式匹配的文本压缩算法

魏星张磊吴义国《现代计算机》2011,(6):28-30

基于LZSS算法,提出引入WM多模式匹配思想的压缩算法(WM_LZSS压缩算法),该算法通过一个模式库自动记录在已读入文本中出现过的匹配长度较长的短语,在压缩的过程中预先对文本进行多模式匹配。通过对WM_LZSS算法测试的实验,表明WM_LZSS压缩算法在文本文件压缩应用中压缩比比LZSS算法高,特别适合于对文本相似度高的长文件进行压缩。相似文献

16.

一种基于LZSS的文本文件压缩算法 总被引：1，自引：0，他引：1

何丹李志蜀《计算机应用》2008,28(9):2335-2337

在LZSS算法的基础上,提出了一种更加适合压缩文本文件的新压缩算法。这种算法通过一个缓存机构自动收录高频短语,不但能让压缩及解压的过程耗时更短,而且压缩比大幅度提高。新算法与LZSS算法的对比测试数据证明了新算法在文本文件压缩应用中的性能明显优于LZSS算法。相似文献

17.

PHAST: Hardware-accelerated shortest path trees

Daniel Delling Andrew V. Goldberg Andreas Nowatzyk Renato F. Werneck 《Journal of Parallel and Distributed Computing》2013

We present a novel algorithm to solve the non-negative single-source shortest path problem on road networks and graphs with low highway dimension. After a quick preprocessing phase, we can compute all distances from a given source in the graph with essentially a linear sweep over all vertices. Because this sweep is independent of the source, we are able to reorder vertices in advance to exploit locality. Moreover, our algorithm takes advantage of features of modern CPU architectures, such as SSE and multiple cores. Compared to Dijkstra’s algorithm, our method needs fewer operations, has better locality, and is better able to exploit parallelism at multi-core and instruction levels. We gain additional speedup when implementing our algorithm on a GPU, where it is up to three orders of magnitude faster than Dijkstra’s algorithm on a high-end CPU. This makes applications based on all-pairs shortest-paths practical for continental-sized road networks. Several algorithms, such as computing the graph diameter, arc flags, or exact reaches, can be greatly accelerated by our method. 相似文献

18.

Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts

《Parallel Computing》2017

Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure that arise in our target problems. It also uses a hierarchical two-dimensional data layout which reduces synchronization costs and maps to memory hierarchy found in multicore processors. We present an OpenMP based implementation of the parallel algorithm in a new multithreaded solver called Basker in the Trilinos framework. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulation. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to state-of-the-art solver KLU. Basker outperforms Intel MKL Pardiso solver (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation. 相似文献

19.

GPU accelerated tensor contractions in the plaquette renormalization scheme

J.F. Yu H.-C. Hsiao Ying-Jer Kao 《Computers & Fluids》2011,45(1):55-58

We use the graphical processing unit (GPU) to accelerate the tensor contractions, which is the most time consuming operations in the variational method based on the plaquette renormalized states. Using a frustrated Heisenberg J₁–J₂ model on a square lattice as an example, we implement the algorithm based on the compute unified device architecture (CUDA). For a single plaquette contraction with the bond dimensions C = 3 of each rank of the tensor, results are obtained 25 times faster on GPU than on a current CPU core. This makes it possible to simulate systems with the size 8 × 8 and larger, which are extremely time consuming on a single CPU. This technology successfully relieves the computing time dependence with C, while in the CPU serial computation, the total required time scales both with C and the system size. 相似文献

20.

q-state Potts model metastability study using optimized GPU-based Monte Carlo algorithms

Ezequiel E. Ferrero Juan Pablo De Francesco Nicolás Wolovick Sergio A. Cannas 《Computer Physics Communications》2012,183(8):1578-1587

We implemented a GPU-based parallel code to perform Monte Carlo simulations of the two-dimensional q-state Potts model. The algorithm is based on a checkerboard update scheme and assigns independent random number generators to each thread. The implementation allows to simulate systems up to ～10⁹ spins with an average time per spin flip of 0.147 ns on the fastest GPU card tested, representing a speedup up to 155×, compared with an optimized serial code running on a high-end CPU.The possibility of performing high speed simulations at large enough system sizes allowed us to provide a positive numerical evidence about the existence of metastability on very large systems based on Binder?s criterion, namely, on the existence or not of specific heat singularities at spinodal temperatures different of the transition one. 相似文献