期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Three‐dimensional thinning algorithms on graphics processing units and multicore CPUs

J. Jimnez J. Ruiz de Miras 《Concurrency and Computation》2012,24(14):1551-1571

Three‐dimensional curve skeletons are a very compact representation of three‐dimensional objects with many uses and applications in ﬁelds such as computer graphics, computer vision, and medical imaging. An important problem is that the calculation of the skeleton is a very time‐consuming process. Thinning is a widely used technique for calculating the curve skeleton because of the properties it ensures and the ease of implementation. In this paper, we present parallel versions of a thinning algorithm for eﬃcient implementation in both graphics processing units and multicore CPUs. The parallel programming models used in our implementations are Compute Uniﬁed Device Architecture (CUDA) and Open Computing Language (OpenCL). The speedup achieved with the optimized parallel algorithms for the graphics processing unit achieves 106.24x against the CPU single‐process version and more than 19x over the CPU multithreaded version. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

2.

Efficient construction of bounding volume hierarchies into a complete octree for ray tracing

Ulises Olivares Hctor G. Rodríguez Arturo García Flix F. Ramos 《Computer Animation and Virtual Worlds》2016,27(3-4):358-368

This paper proposes an efficient construction scheme for bounding volume hierarchies based on a complete tree. This construction offers up to 4× faster construction times than binned‐surface area heuristic and offers competitive ray traversal performance. The construction is fully parallelized on x86 CPU architectures; it takes advantage of the eight‐wide vector units and exploits the advance vector extensions available for current x86 CPU architectures. Additionally, this work presents a clustering algorithm for grouping primitives, which can be computed in linear time O(n). Furthermore, this construction uses the graphics processing unit to perform intensive operations efficiently. Copyright © 2016 John Wiley & Sons, Ltd. 相似文献

3.

CPU–GPU hybrid parallel strategy for cosmological simulations

Yueqing Wang Yong Dou Song Guo Yuanwu Lei Dan Zou 《Concurrency and Computation》2014,26(3):748-765

Gadget is a simulation application for N‐body and smoothed particle hydrodynamics problems in cosmology, and it is widely applied in solving series of cosmological problems. N‐body focuses on the motion of the interaction of N particles, and smoothed particle hydrodynamics is a fluid simulation algorithm that studies the movement of fluid through particle simulation. Most scholars focus their attention on accelerating Gadget on multi‐core CPU or graphics processing units (GPUs) platforms. However, these research activities failed to achieve CPU–GPU hybrid computing, which resulted in tremendous waste of CPU computing resources. In this paper, we propose a CPU–GPU hybrid parallel strategy to accelerate Gadget‐2, a massively parallel structure formation code for cosmological simulations. This strategy uses CPU and GPU to process the calculation of short‐range force. To ensure CPU and GPU workload balance, a dynamic task allocation scheme is proposed according to the computational performance difference between the CPU and GPU. Experimental results showed that our CPU–GPU hybrid parallel strategy achieved an overall speedup factor of 18.6 and a partial speedup factor for short‐range force calculation of 28.35 compared with a single‐core CPU implementation for particles in million‐size magnitudes. Moreover, compared with a GPU platform that contained 12 CPU cores and one GPU, our hybrid parallel strategy obtained overall speedup and partial speedup factors of 6% and 20%, respectively. Furthermore, the scalability of the hybrid strategy is very fine – its performance will be enhanced when the problem scale is increasing. However, this strategy also has its limitation that the performance enhancement will be decreasing if the ratio(the number of CPU cores divides that of the GPU cards) reduces. Finally, in our hybrid strategy, the CPU coefficient of utilization improved by 17.14% or better. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

4.

Graphical processing unit‐based parallelization of the Open Shortest Path First and Border Gateway Protocol routing protocols

Dejan Dundjerski Milo Toma&#x;evi&#x; 《Concurrency and Computation》2015,27(1):237-251

Exponentially growing number of devices on Internet incurs an ever‐increasing load on the network routers in executing network protocols. Parallel processing has recently become an unavoidable means to scale up the router performance. The research effort elaborated in this paper is focused on exploiting the modern trends of general‐purpose computing on graphics processing unit computing in speeding up the execution of network protocols. An additional benefit is off‐loading the CPU, which can now be fully dedicated to the packet processing and forwarding. To this end, the Shortest Path First algorithm in the Open Shortest Path First protocol and the choice of the best routes in the Border Gateway Protocol are parallelized for efficient execution on Compute Unified Device Architecture platform. An evaluation study was conducted on three different graphics processing units with representative network workload for a varying number of routes and devices. The obtained speedup results confirmed the viability and cost‐effectiveness of such an approach. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

5.

The positive temperature anomaly as detected by Landsat TM data in the eastern Marmara Sea (Turkey): possible link with the 1999 Izmit earthquake

M. T. yürür 《International journal of remote sensing》2013,34(6):1205-1218

A long (～15 km) and narrow (～4 km) offshore positive temperature anomaly (～1.7° C) is observed in the Landsat Thematic Mapper (TM) thermal infrared (TIR) image acquired the day following the large ?zmit earthquake (Mw 7.4) of 17 August 1999, in eastern Marmara Sea, Turkey. The earthquake was generated along the North Anatolian Fault, which ruptured for about 150 km, and the anomaly formed at the western termination of this rupture. Discussions of this anomaly may develop by processes different than the seismic activity and considerations on fault geometry and sea bathymetry characteristics suggest that the anomaly may result from aftershock activity near the western end of the earthquake fault. The formation of the anomaly requires the addition of a large quantity of hot waters to the sea. The ascent to the sea bottom of fault‐driven hot fluids (seismic pumping) and formation of thermal plumes may be the processes by which the sea surface temperature increased. Recent works and the present study suggest that TIR data analysis may be used as a tool in seismological studies. 相似文献

6.

Speeding up solving of differential matrix Riccati equations using GPGPU computing and MATLAB

Jesus Peinado Jacinto J. Ibez Enrique Arias Vicente Hernndez 《Concurrency and Computation》2012,24(12):1334-1348

In this work, we developed a parallel algorithm to speed up the resolution of differential matrix Riccati equations using a backward differentiation formula algorithm based on a fixed‐point method. The role and use of differential matrix Riccati equations is especially important in several applications such as optimal control, filtering, and estimation. In some cases, the problem could be large, and it is interesting to speed it up as much as possible. Recently, modern graphic processing units (GPUs) have been used as a way to improve performance. In this paper, we used an approach based on general‐purpose computing on graphics processing units. We used NVIDIA © GPUs with unified architecture. To do this, a special version of basic linear algebra subprograms for GPUs, called CUBLAS, and a package (three different packages were studied) to solve linear systems using GPUs have been used. Moreover, we developed a MATLAB © toolkit to use our implementation from MATLAB in such a way that if the user has a graphic card, the performance of the implementation is improved. If the user does not have such a card, the algorithm can also be run using the machine CPU. Experimental results on a NVIDIA Quadro FX 5800 are shown. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

7.

Feature tracking and matching in video using programmable graphics hardware 总被引：2，自引：0，他引：2

Sudipta N. Sinha Jan-Michael Frahm Marc Pollefeys Yakup Genc 《Machine Vision and Applications》2011,22(1):207-217

This paper describes novel implementations of the KLT feature tracking and SIFT feature extraction algorithms that run on the graphics processing unit (GPU) and is suitable for video analysis in real-time vision systems. While significant acceleration over standard CPU implementations is obtained by exploiting parallelism provided by modern programmable graphics hardware, the CPU is freed up to run other computations in parallel. Our GPU-based KLT implementation tracks about a thousand features in real-time at 30 Hz on 1,024 × 768 resolution video which is a 20 times improvement over the CPU. The GPU-based SIFT implementation extracts about 800 features from 640 × 480 video at 10 Hz which is approximately 10 times faster than an optimized CPU implementation. 相似文献

8.

Gibraltar: A Reed‐Solomon coding library for storage applications on programmable graphics processors

Matthew L. Curry Anthony Skjellum H. Lee Ward Ron Brightwell 《Concurrency and Computation》2011,23(18):2477-2495

Reed–Solomon coding is a method for generating arbitrary amounts of erasure correction information from original data via matrix–vector multiplication in finite fields. Previous work has shown that modern CPUs are not well‐matched to this type of computation, requiring applications that depend on Reed–Solomon coding at high speeds (such as high‐performance storage arrays) to use hardware implementations. This work demonstrates that high performance is possible with current cost‐effective graphics processing units across a wide range of operating conditions and describes how performance will likely evolve in similar architectures. It describes the characteristics of the graphics processing unit architecture that enable high‐speed Reed–Solomon coding. A high‐performance practical library, Gibraltar, has been prototyped that performs Reed–Solomon coding on graphics processors in a manner suitable for storage arrays, along with applications with similar data resiliency needs. This library enables variably resilient erasure correcting codes to be used in a broad range of applications. Its performance is compared with that of a widely available CPU implementation, and a rationale for its API is presented. Its practicality is demonstrated through a usage example. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

9.

Parallel multi‐level 2D‐DWT on CUDA GPUs and its application in ring artifact removal

Leqing Zhu Yadong Zhou Daxing Zhang Dadong Wang Huiyan Wang Xun Wang 《Concurrency and Computation》2015,27(17):5188-5202

This paper presented two schemes of parallel 2D discrete wavelet transform (DWT) on Compute Unified Device Architecture graphics processing units. For the first scheme, the image and filter are transformed to spectral domain by using Fast Fourier Transformation (FFT), multiplied and then transformed back to space domain by using inverse FFT. For the second scheme, the image pixels are convolved directly with filters. Because there is no data relevance, the convolution for data points on different positions could be executed concurrently. To reduce data transfer, the boundary extension and down‐sampling are processed during data loading stage, and transposing is completed implicitly during data storage. A similar skill is adopted when parallelizing inverse 2D DWT. To further speed up the data access, the filter coefficients are stored in the constant memory. We have parallelized the 2D DWT for dozens of wavelet types and achieved a speedup factor of over 380 times compared with that of its CPU version. We applied the parallel 2D DWT in a ring artifact removal procedure; the executing speed was accelerated near 200 times compared with its CPU version. The experimental results showed that the proposed parallel 2D DWT on graphics processing units can significantly improve the performance for a wide variety of wavelet types and is promising for various applications. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

10.

A comparison of GPU strategies for unstructured mesh physics

Charles R. Ferenbaugh 《Concurrency and Computation》2013,25(11):1547-1558

There have been few efforts to date to write physics algorithms for general unstructured meshes (meshes composed of arbitrary polygons/polyhedra) on graphics processing units (GPUs). Typical strategies for GPU memory management, such as double‐buffering and coalescing memory accesses, are difficult to apply to the irregular memory storage patterns of unstructured meshes. This paper presents results from an initial GPU version of a typical unstructured mesh kernel. Three different memory management strategies are described and implemented. Timing results for all three strategies are presented, in some cases showing speedups of over 20 times compared with the original CPU code.Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

11.

High‐speed parallel implementations of the rainbow method based on perfect tables in a heterogeneous system

Jung Woo Kim Jungjoo Seo Jin Hong Kunsoo Park Sung‐Ryul Kim 《Software》2015,45(6):837-855

The computing power of graphics processing units (GPU) has increased rapidly, and there has been extensive research on general‐purpose computing on GPU (GPGPU) for cryptographic algorithms such as RSA, Elliptic Curve Cryptosystem (ECC), NTRU, and Advanced Encryption Standard. With the rise of GPGPU, commodity computers have become complex heterogeneous GPU+CPU systems. This new architecture poses new challenges and opportunities in high‐performance computing. In this paper, we present high‐speed parallel implementations of the rainbow method based on perfect tables, which is known as the most efficient time‐memory trade‐off, in the heterogeneous GPU+CPU system. We give a complete analysis of the effect of multiple checkpoints on reducing the cost of false alarms and take advantage of it for load balancing between GPU and CPU. For GTX460, our implementation is about 1.86 and 3.25 times faster than other GPU‐accelerated implementations, RainbowCrack and Cryptohaze, respectively, and for GTX580, 1.53 and 2.40 times faster. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

12.

Co‐seismic surface ruptures produced by the 2005 Pakistan M w7.6 earthquake in the Muzaffarabad area,revealed by QuickBird imagery data

A. Lin J. Guo 《International journal of remote sensing》2013,34(1):235-246

High‐resolution QuickBird imagery data have been used to analyse and detect the co‐seismic surface ruptures produced by the 2005 Pakistan M _w7.6 earthquake in the Muzaffarabad area. The analytical results and interpretations of the QuickBird images reveal that the co‐seismic surface ruptures are mostly concentrated on the pre‐existing active faults striking northwest–southeast. Most of co‐seismic surface ruptures show a deformation feature of compressional cracks having a right‐stepping echelon geometric pattern. Individual cracks vary from metre order to 1‐km in length, generally 10 to 100 m. In the northern Muzaffarabad city, an east–west striking co‐seismic surface zone of ～1 km length occurred in the jog area between two northwest–southeast striking surface rupture zones. A strong damage zone along which all buildings completely collapsed is concentrated in a deformation zone of ～60 m wide on the uplift side of the east–west striking surface rupture zone. Large‐scale landslides caused by strong ground motion are mostly constricted on the uplift side along the co‐seismic surface rupture zones. The deformation features and spatial distribution patterns of the co‐seismic surface ruptures and the ground motion direction indicate that the co‐seismic fault that triggered the 2005 Pakistan M _w 7.6 earthquake is a thrust fault with a right‐lateral slip component. 相似文献

13.

Measurement of the left‐lateral displacement of Ms 8.1 Kunlun earthquake on 14 November 2001 using Landsat‐7 ETM+ imagery

Jian Guo Liu P. J. Mason Jiming Ma 《International journal of remote sensing》2013,34(10):1875-1891

An imageodesy study has been carried out, using pre‐ and post‐event Landsat‐7 Enhanced Thematic Mapper Plus (ETM+) images, to reveal regional co‐seismic displacement caused by the Ms 8.1 Kunlun earthquake in November 2001. The two Landsat scenes, Kusai Lake and Buka Daban, cover an area of some 57 600 km² (320 km W–E and about 180 km N–S), which includes most of the fault rupture zone. The co‐seismic displacement measured in the Kusai Lake scene shows that the average left‐lateral shift along the Kunlun fault is 4.8 m (ranging from 1.5 to 8.1 m) and the maximum shift appears west of the Kusai Lake. The splayed nature of the fault to the west of Buka Daban, where the fault splits into three branches, causes the displacement pattern to become complicated. Here the average left‐lateral shift, between the south side of the southern branch and the north side of the northern branch, is 4.6 m (ranging from 1.0 to 8.2 m). Our results also illustrate that the south side of the fault is the ‘active’ block, moving significantly in an east–south‐easterly direction, relative to the largely ‘stable’ northern block. 相似文献

14.

基于图形处理器的水面仿真

郭新钊张军《计算机仿真》2010,27(1):218-221

水面效果的仿真可大幅提高自然环境仿真的真实感,传统对于CPU的仿真存在占用CPU时间和系统资源的缺点,针对存在问题,建立了基于图形处理单元（GPU）的水面仿真方法,讨论水面特效在GPU上的实现、以及水面网格在GPU中的重构。因为运算以及水面网格重构都在GPU中完成,充分利用GPU强大的图形处理能力,因此不会造成额外的系统开支,并且增强了对水面细节的表现,使得水面的逼真度和实时性增强。相似文献

15.

High-performance computing tools for the integrated assessment and modelling of social–ecological systems

《Environmental Modelling & Software》2013

相似文献

16.

Multiscale and local search methods for real time region tracking with particle filters: local search driven by adaptive scale estimation on GPUs 总被引：1，自引：0，他引：1

Raúl Cabido Antonio S. Montemayor Juan José Pantrigo Bryson R. Payne 《Machine Vision and Applications》2009,21(1):43-58

Tracking systems are important in computervision, with applications in surveillance, human computer interaction, etc. Consumer graphics processing units (GPUs) have experienced an extraordinary evolution in both computing performance and programmability, leading to greater use of the GPU for non-rendering applications. In this work we propose a real-time object tracking algorithm, based on the hybridization of particle filtering (PF) and a multi-scale local search (MSLS) algorithm, presented for both CPU and GPU architectures. The developed system provides successful results in precise tracking of single and multiple targets in monocular video, operating in real-time at 70 frames per second for 640 × 480 video resolutions on the GPU, up to 1,100% faster than the CPU version of the algorithm. 相似文献

17.

Optimization schemes and performance evaluation of Smith–Waterman algorithm on CPU,GPU and FPGA

Dan Zou Yong Dou Fei Xia 《Concurrency and Computation》2012,24(14):1625-1644

With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well‐known algorithm in the field of biosequence matching and database searching, the Smith–Waterman (S‐W) algorithm as an example, and demonstrate approaches that fully exploit its performance potentials on CPU, GPU, and field‐programmable gate array (FPGA) computing platforms. For CPU platforms, we perform two optimizations, single instruction, multiple data and multithread, with compiler options, to gain over 70 × speedups over naive CPU versions on quad‐core CPU platforms. For GPU platforms, we propose the combination of coalesced global memory accesses, shared memory tiles, and loop unfolding, achieving 50 × speedups over initial GPU versions on an NVIDIA GeForce GTX 470 card. Experimental results show that the GPU GTX 470 gains 12 × speedups, instead of 100 × reported by some studies, over Intel quadcore CPU Q9400, under the same manufacturing technology and both with fully optimized schemes. In addition, for FPGA platforms, we customize a linear systolic array for the S‐W algorithm in a 45‐nm FPGA chip from Xilinx (XC6VLX760), with up to 1024 processing elements. Under only 133 MHz clock rate, the FPGA platform reaches the highest performance and becomes the most power‐efficient platform, using only 25 W compared with 190 W of the GPU GTX 470. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

18.

Co‐seismic ruptures found up to 60 km south of the Kunlun fault after 14 November 2001, Ms 8.1, Kokoxili earthquake using Landsat‐7 ETM+ imagery

J. G. Liu C. E. Haselwimmer 《International journal of remote sensing》2013,34(20):4461-4470

A systematic visual interpretation of pre‐ and post‐earthquake Landsat‐7 ETM+ imagery of the 14 November, Ms 8.1 Kokoxili earthquake has revealed significant post‐earthquake lineaments in the region south of the Kunlun fault, which we interpret as co‐seismic surface ruptures related to the event. This previously unreported surface rupturing is located in two broad swathes ～20 and ～60 km south of the main Kunlun fault. Pre‐existing lineaments and subtle tectonic geomorphologic features associated with these ruptures suggest that earthquake‐triggered displacement occurred along pre‐existing faults. 相似文献

19.

A scalable approach to solving dense linear algebra problems on hybrid CPU‐GPU systems

Fengguang Song Jack Dongarra 《Concurrency and Computation》2015,27(14):3702-3723

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU‐GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double‐precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared‐memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared‐memory systems with multiple GPUs. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

20.

Graphics processing unit‐based triangulations of Blinn molecular surfaces

Srgio E.D. Dias Abel J.P. Gomes 《Concurrency and Computation》2011,23(17):2280-2291

Computing the surface of a molecule (e.g., a protein) plays an important role in the analysis of its geometric structure as needed in the study of interactions between proteins, protein folding, protein docking, and so forth. There are a number of algorithms for the computation of molecular surfaces and their triangulations, but only a few take advantage of graphics processing unit computing. This paper describes a graphics processing unit‐based marching cubes algorithm to triangulate molecular surfaces. In the end of the paper, a performance analysis of three implementations (i.e., serial CPU, CUDA, and OpenCL) of the marching cubes‐based triangulation algorithm takes place as a way to realize beforehand how molecular surfaces can be rendered in real‐time in the future. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献