首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 6 毫秒
1.
In this paper, we present the analysis and development of a cross-platform OpenCL implementation of the box-counting algorithm, which is one of the most widely-used methods for estimating the Fractal Dimension. The Fractal Dimension is a relevant image analysis method used in several disciplines, but computing it is in general a time consuming process, especially when working with 3D images. Unlike parallel programming models that strictly depend on the hardware type and manufacturer, like CUDA, OpenCL allows us to provide an implementation suitable for execution on both GPUs and multi-core CPUs, whatever the hardware manufacturer. Sorting is a key part of the fast box-counting algorithm and the final speedup is highly conditioned by the efficiency of the sorting algorithm used. Our study reveals that current OpenCL implementations of sorting algorithms are clearly slower when compared with both CUDA for GPU and specific multi-core CPU implementations. Our OpenCL algorithm has been specifically optimized according the type of the target device and the results show an average speedup of up to 7.46× and 4×, when executed on the GPU and the multi-core CPU respectively, both compared with the single-threaded (sequential) CPU implementation.  相似文献   

2.
Today, there is a growing demand for computer vision and image processing in different areas and applications such as military surveillance, and biological and medical imaging. Edge detection is a vital image processing technique used as a pre-processing step in many computer vision algorithms. However, the presence of noise makes the edge detection task more challenging; therefore, an image restoration technique is needed to tackle this obstacle by presenting an adaptive solution. As the complexity of processing is rising due to recent high-definition technologies, the expanse of data attained by the image is increasing dramatically. Thus, increased processing power is needed to speed up the completion of certain tasks. In this paper,we present a parallel implementation of hybrid algorithm-comprised edge detection and image restoration along with other processes using Computed Unified Device Architecture (CUDA) platform, exploiting a Single Instruction Multiple Thread (SIMT) execution model on a Graphical Processing Unit (GPU). The performance of the proposed method is tested and evaluated using well-known images from various applications. We evaluated the computation time in both parallel implementation on the GPU, and sequential execution in the Central Processing Unit (CPU) natively and using Hyper-Threading (HT) implementations. The gained speedup for the naïve approach of the proposed edge detection using GPU under global memory direct access is up to 37 times faster, while the speedup of the native CPU implementation when using shared memory approach is up to 25 times and 1.5 times over HT implementation.  相似文献   

3.
CPU/FPGA混合架构是可重构计算的普遍结构,为了简化混合架构上FPGA的使用,提出了一种硬件线程方法,并设计了硬件线程的执行机制,以硬件线程的方式使用可重构资源.同时,软硬件线程可以通过共享数据存储方式进行多线程并行执行,将程序中计算密集部分以FPGA上的硬件线程方式执行,而控制密集部分则以CPU上的软件线程方式执行.在Simics仿真软件模拟的混合架构平台上,对DES,MD5SUM和归并排序算法进行软硬件多线程改造后的实验结果表明,平均执行加速比达到了2.30,有效地发挥了CPU/FPGA混合架构的计算性能.  相似文献   

4.
In this paper we describe a new parallel Frequent Itemset Mining algorithm called “Frontier Expansion.” This implementation is optimized to achieve high performance on a heterogeneous platform consisting of a shared memory multiprocessor and multiple Graphics Processing Unit (GPU) coprocessors. Frontier Expansion is an improved data-parallel algorithm derived from the Equivalent Class Clustering (Eclat) method, in which a partial breadth-first search is utilized to exploit maximum parallelism while being constrained by the available memory capacity. In our approach, the vertical transaction lists are represented using a “bitset” representation and operated using wide bitwise operations across multiple threads on a GPU. We evaluate our approach using four NVIDIA Tesla GPUs and observed a 6–30× speedup relative to state-of-the-art sequential Eclat and FPGrowth implementations executed on a multicore CPU.  相似文献   

5.
We have implemented a parallel version of a dynamic programming biological sequence comparison algorithm to study the potential applicability of using parallel computers for genetic sequence comparisons. Our parallel program is built using C-Linda, a machine-independent parallel programming language, and was tested on both a 10 CPU Sequent Symmetry and a 64 CPU Intel Hypercube. C-Linda implements a shared associative memory model, "tuple space," through which multiple processes can communicate and coordinate control. In our master-worker (MW) parallel implementation, a master process creates several worker processes, extracts a test sequence and multiple library sequences from a database and stores them in tuple space. Each worker reads the test sequence and then repeatedly extracts library strings from tuple space, performs pairwise sequence comparison using a local comparison algorithm to generate a similarity score, and returns the similarity scores to tuple space. The master collects the scores from tuple space and identifies the best match over all library sequences. We also implemented a method of global interworker communication to reduce the total search time by stopping those string comparisons that had no chance of improving on the current best match. Comparisons of the total run time, speedup, and efficiency were made for parallel and sequential versions of a basic MW implementation as well as versions with the global abort threshold.  相似文献   

6.
This article presents a GPU-based single-unit deadlock detection methodology and its algorithm, GPU-OSDDA. Our GPU-based design utilizes parallel hardware of GPU to perform computations and thus is able to overcome the major limitation of prior hardware-based approaches by having the capability of handling thousands of processes and resources, whilst achieving real-world run-times. By utilizing a bit-vector technique for storing algorithm matrices and designing novel, efficient algorithmic methods, we not only reduce memory usage dramatically but also achieve two orders of magnitude speedup over CPU equivalents. Additionally, GPU-OSDDA acts as an interactive service to the CPU, because all of the aforementioned computations and matrix management techniques take place on the GPU, requiring minimal interaction with the CPU. GPU-OSDDA is implemented on three GPU cards: Tesla C2050, Tesla K20c, and Titan X. Our design shows overall speedups of 6-595X over CPU equivalents.  相似文献   

7.
Three‐dimensional curve skeletons are a very compact representation of three‐dimensional objects with many uses and applications in fields such as computer graphics, computer vision, and medical imaging. An important problem is that the calculation of the skeleton is a very time‐consuming process. Thinning is a widely used technique for calculating the curve skeleton because of the properties it ensures and the ease of implementation. In this paper, we present parallel versions of a thinning algorithm for efficient implementation in both graphics processing units and multicore CPUs. The parallel programming models used in our implementations are Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). The speedup achieved with the optimized parallel algorithms for the graphics processing unit achieves 106.24x against the CPU single‐process version and more than 19x over the CPU multithreaded version. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

8.
This paper presents a reformulation of bidirectional path‐tracing that adequately divides the algorithm into processes efficiently executed in parallel on both the CPU and the GPU. We thus benefit from high‐level optimization techniques such as double buffering, batch processing, and asyncronous execution, as well as from the exploitation of most of the CPU, GPU, and memory bus capabilities. Our approach, while avoiding pure GPU implementation limitations (such as limited complexity of shaders, light or camera models, and processed scene data sets), is more than ten times faster than standard bidirectional path‐tracing implementations, leading to performance suitable for production‐oriented rendering engines.  相似文献   

9.
10.
A recently proposed pipelined multithreading (PMT) technique exhibits wide applicability in parallelizing general sequential programs on multi-core processors. However, significant inter-core communication overhead limits PMT performance and prevents its commercial utilization. A simple and effective clustered pipelined multithreading (CPMT) approach is presented to accelerate sequential programs on commodity multi-core processors. This CPMT technique adopts a clustered communication mechanism that can yield very low average communication overhead by eliminating false sharing as well as reducing communication operation and transit delays in the software-only approach. A single-producer/single-consumer concurrent lock-free clusteredQueue algorithm based on a two-level queue structure is also proposed. The accuracy of CPMT is theoretically demonstrated. The performances of the algorithm and CPMT are evaluated on a commodity AMD Phenom four-core processor. The number of enqueue and dequeue times of the algorithm are 20.8 and 23 cycles given an appropriate parameter, respectively. The speedup of CPMT ranges from 13.1% to 119.8% for typical loops extracted from the SPEC CPU 2000 benchmark suite.  相似文献   

11.
Gadget is a simulation application for N‐body and smoothed particle hydrodynamics problems in cosmology, and it is widely applied in solving series of cosmological problems. N‐body focuses on the motion of the interaction of N particles, and smoothed particle hydrodynamics is a fluid simulation algorithm that studies the movement of fluid through particle simulation. Most scholars focus their attention on accelerating Gadget on multi‐core CPU or graphics processing units (GPUs) platforms. However, these research activities failed to achieve CPU–GPU hybrid computing, which resulted in tremendous waste of CPU computing resources. In this paper, we propose a CPU–GPU hybrid parallel strategy to accelerate Gadget‐2, a massively parallel structure formation code for cosmological simulations. This strategy uses CPU and GPU to process the calculation of short‐range force. To ensure CPU and GPU workload balance, a dynamic task allocation scheme is proposed according to the computational performance difference between the CPU and GPU. Experimental results showed that our CPU–GPU hybrid parallel strategy achieved an overall speedup factor of 18.6 and a partial speedup factor for short‐range force calculation of 28.35 compared with a single‐core CPU implementation for particles in million‐size magnitudes. Moreover, compared with a GPU platform that contained 12 CPU cores and one GPU, our hybrid parallel strategy obtained overall speedup and partial speedup factors of 6% and 20%, respectively. Furthermore, the scalability of the hybrid strategy is very fine – its performance will be enhanced when the problem scale is increasing. However, this strategy also has its limitation that the performance enhancement will be decreasing if the ratio(the number of CPU cores divides that of the GPU cards) reduces. Finally, in our hybrid strategy, the CPU coefficient of utilization improved by 17.14% or better. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

12.
In biological research, alignment of protein sequences by computer is often needed to find similarities between them. Although results can be computed in a reasonable time for alignment of two sequences, it is still very central processing unit (CPU) time-consuming when solving massive sequences alignment problems such as protein database search. In this paper, an optimized protein database search method is presented and tested with Swiss-Prot database on graphic processing unit (GPU) devices, and further, the power of CPU multi-threaded computing is also involved to realize a GPU-based heterogeneous parallelism. In our proposed method, a hybrid alignment approach is implemented by combining Smith–Waterman local alignment algorithm with Needleman–Wunsch global alignment algorithm, and parallel database search is realized with compute unified device architecture (CUDA) parallel computing framework. In the experiment, the algorithm is tested on a lower-end and a higher-end personal computers equipped with GeForce GTX 750 Ti and GeForce GTX 1070 graphics cards, respectively. The results show that the parallel method proposed in this paper can achieve a speedup up to 138.86 times over the serial counterpart, improving efficiency and convenience of protein database search significantly.  相似文献   

13.
This paper presents a deep and extensive performance analysis of the particle filter (PF) algorithm for a very compute intensive 3D multi-view visual tracking problem. We compare different implementations and parameter settings of the PF algorithm in a CPU platform taking advantage of the multithreading capabilities of the modern processors and a graphics processing unit (GPU) platform using NVIDIA CUDA computing environment as developing framework. We extend our experimental study to each individual stage of the PF algorithm, and evaluate the quality versus performance trade-off among different ways to design these stages. We have observed that the GPU platform performs better than the multithreaded CPU platform when handling a large number of particles, but we also demonstrate that hybrid CPU/GPU implementations can run almost as fast as only GPU solutions.  相似文献   

14.
15.
单颗粒冷冻电镜是结构生物学研究的重要手段之一,基于贝叶斯理论的冷冻电镜3维图像数据处理软件RELION(regularized likelihood optimization)具有很好的性能和易用性,受到广泛关注.然而其计算需求极大,限制了RELION的应用.针对RELION算法的特点,研究了基于GPU 的并行优化问题.首先全面分析了RELION的原理、RELION程序的算法结构及性能瓶颈;在此基础上,针对GPU细粒度体系结构对程序进行优化设计,提出了基于GPU的多级并型模型.为了获得良好的性能,对RELION的数据结构进行重组.为了避免GPU存储空间不足的问题,设计了自适应并行框架.实验结果表明:基于GPU的RELION实现可以获得良好的性能,相比于单CPU,整个应用的加速比超过36倍,计算密集型算法的加速比达到75倍以上.在多GPU上的测试结果表明基于GPU的RELION具有很好的可扩展性.  相似文献   

16.
针对SKINNY加密算法在中央处理器(CPU)下实现效率偏低的问题,提出一种基于图形处理器(GPU)的快速实现方法。首先,结合SKINNY算法的结构特征提出优化方案,将5个分步操作优化整合为1个整体运算;然后,分析该算法的电子密码本(ECB)模式和计数器(CTR)模式的特性,并给出并行粒度、内存分配等并行设计方案。实验结果表明,与传统的CPU实现方法下的SKINNY算法相比,基于计算统一设备架构(CUDA)实现的SKINNY算法的效率和吞吐量得到很大提升。具体来说,当处理的数据达到16 MB及以上时,在所提实现方法下,SKINNY算法的ECB模式的加速效率提升峰值为99.85%,加速比峰值为671,CTR模式的加速效率提升峰值为99.87%,加速比峰值为765;而与已有AES-256(ECB)和SKINNY_ECB并行算法比较,新提出的SKINNY-256(ECB)并行算法的吞吐量分别是它们的吞吐量的1.29倍和2.55倍。  相似文献   

17.
RSA算法的CUDA高效实现技术   总被引:1,自引:1,他引:0       下载免费PDF全文
CUDA(Compute Unified Device Architecture)作为一种支持GPU通用计算的新型计算架构,在大规模数据并行计算方面得到了广泛的应用。RSA算法是一种计算密集型的公钥密码算法,给出了基于CUDA的RSA算法并行化高效实现技术,其关键为引入大量独立并发的Montgomery模乘线程,并给出了具体的线程组织、数据存储结构以及基于共享内存的性能优化实现技术。根据RSA算法CUDA实现方法,在某款GPU上测试了RSA算法的运算性能和吞吐率。实验结果表明,与RSA算法的通用CPU实现方式相比,CUDA实现能够实现超过40倍的性能加速。  相似文献   

18.
The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory models. Their findings have influenced the introduction of task dependency in the OpenMP 4.0 standard.This article presents DaSH – the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH also includes sequential and shared-memory implementations based on OpenMP and Intel TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics.  相似文献   

19.

The Needleman-Wunsch (NW) is a dynamic programming algorithm used in the pairwise global alignment of two biological sequences. In this paper, three sets of parallel implementations of the NW algorithm are presented using a mixture of specialized software and hardware solutions: POSIX Threads-based, SIMD Extensions-based and a GPU-based implementations. The three implementations aim at improving the performance of the NW algorithm on large scale input without affecting its accuracy. Our experiments show that the GPU-based implementation is the best implementation as it achieves performance 72.5X faster than the sequential implementation, whereas the best performance achieved by the POSIX threads and the SIMD techniques are 2X and 18.2X faster than the sequential implementation, respectively.

  相似文献   

20.
曹建立  陈志奎  王宇新  郭禾 《计算机工程》2021,47(9):217-226,234
针对传统种子填充算法无法充分利用多核处理器性能以及需要人工指定种子的不足,提出基于动态连接和并查集的并行随机种子反向填充算法。将填充任务分为随机种子生成、并行填充、连通区域识别、并行合并与反转步骤,并采用C++和CUDA-C语言分别实现各步骤的CPU和GPU版本。在此基础上,从众多参数组合中选择能发挥硬件最佳性能的参数。实验结果表明,相比传统反向填充算法,并行随机种子反向填充算法能充分利用多核、异构处理器的多线程并行能力,在处理6种不同分辨率的单张和批量图像时获得了平均3.84倍和4.43倍的加速比,其中在处理8 KB高分辨图像时,最高取得6.05倍和7.09倍的加速比。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号