首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Graphics processing units and genetic programming: an overview   总被引:1,自引:0,他引:1  
A top end graphics card (GPU) plus a suitable SIMD interpreter can deliver a several hundred fold speed up, yet cost less than the computer holding it. We give highlights of AI and computational intelligence applications in the new field of general purpose computing on graphics hardware (GPGPU). In particular, we surveyed genetic programming (GP) use with GPU. We gave several applications from Bioinformatics and showed that how the fastest GP is based on an interpreter rather than compilation. Finally using GP to generate GPU CUDA kernel C++ code is sketched.  相似文献   

2.
Genetic programming on graphics processing units   总被引:1,自引:0,他引:1  
The availability of low cost powerful parallel graphics cards has stimulated the port of Genetic Programming (GP) on Graphics Processing Units (GPUs). Our work focuses on the possibilities offered by Nvidia G80 GPUs when programmed in the CUDA language. In a first work we have showed that this setup allows to develop fine grain parallelization schemes to evaluate several GP programs in parallel, while obtaining speedups for usual training sets and program sizes. Here we present another parallelization scheme and optimizations about program representation and use of GPU fast memory. This increases the computation speed about three times faster, up to 4 billion GP operations per second. The code has been developed within the well known ECJ library and is open source.  相似文献   

3.
4.
基于CUDA的GMM模型快速训练方法及应用   总被引:1,自引:1,他引:0  
由于能够很好地近似描述任何分布,混合高斯模型(GMM)在模式在识别领域得到了广泛的应用.GMM模型参数通常使用迭代的期望最大化(EM)算法训练获得,当训练数据量非常庞大及模型混合数很大时,需要花费很长的训练时间.NVIDIA公司推出的统一计算设备架构(Computed unified device architecture,CUDA)技术通过在图形处理单元(GPU)并发执行多个线程能够实现大规模并行快速计算.本文提出一种基于CUDA,适用于特大数据量的GMM模型快速训练方法,包括用于模型初始化的K-means算法的快速实现方法,以及用于模型参数估计的EM算法的快速实现方法.文中还将这种训练方法应用到语种GMM模型训练中.实验结果表明,与Intel DualCore PentiumⅣ3.0 GHz CPU的一个单核相比,在NVIDIA GTS250 GPU上语种GMM模型训练速度提高了26倍左右.  相似文献   

5.
Molecular dynamics (MD) is an important research tool extensively applied in materials science. Running MD on a graphics processing unit (GPU) is an attractive new approach for accelerating MD simulations. Currently, GPU implementations of MD usually run in a one-host-process-one-GPU (OHPOG) scheme. This scheme may pose a limitation on the system size that an implementation can handle due to the small device memory relative to the host memory. In this paper, we present a one-host-process-multiple-GPU (OHPMG) implementation of MD with embedded-atom-model or semi-empirical tight-binding many-body potentials. Because more device memory is available in an OHPMG process, the system size that can be handled is increased to a few million or more atoms. In comparison with the serial CPU implementation, in which Newton’s third law is applied to improve the computational efficiency, our OHPMG implementation has achieved a 28.9x–86.0x speedup in double precision, depending on the system size, the cut-off ranges and the number of GPUs. The implementation can also handle a group of small simulation boxes in one run by combining the small boxes into a large box. This approach greatly improves the GPU computing efficiency when a large number of MD simulations for small boxes are needed for statistical purposes.  相似文献   

6.
This paper proposes a method called layered genetic programming (LAGEP) to construct a classifier based on multi-population genetic programming (MGP). LAGEP employs layer architecture to arrange multiple populations. A layer is composed of a number of populations. The results of populations are discriminant functions. These functions transform the training set to construct a new training set. The successive layer uses the new training set to obtain better discriminant functions. Moreover, because the functions generated by each layer will be composed to a long discriminant function, which is the result of LAGEP, every layer can evolve with short individuals. For each population, we propose an adaptive mutation rate tuning method to increase the mutation rate based on fitness values and remaining generations. Several experiments are conducted with different settings of LAGEP and several real-world medical problems. Experiment results show that LAGEP achieves comparable accuracy to single population GP in much less time.  相似文献   

7.
Algorithms for constructing tree‐based classifiers are aimed at building an optimal set of rules implicitly described by some dataset of training samples. As the number of samples and/or attributes in the dataset increases, the required construction time becomes the limiting factor for interactive or even functional use. The problem is emphasized if tree derivation is part of an iterative optimization method, such as boosting. Attempts to parallelize the construction of classification trees have therefore been presented in the past. This paper describes a parallel method for binary classification tree construction implemented on a graphics processing unit (GPU) using compute unified device architecture (CUDA). By employing node‐level, attribute‐level, and split‐level parallelism, the task parallel and data parallel sections of tree induction are mapped to the architecture of a modern GPU. The GPU‐based solution is compared with the sequential and multi‐threaded CPU versions on public access datasets, and it is shown that an order of magnitude acceleration can be achieved on this data‐intensive task using inexpensive commodity hardware. The influence of dataset characteristics on the efficiency of parallel implementation is also analyzed. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

8.
Classification using Ant Programming is a challenging data mining task which demands a great deal of computational resources when handling data sets of high dimensionality. This paper presents a new parallelization approach of an existing multi-objective Ant Programming model for classification, using GPUs and the NVIDIA CUDA programming model. The computational costs of the different steps of the algorithm are evaluated and it is discussed how best to parallelize them. The features of both the CPU parallel and GPU versions of the algorithm are presented. An experimental study is carried out to evaluate the performance and efficiency of the interpreter of the rules, and reports the execution times and speedups regarding variable population size, complexity of the rules mined and dimensionality of the data sets. Experiments measure the original single-threaded and the new multi-threaded CPU and GPU times with different number of GPU devices. The results are reported in terms of the number of Giga GP operations per second of the interpreter (up to 10 billion GPops/s) and the speedup achieved (up to 834× vs CPU, 212× vs 4-threaded CPU). The proposed GPU model is demonstrated to scale efficiently to larger datasets and to multiple GPU devices, which allows the expansion of its applicability to significantly more complicated data sets, previously unmanageable by the original algorithm in reasonable time.  相似文献   

9.
袁斌 《图学学报》2010,31(3):76
计算机图形硬件技术的快速发展可以用来加速可视化过程,为此针对非均匀直线网格,给出了基于均匀辅助网格的CPU光线投射算法、基于辅助纹理的GPU光线投射算法,以及基于切片的3D纹理体绘制算法,并在Nvidia Geforce 6800GT图形卡上对这些算法进行了测试。结果表明,GPU算法远远快于CPU算法,而基于切片的3D纹理体绘制算法则快于GPU光线投射算法。  相似文献   

10.
Computing systems should be designed to exploit parallelism in order to improve performance. In general, a GPU (Graphics Processing Unit) can provide more parallelism than a CPU (Central Processing Unit), resulting in the wide usage of heterogeneous computing systems that utilize both the CPU and the GPU together. In the heterogeneous computing systems, the efficiency of the scheduling scheme, which selects the device to execute the application between the CPU and the GPU, is one of the most critical factors in determining the performance. This paper proposes a dynamic scheduling scheme for the selection of the device between the CPU and the GPU to execute the application based on the estimated-execution-time information. The proposed scheduling scheme enables the selection between the CPU and the GPU to minimize the completion time, resulting in a better system performance, even though it requires the training period to collect the execution history. According to our simulations, the proposed estimated-execution-time scheduling can improve the utilization of the CPU and the GPU compared to existing scheduling schemes, resulting in reduced execution time and enhanced energy efficiency of heterogeneous computing systems.  相似文献   

11.
This article presents the parallel implementation on a GPU of a real-time dynamic tone-mapping operator. The operator we describe in this article is generic and may be used by any application. However, the goal of our work is to integrate this operator into the graphic rendering process of a car driving simulator; thus, we studied its real-time implementation. The tone-mapping operator outputs a low dynamic range (LDR) image keeping as much as possible the contrast and luminance of the original input high dynamic range (HDR) image. It performs the mapping between the luminances of the original scene to the output device??s display values. We address the problem of mapping HDR images to standard displays. In this case, the tone mapping compresses the luminances ratio. Several tone-mapping operators can be found in the literature as well as some parallelizations. However, they use either static or adaptations of static operators. We have adapted the dynamic operator of Irawan and parallelized it on GPU. Algorithmic optimizations have been performed, and we have explored the different strategies of repartition of the computation among the CPU and the GPU. We have chosen to implement on the GPU the changes between the color spaces and the interpolation of the histogram which are the most time-consuming steps on the CPU (1?C2?s per image 1,002?×?666). All of these optimizations lead to an increase of the processing rate and the number of HDR-quality images displayed to LDR per second. This operator has been implemented on a RISC processor Pentium 4?at 3.6?GHz and a GPU Nvidia 8800?GTX (728MB, 518GFLOPS). The execution speed has been multiplied by a factor of 15 compared to the naive implementation of the algorithm. The display rate reaches 30 images per second, which fulfills our goal for real-time video rate of 25 images per second.  相似文献   

12.
Machine learning algorithms such as genetic programming (GP) can evolve biased classifiers when data sets are unbalanced. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria in the fitness function. This paper aims to both highlight the limitations of the current GP approaches in this area and develop several new fitness functions for binary classification with unbalanced data. Using a range of real-world classification problems with class imbalance, we empirically show that these new fitness functions evolve classifiers with good performance on both the minority and majority classes. Our approaches use the original unbalanced training data in the GP learning process, without the need to artificially balance the training examples from the two classes (e.g., via sampling).  相似文献   

13.
随着数据采集设备的发展,数字地形分析中高分辨率数字高程模型(DEM)图像越来越普遍。目前已经存在一系列的曲线结构提取算法由于计算复杂度较高,因此在针对高分辨率DEM图像提取地形特征线时效率较低。提出一种在图形处理器(GPU)上加速Steger曲线结构提取算法的策略,利用图形处理器上计算统一设备架构(CUDA)的高度并行性来加速算法中计算密集的Hessian矩阵生成模块以及图像特征点提取模块,对于百万像素级的DEM图像该算法可以获得5倍以上的加速比。  相似文献   

14.
吴健  兰时勇  黄飞虎 《计算机工程》2014,(2):208-211,218
针对当前高分辨率的多路视频拼接系统速度慢、实时性能低的问题,提出一种基于CPU和GPU并行架构的多路高清视频拼接算法。该算法在传统基于方向的快速特征点检测和旋转不变的特征描述算法上进行改进,删除针对尺度不变性应用的图像金字塔模块,并使用基于重叠区的局部配准方法,将配准后的图像数据在GPU设备端进行并行融合。在GPU与CPU异步执行的原则上,实现CPU端当前帧图像的配准,与其前帧图像融合,且以并行方式执行。通过显卡端图像数据计算与图像渲染之间的共享缓冲区,完成帧图像的快速渲染。实验结果表明,在4路200万像素的网络相机环境下,该算法实现的全景拼接系统的视频帧率达到17 f/s,可满足大场景的实时性需求。  相似文献   

15.
General purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi‐clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro‐benchmark1 considered in this paper resulted in 10–40% and 80–92% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9% and 152.3% improvement compared with separate kernel launch and 39.5% and 36.8% improvement compared with block level kernel coalesce method, respectively.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

16.
17.
研究基于总变分(TV)的图像去噪问题,针对中央处理器(CPU)计算速度较慢的问题,提出了在图像处理器(GPU)上并行计算的方法。考虑总变分最小问题的对偶模型,建立原始变量与对偶变量的关系,采用梯度投影算法求解对偶变量。数值实验分别在GPU与CPU上进行。实验结果表明,总变分去噪模型对偶算法在GPU设备上执行的效率高于在CPU上执行的效率,并且随着图像尺寸的增大,GPU并行计算的优势更加突出。  相似文献   

18.
基于综合LOD因子的自适应GPU地形渲染   总被引:1,自引:0,他引:1       下载免费PDF全文
根据四叉树的地形分块数据组织形式,提出一种面向图形处理器(GPU)的自适应地形渲染算法。将综合细节层次因子作为地形块节点评价函数,对静态地形块误差、动态视点依赖误差和视点移动速度进行量化,在顶点着色器上实现高程值的平滑过渡,消除突跃现象,并通过添加“裙”遮盖裂缝。实验结果表明,该算法的地形自适应性较好,具有较高的帧率和GPU利用率。  相似文献   

19.
Speeding up the evaluation phase of GP classification algorithms on GPUs   总被引:2,自引:1,他引:1  
The efficiency of evolutionary algorithms has become a studied problem since it is one of the major weaknesses in these algorithms. Specifically, when these algorithms are employed for the classification task, the computational time required by them grows excessively as the problem complexity increases. This paper proposes an efficient scalable and massively parallel evaluation model using the NVIDIA CUDA GPU programming model to speed up the fitness calculation phase and greatly reduce the computational time. Experimental results show that our model significantly reduces the computational time compared to the sequential approach, reaching a speedup of up to 820×. Moreover, the model is able to scale to multiple GPU devices and can be easily extended to any evolutionary algorithm.  相似文献   

20.
基于CUDA的并行布谷鸟搜索算法设计与实现   总被引:1,自引:0,他引:1  
布谷鸟搜索(cuckoo search,CS)算法是近几年发展起来的智能元启发式算法,已经被成功应用于多种优化问题中。针对CS算法在求解大数据、大规模复杂问题时,计算时间过长的问题,提出了一种基于统一计算设备架构(compute unified device architecture,CUDA)的并行布谷鸟搜索算法。该算法的并行实现采用任务并行与数据并行相结合的方式,利用图形处理器(graphic processing unit,GPU)线程块与线程分别映射布谷鸟个体与个体的每一维数据,并行实现CS算法中的鸟巢位置更新、个体适应度评估、鸟巢重建、寻找最优个体操作。整个CS算法的寻优迭代过程完全通过GPU实现,降低了算法计算过程中CPU与GPU的通信开销。对4个经典基准测试函数进行了仿真实验,结果表明,相比标准CS算法,基于CUDA架构的并行CS算法在求解收敛性一致的前提下,在求解速度上获得了高达110倍的计算加速比。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号