首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper investigates the speed improvements available when using a graphics processing unit (GPU) for evaluation of individuals in a genetic programming (GP) environment. An existing GP system is modified to enable parallel evaluation of individuals on a GPU device. Several issues related to implementing GP on GPU are discussed, including how to perform tree-based GP on a device without recursion support, as well as the effect that proper memory layout can have on speed increases when using CUDA-enabled nVidia GPU devices. The specific GP implementation is designed to evolve stock trading strategies using technical analysis indicators. The second goal of this research is to investigate the possible improvement in performance when training individuals on a larger number of stocks and training days. This increased training size (nearly 100,000 training points) is enabled due to the speedups realized by GPU evaluation. Several different scenarios were used to test various speed optimizations of GP evaluation on the GPU device, with a peak speedup factor of over 600 (when compared to sequential evaluation on a 2.4 GHz CPU). Also, it is found that increasing the number of stocks and the length of the training period can result in higher out-of-training testing profitability.  相似文献   

2.
Genetic programming on graphics processing units   总被引:1,自引:0,他引:1  
The availability of low cost powerful parallel graphics cards has stimulated the port of Genetic Programming (GP) on Graphics Processing Units (GPUs). Our work focuses on the possibilities offered by Nvidia G80 GPUs when programmed in the CUDA language. In a first work we have showed that this setup allows to develop fine grain parallelization schemes to evaluate several GP programs in parallel, while obtaining speedups for usual training sets and program sizes. Here we present another parallelization scheme and optimizations about program representation and use of GPU fast memory. This increases the computation speed about three times faster, up to 4 billion GP operations per second. The code has been developed within the well known ECJ library and is open source.  相似文献   

3.
Genetic Programming (GP) (Koza, Genetic programming, MIT Press, Cambridge, 1992) is well-known as a computationally intensive technique. Subsequently, faster parallel versions have been implemented that harness the highly parallel hardware provided by graphics cards enabling significant gains in the performance of GP to be achieved. However, extracting the maximum performance from a graphics card for the purposes of GP is difficult. A key reason for this is that in addition to the processor resources, the fast on-chip memory of graphics cards needs to be fully exploited. Techniques will be presented that will improve the performance of a graphics card implementation of tree-based GP by better exploiting this faster memory. It will be demonstrated that both L1 cache and shared memory need to be considered for extracting the maximum performance. Better GP program representation and use of the register file is also explored to further boost performance. Using an NVidia Kepler 670GTX GPU, a maximum performance of 36 billion Genetic Programming Operations per Second is demonstrated.  相似文献   

4.
Classification using Ant Programming is a challenging data mining task which demands a great deal of computational resources when handling data sets of high dimensionality. This paper presents a new parallelization approach of an existing multi-objective Ant Programming model for classification, using GPUs and the NVIDIA CUDA programming model. The computational costs of the different steps of the algorithm are evaluated and it is discussed how best to parallelize them. The features of both the CPU parallel and GPU versions of the algorithm are presented. An experimental study is carried out to evaluate the performance and efficiency of the interpreter of the rules, and reports the execution times and speedups regarding variable population size, complexity of the rules mined and dimensionality of the data sets. Experiments measure the original single-threaded and the new multi-threaded CPU and GPU times with different number of GPU devices. The results are reported in terms of the number of Giga GP operations per second of the interpreter (up to 10 billion GPops/s) and the speedup achieved (up to 834× vs CPU, 212× vs 4-threaded CPU). The proposed GPU model is demonstrated to scale efficiently to larger datasets and to multiple GPU devices, which allows the expansion of its applicability to significantly more complicated data sets, previously unmanageable by the original algorithm in reasonable time.  相似文献   

5.
We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications.  相似文献   

6.
The explosive growth in integration technology and the parallel nature of rasterization‐based graphics APIs (Application Programming Interface) changed the panorama of consumer‐level graphics: today, GPUs (Graphics Processing Units) are cheap, fast and ubiquitous. We show how to harness the computational power of GPUs and solve the incompressible Navier‐Stokes fluid equations significantly faster (more than one order of magnitude in average) than on CPU solvers of comparable cost. While past approaches typically used Stam's implicit solver, we use a variation of SMAC (Simplified Marker and Cell). SMAC is widely used in engineering applications, where experimental reproducibility is essential. Thus, we show that the GPU is a viable and affordable processor for scientific applications. Our solver works with general rectangular domains (possibly with obstacles), implements a variety of boundary conditions and incorporates energy transport through the traditional Boussinesq approximation. Finally, we discuss the implications of our solver in light of future GPU features, and possible extensions such as three‐dimensional domains and free‐boundary problems.  相似文献   

7.
The general-purpose computing on graphic processing units (GPGPUs) becomes increasingly popular due to its high computational throughput for data parallel applications. Modern GPU architectures have limited capability for error detection and fault tolerance since they are originally designed for graphics processing. However, the rigorous execution correctness is required for general-purpose applications, which makes reliability a growing concern in the GPGPU architecture design. With CMOS processing technologies continuously scaling down to the nano-scale, on-chip soft error rate (SER) has been predicted to increase exponentially. GPGPUs with hundreds of cores integrated into a single chip are prone to manifest high SER. This paper explores a first step to model and characterize GPGPU reliability in light of soft errors. We develop GPGPU-SODA (GPGPU SOftware Dependability Analysis), a framework to estimate the soft-error vulnerability of GPGPU microarchitecture. By using GPGPU-SODA, we observe that several microarchitecture structures in GPGPUs exhibit high soft-error susceptibility, and the structure vulnerability is sensitive to the workload characteristics (e.g. branch divergences, memory access pattern). We further investigate the impact of several architectural optimizations on GPU soft-error robustness. For example, we find that increasing the number of threads supported by GPU significantly affects the GPGPU soft-error robustness. However, changing the warp scheduling policy has little impact on the structure vulnerability. The observations made in this study provide designers the useful guidance to build resilient GPGPUs: a comprehensive resiliency solution for GPGPUs should consider the entire GPGPU design instead of solely focusing on a particular structure.  相似文献   

8.
Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead.  相似文献   

9.
Tracking systems are important in computervision, with applications in surveillance, human computer interaction, etc. Consumer graphics processing units (GPUs) have experienced an extraordinary evolution in both computing performance and programmability, leading to greater use of the GPU for non-rendering applications. In this work we propose a real-time object tracking algorithm, based on the hybridization of particle filtering (PF) and a multi-scale local search (MSLS) algorithm, presented for both CPU and GPU architectures. The developed system provides successful results in precise tracking of single and multiple targets in monocular video, operating in real-time at 70 frames per second for 640 × 480 video resolutions on the GPU, up to 1,100% faster than the CPU version of the algorithm.  相似文献   

10.
Verification has grown to dominate the cost of electronic system design, consuming about 60% of design effort. Among several verification techniques, logic simulation remains the major verification technique. Speeding up logic simulation results in great savings and shorter time-to-market. We parallelize logic simulation using Graphics Processing Units (GPUs). In the past, GPUs were special-purpose application accelerators, suitable only for conventional graphics applications. The new generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. We develop a parallel cycle-based logic simulation algorithm that uses And Inverter Graphs (AIGs) as design representations. AIGs have proven to be an effective representation for various design automation applications, and we obtain similar benefits for speeding up logic simulation. We develop two clustering algorithms that partition the gates in the designs into independent blocks. Our algorithms exploit the massively parallel GPU architecture featuring thousands of concurrent threads, fast memory, and memory coalescing for optimizations. We demonstrate up-to 5x and 21x speedups on several benchmarks using our simulation system with the first and second clustering algorithms, respectively. Our work ultimately results in significant reduction in the overall design cycle.  相似文献   

11.
Zippy: A Framework for Computation and Visualization on a GPU Cluster   总被引:1,自引:0,他引:1  
Due to its high performance/cost ratio, a GPU cluster is an attractive platform for large scale general‐purpose computation and visualization applications. However, the programming model for high performance general‐purpose computation on GPU clusters remains a complex problem. In this paper, we introduce the Zippy frame‐work, a general and scalable solution to this problem. It abstracts the GPU cluster programming with a two‐level parallelism hierarchy and a non‐uniform memory access (NUMA) model. Zippy preserves the advantages of both message passing and shared‐memory models. It employs global arrays (GA) to simplify the communication, synchronization, and collaboration among multiple GPUs. Moreover, it exposes data locality to the programmer for optimal performance and scalability. We present three example applications developed with Zippy: sort‐last volume rendering, Marching Cubes isosurface extraction and rendering, and lattice Boltzmann flow simulation with online visualization. They demonstrate that Zippy can ease the development and integration of parallel visualization, graphics, and computation modules on a GPU cluster.  相似文献   

12.
We present Glimmer, a new multilevel algorithm for multidimensional scaling designed to exploit modern graphics processing unit (GPU) hardware. We also present GPU-SF, a parallel, force-based subsystem used by Glimmer. Glimmer organizes input into a hierarchy of levels and recursively applies GPU-SF to combine and refine the levels. The multilevel nature of the algorithm makes local minima less likely while the GPU parallelism improves speed of computation. We propose a robust termination condition for GPU-SF based on a filtered approximation of the normalized stress function. We demonstrate the benefits of Glimmer in terms of speed, normalized stress, and visual quality against several previous algorithms for a range of synthetic and real benchmark datasets. We also show that the performance of Glimmer on GPUs is substantially faster than a CPU implementation of the same algorithm.  相似文献   

13.
We report fast computation of computer-generated holograms (CGHs) using Xeon Phi coprocessors, which have massively x86-based processors on one chip, recently released by Intel. CGHs can generate arbitrary light wavefronts, and therefore, are promising technology for many applications: for example, three-dimensional displays, diffractive optical elements, and the generation of arbitrary beams. CGHs incur enormous computational cost. In this paper, we describe the implementations of several CGH generating algorithms on the Xeon Phi, and the comparisons in terms of the performance and the ease of programming between the Xeon Phi, a CPU and graphics processing unit (GPU).  相似文献   

14.
15.
Association rule mining is a well-known data mining task, but it requires much computational time and memory when mining large scale data sets of high dimensionality. This is mainly due to the evaluation process, where the antecedent and consequent in each rule mined are evaluated for each record. This paper presents a novel methodology for evaluating association rules on graphics processing units (GPUs). The evaluation model may be applied to any association rule mining algorithm. The use of GPUs and the compute unified device architecture (CUDA) programming model enables the rules mined to be evaluated in a massively parallel way, thus reducing the computational time required. This proposal takes advantage of concurrent kernels execution and asynchronous data transfers, which improves the efficiency of the model. In an experimental study, we evaluate interpreter performance and compare the execution time of the proposed model with regard to single-threaded, multi-threaded, and graphics processing unit implementation. The results obtained show an interpreter performance above 67 billion giga operations per second, and speed-up by a factor of up to 454 over the single-threaded CPU model, when using two NVIDIA 480 GTX GPUs. The evaluation model demonstrates its efficiency and scalability according to the problem complexity, number of instances, rules, and GPU devices.  相似文献   

16.
We present a method for stochastic fiber tract mapping from diffusion tensor MRI (DT-MRI) implemented on graphics hardware. From the simulated fibers we compute a connectivity map that gives an indication of the probability that two points in the dataset are connected by a neuronal fiber path. A Bayesian formulation of the fiber model is given and it is shown that the inversion method can be used to construct plausible connectivity. An implementation of this fiber model on the graphics processing unit (GPU) is presented. Since the fiber paths can be stochastically generated independently of one another, the algorithm is highly parallelizable. This allows us to exploit the data-parallel nature of the GPU fragment processors. We also present a framework for the connectivity computation on the GPU. Our implementation allows the user to interactively select regions of interest and observe the evolving connectivity results during computation. Results are presented from the stochastic generation of over 250,000 fiber steps per iteration at interactive frame rates on consumer-grade graphics hardware.  相似文献   

17.
We study an automated verification method for functional correctness of parallel programs running on graphics processing units (GPUs). Our method is based on Kojima and Igarashi’s Hoare logic for GPU programs. Our algorithm generates verification conditions (VCs) from a program annotated by specifications and loop invariants, and passes them to off-the-shelf SMT solvers. It is often impossible, however, to solve naively generated VCs in reasonable time. A main difficulty stems from quantifiers over threads due to the parallel nature of GPU programs. To overcome this difficulty, we additionally apply several transformations to simplify VCs before calling SMT solvers. Our implementation successfully verifies correctness of several GPU programs, including matrix multiplication optimized by using shared memory. In contrast to many existing verification tools for GPU programs, our verifier succeeds in verifying fully parameterized programs: parameters such as the number of threads and the sizes of matrices are all symbolic. We empirically confirm that our simplification heuristics is highly effective for improving efficiency of the verification procedure.  相似文献   

18.
Maximum utilization of hardware resources is crucial to leverage the enormous computational power of graphics processing units (GPUs). However, there lacks an effective metric to denote whether the launched threads are kept busy. To address this issue, we propose a metric called ETU to describe the efficiency of threads utilization. First, we execute several CUDA-SDK sample codes, with(out) double precision arithmetic, on two generations of GPUs so as to perform a preliminary validation of the ETU metric. Taking the spherical harmonic transform as an example, we then give two GPU implementations for Legendre transforms and check the relationship between ETU and application performance. Experimental results show that applications with larger ETU can usually achieve better performance, which is more accurate than occupancy proposed by NVIDIA. Finally, we select the GPU implementations with better performance to accelerate Legendre transforms in STSWM, which is a spectral transform shallow water model.  相似文献   

19.
We study the use of massively parallel architectures for computing a matrix inverse. Two different algorithms are reviewed, the traditional approach based on Gaussian elimination and the Gauss–Jordan elimination alternative, and several high performance implementations are presented and evaluated. The target architecture is a current general-purpose multicore processor (CPU) connected to a graphics processor (GPU). Numerical experiments show the efficiency attained by the proposed implementations and how the computation of large-scale inverses, which only a few years ago would have required a distributed-memory cluster, take only a few minutes on a hybrid architecture formed by a multicore CPU and a GPU.  相似文献   

20.
Discrete Wavelet Transform on Consumer-Level Graphics Hardware   总被引:1,自引:0,他引:1  
Discrete wavelet transform (DWT) has been heavily studied and developed in various scientific and engineering fields. Its multiresolution and locality nature facilitates applications requiring progressiveness and capturing high-frequency details. However, when dealing with enormous data volume, its performance may drastically reduce. On the other hand, with the recent advances in consumer-level graphics hardware, personal computers nowadays usually equip with a graphics processing unit (GPU) based graphics accelerator which offers SIMD-based parallel processing power. This paper presents a SIMD algorithm that performs the convolution-based DWT completely on a GPU, which brings us significant performance gain on a normal PC without extra cost. Although the forward and inverse wavelet transforms are mathematically different, the proposed algorithm unifies them to an almost identical process that can be efficiently implemented on GPU. Different wavelet kernels and boundary extension schemes can be easily incorporated by simply modifying input parameters. To demonstrate its applicability and performance, we apply it to wavelet-based geometric design, stylized image processing, texture-illuminance decoupling, and JPEG2000 image encoding  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号