首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We describe an efficient and scalable symmetric iterative eigensolver developed for distributed memory multi‐core platforms. We achieve over 80% parallel efficiency by major reductions in communication overheads for the sparse matrix‐vector multiplication and basis orthogonalization tasks. We show that the scalability of the solver is significantly improved compared to an earlier version, after we carefully reorganize the computational tasks and map them to processing units in a way that exploits the network topology. We discuss the advantage of using a hybrid OpenMP/MPI programming model to implement such a solver. We also present strategies for hiding communication on a multi‐core platform. We demonstrate the effectiveness of these techniques by reporting the performance improvements achieved when we apply our solver to large‐scale eigenvalue problems arising in nuclear structure calculations. Because sparse matrix‐vector multiplication and inner product computation constitute the main kernels in most iterative methods, our ideas are applicable in general to the solution of problems involving large‐scale symmetric sparse matrices with irregular sparsity patterns. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

2.
The Internet, in particular the World Wide Web, continues to expand at an amazing pace. We propose a new infrastructure, SuperWeb, to harness global resources, such as CPU cycles or disk storage, and make them available to every user on the Internet. SuperWeb has the potential for solving parallel supercomputing applications involving thousands of co-operating components on the Internet. However, we anticipate that initial implementations will be used inside large organizations with large heterogeneous intranets. Our approach is based on recent advances in Internet connectivity and the implementation of safe distributed computing realized by languages such as Java. Our SuperWeb prototype consists of brokers, clients and hosts. Hosts register a fraction of their computing resources (CPU time, memory, bandwidth, disk space) with resource brokers. Clients submit tasks that need to be executed. The broker maps client computations onto the registered hosts. We examine an economic model for trading computing resources, and discuss several technical challenges associated with such a global computing environment. © 1997 John Wiley & Sons, Ltd.  相似文献   

3.
Feature tracking and matching in video using programmable graphics hardware   总被引:2,自引:0,他引:2  
This paper describes novel implementations of the KLT feature tracking and SIFT feature extraction algorithms that run on the graphics processing unit (GPU) and is suitable for video analysis in real-time vision systems. While significant acceleration over standard CPU implementations is obtained by exploiting parallelism provided by modern programmable graphics hardware, the CPU is freed up to run other computations in parallel. Our GPU-based KLT implementation tracks about a thousand features in real-time at 30 Hz on 1,024 × 768 resolution video which is a 20 times improvement over the CPU. The GPU-based SIFT implementation extracts about 800 features from 640 × 480 video at 10 Hz which is approximately 10 times faster than an optimized CPU implementation.  相似文献   

4.
Modern GPUs excel in parallel computations, so they are an interesting target to perform matrix transformations such as the DCT, a fundamental part of MPEG video coding algorithms. Considering a system to encode synthetic video (e.g., computer-generated frames), this approach becomes even more appealing, since the images to encode are already in the GPU, eliminating the costs of transferring raw video from the CPU to the GPU. However, after a raw frame has been transformed and quantized by the GPU, the resulting coefficients must be reordered, entropy encoded and framed into the resulting MPEG bitstream. These last steps are essentially sequential and their straightforward GPU implementation is inefficient compared to CPU-based implementations. We present different approaches to implement part of these steps in GPU, aiming for a better usage of the memory bus, compensating the suboptimal use of the GPU with the gains in transfer time. We analyze three approaches to perform the zigzag scan and Huffman coding combining GPU and CPU, and two approaches to assemble the results to build the actual output bitstream both in GPU and CPU memory. Our experiments show that optimising the amount of data transferred from GPU to CPU implementing the last sequential compression steps in the GPU, and using a parallel fast scan implementation of the zigzag scanning improve the overall performance of the system. Savings in transfer time outweigh the extra cost incurred in the GPU.  相似文献   

5.
The Coupled Perturbed Kohn-Sham equations have been implemented in the Amsterdam Density Functional program package. Our implementation differs from previous ones in many ways. This program uses density fitting to calculate the Coulomb and exchange integrals. Further, all matrix elements of the Fock type matrix and its derivatives are calculated by numerical integration. The frozen core approximation is also implemented. Our implementation is approximately 10 times faster than a finite difference algorithm, and the absolute CPU times also compare favorably with other reported implementations.  相似文献   

6.
We describe a hybrid Lyapunov solver based on the matrix sign function, where the intensive parts of the computation are accelerated using a graphics processor (GPU) while executing the remaining operations on a general-purpose multi-core processor (CPU). The initial stage of the iteration operates in single-precision arithmetic, returning a low-rank factor of an approximate solution. As the main computation in this stage consists of explicit matrix inversions, we propose a hybrid implementation of Gauß-Jordan elimination using look-ahead to overlap computations on GPU and CPU. To improve the approximate solution, we introduce an iterative refinement procedure that allows to cheaply recover full double-precision accuracy. In contrast to earlier approaches to iterative refinement for Lyapunov equations, this approach retains the low-rank factorization structure of the approximate solution. The combination of the two stages results in a mixed-precision algorithm, that exploits the capabilities of both general-purpose CPUs and many-core GPUs and overlaps critical computations. Numerical experiments using real-world data and a platform equipped with two Intel Xeon QuadCore processors and an Nvidia Tesla C1060 show a significant efficiency gain of the hybrid method compared to a classical CPU implementation.  相似文献   

7.
The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarz’s, Cimmino’s, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradient-accelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method.  相似文献   

8.
In this paper we focus on Jacobi like resolution of the eigenproblem for a real symmetric matrix from a parallel performance point of view: we try to optimize the algorithm working on the communication intensive part of the code. We discuss several parallel implementations and propose an implementation which overlaps the communications by the computations to reach a better efficiency. We show that the overlapping implementation can lead to significant improvements. We conclude by presenting our future work.  相似文献   

9.
10.
For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)‐2 symmetric matrix‐vector multiplication, and the BLAS‐3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi‐GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU‐GPU kernel into computational kernels at higher‐levels of software stacks, that is, a shared‐memory dense eigensolver and a distributed‐memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher‐level kernels, not only reducing the solution time but also enabling the solution of larger‐scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

11.
We consider the computation of shortest paths on Graphic Processing Units (GPUs). The blocked recursive elimination strategy we use is applicable to a class of algorithms (such as all-pairs shortest-paths, transitive closure, and LU decomposition without pivoting) having similar data access patterns. Using the all-pairs shortest-paths problem as an example, we uncover potential gains over this class of algorithms. The impressive computational power and memory bandwidth of the GPU make it an attractive platform to run such computationally intensive algorithms. Although improvements over CPU implementations have previously been achieved for those algorithms in terms of raw speed, the utilization of the underlying computational resources was quite low. We implemented a recursively partitioned all-pairs shortest-paths algorithm that harnesses the power of GPUs better than existing implementations. The alternate schedule of path computations allowed us to cast almost all operations into matrix–matrix multiplications on a semiring. Since matrix–matrix multiplication is highly optimized and has a high ratio of computation to communication, our implementation does not suffer from the premature saturation of bandwidth resources as iterative algorithms do. By increasing temporal locality, our implementation runs more than two orders of magnitude faster on an NVIDIA 8800 GPU than on an Opteron. Our work provides evidence that programmers should rethink algorithms instead of directly porting them to GPU.  相似文献   

12.
Exascale computers are expected to have highly hierarchical architectures with nodes composed by multiple core processors (CPU; central processing unit) and accelerators (GPU; graphics processing unit). The different programming levels generate new difficult algorithm issues. In particular when solving extremely large linear systems, new programming paradigms of Krylov methods should be defined and evaluated with respect to modern state of the art of scientific methods. Iterative Krylov methods involve linear algebra operations such as dot product, norm, addition of vectors and sparse matrix–vector multiplication. These operations are computationally expensive for large size matrices. In this paper, we aim to focus on the best way to perform effectively these operations, in double precision, on GPU in order to make iterative Krylov methods more robust and therefore reduce the computing time. The performance of our algorithms is evaluated on several matrices arising from engineering problems. Numerical experiments illustrate the robustness and accuracy of our implementation compared to the existing libraries. We deal with different preconditioned Krylov methods: Conjugate Gradient for symmetric positive-definite matrices, and Generalized Conjugate Residual, Bi-Conjugate Gradient Conjugate Residual, transpose-free Quasi Minimal Residual, Stabilized BiConjugate Gradient and Stabilized BiConjugate Gradient (L) for the solution of sparse linear systems with non symmetric matrices. We consider and compare several sparse compressed formats, and propose a way to implement effectively Krylov methods on GPU and on multicore CPU. Finally, we give strategies to faster algorithms by auto-tuning the threading design, upon the problem characteristics and the hardware changes. As a conclusion, we propose and analyse hybrid sub-structuring methods that should pave the way to exascale hybrid methods.  相似文献   

13.
Classic analyses of system implementations view user participation as a key element for successful implementation. However, under some conditions, avoiding user participation offers an alternative route to a successful implementation; this is advisable especially when the user network is weak and aligning user needs with the technological capabilities will take too much resource. To illustrate such situation, we analyse how a successful implementation outcome of an enterprise resource planning (ERP) system emerged in a recently established conglomeration of two previously independent universities. The ERP was used to replace several legacy student administration systems for both political and functional reasons. It was deemed successful by both project consultants and the new university's management while the users were marginalised (‘black boxed’) and left to ‘pick up the pieces’ of an incomplete system using traditional methods such as shadow systems and work‐a‐rounds. Using a process approach and an actor–network theory ‘reading’ of related socio‐technical events, we demonstrate how three networks of actors – management, the project team and the administrative users – collided and influenced the implementation outcome and how the management and project network established the ERP as a reliable ally while at the same time the users – while being enrolled in the network – were betrayed through marginalisation. Our analysis also suggests a useful way to conduct a ‘follow the network’ analysis explaining and accounting for the observed implementation outcome. We illustrate the benefits of using a socio‐technical processual analysis and show how stable actor networks must be constructed during large‐scale information technology change and how different actor groups perceive and influence differently the implementation outcome.  相似文献   

14.
We present a novel, hybrid parallel continuous collision detection (HPCCD) method that exploits the availability of multi‐core CPU and GPU architectures. HPCCD is based on a bounding volume hierarchy (BVH) and selectively performs lazy reconstructions. Our method works with a wide variety of deforming models and supports self‐collision detection. HPCCD takes advantage of hybrid multi‐core architectures – using the general‐purpose CPUs to perform the BVH traversal and culling while GPUs are used to perform elementary tests that reduce to solving cubic equations. We propose a novel task decomposition method that leads to a lock‐free parallel algorithm in the main loop of our BVH‐based collision detection to create a highly scalable algorithm. By exploiting the availability of hybrid, multi‐core CPU and GPU architectures, our proposed method achieves more than an order of magnitude improvement in performance using four CPU‐cores and two GPUs, compared to using a single CPU‐core. This improvement results in an interactive performance, up to 148 fps, for various deforming benchmarks consisting of tens or hundreds of thousand triangles.  相似文献   

15.
A software framework taking advantage of parallel processing capabilities of CPUs and GPUs is designed for the real‐time interactive cutting simulation of deformable objects. Deformable objects are modelled as voxels connected by links. The voxels are embedded in an octree mesh used for deformation. Cutting is performed by disconnecting links swept by the cutting tool and then adaptively refining octree elements near the cutting tool trajectory. A surface mesh used for visual display is reconstructed from disconnected links using the dual contour method. Spatial hashing of the octree mesh and topology‐aware interpolation of distance field are used for collision. Our framework uses a novel GPU implementation for inter‐object collision and object self collision, while tool‐object collision, cutting and deformation are assigned to CPU, using multiple threads whenever possible. A novel method that splits cutting operations into four independent tasks running in parallel is designed. Our framework also performs data transfers between CPU and GPU simultaneously with other tasks to reduce their impact on performances. Simulation tests show that when compared to three‐threaded CPU implementations, our GPU accelerated collision is 53–160% faster; and the overall simulation frame rate is 47–98% faster.  相似文献   

16.
SMP集群系统上矩阵特征问题并行求解器的有效算法   总被引:2,自引:0,他引:2  
对称矩阵三对角化和三对角对称矩阵的特征值求解是稠密对称矩阵特征问题并行求解器的关键步 .针对SMP集群系统的多级体系结构,基于Householder变换的矩阵三对角化和三对角矩阵特征值问题的分而治之算法,给出了它们的MPI OpenMP混合并行算法 .算法研究集中在SMP集群系统环境下的负载平衡、通信开销和性能评价 .混合并行算法的设计结合了粗粒度线程并行模式和任务共享的动态调用方法,改善了MPI算法中的负载平衡问题、降低了通信开销 .在深腾6800上的实验表明,基于混合并行算法的求解器比纯MPI版本的求解器具有更好的性能和可扩展性 .  相似文献   

17.
《Parallel Computing》2014,40(8):425-447
EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed for simulating thermo-fluid flows across a wide range of scales and physical scenarios. The dynamic core of EULAG includes the multidimensional positive definite advection transport algorithm (MPDATA) and elliptic solver. In this work we investigate aspects of an optimal parallel version of the 2D MPDATA algorithm on modern hybrid architectures with GPU accelerators, where computations are distributed across both GPU and CPU components.Using the hybrid OpenMP–OpenCL model of parallel programming opens the way to harness the power of CPU–GPU platforms in a portable way. In order to better utilize features of such computing platforms, comprehensive adaptations of MPDATA computations to hybrid architectures are proposed. These adaptations are based on efficient strategies for memory and computing resource management, which allow us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of CPU–GPU platforms. The main contributions of the paper are:
  • •method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations;
  • •method for the adaptation of 2D MPDATA computations to multicore CPU platforms, based on space and temporal blocking techniques;
  • •method for the adaptation of the 2D MPDATA algorithm to GPU architectures, based on a hierarchical decomposition strategy across data and computation domains, with support provided by the developed GPU task scheduler allowing for the flexible management of available resources;
  • •approach to the parametric optimization of 2D MPDATA computations on GPUs using the autotuning technique, which allows us to provide a portable implementation methodology across a variety of GPUs.
Hybrid platforms tested in this study contain different numbers of CPUs and GPUs – from solutions consisting of a single CPU and a single GPU to the most elaborate configuration containing two CPUs and two GPUs. Processors of different vendors are employed in these systems – both Intel and AMD CPUs, as well as GPUs from NVIDIA and AMD. For all the grid sizes and for all the tested platforms, the hybrid version with computations spread across CPU and GPU components allows us to achieve the highest performance. In particular, for the largest MPDATA grids used in our experiments, the speedups of the hybrid versions over GPU and CPU versions vary from 1.30 to 1.69, and from 1.95 to 2.25, respectively.  相似文献   

18.
This paper describes the design and implementation of three core factorization routines—LU, QR, and Cholesky—included in the out‐of‐core extension of ScaLAPACK. These routines allow the factorization and solution of a dense system that is too large to fit entirely in physical memory. The full matrix is stored on disk and the factorization routines transfer sub‐matrice panels into memory. The ‘left‐looking’ column‐oriented variant of the factorization algorithm is implemented to reduce the disk I/O traffic. The routines are implemented using a portable I/O interface and utilize high‐performance ScaLAPACK factorization routines as in‐core computational kernels. We present the details of the implementation for the out‐of‐core ScaLAPACK factorization routines, as well as performance and scalability results on a Beowulf Linux cluster. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

19.
We compare the CPU effort and pricing biases of seven Fourier-based implementations. Our analyses show that truncation and discretization errors significantly increase as we move away from the Black–Scholes–Merton framework. We rank the speed and accuracy of the competing choices, showing which methods require smaller truncation ranges and which are the most efficient in terms of sampling densities. While all implementations converge well in the Bates jump-diffusion model, Attari's formula is the only Fourier-based method that does not blow up for any Variance Gamma parameter values. In terms of speed, the use of strike vector computations significantly improves the computational burden, rendering both fast Fourier transforms (FFT) and plain delta-probability decompositions inefficient. We conclude that the multi-strike version of the COS method is notably faster than any other implementation, whereas the strike-optimized Carr Madan's formula is simultaneously faster and more accurate than the FFT, thus questioning its use.  相似文献   

20.
Sequence segmentation has gained popularity in bioinformatics and particularly in studying DNA sequences. Information theoretic models have been used in providing accurate solutions in the segmentation of DNA sequences. Existing dynamic programming approaches provide optimal solution to the segmentation problem. However, their quadratic time complexity prohibits their applicability to long sequences. In this paper, we propose a parallel approach to improve the performance of a quasilinear sequence segmentation algorithm. The target segmentation technique is a divide-and-conquer recursive algorithm that is based on information theory principles and models. We present three parallel implementations that aim at reducing the segmentation time. The first implementation uses the multithreading capabilities of CPUs. The second one is a hybrid implementation that utilizes the synergy between the CPU and the multithreading power of GPUs. The third implementation is a variation of the hybrid approach where it utilizes the concept of unified memory between the CPU and the GPU instead of the standard memory copy approach. We demonstrate the applicability of the parallel implementations by testing them on real DNA sequences and randomly generated sequences with different lengths and different number of unique elements. The results show that the hybrid CPU-GPU approach outperforms the sequential implementation with a speedup of up to 5.9X while the CPU parallel implementation provides a poor speedup of only 1.7X.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号