首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 24 毫秒
1.
For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)‐2 symmetric matrix‐vector multiplication, and the BLAS‐3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi‐GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU‐GPU kernel into computational kernels at higher‐levels of software stacks, that is, a shared‐memory dense eigensolver and a distributed‐memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher‐level kernels, not only reducing the solution time but also enabling the solution of larger‐scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

2.
肖玄基  张云泉  李玉成  袁良 《软件学报》2013,24(S2):118-126
MAGMA是第一个面向下一代体系架构(多核CPU和GPU)开源的线性代数软件包,它采用了诸多针对异构平台的优化方法,包括混合同步、通信避免和动态任务调度.它在功能、数据存储、接口上与LAPACK相似,可以发挥GPU的巨大计算能力进行数值计算.对MAGMA进行了测试分析.首先对矩阵分解算法进行分析;然后通过测试结果,分析MAGMA有效的优化和并行方法,为MAGMA使用、优化提供有益的建议;最后提出了一种对于矩阵分块算法的自适应调优的方法,经过测试,对于方阵的SGEQRF函数加速比达到1.09,对于高瘦矩阵的CGEQRF函数加速比达到1.8.  相似文献   

3.
Cholesky分解递归算法与改进   总被引:10,自引:0,他引:10  
递归算法是计算稠密线性代数的一种新的有效方法。递归产生自动、变化的矩阵分块,能充分发挥当今分级存储高性能计算机的效率。对Cholesky分解递归算法进行了研究,给出了算法的详细推导过程,用具有递归功能的Fortran90实现了算法,并通过矩阵元素顺序重排的方法,进一步提高了递归算法的运算速度。研究产生的算法比目前常用的分块算法快15%-25%。  相似文献   

4.
1 引言以乔莱斯基(Cholesky)分解、LU分解等为代表的线性代数问题的数值计算在现代科学研究和工程技术中得到广泛应用。随着计算机结构和技术的发展,实现这些线性代数数值计算的计算机算法和软件也在不断发展。通用的基本线性代数子程序库BLAS(Basic Lin-ear Algebra Subprograms)从70年代的Level-1 BLAS(执行向量一向量运算),发展到80年代的Level-2  相似文献   

5.
In this paper,some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000.They include matrix multiplication,LU factorization of a dense matrix,Cholesky factorization of a symmetric matrix,and eigendecomposition of symmetric matrix for real and complex data types.These programs are constructed based on fast BLAS library of Dawning-1000 under NX environment.Some comparison results under different parallel environments and implementing methods are also given for Cholesky factorization.The execution time,measured performance and speedup for each problem on Dawning-1000 are shown.For matrix multiplication and IU factorization,1.86GFLOPS and 1.53GFLOPS are reached.  相似文献   

6.
基于星形互连网络的并行快速傅立叶变换算法   总被引:6,自引:0,他引:6  
星形互连网络是一种易于实现大规模并行计算的互连网络拓扑结构。利用星形互连网络的递归可分解性的多样性,提出了一种基于星形互连网络的并行快速傅立叶变换算法的实现方法。该方法能够有效地减少计算过程中处理器结点之间的通信开销。提出的星图结点和数据的映射应运 及实现并行FFT的思想可推广到线性方程组求解、矩阵乘法等其它并行算法在星形互连网络上的实现。  相似文献   

7.
基于OpenBLAS和BLIS开源线性代数基础算法库,对稠密矩阵乘法GEMM运算的性能优化展开研究。针对如何选取稠密矩阵分块并行算法的关键分块参数这一问题,建立性能优化模型。采用改进的遗传算法求解上述优化模型,将某一分块参数组合(种群个体)所对应的稠密矩阵乘法的性能值作为该个体的适应度,通过不断迭代地进行选择、交叉、变异操作,找到最优的分块参数组合,使得稠密矩阵运算的性能值最优。数值实验表明,基于遗传算法求解得出最优分块参数下的GEMM性能值优于默认分块参数下的性能值,达到了优化的目的。  相似文献   

8.
§1.引言 以LU分解, Cholesky分解等为代表的线性代数问题的数值计算在现代科学研究和工程技术中得到广泛应用.随着计算机技术的发展,实现这些线性代数数值计算的计算机算法和软件也在不断发展.目前,具有多级存储结构的高性能RISC计算机已占据了数值计算领域的主导地位. RSIC处理器的运算速度非常快,它们与存储器之间的速度差距很大.计算机的性能能不能充分发挥,多级存储结构与高缓能否得到有效利用成为关键.为此,现行的  相似文献   

9.
Quantum control plays a key role in quantum technology, in particular for steering quantum systems. As problem size grows exponentially with the system size, it is necessary to deal with fast numerical algorithms and implementations. We improved an existing code for quantum control concerning two linear algebra tasks: the computation of the matrix exponential and efficient parallelisation of prefix matrix multiplication.For the matrix exponential we compare three methods: the eigendecomposition method, the Padé method and a polynomial expansion based on Chebyshev polynomials. We show that the Chebyshev method outperforms the other methods both in terms of computation time and accuracy. For the prefix problem we compare the tree-based parallel prefix scheme, which is based on a recursive approach, with a sequential multiplication scheme where only the individual matrix multiplications are parallelised. We show that this fine-grain approach outperforms the parallel prefix scheme by a factor of 2–3, depending on parallel hardware and problem size, and also leads to lesser memory requirements.Overall, the improved linear algebra implementations not only led to a considerable runtime reduction, but also allowed us to tackle problems of larger size on the same parallel compute cluster.  相似文献   

10.
G.J. Bierman 《Automatica》1983,19(5):503-511
The Rauch-Tung-Streibel smoother recursion is used to derive a new smoother algorithm based upon a decomposition of the linear model dynamical equation and maximizing use of rank 1 matrix modification. This new algorithm, it turns out, parallels Bierman's forward recursive square-root information filter/backward recursive U-D factorized covariance algorithm. The new result features computational efficiency, reliance on numerically stable matrix modification algorithms, and reduced computer storage.  相似文献   

11.
This paper discusses the development and refinement of several distributed matrix multiplication algorithms. Our goal in this research has been to determine if successful distribution of this problem is possible within a loosely-coupled environment. Our criteria for success are fast execution speed and, to a lesser extent, memory efficiency. Our results indicate that, perhaps counter-intuitively, it is possible to use distribution to improve the performance of dense matrix multiplication. The speed increase obtained ranges up to a factor of four, depending upon the algorithm and the process configuration used. Among the factors affecting performance are computational complexity, number and size of interprocess messages, and bookkeeping overhead. We conclude that this approach to matrix multiplication has potential. Furthermore, some of the principles discussed here may be usefully employed in the distribution of other algorithms of the same O(n3) computational complexity, such as LU decomposition (linear system solvers) and Cholesky factorization.  相似文献   

12.
大规模三角线性方程求解是科学与工程应用中重要的计算核心,受限于处理器的缓存容量和结构设计,其在CPU和GPU等平台上的计算效率不高。大规模三角线性方程的分块求解中,矩阵乘是主要运算,其计算效率对提升三角线性方程求解的计算效率至关重要。以矩阵乘计算效率较高的矩阵乘协处理器为计算平台,针对其结构特点提出了矩阵乘协处理器上大规模三角线性方程分块求解的实现方法和性能分析模型。实验结果表明,矩阵乘协处理器上大规模三角线性方程求解的计算效率最高可达85.9%,其实际性能和资源利用率分别为同等工艺下GPU的2.42倍和10.72倍。  相似文献   

13.
14.
D. Bini 《Calcolo》1988,25(1-2):37-51
We survey certain techniques such as approximation, interpolation and embedding in matrix algebra, which are particularly useful to devise fast parallel algorithms for some matrix-structured problems. The problems we consider are triangular Toeplitz matrix inversion and its algebraic counterpart polynomial division, matrix multiplication, and some computations concerning banded Toeplitz matrices.  相似文献   

15.
In this paper we propose a distributed packed storage format that exploits the symmetry or the triangular structure of a dense matrix. This format stores only half of the matrix while maintaining most of the efficiency compared with a full storage for a wide range of operations. This work has been motivated by the fact that, in contrast to sequential linear algebra libraries (e.g. LAPACK), there is no routine or format that handles packed matrices in the currently available parallel distributed libraries. The proposed algorithms exclusively use the existing ScaLAPACK computational kernels, which proves the generality of the approach, provides easy portability of the code and provides efficient re‐use of existing software. The performance results obtained for the Cholesky factorization show that our packed format performs as good as or better than the ScaLAPACK full storage algorithm for a small number of processors. For a larger number of processors, the ScaLAPACK full storage routine performs slightly better until each processor runs out of memory. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

16.
在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。  相似文献   

17.
We generalize Kedlaya and Umans’ modular composition algorithm to the multivariate case. As a main application, we give fast algorithms for many operations involving triangular sets (over a finite field), such as modular multiplication, inversion, or change of order. For the first time, we are able to exhibit running times for these operations that are almost linear, without any overhead exponential in the number of variables. As a further application, we show that, from the complexity viewpoint, Charlap, Coley, and Robbins’ approach to elliptic curve point counting can be competitive with the better known approach due to Elkies.  相似文献   

18.
Identification algorithms for concentrated dynamic systems are studied. The apparatus of pseudo-inversion of block matrices underlies adaptive identification. It is used to construct a recursive algorithm for linear least-squares problems and a recursive iterative algorithm for nonlinear least-squares problems.  相似文献   

19.
LU分解递归算法的研究   总被引:1,自引:0,他引:1  
陈建平 《计算机科学》2004,31(6):141-142
将递归方法引入稠密线性代数的计算,能产生自动的矩阵分块,使算法适合于当今分级存储高性能计算机的结构,提高运算速度。文中对解线性代数方程组的LU分解递归算法进行了研究,给出了算法的详细推导过程。  相似文献   

20.
A Library for Pattern-based Sparse Matrix Vector Multiply   总被引:1,自引:0,他引:1  
Pattern-based Representation (PBR) is a novel approach to improving the performance of Sparse Matrix-Vector Multiply (SMVM) numerical kernels. Motivated by our observation that many matrices can be divided into blocks that share a small number of distinct patterns, we generate custom multiplication kernels for frequently recurring block patterns. The resulting reduction in index overhead significantly reduces memory bandwidth requirements and improves performance. Unlike existing methods, PBR requires neither detection of dense blocks nor zero filling, making it particularly advantageous for matrices that lack dense nonzero concentrations. SMVM kernels for PBR can benefit from explicit prefetching and vectorization, and are amenable to parallelization. The analysis and format conversion to PBR is implemented as a library, making it suitable for applications that generate matrices dynamically at runtime. We present sequential and parallel performance results for PBR on two current multicore architectures, which show that PBR outperforms available alternatives for the matrices to which it is applicable, and that the analysis and conversion overhead is amortized in realistic application scenarios.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号