首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 187 毫秒
1.
大规模三角线性方程求解是科学与工程应用中重要的计算核心,受限于处理器的缓存容量和结构设计,其在CPU和GPU等平台上的计算效率不高。大规模三角线性方程的分块求解中,矩阵乘是主要运算,其计算效率对提升三角线性方程求解的计算效率至关重要。以矩阵乘计算效率较高的矩阵乘协处理器为计算平台,针对其结构特点提出了矩阵乘协处理器上大规模三角线性方程分块求解的实现方法和性能分析模型。实验结果表明,矩阵乘协处理器上大规模三角线性方程求解的计算效率最高可达85.9%,其实际性能和资源利用率分别为同等工艺下GPU的2.42倍和10.72倍。  相似文献   

2.
BLAS (basic linear algebra subprograms)是最基本、最重要的底层数学库之一.在一个标准的BLAS库中,BLAS 3级函数涵盖的矩阵-矩阵运算尤为重要,在许多大规模科学与工程计算应用中被广泛调用.另外, BLAS 3级属于计算密集型函数,对充分发挥处理器的计算性能有至关重要的作用.针对国产SW26010-Pro处理器研究BLAS 3级函数的众核并行优化技术.具体而言,根据SW26010-Pro的存储层次结构,设计多级分块算法,挖掘矩阵运算的并行性.在此基础上,基于远程内存访问(remote memory access, RMA)机制设计数据共享策略,提高从核间的数据传输效率.进一步地,采用三缓冲、参数调优等方法对算法进行全面优化,隐藏直接内存访问(direct memory access, DMA)访存开销和RMA通信开销.此外,利用SW26010-Pro的两条硬件流水线和若干向量化计算/访存指令,还对BLAS 3级函数的矩阵-矩阵乘法、矩阵方程组求解、矩阵转置操作等若干运算进行手工汇编优化,提高了函数的浮点计算效率.实验结果显示,所提出的并行优化技术...  相似文献   

3.
脉动阵列结构规整、吞吐量大,适合矩阵乘算法,广泛用于设计高性能卷积、矩阵乘加速结构。在深亚微米工艺下,通过增大阵列规模来提升芯片计算性能,会导致频率下降、功耗剧增等问题。因此,结合3D集成电路技术,提出了一种将平面脉动阵列结构映射到3D集成电路上的双精度浮点矩阵乘加速结构3D-MMA。首先,设计了针对该结构的分块映射调度算法,提升矩阵乘计算效率;其次,提出了基于3D-MMA的加速系统,构建了3D-MMA的性能模型,并对其设计空间进行探索;最后,评估了该结构实现代价,并同已有先进加速器进行对比分析。实验结果表明,访存带宽为160GB/s时,采用4层16×16脉动阵列的堆叠结构时,3D-MMA计算峰值性能达3TFLOPS,效率达99%,且实现代价小于二维实现。在相同工艺下,同线性阵列加速器及K40GPU相比,3D-MMA的性能是后者的1.36及1.92倍,而面积远小于后者。探索了3D集成电路在高性能矩阵乘加速器设计中的优势,对未来进一步提升高性能计算平台性能具有一定的参考价值。  相似文献   

4.
BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块,广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算.针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数.基于RMA通信机制设计了从核归约策略,提升了BLAS 1、2级若干函数的归约效率.针对TRSV、TPSV等存在数据依赖关系的函数,提出了一套高效并行算法,该算法通过点对点同步维持数据依赖关系,设计了适用于三角矩阵的高效任务映射机制,有效减少了从核点对点同步的次数,提高了函数的执行效率.通过自适应优化、向量压缩、数据复用等技术,进一步提升了BLAS 1、2级函数的访存带宽利用率.实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%,平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%,平均可达80%以上.与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用...  相似文献   

5.
BLAS是当前科学计算领域重要的底层支持数学库之一,其中的3级BLAS函数应用最为广泛.本文基于国产申威1600平台,提出了一种基础线性代数库BLAS的三级函数通用矩阵乘GEMM的高性能实现方法.在单核上,使用乘加指令、循环展开、软件流水线指令重排、SIMD向量化运算、寄存器分块技术等与平台架构相关的技术手段,实现汇编级手工优化;在多核上,提出了适用于该平台的多线程加速方案.实验结果显示,在单核串行性能测试中,与知名开源数学库GotoBLAS相比,我们实现了平均4.72倍的加速效果;在多核并行扩展测试中,4线程版的性能则平均达到了单线程版性能的3.02倍.  相似文献   

6.
借助模幂乘协处理器是提升RSA性能最有效的方法,但当RSA模幂运算长度超过协处理器能支持的最大运算长度时,协处理器将不再适用。本文针对这个问题,基于中国剩余定理和Fischer、Seifert算法,在n-bit模幂乘协处理器的基础上实现了模长为2n-bitRSA算法,并利用模幂乘协处理器实现了n-bit大数乘法和除法,进一步提高了RSA运算效率。  相似文献   

7.
给出了一种公钥密码协处理器的结构,既可以计算定义在Fp上的椭圆曲线的点乘运算,也可以计算应用在RSA中的模幂运算,支持域长度不超过256比特的ECC,长度不超过2 048比特的RSA。该协处理器具有结构简单、实现方便、稍加调整即可满足用户对面积的要求等特点。  相似文献   

8.
加速GF(2m)上的模乘运算是提高GF(2^m)上ECC算法性能的关键。在分析EC上点乘操作的基础上,我们构造了模乘运算在线性Systolic上实现的局部并行处理递推形式,并设计了Systolic阵列的具体单元结构和连接,给出了性能分析和模拟结果。实验证明,局部并行阵列结构能适应多种EC上的模乘处理。  相似文献   

9.
针对RS(Reed-Solomon)算法编码过程涉及有限域运算,复杂度高,效率低,运算代价难以被大规模分布式存储系统所接受等问题,提出了一种RS柯西码编码改进算法。该算法用贪心算法选取局部最优柯西矩阵,减少柯西码的计算量。同时,引入二进制矩阵替换柯西矩阵中的有限域元素进行阵列化,将有限域运算转换为异或运算,并对阵列进行运算优化,进一步减少计算量,增加柯西码的编码效率。根据仿真实验表明,改进后RS柯西码与通过遍历得到的最优柯西矩阵的柯西码相比,计算量更小,与编码效率著称的阵列码中的EVENODD码和STAR码相比,编码效率更高。并且具有类似阵列码性质,能够选择更简单高效的译码方法,在一定程度上提高解码效率。  相似文献   

10.
一种通用ECC协处理器的设计与实现   总被引:1,自引:0,他引:1       下载免费PDF全文
蔡亮  戴紫彬  陈璐 《计算机工程》2009,35(4):140-142
提出一种能同时在素数域和二进制有限域下支持任意曲线、任意域多项式的高速椭圆曲线密码体系(ECC)协处理器。该协处理器可以完成ECC中的各种基本运算,根据指令调用基本运算单元完成ECDSA及其他改进算法。支持384位以下任意长度的ECC应用,采用基于字的模乘器、操作数分离、RAM阵列等技术提高系统性能。  相似文献   

11.
We present a new fast and scalable matrix multiplication algorithm called DIMMA (distribution-independent matrix multiplication algorithm) for block cyclic data distribution on distributed-memory concurrent computers. The algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap computation and communication effectively, and exploits the LCM block concept to obtain the maximum performance of the sequential BLAS (basic linear algebra subprograms) routine in each processor even when the block size is very small or very large. The algorithm is implemented and compared with SUMMA on the Intel Paragon computer. © 1998 John Wiley & Sons, Ltd.  相似文献   

12.
在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。  相似文献   

13.
BLAS (Basic Linear Algebra Subprograms)是一个以向量和矩阵为操作对象的基础函数库.该库中函数分为3个级别,各个级别分别提供了向量-向量(1级)、向量-矩阵(2级)、矩阵-矩阵(3级)之间的基本运算.本文研究如何在申威众核处理器上BLAS-1、2级函数的并行实现,并充分利用平台特性对它们进行深度的性能调优,归纳总结程序在申威平台上的并行实现与优化技巧.申威26010 CPU采用了异构众核架构,众多计算核心提供的大规模并行处理能力,使单块芯片具有3 TFLOPS的双精度浮点计算性能.实验结果显示BLAS-1、2级函数相对于GotoBLAS参考实现版的平均加速比分别高达11.x和6.x,对于每一优化手段,均有明显的性能加速.  相似文献   

14.
We propose a new software package which would be very useful for implementing dense linear algebra algorithms on block-partitioned matrices. The routines are referred to as block basic linear algebra subprograms (BLAS), and their use is restricted to computations in which one or more of the matrices involved consists of a single row or column of blocks, and in which no more than one of the matrices consists of an unrestricted two-dimensional array of blocks. The functionality of the block BLAS routines can also be provided by Level 2 and 3 BLAS routines. However, for non-uniform memory access machines the use of the block BLAS permits certain optimizations in memory access to be taken advantage of. This is particularly true for distributed memory machines, for which the block BLAS are referred to as the parallel block basic linear algebra subprograms (PB-BLAS). The PB-BLAS are the main focus of this paper, and for a block-cyclic data distribution, in a single row or column of blocks lies in a single row or column of the processor template. The PB-BLAS consist of calls to the sequential BLAS for local computations, and calls to the BLACS for communication. The PB-BLAS are the building blocks for implementing ScaLAPACK, the distributed-memory version of LAPACK, and provide the same ease-of-use and portability for ScaLAPACK that the BLAS provide for LAPACK. The PB-BLAS consist of all Level 2 and 3 BLAS routines for dense matrix computations (not for banded matrix) and four auxiliary routines for transposing and copying of a vector and/or a block vector. The PB-BLAS are currently available for all numeric data types, i.e., single and double precision, real and complex.  相似文献   

15.
In this paper, we present straightforward techniques for a highly efficient, scalable implementation of common matrix–matrix operations generally known as the Level 3 Basic Linear Algebra Subprograms (BLAS). This work builds on our recent discovery of a parallel matrix–matrix multiplication implementation, which has yielded superior performance, and requires little work space. We show that the techniques used for the matrix–matrix multiplication naturally extend to all important Level 3 BLAS and thus this approach becomes an enabling technology for efficient parallel implementation of these routines and libraries that use BLAS. Representative performance results on the Intel Paragon system are given. © 1997 John Wiley & Sons, Ltd.  相似文献   

16.
The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized memory. The main basic component of these scientific codes is often matrix multiplication, and the efficient development of other linear algebra packages is directly based on the matrix multiplication routine implemented in the BLAS library. BLAS library is used in the form of packages implemented by the vendors or free implementations. The latest versions of this library are multithreaded and can be used efficiently in multicore systems, but when they are used inside parallel codes, the two parallelism levels can interfere and produce a degradation of the performance. In this work, an auto-tuning method is proposed to select automatically the optimum number of threads to use at each parallel level when multithreaded linear algebra routines are called from OpenMP parallel codes. The method is based on a simple but effective theoretical model of the execution time of the two-level routines. The methodology is applied to a two-level matrix–matrix multiplication and to different matrix factorizations (LU, QR and Cholesky) by blocks. Traditional schemes which directly use the multithreaded routine of BLAS, dgemm, are compared with schemes combining the multithreaded dgemm with OpenMP.  相似文献   

17.
双精度普通矩阵乘法DGEMM是BLAS库中最核心的函数之一,大部分三级BLAS库函数的核心计算都是通过调用DGEM M来实现的.该文针对龙芯3A具有128位访存指令的特点,通过理论分析,找到了最佳的循环展开方式;针对龙芯3A的Cache替换策略(随机替换),通过使用地址交错技术,减少了Cache的冲突失效;针对龙芯3A访存带宽有限的问题,通过使用共享数据的任务划分方式,减少了数据访存量.优化后的DGEMM单核和多核运算速度均是性能最高的开源BLAS库(Goto-BLAS)的2倍多.  相似文献   

18.
LAN-connected workstations are a heterogeneous environment, where each workstation provides time-varying computing power, and thus dynamic load balancing mechanisms are necessary for parallel applications to run efficiently. Parallel basic linear algebra subprograms (BLAS) have recently shown promise as a means of taking advantage of parallel computing in solving scientific problems. Most existing parallel algorithms of BLAS are designed for conventional parallel computers; they do not take the particular characteristics of LAN-connected workstations into consideration. This paper presents a parallelizing method of Level 3 BLAS for LAN-connected workstations. The parallelizing method makes dynamic load balancing throughcolumn-blockingdata distribution. The experiment results indicate that this dynamic load balancing mechanism really leads to a more efficient parallel level 3 BLAS for LAN-connected workstations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号