首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
BLAS (Basic Linear Algebra Subprograms)是一个以向量和矩阵为操作对象的基础函数库.该库中函数分为3个级别,各个级别分别提供了向量-向量(1级)、向量-矩阵(2级)、矩阵-矩阵(3级)之间的基本运算.本文研究如何在申威众核处理器上BLAS-1、2级函数的并行实现,并充分利用平台特性对它们进行深度的性能调优,归纳总结程序在申威平台上的并行实现与优化技巧.申威26010 CPU采用了异构众核架构,众多计算核心提供的大规模并行处理能力,使单块芯片具有3 TFLOPS的双精度浮点计算性能.实验结果显示BLAS-1、2级函数相对于GotoBLAS参考实现版的平均加速比分别高达11.x和6.x,对于每一优化手段,均有明显的性能加速.  相似文献   

2.
In this paper, we present straightforward techniques for a highly efficient, scalable implementation of common matrix–matrix operations generally known as the Level 3 Basic Linear Algebra Subprograms (BLAS). This work builds on our recent discovery of a parallel matrix–matrix multiplication implementation, which has yielded superior performance, and requires little work space. We show that the techniques used for the matrix–matrix multiplication naturally extend to all important Level 3 BLAS and thus this approach becomes an enabling technology for efficient parallel implementation of these routines and libraries that use BLAS. Representative performance results on the Intel Paragon system are given. © 1997 John Wiley & Sons, Ltd.  相似文献   

3.
The paper describes Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PUMMA package includes not only the non-transposed matrix multiplication routine C = A ⋅ B, but also transposed multiplication routines C = AT ⋅ B, C = A ⋅ BT, and C = AT ⋅ BT, for a block cyclic data distribution. The routines perform efficiently for a wide range of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta computer.  相似文献   

4.
BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块,广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算.针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数.基于RMA通信机制设计了从核归约策略,提升了BLAS 1、2级若干函数的归约效率.针对TRSV、TPSV等存在数据依赖关系的函数,提出了一套高效并行算法,该算法通过点对点同步维持数据依赖关系,设计了适用于三角矩阵的高效任务映射机制,有效减少了从核点对点同步的次数,提高了函数的执行效率.通过自适应优化、向量压缩、数据复用等技术,进一步提升了BLAS 1、2级函数的访存带宽利用率.实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%,平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%,平均可达80%以上.与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用...  相似文献   

5.
In this paper,some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000.They include matrix multiplication,LU factorization of a dense matrix,Cholesky factorization of a symmetric matrix,and eigendecomposition of symmetric matrix for real and complex data types.These programs are constructed based on fast BLAS library of Dawning-1000 under NX environment.Some comparison results under different parallel environments and implementing methods are also given for Cholesky factorization.The execution time,measured performance and speedup for each problem on Dawning-1000 are shown.For matrix multiplication and IU factorization,1.86GFLOPS and 1.53GFLOPS are reached.  相似文献   

6.
Parallel accelerators are playing an increasingly important role in scientific computing. However, it is perceived that their weakness nowadays is their reduced “programmability” in comparison with traditional general-purpose CPUs. For the domain of dense linear algebra, we demonstrate that this is not necessarily the case. We show how the libflame library carefully layers routines and abstracts details related to storage and computation, so that extending it to take advantage of multiple accelerators is achievable without introducing platform specific complexity into the library code base. We focus on the experience of the library developer as he develops a library routine for a new operation, reduction of a generalized Hermitian positive definite eigenvalue problem to a standard Hermitian form, and configures the library to target a multi-GPU platform. It becomes obvious that the library developer does not need to know about the parallelization or the details of the multi-accelerator platform. Excellent performance on a system with four NVIDIA Tesla C2050 GPUs is reported. This makes libflame the first library to be released that incorporates multi-GPU functionality for dense matrix computations, setting a new standard for performance.  相似文献   

7.
We propose a new software package which would be very useful for implementing dense linear algebra algorithms on block-partitioned matrices. The routines are referred to as block basic linear algebra subprograms (BLAS), and their use is restricted to computations in which one or more of the matrices involved consists of a single row or column of blocks, and in which no more than one of the matrices consists of an unrestricted two-dimensional array of blocks. The functionality of the block BLAS routines can also be provided by Level 2 and 3 BLAS routines. However, for non-uniform memory access machines the use of the block BLAS permits certain optimizations in memory access to be taken advantage of. This is particularly true for distributed memory machines, for which the block BLAS are referred to as the parallel block basic linear algebra subprograms (PB-BLAS). The PB-BLAS are the main focus of this paper, and for a block-cyclic data distribution, in a single row or column of blocks lies in a single row or column of the processor template. The PB-BLAS consist of calls to the sequential BLAS for local computations, and calls to the BLACS for communication. The PB-BLAS are the building blocks for implementing ScaLAPACK, the distributed-memory version of LAPACK, and provide the same ease-of-use and portability for ScaLAPACK that the BLAS provide for LAPACK. The PB-BLAS consist of all Level 2 and 3 BLAS routines for dense matrix computations (not for banded matrix) and four auxiliary routines for transposing and copying of a vector and/or a block vector. The PB-BLAS are currently available for all numeric data types, i.e., single and double precision, real and complex.  相似文献   

8.
We present a new fast and scalable matrix multiplication algorithm called DIMMA (distribution-independent matrix multiplication algorithm) for block cyclic data distribution on distributed-memory concurrent computers. The algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap computation and communication effectively, and exploits the LCM block concept to obtain the maximum performance of the sequential BLAS (basic linear algebra subprograms) routine in each processor even when the block size is very small or very large. The algorithm is implemented and compared with SUMMA on the Intel Paragon computer. © 1998 John Wiley & Sons, Ltd.  相似文献   

9.
STREAM是微处理器上内存性能的基准测试程序,在多核多线程FT1000微处理器上发挥高性能是具有挑战性的研究工作。基于多级Cache结构,优化STREAM四个程序的指令流水线,根据寄存器数,设计了多级循环展开方法,根据指令延迟和Cache行的大小确定数据预取的数目,使用汇编语言编写了优化子程序。基于OpenMP并行环境,设计了STREAM并行程序,优化了局部化数据分配方式。数据测试结果表明,优化后的STREAM的性能比原始串行程序性能提高了19.2%~64.2%。优化后,并行程序的最高访存性能达到8.5 GB/s,对比优化前的最高访存性能最大提高了22.7%。  相似文献   

10.
BLAS (basic linear algebra subprograms)是最基本、最重要的底层数学库之一.在一个标准的BLAS库中,BLAS 3级函数涵盖的矩阵-矩阵运算尤为重要,在许多大规模科学与工程计算应用中被广泛调用.另外, BLAS 3级属于计算密集型函数,对充分发挥处理器的计算性能有至关重要的作用.针对国产SW26010-Pro处理器研究BLAS 3级函数的众核并行优化技术.具体而言,根据SW26010-Pro的存储层次结构,设计多级分块算法,挖掘矩阵运算的并行性.在此基础上,基于远程内存访问(remote memory access, RMA)机制设计数据共享策略,提高从核间的数据传输效率.进一步地,采用三缓冲、参数调优等方法对算法进行全面优化,隐藏直接内存访问(direct memory access, DMA)访存开销和RMA通信开销.此外,利用SW26010-Pro的两条硬件流水线和若干向量化计算/访存指令,还对BLAS 3级函数的矩阵-矩阵乘法、矩阵方程组求解、矩阵转置操作等若干运算进行手工汇编优化,提高了函数的浮点计算效率.实验结果显示,所提出的并行优化技术...  相似文献   

11.
阐述MPI与OpenMP进行并行计算的特点,并在Visual Studio 2010上构建一个基于两者的混合编程平台。程序在该平台上执行时能够同时实现多进程与进程内多线程编程,设计并实现一种基于数据划分的矩阵乘法的并行算法,将数据分解为两部分交给两个计算节点分别完成,并在每个计算节点内将数据进一步划分,交给多个线程同时执行。通过与非并行矩阵乘法、MPI矩阵乘法、OpenMP矩阵乘法运算性能进行比较,验证该算法可以有效地挖掘计算机的处理能力。  相似文献   

12.
Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and the limitations imposed by matrix storage formats. A simple but effective framework is presented here that lays the foundation for building high‐performance matrix‐multiplication codes in a structured, portable and efficient manner. The resulting codes are validated on three different representative RISC and CISC architectures on which they significantly outperform highly optimized libraries such as ATLAS and other competing methodologies reported in the literature. The main component of the proposed approach is a hierarchical storage format that efficiently generalizes the applicability of the memory hierarchy friendly Morton ordering to arbitrary‐sized matrices. The storage format supports polyalgorithms, which are shown here to be essential for obtaining the best possible performance for a range of problem sizes. Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm. The authors expose the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high‐performance code. These kernel routines operate on small blocks that fit in the L1 cache. The performance advantages of the proposed framework can be effectively delivered to new and existing applications through the use of object‐oriented or compiler‐based approaches. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

13.
This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge–Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this work an implementation based on Intel Math Kernel Library (Intel MKL) routines and the authors’ own implementation, both using the CSR storage scheme and working on Intel Xeon Phi, were investigated. The implementation based on the Intel MKL library uses the high-performance BLAS and Sparse BLAS routines. In our application we focus on OpenMP style programming. We implement SpMV operation and vector addition using the basic optimizing techniques and the vectorization. We evaluate our approach in native and offload modes for various number of cores and thread allocation affinities. Both implementations (based on Intel MKL and made by the authors) were compared in respect of the time, the speedup and the performance. The numerical experiments on Intel Xeon Phi show that the performance of authors’ implementation is very promising and gives a gain of up to two times compared to the multithreaded implementation (based on Intel MKL) running on CPU (Intel Xeon processor) and even three times in comparison with the application which uses Intel MKL on Intel Xeon Phi.  相似文献   

14.
The Basic Linear Algebra Subprograms (BLAS) define one of the most heavily used performance‐critical APIs in scientific computing today. It has long been understood that the most important of these routines, the dense Level 3 BLAS, may be written efficiently given a highly optimized general matrix multiply routine. In this paper, however, we show that an even larger set of operations can be efficiently maintained using a much simpler matrix multiply kernel. Indeed, this is how our own project, ATLAS (which provides one of the most widely used BLAS implementations in use today), supports a large variety of performance‐critical routines. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

15.
线程级并行计算在图形渲染引擎中的研究   总被引:1,自引:0,他引:1  
针对并行计算技术在当前图形渲染系统中应用层面较浅的问题,为提高多核平台下图形应用程序CPU利用率,提出了一种新的Fork-Join多线程渲染方案。对当前流行的开源图形渲染引擎——OGRE引擎的渲染框架进行了多线程优化,用OpenMP方法对图形引擎的逻辑帧实现并行化,根据Win32线程库和DirectX11的多线程支持提出了一种渲染帧并行化方法,并将其应用于图形引擎。在多核平台上的实验结果表明,该方案能有效提高渲染速率和CPU利用率,改善CPU负载均衡。  相似文献   

16.
The use of a massively parallel machine is aimed at the development of applications programs to solve most significant scientific, engineering, industrial and commercial problems. High-performance computing technology has emerged as a powerful and indispensable aid to scientific and engineering research, product and process development, and all aspects of manufacturing. Such computational power can be achieved only by massively parallel computers. It also requires a new and more effective mode of interaction between the computational sciences and applications and those parts of computer science concerned with the development of algorithms and software. We are interested in using parallel processing to handle large numerical tasks such as linear algebra problems. Yet, programming such systems has proven itself to be very complicated, error-prone and architecture-specific. One successful method for alleviating this problem, a method that worked well in the case of the massively pipelined supercomputers, is to use subprogram libraries. These libraries are built to efficiently perform some basic operations, while hiding low-level system specifics from the programmer. Efficiently porting a library to a new hardware, be it a vector machine or a shared memory or message passing based multiprocessor, is a major undertaking. It is a slow process that requires an intimate knowledge of the hardware features and optimization issues. We propose a scheme for the creation of portable implementations of such libraries. We present an implementation of BLAS (basic linear algebra subprograms), which is used as a standard linear algebra library. Our parallel implementation uses the virtual machine for multiprocessors (VMMP) (1990), which is a software package that provides a coherent set of services for explicitly parallel application programs running on diverse MIMD multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallell program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementating the same services on all target multiprocessors. Software created using this scheme is automatically efficient on both categories of MIMD machines, and on any hardware VMMP has been ported to. An additional level of abstraction is achieved using the programming language C++, an object-oriented language. Eckel, Stroustrup, 1989, 1986). For the programmer who is using BLAS-3, it is hiding both the data structures used to define linear algebra objects, and the parallel nature of the operations performed on these objects. We coded BLAS on top of VMMP. This code was run without any modifications on two shared memory machines-the commercial Sequent Symmetry and the experimental Taunop. (The code should run on any machine the VMMP was ported onto, given the availability of a C++ compiler). Performance results for this implementation are given. The speed-up of the BLAS-3 routines, tested on 22 processors of the Sequent, was in the range of 8.68 to 15.89. Application programs (e.g. Cholesky factorization) using the library routines achieved similar efficiency.  相似文献   

17.
The current paper explores the capability and flexibility of field programmable gate-arrays (FPGAs) to implement variable-precision floating-point (VP) arithmetic. First, the VP exact dot product algorithm, which uses exact fixed-point operations to obtain an exact result, is presented. A VP multiplication and accumulation unit (VPMAC) on FPGA is then proposed. In the proposed design, the parallel multipliers generate the partial products of mantissa multiplication in parallel, which is the most time-consuming part in the VP multiplication and accumulation operation. This method fully utilizes DSP performance on FPGAs to enhance the performance of the VPMAC unit. Several other schemes, such as two-level RAM bank, carry-save accumulation, and partial summation, are used to achieve high frequency and pipeline throughput in the product accumulation stage. The typical algorithms in Basic Linear Algorithm Subprograms (i.e., vector dot product, general matrix vector product, and general matrix multiply product), LU decomposition, and Modified Gram–Schmidt QR decomposition, are used to evaluate the performance of the VPMAC unit. Two schemes, called the VPMAC coprocessor and matrix accelerator, are presented to implement these applications. Finally, prototypes of the VPMAC unit and the matrix accelerator based on the VPMAC unit are created on a Xilinx XC6VLX760 FPGA chip. Compared with a parallel software implementation based on OpenMP running on an Intel Xeon Quad-core E5620 CPU, the VPMAC coprocessor, equipped with one VPMAC unit, achieves a maximum acceleration factor of 18X. Moreover, the matrix accelerator, which mainly consists of a linear array of eight processing elements, achieves 12X–65X better performance.  相似文献   

18.
This paper presents a collection of linear algebra subroutines for transputer networks. The developed pilot library is intended to form a basis of a complete parallel linear algebra library for validating computations, whose routines will deliver (as accurately as necessary) either the best possible result, or a corresponding inclusion based on controlled rounding and an optimal scalar product. So far, as a first step, we have produced code for interval arithmetic, scalar products and simple vector-matrix operations with maximum accuracy. For the solution of triangular systems and the LU decomposition of dense matrices, new versions of classical methods are implemented which allow the application of optimal scalar products as a single, invidisible operation and optimize the overlap of communication and computation. The library also contains routines for computing inclusions of unstructured, dense linear systems of equations. Network topology dependency is avoided in all numerical routines by the use of general communication routines. This way the user is able to work with different topologies like ring structures, binary trees and hypercubes.  相似文献   

19.
陈江  赵永华  迟学斌 《计算机工程》2005,31(22):58-60,94
COUPL+是一种基于消息传递模型的并行库,它将并行程序巾需要处理的数据划分、消息传递函数的调用等都封装在其函数中。COUPL+可以简化在分布式存储结构并行机上编写基于网格的应用程序的任务。该文简要介绍了COUPL+的基本原理,以及它与MPI、OpenMP和HPF的特性对比;并且使用COUPL+实现了共轭梯度法和结构化网格计算两种并行计算中常用的任务,也对比了使用MPI和HPF的性能差异。  相似文献   

20.
研究Android平台中密码运算加速方法,采用运算并行化的思想,利用Android平台的RenderScript并行运算机制实现大整数乘法运算,为椭圆曲线密码等密码运算提供高效快速的基本操作。设计并实现了适合并行处理的大整数乘法运算存储结构和运算执行逻辑,以矩阵的方式分割并处理大整数对象,可以一次同步完成所需的乘法和加法运算,进而得到最终运算结果。实验结果表明,与Android平台原生的Java大整数运算库相比,该方法在执行时间上具有明显优势。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号