期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

胡怡陈道琨杨超刘芳芳马文静尹万旺袁欣辉林蓉芬《软件学报》2023,34(9):4421-4436

BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块,广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算.针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数.基于RMA通信机制设计了从核归约策略,提升了BLAS 1、2级若干函数的归约效率.针对TRSV、TPSV等存在数据依赖关系的函数,提出了一套高效并行算法,该算法通过点对点同步维持数据依赖关系,设计了适用于三角矩阵的高效任务映射机制,有效减少了从核点对点同步的次数,提高了函数的执行效率.通过自适应优化、向量压缩、数据复用等技术,进一步提升了BLAS 1、2级函数的访存带宽利用率.实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%,平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%,平均可达80%以上.与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用... 相似文献

2.

LINPACK routines based on level 2 BLAS

Margreet Louter-Nool 《The Journal of supercomputing》1989,3(4):331-349

Extended or Level 2 BLAS is intended to improve the performance of portable programs on high-performance computers. In this paper we examine where Extended BLAS routines may be inserted in LINPACK so that no changes in the parameter list have to be made. We also discuss why, for some algorithms, a simple restructuring in terms of Level 2 BLAS fails. We do not attempt to redesign the algorithms or to change the data structure. We concentrate on the translation of calls to the original (Level 1) BLAS into calls to Level 2 BLAS to improve readability, modularity, and efficiency. This examination results in a still portable subset of LINPACK with a better performance than the original routines. The measured performances of original and modified LINPACK routines on the CDC CYBER 990, CDC CYBER 205, CRAY X-MP, and the NEX SX-2 are compared and analyzed. 相似文献

3.

PB-BLAS: a set of parallel block basic linear algebra subprograms

Jaeyoung Choi Jack J. Dongarra David W. Walker 《Concurrency and Computation》1996,8(7):517-535

We propose a new software package which would be very useful for implementing dense linear algebra algorithms on block-partitioned matrices. The routines are referred to as block basic linear algebra subprograms (BLAS), and their use is restricted to computations in which one or more of the matrices involved consists of a single row or column of blocks, and in which no more than one of the matrices consists of an unrestricted two-dimensional array of blocks. The functionality of the block BLAS routines can also be provided by Level 2 and 3 BLAS routines. However, for non-uniform memory access machines the use of the block BLAS permits certain optimizations in memory access to be taken advantage of. This is particularly true for distributed memory machines, for which the block BLAS are referred to as the parallel block basic linear algebra subprograms (PB-BLAS). The PB-BLAS are the main focus of this paper, and for a block-cyclic data distribution, in a single row or column of blocks lies in a single row or column of the processor template. The PB-BLAS consist of calls to the sequential BLAS for local computations, and calls to the BLACS for communication. The PB-BLAS are the building blocks for implementing ScaLAPACK, the distributed-memory version of LAPACK, and provide the same ease-of-use and portability for ScaLAPACK that the BLAS provide for LAPACK. The PB-BLAS consist of all Level 2 and 3 BLAS routines for dense matrix computations (not for banded matrix) and four auxiliary routines for transposing and copying of a vector and/or a block vector. The PB-BLAS are currently available for all numeric data types, i.e., single and double precision, real and complex. 相似文献

4.

Parallel implementation of BLAS: general techniques for Level 3 BLAS

Almadena Chtchelkanova John Gunnels Greg Morrow James Overfelt Robert A. van de Geijn 《Concurrency and Computation》1997,9(9):837-857

In this paper, we present straightforward techniques for a highly efficient, scalable implementation of common matrix–matrix operations generally known as the Level 3 Basic Linear Algebra Subprograms (BLAS). This work builds on our recent discovery of a parallel matrix–matrix multiplication implementation, which has yielded superior performance, and requires little work space. We show that the techniques used for the matrix–matrix multiplication naturally extend to all important Level 3 BLAS and thus this approach becomes an enabling technology for efficient parallel implementation of these routines and libraries that use BLAS. Representative performance results on the Intel Paragon system are given. © 1997 John Wiley & Sons, Ltd. 相似文献

5.

Minimizing development and maintenance costs in supporting persistently optimized BLAS

R. Clint Whaley Antoine Petitet 《Software》2005,35(2):101-121

The Basic Linear Algebra Subprograms (BLAS) define one of the most heavily used performance‐critical APIs in scientific computing today. It has long been understood that the most important of these routines, the dense Level 3 BLAS, may be written efficiently given a highly optimized general matrix multiply routine. In this paper, however, we show that an even larger set of operations can be efficiently maintained using a much simpler matrix multiply kernel. Indeed, this is how our own project, ATLAS (which provides one of the most widely used BLAS implementations in use today), supports a large variety of performance‐critical routines. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

6.

Performance of parallel Cholesky factorization algorithms using BLAS

Glenn R. Luecke Jae Heon Yun Philip W. Smith 《The Journal of supercomputing》1992,6(3-4):315-329

This paper considers four parallel Cholesky factorization algorithms, including SPOTRF from the February 1992 release of LAPACK, each of which call parallel Level 2 or 3 BLAS, or both. A fifth parallel Cholesky algorithm that calls serial Level 3 BLAS is also described. The efficiency of these five algorithms on the CRAY-2, CRAY Y-MP/832, Hitachi Data Systems EX 80, and IBM 3090-600J is evaluated and compared with a vendor-optimized parallel Cholesky factorization algorithm. The fifth parallel Cholesky algorithm that calls serial Level 3 BLAS provided the best performance of all algorithms that called BLAS routines. In fact, this algorithm outperformed the Cray-optimized libsci routine (SPOTRF) by 13–44%;, depending on the problem size and the number of processors used.This work was supported by grants from IMSL, Inc., and Hitachi Data Systems. The first version of this paper was presented as a poster session at Supercomputing '90, New York City, November 1990. 相似文献

7.

Restructuring Software: A Case Study

T. R. HOPKINS 《Software》1996,26(8):967-982

We use knot count and path count metrics to identify which routines in the Level 1 basic linear algebra subroutines (BLAS) might benefit from code restructuring. We then consider how logical restructuring and the improvements in the facilities available from successive versions of Fortran have allowed us to improve the complexity of the code as measured by knot count, path count and cyclomatic complexity, and the user interface of one of the identified routines which computes the Euclidean norm of a vector. With these reductions in complexity we hope that we have contributed to improvements in the maintainability and clarity of the code. Software complexity metrics and the control graph are used to quantify and provide a visual guide to the quality of the software, and the performance of two Fortran code restructuring tools is reported. Finally, we give some indication of the cost of the extra numerical robustness offered by the BLAS routine over the use of new Fortran 90 intrinsic functions. 相似文献

8.

Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers

Jaeyoung Choi David W. Walker Jack J. Dongarra 《Concurrency and Computation》1994,6(7):543-570

The paper describes Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PUMMA package includes not only the non-transposed matrix multiplication routine C = A ⋅ B, but also transposed multiplication routines C = A^T ⋅ B, C = A ⋅ B^T, and C = A^T ⋅ B^T, for a block cyclic data distribution. The routines perform efficiently for a wide range of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta computer. 相似文献

9.

Parallelizing a Level 3 BLAS Library for LAN-Connected Workstations

《Journal of Parallel and Distributed Computing》1996,38(1):28-36

LAN-connected workstations are a heterogeneous environment, where each workstation provides time-varying computing power, and thus dynamic load balancing mechanisms are necessary for parallel applications to run efficiently. Parallel basic linear algebra subprograms (BLAS) have recently shown promise as a means of taking advantage of parallel computing in solving scientific problems. Most existing parallel algorithms of BLAS are designed for conventional parallel computers; they do not take the particular characteristics of LAN-connected workstations into consideration. This paper presents a parallelizing method of Level 3 BLAS for LAN-connected workstations. The parallelizing method makes dynamic load balancing throughcolumn-blockingdata distribution. The experiment results indicate that this dynamic load balancing mechanism really leads to a more efficient parallel level 3 BLAS for LAN-connected workstations. 相似文献

10.

基于申威1600的3级BLAS GEMM函数优化

刘昊刘芳芳张鹏杨超蒋丽娟《计算机系统应用》2016,25(12):234-239

BLAS是当前科学计算领域重要的底层支持数学库之一,其中的3级BLAS函数应用最为广泛.本文基于国产申威1600平台,提出了一种基础线性代数库BLAS的三级函数通用矩阵乘GEMM的高性能实现方法.在单核上,使用乘加指令、循环展开、软件流水线指令重排、SIMD向量化运算、寄存器分块技术等与平台架构相关的技术手段,实现汇编级手工优化;在多核上,提出了适用于该平台的多线程加速方案.实验结果显示,在单核串行性能测试中,与知名开源数学库GotoBLAS相比,我们实现了平均4.72倍的加速效果;在多核并行扩展测试中,4线程版的性能则平均达到了单线程版性能的3.02倍. 相似文献

11.

Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems

Jesús Cámara Javier Cuenca Luis-Pedro García Domingo Giménez 《Parallel Computing》2014

The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized memory. The main basic component of these scientific codes is often matrix multiplication, and the efficient development of other linear algebra packages is directly based on the matrix multiplication routine implemented in the BLAS library. BLAS library is used in the form of packages implemented by the vendors or free implementations. The latest versions of this library are multithreaded and can be used efficiently in multicore systems, but when they are used inside parallel codes, the two parallelism levels can interfere and produce a degradation of the performance. In this work, an auto-tuning method is proposed to select automatically the optimum number of threads to use at each parallel level when multithreaded linear algebra routines are called from OpenMP parallel codes. The method is based on a simple but effective theoretical model of the execution time of the two-level routines. The methodology is applied to a two-level matrix–matrix multiplication and to different matrix factorizations (LU, QR and Cholesky) by blocks. Traditional schemes which directly use the multithreaded routine of BLAS, dgemm, are compared with schemes combining the multithreaded dgemm with OpenMP. 相似文献

12.

基于申威众核处理器的1、2级BLAS函数优化研究

孙家栋孙乔邓攀杨超《计算机系统应用》2017,26(11):101-108

BLAS （Basic Linear Algebra Subprograms）是一个以向量和矩阵为操作对象的基础函数库.该库中函数分为3个级别,各个级别分别提供了向量-向量（1级）、向量-矩阵（2级）、矩阵-矩阵（3级）之间的基本运算.本文研究如何在申威众核处理器上BLAS-1、2级函数的并行实现,并充分利用平台特性对它们进行深度的性能调优,归纳总结程序在申威平台上的并行实现与优化技巧.申威26010 CPU采用了异构众核架构,众多计算核心提供的大规模并行处理能力,使单块芯片具有3 TFLOPS的双精度浮点计算性能.实验结果显示BLAS-1、2级函数相对于GotoBLAS参考实现版的平均加速比分别高达11.x和6.x,对于每一优化手段,均有明显的性能加速. 相似文献

13.

Using Ginkgo's memory accessor for improving the accuracy of memory-bound low precision BLAS

Thomas Grützmacher Hartwig Anzt Enrique S. Quintana-Ortí 《Software》2023,53(1):81-98

The roofline model not only provides a powerful tool to relate an application's performance with the specific constraints imposed by the target hardware but also offers a graphic representation of the balance between memory access cost and compute throughput. In this work, we present a strategy to break up the tight coupling between the precision format used for arithmetic operations and the storage format employed for memory operations. (At a high level, this idea is equivalent to compressing/decompressing the data in registers before/after invoking store/load memory operations.) In practice, we demonstrate that a “memory accessor” that hides the data compression behind the memory access, can virtually push the bandwidth-induced roofline, yielding higher performance for memory-bound applications using high precision arithmetic that can handle the numerical effects associated with lossy compression. We also demonstrate that memory-bound applications operating on low precision data can increase the accuracy by relying on the memory accessor to perform all arithmetic operations in high precision. In particular, we demonstrate that memory-bound BLAS operations (including the sparse matrix-vector product) can be re-engineered with the memory accessor and that the resulting accessor-enabled BLAS routines achieve lower rounding errors while delivering the same performance as the fast low precision BLAS. 相似文献

14.

An (almost) direct deployment of the Fast Multipole Method on the Cell processor

Pierre Fortin Jean-Luc Lamotte 《The Journal of supercomputing》2013,65(3):1205-1222

This paper presents the first deployment of the Fast Multipole Method on the Cell processor (PowerXCell 8i). We rely on the matrix formulation with BLAS routines of the FMB code (Fast Multipole with BLAS) in order to directly and efficiently offload the most time consuming operators of both far field and near field computations on the Cell heterogeneous cores. We detail the difficulties that had to be solved first, and we finally obtain a deployment in single and double precisions, which scales linearly on several Cell blades and which is able to handle both uniform and non-uniform distributions of particles. We also present our performance results and comparisons with multicore CPUs, as well as the limitations of our deployment on the Cell processor. 相似文献

15.

GOTOBLAS一般矩阵乘法高效实现机制的研究 总被引：2，自引：0，他引：2

下载免费PDF全文

蒋孟奇张云泉宋刚李玉成《计算机工程》2008,34(7):84-86,1

对GOTOBLAS库(GOTO)的实现机制,尤其是其中的一般矩阵乘法部分的实现进行了分析。结合近年来的一些研究成果,讨论了如何高效地实现矩阵相乘操作,把存储层次对程序性能的影响提高到计算模型的高度。对比实验表明,GOTO库的性能远远高于没有考虑存储层次的一般BLAS库。证明了GOTO库性能上的优越性和将存储层次引入计算模型的必要性。相似文献

16.

Performance of level 3 BLAS kernels in a dynamically partitioned data-flow environment

《Computing Systems in Engineering》1995,6(4-5):357-361

The Dynamically Partitioned Data-Flow (DPDF) model is based on an original analysis concept of the data dependency graph at the instruction level. Instead of a breadth first analysis, as in a classical Data-Flow Model, we execute instructions along data-dependent paths. As a consequence, data locality can be exploited by reusing results between the execution of consecutive instructions. In addition, the different paths are not statically defined but arise from a dynamical partitioning of the graph. This model presents the advantage to support very small cost dynamic scheduling and multitasking strategies. In order to study the efficiency of this new model, a first architecture has been defined. This architecture is currently limited to a single processor with one serial processing unit but four graph analyzing units (called prefetch units). Each of these prefetch units is able to build dynamically its own execution path inside the Data-Flow graph of an application. The efficiency of this architecture is studied on a numerical benchmark composed of a subset of the Livermore loops and of three routines of the Level 3 BLAS (GEMM, SYRK and TRSM). Our goal in these experimentations is to demonstrate the ability of the four prefetch units to feed the ALU. 相似文献

17.

Iterative diagonalization of symmetric matrices in mixed precision and its application to electronic structure calculations

Eiji Tsuchida Yoong-Kee Choe 《Computer Physics Communications》2012,183(4):980-985

Diagonalization of a large matrix is the computational bottleneck in many applications such as electronic structure calculations. We show that a speedup of over 30% can be achieved by exploiting 32-bit floating point operations, while keeping 64-bit accuracy. Moreover, most of the computationally expensive operations are performed by level-3 BLAS/LAPACK routines in our implementation, thus leading to optimal performance on most platforms. We also discuss the effectiveness of problem-specific preconditioners which take into account nondiagonal elements. 相似文献

18.

Explicit Fourth-Order Runge–Kutta Method on Intel Xeon Phi Coprocessor

Beata Bylina Joanna Potiopa 《International journal of parallel programming》2017,45(5):1073-1090

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge–Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this work an implementation based on Intel Math Kernel Library (Intel MKL) routines and the authors’ own implementation, both using the CSR storage scheme and working on Intel Xeon Phi, were investigated. The implementation based on the Intel MKL library uses the high-performance BLAS and Sparse BLAS routines. In our application we focus on OpenMP style programming. We implement SpMV operation and vector addition using the basic optimizing techniques and the vectorization. We evaluate our approach in native and offload modes for various number of cores and thread allocation affinities. Both implementations (based on Intel MKL and made by the authors) were compared in respect of the time, the speedup and the performance. The numerical experiments on Intel Xeon Phi show that the performance of authors’ implementation is very promising and gives a gain of up to two times compared to the multithreaded implementation (based on Intel MKL) running on CPU (Intel Xeon processor) and even three times in comparison with the application which uses Intel MKL on Intel Xeon Phi. 相似文献

19.

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

S. Tabik G. Ortega E. M. Garzón 《The Journal of supercomputing》2014,70(2):577-587

Programmers usually implement iterative methods that solve partial differential equations by expressing them using a sequence of basic kernels from libraries optimized for the graphics processing unit (GPU). The global runtime of the resulting combination is often penalized by the smallest and most inefficient vector operations. To improve the GPU exploitation, we identify and analyze the potential kernels to be fused according to the data dependence, data type and size, and GPU resources. This paper provides an extensive analysis of the impact of fusing vector operations [level 1 of Basic Linear Algebra Subprograms (BLAS)] on the performance of the GPU. The experimental evaluation shows that this optimization provides noticeable improvement especially for kernels with lower memory requirements and on more modern GPUs. It is worth noting that the fused BLAS operations can be very useful to help programmers efficiently code iterative methods to solve large linear systems of equations for the GPU. Iterative methods such as biconjugate gradient method (BCG) are one of the examples that can benefit from this optimization strategy. Indeed, kernel fusion of vector routines makes the most efficient GPU implementation of BCG run between \(1.09\times \) and \(1.27\times \) faster on three GPUs of different characteristics. 相似文献

20.

Emmerald: a fast matrix–matrix multiply using Intel's SSE instructions

Douglas Aberdeen Jonathan Baxter 《Concurrency and Computation》2001,13(2):103-119

相似文献