期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陈少虎张云泉张先轶程豪《软件学报》2010,21(Z1):214-223

BLAS 库是高性能计算中最基本的数学库,它的性能对超级计算机的性能有着极大的影响.而且随着CPU多核化的发展,BLAS 的多核并行性能已经变得比与体系结构相关的单核性能更加重要.实验以流行于高性能计算的Xeon、Opteron 系列多核X86 处理器为例,全面测试了GotoBLAS、Atlas、MKL 和ACML 四种主流的BLAS 库的所有1,2,3 级函数,并覆盖了不同计算规模和多核并行方面的测试.通过测试结果,分析源代码、BLAS 库资料和论文的方式,分析BLAS 有效的优化和并行方法,以及它们所适合的平台.为BLAS 的优化、使用,甚至高性能处理器的发展上提供有益的建议.实验结果表明,比起一个逻辑处理强大但是复杂的处理器,一个cache 更大、性能更好,内存带宽更宽、延迟更小,主频更高的处理器往往能在高性能计算中取得更好的性能.同时,X86 平台上的状况对其他体系结构也有巨大的借鉴意义. 相似文献

2.

基于申威1621的通用矩阵向量乘法的性能分析与优化

邓洁赵荣彩王磊《计算机应用》2022,(S1):215-220

通用矩阵向量乘法（GEMV）函数是整个二级基础线性代数子程序（BLAS）函数库的构建基础,BLAS作为关键基础计算软件之一,目前在申威处理器上却没有一个高性能实现的版本。针对上述问题,为充分发挥申威1621平台的高性能BLAS库计算优势,提出一种基于申威1621的通用矩阵向量乘法的性能分析与优化方法。首先对GEMV函数进行计算重排序、循环分块的改进;然后采取单指令多数据流（SIMD）以及指令重排的优化方式;最后对内存分配方式进行择优选择。测试结果表明,GEMV函数平均性能达到GotoBLAS版的2.17倍。在使用堆栈分配内存空间或增加对y向量步长的判断分支两种方案后,相较于GotoBLAS,小规模矩阵的平均性能由2.265倍提升至2.875倍。为提高大规模矩阵的性能,以及发挥申威1621多核处理器并行机制,在开启4线程后,平均性能达到单核的3.57倍。因此,优化后的GEMV函数在申威平台上较好的体现了并行效果。相似文献

3.

基于申威1621处理器的BLAS一级函数优化

李浩然王磊《计算机系统应用》2021,30(7):246-252

BLAS (Basic Linear Algebra Subprograms)是一个基本线性代数操作的数学函数标准, 该库函数分为三个级别, 每个级别提供了向量与向量(1级)、向量与矩阵(2级)、向量与向量(三级)之间的基本运算. 本文研究了在申威1621处理器上BLAS一级函数的优化方案, 以函数AXPY为例, 充分利用平台的架构特点对其进行性能调优,设计了自动的线程分配方案. 实验结果显示优化过后的BLAS一级函数AXPY相对于GotoBLAS参考实现版本的单核和多核加速比分别高达4.36和9.50, 对于每种优化方式均得到了一定的性能提升. 相似文献

4.

面向SW26010-Pro的1、2级BLAS函数众核并行优化技术

胡怡陈道琨杨超刘芳芳马文静尹万旺袁欣辉林蓉芬《软件学报》2023,34(9):4421-4436

BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块,广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算.针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数.基于RMA通信机制设计了从核归约策略,提升了BLAS 1、2级若干函数的归约效率.针对TRSV、TPSV等存在数据依赖关系的函数,提出了一套高效并行算法,该算法通过点对点同步维持数据依赖关系,设计了适用于三角矩阵的高效任务映射机制,有效减少了从核点对点同步的次数,提高了函数的执行效率.通过自适应优化、向量压缩、数据复用等技术,进一步提升了BLAS 1、2级函数的访存带宽利用率.实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%,平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%,平均可达80%以上.与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用... 相似文献

5.

面向龙芯3A体系结构的BLAS库优化

何颂颂顾乃杰朱海涛刘燕君《小型微型计算机系统》2012,33(3):571-575

双精度普通矩阵乘法DGEMM是BLAS库中最核心的函数之一,大部分三级BLAS库函数的核心计算都是通过调用DGEM M来实现的.该文针对龙芯3A具有128位访存指令的特点,通过理论分析,找到了最佳的循环展开方式;针对龙芯3A的Cache替换策略(随机替换),通过使用地址交错技术,减少了Cache的冲突失效;针对龙芯3A访存带宽有限的问题,通过使用共享数据的任务划分方式,减少了数据访存量.优化后的DGEMM单核和多核运算速度均是性能最高的开源BLAS库(Goto-BLAS)的2倍多. 相似文献

6.

基于申威1621的高精度点积算法实现与优化

徐方洁王磊王一卓张亚光《计算机系统应用》2023,32(2):400-405

点积函数是BLAS库中的一级基础函数,其被科学计算等领域广泛调用.由于浮点计算会引入舍入误差,现有BLAS库中双精度点积函数不足以满足某些应用领域的精度要求,因此需要高精度算法来实现更精确可靠的计算.在本文中,面向国产申威1621平台,在现有的BLAS库的基础上,新增高精度点积函数的实现接口,来满足应用的高精度需求.同时,对于高精度点积算法运用循环展开、访存优化、指令重排等优化策略,实现汇编级手工优化.实验结果显示,文中高精度点积算法的计算结果精度,近似达到了双精度点积的两倍,有效提升了原始算法精度.同时,在保证精度提升的基础上,文中优化后的高精度点积函数相比未优化前,平均性能加速比达到了1.61. 相似文献

7.

多核龙芯3A上二级BLAS库的优化 总被引：1，自引：0，他引：1

李毅何颂颂李恺《计算机系统应用》2011,20(1):163-167

针对龙芯3A体系结构以及二级BLAS库函数的特点,在指令级、存储级和线程级抽取并行方案,总结了一些合适的优化方法,并对其进行了定量的分析.实验表明,这些优化可以将二级BLAS函数单线程的性能提升20%以上,多线程下也可以得到2.5倍左右的加速比,这对今后多核龙芯上的系统软件优化工作有着一定的帮助. 相似文献

8.

基于申威众核处理器的1、2级BLAS函数优化研究

孙家栋孙乔邓攀杨超《计算机系统应用》2017,26(11):101-108

BLAS （Basic Linear Algebra Subprograms）是一个以向量和矩阵为操作对象的基础函数库.该库中函数分为3个级别,各个级别分别提供了向量-向量（1级）、向量-矩阵（2级）、矩阵-矩阵（3级）之间的基本运算.本文研究如何在申威众核处理器上BLAS-1、2级函数的并行实现,并充分利用平台特性对它们进行深度的性能调优,归纳总结程序在申威平台上的并行实现与优化技巧.申威26010 CPU采用了异构众核架构,众多计算核心提供的大规模并行处理能力,使单块芯片具有3 TFLOPS的双精度浮点计算性能.实验结果显示BLAS-1、2级函数相对于GotoBLAS参考实现版的平均加速比分别高达11.x和6.x,对于每一优化手段,均有明显的性能加速. 相似文献

9.

国产SW26010-Pro处理器上3级BLAS函数众核并行优化

胡怡陈道琨杨超马文静刘芳芳宋超博孙强史俊达《软件学报》2024,35(3):1569-1584

BLAS (basic linear algebra subprograms)是最基本、最重要的底层数学库之一.在一个标准的BLAS库中,BLAS 3级函数涵盖的矩阵-矩阵运算尤为重要,在许多大规模科学与工程计算应用中被广泛调用.另外, BLAS 3级属于计算密集型函数,对充分发挥处理器的计算性能有至关重要的作用.针对国产SW26010-Pro处理器研究BLAS 3级函数的众核并行优化技术.具体而言,根据SW26010-Pro的存储层次结构,设计多级分块算法,挖掘矩阵运算的并行性.在此基础上,基于远程内存访问(remote memory access, RMA)机制设计数据共享策略,提高从核间的数据传输效率.进一步地,采用三缓冲、参数调优等方法对算法进行全面优化,隐藏直接内存访问(direct memory access, DMA)访存开销和RMA通信开销.此外,利用SW26010-Pro的两条硬件流水线和若干向量化计算/访存指令,还对BLAS 3级函数的矩阵-矩阵乘法、矩阵方程组求解、矩阵转置操作等若干运算进行手工汇编优化,提高了函数的浮点计算效率.实验结果显示,所提出的并行优化技术... 相似文献

10.

基于OpenCL的连续数据无关访存密集型函数并行与优化研究

蒋丽媛张云泉龙国平贾海鹏《计算机科学》2013,40(3):111-115

连续的数据无关是指计算目标矩阵连续的元素时使用的源矩阵元素之间没有关系且也为连续的,访存密集型是指函数的计算量较小,但是有大量的数据传输操作。在OpenCL框架下,以bitwise函数为例,研究和实现了连续数据无关访存密集型函数在GPU平台上的并行与优化。在考察向量化、线程组织方式和指令选择优化等多个优化角度在不同的GPU硬件平台上对性能的影响之后,实现了这个函数的跨平合性能移植。实验结果表明,在不考虑数据传输的前提下,优化后的函数与这个函数在OpenCV库中的CPU版本相比,在AMD HD 5850 GPU达到了平均40倍的性能加速比;在AMD HD 7970 GPU达到了平均90倍的性能加速比;在NVIDIA Tesla 02050 CPU上达到了平均60倍的性能加速比;同时,与这个函数在OpenCV库中的CUDA实现相比,在NVIDIA Tesla 02050平台上也达到了1.5倍的性能加速。相似文献

11.

一种基于线程的数据预取方法 总被引：1，自引：0，他引：1

下载免费PDF全文

欧国东张民选《计算机工程与科学》2008,30(1):119-122

多线程、多核处理器的推广受限于应用。目前,大部分应用尤其是桌面应用都是单线程程序,不能充分利用多线程处理器提供的多个现场并行执行来提高速度。使用空闲现场加速单线程应用是目前研究的一个热点,研究主要集中在提高传统串行应用存储访问的效率和分支预测的精度。在基于线程的数据预取方法中,数据预取线程是从主线程的执执行踪迹中提取的。它们使用空闲的现场,和主线程并行执行,在主线程需要数据之前把数据取到离处理器更近的存储层次。基于线程的数据预取方法能够有效地解决传统数据预取方法难以处理的诸多问题,如不规则内存访问模式。本文具体分析了应用程序中访存行为的特点,结合控制流处理,设计并验证了一种基于线程的数据预取方法TDP。模拟结果显示,使用TDP可以获得7％左右的性能提升。相似文献

12.

Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study

Jesús Jiménez Juan Ruiz de Miras 《The Journal of supercomputing》2013,65(3):1327-1352

In this paper, we present the analysis and development of a cross-platform OpenCL implementation of the box-counting algorithm, which is one of the most widely-used methods for estimating the Fractal Dimension. The Fractal Dimension is a relevant image analysis method used in several disciplines, but computing it is in general a time consuming process, especially when working with 3D images. Unlike parallel programming models that strictly depend on the hardware type and manufacturer, like CUDA, OpenCL allows us to provide an implementation suitable for execution on both GPUs and multi-core CPUs, whatever the hardware manufacturer. Sorting is a key part of the fast box-counting algorithm and the final speedup is highly conditioned by the efficiency of the sorting algorithm used. Our study reveals that current OpenCL implementations of sorting algorithms are clearly slower when compared with both CUDA for GPU and specific multi-core CPU implementations. Our OpenCL algorithm has been specifically optimized according the type of the target device and the results show an average speedup of up to 7.46× and 4×, when executed on the GPU and the multi-core CPU respectively, both compared with the single-threaded (sequential) CPU implementation. 相似文献

13.

CLUS_GPU-BLASTP: accelerated protein sequence alignment using GPU-enabled cluster

Sita Rani O. P. Gupta 《The Journal of supercomputing》2017,73(10):4580-4595

Basic Local Alignment Search Tool (BLAST) is one of the most frequently used algorithms for bioinformatics applications. In this paper, an accelerated implementation of protein BLAST, i.e., CLUS_GPU-BLASTP for multiple query sequence processing in parallel, on graphical processing unit (GPU)-enabled high-performance cluster is proposed. The experimental setup consisted of a high-performance GPU-enabled cluster. Each compute node of the cluster consisted of two hex-core Intel, Xeon 2.93 GHz processors with 50 GB RAM and 12 MB cache. Each compute node was also equipped with a NVIDIA M2050 GPU. In comparison with the famous GPU-BLAST, our BLAST implementation is 2.1 times faster on single compute node. On a cluster of 12 compute nodes, our implementation gave a speedup of 13.2X. In comparison with standard single-threaded NCBI-BLAST, our implementation achieves a speedup ranging from 7.4X to 8.2X. 相似文献

14.

Iterative diagonalization of symmetric matrices in mixed precision and its application to electronic structure calculations

Eiji Tsuchida Yoong-Kee Choe 《Computer Physics Communications》2012,183(4):980-985

Diagonalization of a large matrix is the computational bottleneck in many applications such as electronic structure calculations. We show that a speedup of over 30% can be achieved by exploiting 32-bit floating point operations, while keeping 64-bit accuracy. Moreover, most of the computationally expensive operations are performed by level-3 BLAS/LAPACK routines in our implementation, thus leading to optimal performance on most platforms. We also discuss the effectiveness of problem-specific preconditioners which take into account nondiagonal elements. 相似文献

15.

Parallelization of the Kalman filter on multicore computational platforms

O. Rosén A. Medvedev T. Wigren 《Control Engineering Practice》2013,21(9):1188-1194

Parallelization of the Kalman filter algorithm, with emphasis on the specific demands of multicore architecture implementation, is investigated. The approach is based on the nonrestrictive assumption of a banded system matrix. Both time-varying and time-invariant systems can be generally transformed to such a form. The proposed method is applied to a radio interference power estimation problem for which speedup evaluations using up to eight cores are performed. It is shown that the algorithm is capable of achieving linear speedup in the number of cores used, while speedup factors for a parallel BLAS implementation are less than two. An algorithm analysis that provides guidelines to the choice of implementation hardware to meet a desired performance is also provided. 相似文献

16.

Simulating cortical networks on heterogeneous multi-GPU systems

Andrew Nere Sean Franey Atif Hashmi Mikko Lipasti 《Journal of Parallel and Distributed Computing》2013

Recent advances in neuroscientific understanding have highlighted the highly parallel computation power of the mammalian neocortex. In this paper we describe a GPGPU-accelerated implementation of an intelligent learning model inspired by the structural and functional properties of the neocortex. Furthermore, we consider two inefficiencies inherent to our initial implementation and propose software optimizations to mitigate such problems. Analysis of our application’s behavior and performance provides important insights into the GPGPU architecture, including the number of cores, the memory system, atomic operations, and the global thread scheduler. Additionally, we create a runtime profiling tool for the cortical network that proportionally distributes work across the host CPU as well as multiple GPGPUs available to the system. Using the profiling tool with these optimizations on Nvidia’s CUDA framework, we achieve up to 60× speedup over a single-threaded CPU implementation of the model. 相似文献

17.

Explicit Fourth-Order Runge–Kutta Method on Intel Xeon Phi Coprocessor

Beata Bylina Joanna Potiopa 《International journal of parallel programming》2017,45(5):1073-1090

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge–Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this work an implementation based on Intel Math Kernel Library (Intel MKL) routines and the authors’ own implementation, both using the CSR storage scheme and working on Intel Xeon Phi, were investigated. The implementation based on the Intel MKL library uses the high-performance BLAS and Sparse BLAS routines. In our application we focus on OpenMP style programming. We implement SpMV operation and vector addition using the basic optimizing techniques and the vectorization. We evaluate our approach in native and offload modes for various number of cores and thread allocation affinities. Both implementations (based on Intel MKL and made by the authors) were compared in respect of the time, the speedup and the performance. The numerical experiments on Intel Xeon Phi show that the performance of authors’ implementation is very promising and gives a gain of up to two times compared to the multithreaded implementation (based on Intel MKL) running on CPU (Intel Xeon processor) and even three times in comparison with the application which uses Intel MKL on Intel Xeon Phi. 相似文献