首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 437 毫秒
1.
肖玄基  张云泉  李玉成  袁良 《软件学报》2013,24(S2):118-126
MAGMA是第一个面向下一代体系架构(多核CPU和GPU)开源的线性代数软件包,它采用了诸多针对异构平台的优化方法,包括混合同步、通信避免和动态任务调度.它在功能、数据存储、接口上与LAPACK相似,可以发挥GPU的巨大计算能力进行数值计算.对MAGMA进行了测试分析.首先对矩阵分解算法进行分析;然后通过测试结果,分析MAGMA有效的优化和并行方法,为MAGMA使用、优化提供有益的建议;最后提出了一种对于矩阵分块算法的自适应调优的方法,经过测试,对于方阵的SGEQRF函数加速比达到1.09,对于高瘦矩阵的CGEQRF函数加速比达到1.8.  相似文献   

2.
PLASMA自适应调优与性能优化的设计与实现   总被引:1,自引:0,他引:1  
PLASMA是一个高效的线性代数软件包,其数据分布结合分堆、细粒度并行以及乱序执行机制等大大提高了程序的性能。但PLASMA仍然存在一些问题,比如分块大小对程序性能的影响非常大,以及产生了大量的数据拷贝等。通过对比传统的LAPACK和PLASMA的实现机制,分析了PLASMA中存在的优势和不足,介绍了两种弥补PLASMA自身不足的方法。针对PLASMA的架构,经过大量的测试与分析,提出了边缘矩阵的概念并分析了其对性能的影响,据此提出了一种自适应调优的方法。并通过数据拷贝与计算并行的运行方式,进一步提高了PLASMA性能,最后通过大量的测试验证了该优化方法的效果。  相似文献   

3.
曹越 《测控技术》2016,35(1):113-117
以自主嵌入式处理器为平台,对Android系统性能进行分析.通过Oprofile工具采集Android系统下的访存热点函数,结合处理器架构特点,并充分考虑传统Cache特性,重点针对Android系统的BionicC库及Libcutils库中的热点访存函数提出优化算法进行汇编优化.实验表明:优化后的Bionic C库和Libcutils库与优化前相比,访存带宽分别提升8.91%和12.3%,系统性能分别提升1.54%和3.81%;Android系统整体性能提升5.35%.  相似文献   

4.
Stencil计算是科学应用中的一类重要计算,而分块是提升Stencil计算数据局部性的关键技术.针对现有三维Stencil优化在SW26010处理器上缺少时间分块以及分块参数需手工调优的问题,引入时间分块,提出了面向SW26010处理器的三维Stencil自适应分块参数算法.通过建立性能分析模型,结合硬件计算能力及存储容量等限制因素,文中系统地分析了分块参数对模型性能的影响,判断性能瓶颈,指导分块参数的优化方向.基于性能分析模型,自适应分块参数算法可给出预测性能最优时的分块参数,有利于三维Stencil在SW26010处理器上的快速优化部署.选取了三维7点和三维27点Stencil算例进行实验.与未使用时间分块的三维Stencil优化相比,以上两个算例在自适应选择的分块参数下可以达到1.47和1.29的加速比,且实际最优分块参数与理论最佳分块参数一致,这验证了所提性能分析模型及自适应分块参数算法的有效性.  相似文献   

5.
邓洁  赵荣彩  王磊 《计算机应用》2022,(S1):215-220
通用矩阵向量乘法(GEMV)函数是整个二级基础线性代数子程序(BLAS)函数库的构建基础,BLAS作为关键基础计算软件之一,目前在申威处理器上却没有一个高性能实现的版本。针对上述问题,为充分发挥申威1621平台的高性能BLAS库计算优势,提出一种基于申威1621的通用矩阵向量乘法的性能分析与优化方法。首先对GEMV函数进行计算重排序、循环分块的改进;然后采取单指令多数据流(SIMD)以及指令重排的优化方式;最后对内存分配方式进行择优选择。测试结果表明,GEMV函数平均性能达到GotoBLAS版的2.17倍。在使用堆栈分配内存空间或增加对y向量步长的判断分支两种方案后,相较于GotoBLAS,小规模矩阵的平均性能由2.265倍提升至2.875倍。为提高大规模矩阵的性能,以及发挥申威1621多核处理器并行机制,在开启4线程后,平均性能达到单核的3.57倍。因此,优化后的GEMV函数在申威平台上较好的体现了并行效果。  相似文献   

6.
针对数字微流控生物芯片的测试和诊断过程进行建模和分析,并根据并行测试的分块数和单元出错概率为相应的测试和诊断成本建立函数。通过Matlab对测试诊断成本函数的分析表明:随着并行测试分块数的增大,测试诊断成本的变化趋势不明显,也就是说,并行测试的分块数对测试诊断成本的影响不大;而随着单元出错概率p的增加,测试成本呈明显的增加趋势,且增加的幅度较大。另外,诊断过程中,根据单元出错概率对出错的子阵列再进行诊断,诊断过程必须持续若干次,直到所有故障定位后才能结束。在这些诊断中,针对最后一次定位的诊断成本是最大的,而且与其他次的诊断过程的成本相差几十个数量级,决定了总成本的大小。这些结论为数字微流控生物芯片的测试和诊断过程优化提供重要的理论依据,并为测试诊断方法的设计提供指导。  相似文献   

7.
现有多面体编译工具往往使用一些简单的启发式策略来寻找最优的语句合并,对于不同的待优化程序,需要手工调整循环合并策略以获得最佳性能.针对这一问题,面向多核CPU目标平台,文中提出了一种基于数据重用分析的循环合并策略.该策略避免了不必要的且会影响数据局部性利用的合并限制:针对调度的不同阶段,提出了面向不同并行层次的并行性合并限制;对于数组访问关系较为复杂的语句,提出了面向CPU高速缓存优化的分块性合并限制.相较于以往的合并策略,该策略在计算合并收益时考虑到了空间局部性的变化.文中基于LLVM编译框架中的多面体编译模块Polly实现了这一策略,并选用Polybench等测试套件中的部分测试用例进行测试.实验结果表明,相较于现有的多种合并策略,在单核执行情况下,测试用例平均获得了14.9%~62.5%的性能提升;在多核执行情况下,多个测试用例平均获得了19.7%~94.9%的性能提升,在单个测试用例中最高获得了1.49x~3.07x的加速效果.  相似文献   

8.
分块最大相似性嵌入稀疏编码的人脸识别   总被引:1,自引:0,他引:1  
图像相似性先验嵌入的方法能有效提升基于稀疏编码表示的人脸识别在低维特征空间的识别性能.针对非受控人脸图像存在表情变化、部分遮挡和伪装的问题,提出基于图像分块的最大相似性嵌入稀疏编码表示的人脸识别方法.该方法首先将训练图像和测试图像进行同样的非重叠分块;然后计算测试图像与各训练图像对应分块间的相似性,并以其最大值度量图像间的相似性;最后将提取的最大块相似性信息嵌入到稀疏编码表示的人脸识别中.在AR标准人脸库上的测试表明,与全局相似性嵌入的加权稀疏编码表示分类方法相比,文中方法在训练样本和测试样本同时存在表情变化、遮挡和伪装的人脸识别中具有较大的性能提升.  相似文献   

9.
SpMV的自动性能优化实现技术及其应用研究   总被引:1,自引:0,他引:1  
在科学计算中,稀疏矩阵向量乘(SpMV)是一个十分重要且经常被大量调用的计算内核.由于SpMV一般实现算法的浮点计算和存储访问次数比率非常低,且其存储访问模式极为不规则,其实际运行性能往往很低.通过采用寄存器分块算法和启发式分块大小选择算法,将稀疏矩阵分成小的稠密分块,重用保存在寄存器中向量x元素,可以提高该计算内核的性能.剖析和总结了OSKI软件包所采用的若干关键优化技术,并进行了实际应用性能测试.测试表明,在实际应用这些优化技术的过程中,应用程序对SpMV的调用次数要达到上百次的量级,才能抵消由于应用这些性能优化技术所带来的额外时间开销,取得性能加速效果.在Pentium 4和AMD Athlon平台上,测试了10个矩阵,其平均加速比分别达到了1.69和1.48.  相似文献   

10.
数字指纹技术作为新兴的数字版权保护技术,在抄袭检测方面发挥着重要的作用,而指纹生成算法直接决定数字指纹方案的性能。文章阐述了数字指纹方案中3种常用的指纹生成算法—MD5、SHA1、Rabin指纹算法,介绍了3种常用算法的基本原理,并通过实验对其进行了性能测试。为了测试3种算法的性能差异,实验中构建了20KB~20M大小不等的测试文件,首先对测试文件进行了预处理,去除了其中的无关字符,然后对处理过的文本采取分块策略进行指纹生成,并测试了算法在不同的分块策略和不同文件大小下指纹生成的效率。实验结果表明,哈希函数(MD5、SHA1)在大文件的指纹生成中具有较高的性能,而在小文件的指纹生成中,哈希函数与Rabin指纹算法具有同样的性能。实验结果为下一步制定新的指纹方案,进行指纹生成算法选择提供了实验支撑。  相似文献   

11.
§1.引言 串行程序的自动并行化是当今高性能计算领域的一个重要研究热点,也是自有高性能并行计算机以来人们梦寐以求的一个目标.在并行计算机硬件技术高速发展的今天,国外已有万亿次,四万亿次超大规模并行计算机问世,这为一批急需解决的大型科学工程计算问题提供了有利的物资基础.但是相比之下并行算法,并行软件技术,并行开发环境和支持工具的严重滞后,大大影响了并行机的应用效率与市场推广.目前国内外各大并行机公司与应用部门都在加大力度投入大量人力、物力、财力加速并行软件的开发,其中包括把现有大量成熟的串行软件移…  相似文献   

12.
The promise of future many‐core processors, with hundreds of threads running concurrently, has led the developers of linear algebra libraries to rethink their design in order to extract more parallelism, further exploit data locality, attain better load balance, and pay careful attention to the critical path of computation. In this paper we describe how existing serial libraries such as (C)LAPACK and FLAME can be easily parallelized using the SMPSs tools, consisting of a few OpenMP‐like pragmas and a run‐time system. In the LAPACK case, this usually requires the development of blocked algorithms for simple BLAS‐level operations, which expose concurrency at a finer grain. For better performance, our experimental results indicate that column‐major order, as employed by this library, needs to be abandoned in benefit of a block data layout. This will require a deeper rewrite of LAPACK or, alternatively, a dynamic conversion of the storage pattern at run‐time. The parallelization of FLAME routines using SMPSs is simpler as this library includes blocked algorithms (or algorithms‐by‐blocks in the FLAME argot) for most operations and storage‐by‐blocks (or block data layout) is already in place. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

13.
Improvements in performance modeling and identification of computational regimes within software libraries is a critical first step in developing software libraries that are truly agile with respect to the application as well as to the hardware. It is shown here that Pareto ranking, a concept from multi‐objective optimization, can be an effective tool for mining large performance datasets. The approach is illustrated using software performance data gathered using both the public domain LAPACK library and an asynchronous communication library based on IBM LAPI active message library. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

14.
Programmers productivity has always been overlooked as compared to the performance optimizations in high performance computing (HPC) community. In many parallel programming languages like MPI/MPI-IO, the performance optimizations are provided as various function options to the programmers. In order to write an efficient code, they are required to know the exact usage of the optimization functions, hence programmer productivity is limited. In this paper, we present RFSA, a Reduced Function Set Abstraction based on an existing parallel programming interface (MPI-IO) for I/O. The purpose of RFSA is to hide the performance optimization functions from the application developer, and relieve the application developer from deciding on a specific function. The proposed set of functions rely on a selection algorithm to decide among the most common optimizations provided by MPI-IO. We implement a selection algorithm for I/O functions like read, write, etc., and also merge a set of functions for data types and file views. By running different parallel I/O benchmarks on both medium-scale clusters and NERSC supercomputers, we show an improved programmer productivity (35.7% on average). This approach incurs an overhead of 2–5% for one particular optimization, and shows performance improvement of 17% when a combination of different optimizations is required by an application.  相似文献   

15.
We propose a new software package which would be very useful for implementing dense linear algebra algorithms on block-partitioned matrices. The routines are referred to as block basic linear algebra subprograms (BLAS), and their use is restricted to computations in which one or more of the matrices involved consists of a single row or column of blocks, and in which no more than one of the matrices consists of an unrestricted two-dimensional array of blocks. The functionality of the block BLAS routines can also be provided by Level 2 and 3 BLAS routines. However, for non-uniform memory access machines the use of the block BLAS permits certain optimizations in memory access to be taken advantage of. This is particularly true for distributed memory machines, for which the block BLAS are referred to as the parallel block basic linear algebra subprograms (PB-BLAS). The PB-BLAS are the main focus of this paper, and for a block-cyclic data distribution, in a single row or column of blocks lies in a single row or column of the processor template. The PB-BLAS consist of calls to the sequential BLAS for local computations, and calls to the BLACS for communication. The PB-BLAS are the building blocks for implementing ScaLAPACK, the distributed-memory version of LAPACK, and provide the same ease-of-use and portability for ScaLAPACK that the BLAS provide for LAPACK. The PB-BLAS consist of all Level 2 and 3 BLAS routines for dense matrix computations (not for banded matrix) and four auxiliary routines for transposing and copying of a vector and/or a block vector. The PB-BLAS are currently available for all numeric data types, i.e., single and double precision, real and complex.  相似文献   

16.
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms′ performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain.  相似文献   

17.
In a previous paper (Vidal et al., 2008, [21]), we presented a parallel solver for the symmetric Toeplitz eigenvalue problem, which is based on a modified version of the Lanczos iteration. However, its efficient implementation on modern parallel architectures is not trivial.In this paper, we present an efficient implementation on multicore processors which takes advantage of the features of this architecture. Several optimization techniques have been incorporated to the algorithm: improvement of Discrete Sine Transform routines, utilization of the Gohberg-Semencul formulas to solve the Toeplitz linear systems, optimization of the workload distribution among processors, and others. Although the algorithm follows a distributed memory parallel programming paradigm that is led by the nature of the mathematical derivation, special attention has been paid to obtaining the best performance in multicore environments. Hybrid techniques, which merge OpenMP and MPI, have been used to increase the performance in these environments. Experimental results show that our implementation takes advantage of multicore architectures and clearly outperforms the results obtained with LAPACK or ScaLAPACK.  相似文献   

18.
CAN通信中,CAN协议提供波特率、位周期内取样点数和位置的编程设置。优化位定时参数,保证信息同步,保证传输延迟和时钟误差在极端条件下进行恰当的错误检测。在给定的系统约束条件下,探讨优化定时参数,说明未定时参数的确定方法,并给出例子。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号