首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 218 毫秒
1.
布尔方程组求解技术对于密码分析具有重要的现实意义.然而,在众多求解算法的实际计算过程中,难以抑制的空间需求增长与计算机系统有限的存储能力之间的矛盾,正是当前制约布尔方程组求解技术取得更大成果的最主要瓶颈.针对基于消项的求解算法,分析了该矛盾的产生根源,提出了解决途径,进而设计了一种全新的布尔多项式计算机表示,称之为BanYan. BanYan适用于基于首项约化的求解算法,如F4,F5,XL等算法.通过记录中间结果的生成信息而非其本身,避免算法实现陷入项数规模高速膨胀带来的巨大存储负担.与BDD和系数矩阵等基于项的传统布尔多项式表示相比,平均情况以及最坏情况下,使用BanYan表示法所需要的空间约为项数表示法的1/l(l为计算过程中产生的多项式的平均项数),从而显著提升布尔方程组求解算法的现实求解能力.  相似文献   

2.
HPL是高性能计算广泛采用的Linpack测试软件包,传统HPL算法中,求解矩阵将以块为单位循环分布到所有处理器,由于国产加速器(China Accelerator)的底层矩阵乘接口仅支持定制接口,传统HPL算法已不适合CPU+China Accelerator异构系统,因此,必须基于定制接口完成矩阵分布细致划分与封装dPEM,以提供一个通用的HPL测试配置环境;同时,为了充分发挥国产异构系统的效率,设计了异构协同矩阵乘调度算法OA4MM,以提高国产异构系统的效率。实验验证了dPEM的有效性和OA4MM算法的高效性,OA4MM较传统的异构HPL调度算法性能提升近10%。  相似文献   

3.
数值模拟是行星流体动力学研究的主要工具.本文介绍CPU-MIC异构众核平台的行星流体动力学数值模拟,计算并模拟地球外核的磁流体运动.本文在已有工作的基础上~([1-3]),添加了CPU-MIC异构众核环境的数值模拟支持.首先描述了CPU-MIC异构众核环境的上的数值模拟流程,然后给出了MIC上的分布式并行GMRES(m)众核解法器的实现算法.其次,实现了解法器的计算核心稀疏矩阵向量乘(SpMV)在MIC上的分布式并行算法,该SpMV实现了计算-通信重叠、数据传输-计算重叠.再次,为加速行星流体动力学方程收敛,给出了MIC上以SpMV为基本操作的分布式并行多项式预条件子.最后,提出了一些MIC众核平台的优化措施,如多线程、流存储和数据传输优化等.天河2号数值模拟表明相比CPU版的数值模拟,CPU-MIC异构众核环境下数值模拟在单MIC卡和64块MIC卡分别取得了6.93和6.0倍的加速比.  相似文献   

4.
共轭梯度算法是求解对称正定线性系统的重要方法之一,该算法求解问题通常具有稀疏性.随着问题规模的不断增大,单CPU因其存储及计算能力限制已经不能满足大规模稀疏线性方程组求解的实时需求.基于此,本文提出一种基于CPU+GPU异构平台的MPI+CUDA异构并行求解算法.首先,对共轭梯度算法进行了热点性能分析,说明该算法求解时存在的计算困难及挑战;然后,根据共轭梯度算法特性进行了任务划分,实现异构并行算法设计;最后,针对异构并行算法中存在的通信开销、数据传输开销和存储器访问开销等问题,对异构并行算法进行优化以进一步提升求解效率及性能.实验结果表明,与MPI并行和CUDALib并行相比,MPI+CUDA异构混合并行在串行计算部分较少的Jacobi预处理共轭梯度算法上分别获得336%和33%的性能提升,在串行计算部分较多的ILU预处理共轭梯度算法上也能分别获得25%和7%的性能提升,同时结果还显示MPI+CUDA混合并行随着节点数目的增加具有一定可扩展性.  相似文献   

5.
本文针对代数多重网格(algebraic multigrid,AMG)并行实现中的稀疏矩阵-向量乘,建立了稀疏矩阵新的分布和数据存储模式,提出了一类具有最小通信量以及隐藏通信的新稀疏矩阵-向量乘并行算法,并实现了基于K-循环迭代的求解阶段并行算法.针对现代多核处理器,结合细粒度的并行编程模型,实现了MPI+OpenMP混合编程并行算法.通过同hypre软件包测试比较,在深腾7000集群上求解三维Laplace方程并行规模达到512核心时,并行求解阶段运行时间较hypre(high performance preconditioners)软件包提高了56%,在元集群上提高了39%,验证了算法的有效性.  相似文献   

6.
在集群与GPU组成的异构并行计算平台上,使用MPI+CUDA混合编程模型,实现基于ABEEMσπ模型的分子动力学模拟中电荷分布的计算.通过对电荷分布分布求解中的计算部分移植到GPU上进行,并针对算法中通信开销大和资源未充分利用的问题,通过异构平台的异步并发方法进行优化,提高了求解效率.性能测试结果表明,相比于单纯MPI并行算法,优化后GPU加速的异构并行算法,在化学大分子模型电荷分布计算上,有着明显的性能优势.  相似文献   

7.
传统求图传递闭包的方法存在计算量大与计算时间长的问题。为加快处理大数据量的传递闭包算法的计算速度,结合算法密集计算和开放式计算语言(OpenCL)框架的特征,采用本地存储器优化的并行子矩阵乘和分块的矩阵乘并行计算,提出一种基于OpenCL的传递闭包并行算法。利用本地存储器优化的并行子矩阵乘算法来优化计算步骤,提高图形处理器(GPU)的存储器利用率,降低数据获取延迟。通过分块矩阵乘并行计算算法实现大数据量的矩阵乘,提高GPU计算核心的利用率。数据结果表明,与CPU串行算法、基于开放多处理的并行算法和基于统一设备计算架构的并行算法相比,传递闭包并行算法在OpenCL架构下NVIDIA GeForce GTX 1070计算平台上分别获得了593.14倍、208.62倍和1.05倍的加速比。  相似文献   

8.
在实际工程应用中,使用传统的CPU串行计算来开展燃烧数值模拟往往难以满足对模拟速度的要求。利用GPU比CPU更强的计算能力,通过在交错网格上将燃烧物理方程离散化,使用预处理稳定双共轭梯度法(PBiCGSTAB)求解离散化方程,并且探索面向GPU编程的矩阵向量乘并行算法和逆矩阵向量乘并行算法,从而给出一种在GPU上数值求解层流扩散燃烧的可行方法。实验结果表明,GPU并行程序获得了相对串行CPU程序约10倍以上的加速效果,且计算结果与实际情况相符,因而所提方法是可行且高效的。  相似文献   

9.
方民权  张卫民  高畅  方建滨 《软件学报》2015,26(S2):247-256
高光谱遥感影像降维最大噪声分数变换(maximum noise fraction rotation,简称MNF rotation)方法运算量大,耗时长.基于多核CPU与众核MIC(many integrated cores)平台,研究MNF算法的并行方案和性能优化.通过热点分析,针对滤波、协方差矩阵运算和MNF变换等热点,提出相应并行方案和多种优化策略,量化分析优化效果,设计MKL(math kernel library)库函数实现方案并测评其性能;设计并实现基于多核CPU的C-MNF和基于CPU/MIC的M-MNF并行算法.实验结果显示,C-MNF算法在多核CPU取得的加速比为58.9~106.4,而基于CPU/MIC异构系统的M-MNF算法性能最好,加速比最高可达137倍.  相似文献   

10.
针对GPU并行计算领域缺少精确的性能分析模型和有针对性的性能优化方法,提出一种基于GPU的并行计算性能定量分析模型,其通过对指令流水线、共享存储器访存、全局存储器访存的性能建模,来定量分析并行程序,帮助程序员找到程序运行瓶颈,进行有效的性能优化。实验部分通过3个具有代表性的实际应用(稠密矩阵乘法、三对角线性方程组求解、稀疏矩阵矢量乘法)的性能分析证明了该模型的实用性,并有效地实现了算法的优化。  相似文献   

11.
We present fast and highly scalable parallel computations for a number of important and fundamental matrix problems on distributed memory systems (DMS). These problems include matrix multiplication, matrix chain product, and computing the powers, the inverse, the characteristic polynomial, the determinant, the rank, the Krylov matrix, and an LU- and a QR-factorization of a matrix, and solving linear systems of equations. Our highly scalable parallel computations for these problems are based on a highly scalable implementation of the fastest sequential matrix multiplication algorithm on DMS. We show that compared with the best known parallel time complexities on parallel random access machines (PRAM), the most powerful but unrealistic shared memory model of parallel computing, our parallel matrix computations achieve the same speeds on distributed memory parallel computers (DMPC), and have an extra polylog factor in the time complexities on DMS with hypercubic networks. Furthermore, our parallel matrix computations are fully scalable on DMPC and highly scalable over a wide range of system size on DMS with hypercubic networks. Such fast (in terms of parallel time complexity) and highly scalable (in terms of our definition of scalability) parallel matrix computations were rarely seen before on any distributed memory systems.  相似文献   

12.
尹孟嘉  许先斌  熊曾刚  张涛 《计算机科学》2015,42(12):13-17, 22
性能评价和优化是设计高效率并行程序必不可少的重要工作,存储系统的性能高低直接影响到处理器的整体性能。利用GPGPU-Sim对GPU的存储层次结构进行了模拟,找出了SM数量与存储控制器数量之间最佳配置关系。矩阵乘法是科学计算领域中的基本组成部分,是一种具有计算和访存密集特点的典型应用,其性能是GPU高性能计算的一个重要指标。性能模型作为并行系统性能评价的新的技术解决方案,具有许多其它性能评价方法无法比拟的优势。建立了一个性能模型,模型通过对指令流水线、共享存储器访存、全局存储器访存进行定量分析,找到了程序运行瓶颈,提高了执行速度。实验证明,该模型具有实用性,并有效地实现了矩阵乘法的优化。  相似文献   

13.
讨论了在一个由高速局域网连接的高性能异构工作站平台上,如何有效地利用空闲工作站来求解计算密集型任务矩阵相乘的问题,为了获得较好的并行计算性能,文中给出了一个异构工作站群之间任务调度的模型和算法,算法中考虑了并行计算中协作任务间的通信时间、数据加栽时间、结果收集时间和各个异构工作站的任务计算时间,通过这个模型,可以在所有可利用的工作站集合中找出最适合的子集,获得最短的执行时间.  相似文献   

14.
This paper presents the parallelization on a GPU of the sequential matrix diagonalization (SMD) algorithm, a method for diagonalizing polynomial covariance matrices, which is the most recent technique for polynomial eigenvalue decomposition. We first parallelize with CUDA the calculation of the polynomial covariance matrix. Then, following a formal transformation of the polynomial matrix multiplication code—extensively used by SMD—we insert in this code the cublasDgemm function of CUBLAS library. Furthermore, a specialized cache memory system is implemented within the GPU to greatly limit the PC-to-GPU transfers of slices of polynomial matrices. The resulting SMD code can be applied efficiently over high-dimensional data. The proposed method is verified using sequences of images of airplanes with varying spatial orientation. The performance of the parallel codes for polynomial covariance matrix generation and SMD is evaluated and reveals speedups of up to 161 and 67, respectively, relative to sequential execution on a PC.  相似文献   

15.
This paper presents a practical evaluation and comparison of three state-of-the-art parallel functional languages. The evaluation is based on implementations of three typical symbolic computation programs, with performance measured on a Beowulf-class parallel architecture.We assess three mature parallel functional languages: PMLS, a system for implicitly parallel execution of ML programs; GPH, a mainly implicit parallel extension of Haskell; and Eden, a more explicit parallel extension of Haskell designed for both distributed and parallel execution. While all three languages employ a completely implicit approach to communication, each language takes a different approach to specifying and controlling parallelism, ranging from explicit identification of processes as language constructs (Eden) through annotation of potential parallelism (GPH) to automatic detection of parallel skeletons in sequential code (PMLS).We present detailed performance measurements of all three systems on a widely available parallel architecture: a Beowulf cluster of low-cost commodity workstations. We use three representative symbolic applications: a matrix multiplication algorithm, an exact linear system solver, and a simple ray-tracer. Our results show how moderate speedups can be achieved with little or no changes to the sequential code, and that parallel performance can be significantly improved even within our high-level model of parallel functional programming by controlling key aspects of the program such as load distribution and thread granularity.  相似文献   

16.
17.
Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message-passing architecture with a two-dimensional mesh topology. We analyze and compare three algorithms and obtain an implementation, BiMMeR, that uses communication primitives highly suited to the Delta and exploits the single node assembly-coded matrix multiplication. Our algorithm is completely general, i.e. able to deal with various data layouts as well as arbitrary mesh aspect ratios and matrix dimensions, and has achieved parallel efficiency of 86 %, with overall peak performance in excess of 8 Gflops on 256 nodes for an 8800 × 8800 matrix. We describe BiMMeR's design and implementation and present performance results that demonstrate scalability and robust behavior over varying mesh topologies.  相似文献   

18.
We estimate parallel complexity of several matrix computations under both Boolean and arithmetic machine models using deterministic and probabilistic approaches. Those computations include the evaluation of the inverse, the determinant, and the characteristic polynomial of a matrix. Recently, processor efficiency of the previous parallel algorithms for numerical matrix inversion has been substantially improved in (Pan and Reif, 1987), reaching optimum estimates up to within a logarithmic factor; that work, however, applies neither to the evaluation of the determinant and the characteristic polynomial nor to exact matrix inversion nor to the numerical inversion of ill-conditioned matrices. We present four new approaches to the solution of those latter problems (having several applications to combinatorial computations) in order to extend the suboptimum time and processor bounds of (Pan and Reif, 1987) to the case of computing the inverse, determinant, and characteristic polynomial of an arbitrary integer input matrix. In addition, processor efficient algorithms using polylogarithmic parallel time are devised for some other matrix computations, such as triangular and QR-factorizations of a matrix and its reduction to Hessenberg form.  相似文献   

19.
为提高异构CMP任务调度执行效率,充分发挥异构CMP的异构性和并行能力,提出一种基于异构CMP的改进蚁群优化任务调度算法--IACOTS。IACOTS算法首先建立任务调度模型、路径选择规则和信息素更新规则,使蚁群算法能够适用于异构CMP任务调度问题。同时通过采用动态信息素更新、相遇并行搜索策略和引入遗传算法中的变异因子对基本的蚁群算法进行优化,克服蚁群算法搜索时间过长和“早熟”现象。通过仿真实验获得的结果表明,IACOTS算法执行效率优于现有的遗传算法,完成相同的任务需要的迭代次数最少,能有效降低程序执行时间,适用于异构CMP等大规模并行环境的任务调度。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号