首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
Intel,AMD和IBM都具有针对自身特点的向量扩展库。相比于传统的标量计算,向量化技术带来的加速比较高。为此,针对申威26010处理器开发向量数学库软件。在分析函数常用级数法和迭代法算法的基础上,结合三角函数、反三角函数、指数函数和对数函数研究一种高效向量化算法,并对其进行实现与优化,使其支持函数高精度和高性能计算,并且满足浮点运算的要求。测试结果表明,该算法精度达到申威26010处理器上特定应用的要求,与Intel VML数学库相比,各函数的平均加速比均达到1.1以上。  相似文献   

2.
首先介绍了SIMD扩展技术,并分析了使用SIMD扩展的3种方式,认为通过调用特定目标平台优化的第三方库是应用领域软件开发者快速开发高效并行程序的较好的方式;其次,介绍了国产神威处理器SW-1600平台,并利用SIMD扩展和循环展开等技术开发了SW-VML(SW Vector Math Library),开发过程中提出了访存对界、简化向量条件分支的优化方法,解决了非对界访存、向量与标量数组转换影响性能的问题,并根据SW编译器对OpenMP的支持,开发了多线程OpenMp版;最后,在SW-1600平台上采用不同向量规模对SW-VML进行了测试,测试结果显示,SIMD向量化相对于串行程序加速比为2.08,4线程相对单线程平均加速比为2.26.SW-VML是在国产神威系列处理器上开发高效程序的向量函数软件包,也是在神威蓝光高性能计算平台单计算节点开发高性能程序的基础软件工具包.  相似文献   

3.
刘芳芳  杨超  袁欣辉  吴长茂  敖玉龙 《软件学报》2018,29(12):3921-3932
世界首台峰值性能超过100P的超级计算机——神威太湖之光已经研制完成,该超级计算机采用了国产申威异构众核处理器,该处理器不同于现有的纯CPU,CPU-MIC,CPU-GPU架构,采用了主-从核架构,单处理器峰值计算能力为3TFlops/s,访存带宽为130GB/s.稀疏矩阵向量乘SpMV(sparse matrix-vector multiplication)是科学与工程计算中的一个非常重要的核心函数,众所周知,其是带宽受限型的,且存在间接访存操作.国产申威处理器给稀疏矩阵向量乘的高效实现带来了很大的挑战.针对申威处理器提出了一种CSR格式SpMV操作的通用异构众核并行算法,该算法从任务划分、LDM空间划分方面进行精细设计,提出了一套动静态buffer的缓存机制以提升向量x的访存命中率,提出了一套动静态的任务调度方法以实现负载均衡.另外还分析了该算法中影响SpMV性能的几个关键因素,并开展了自适应优化,进一步提升了性能.采用Matrix Market矩阵集中具有代表性的16个稀疏矩阵进行了测试,相比主核版最高有10倍左右的加速,平均加速比为6.51.通过采用主核版CSR格式SpMV的访存量进行分析,测试矩阵最高可达该处理器实测带宽的86%,平均可达到47%.  相似文献   

4.
BLAS (Basic Linear Algebra Subprograms)是一个基本线性代数操作的数学函数标准, 该库函数分为三个级别, 每个级别提供了向量与向量(1级)、向量与矩阵(2级)、向量与向量(三级)之间的基本运算. 本文研究了在申威1621处理器上BLAS一级函数的优化方案, 以函数AXPY为例, 充分利用平台的架构特点对其进行性能调优,设计了自动的线程分配方案. 实验结果显示优化过后的BLAS一级函数AXPY相对于GotoBLAS参考实现版本的单核和多核加速比分别高达4.36和9.50, 对于每种优化方式均得到了一定的性能提升.  相似文献   

5.
利用流SIMD扩展加速3D曲线网格的流线计算   总被引:4,自引:0,他引:4  
张文  李晓梅 《计算机学报》2001,24(8):785-790
流线是一种基本的流场可视化技术,计算流线要耗费大量时间,Intel处理器(Pentium Ⅲ,Pentium4)提供流SIMD扩展(SSE),支持指令 级SIMD操作。3D曲线网格上的流线计算包含速度插值、数值积分、点定位等主要子过程,具有很高的内在SIMD并行性。通过将数据按SSE数据类型组织以及对主要子过程进行SIMD并行化,设计了线流计算的SSE算法。采用向量类库、嵌入汇编两种SSE编码方式分别实现SSE算法,并依据处理器的体系结构优化代码。测试结果表明:SSE大大加速了3D曲线网格的流线计算,向量类库方式比传统计算提高55%左右的性能,嵌入汇编提高75%左右。  相似文献   

6.
随着RISC-V指令集的流行,出现了一批应用于IoT智能硬件、嵌入式系统、人工智能芯片、安全设备及高性能计算等不同领域的开源和商业IP软核。性能、功耗和面积三者之间的平衡需要指令集可裁剪、易扩展,以及软件开发环境的配套支持。为此,按照增加自定义指令、扩展ALU功能单元、连接控制信号和数据通路、FPGA原型验证、定制交叉编译环境和应用程序测试的流程,基于FPGA快速实现了定制化RISC-V处理器。以加速矩阵运算为例,基于FPGA在开源IP蜂鸟E203上设计了一条计算向量内积的自定义指令,并在FPGA上进行了原型验证。应用测试程序表明,定制化的RISC-V处理器的计算性能有显著提升,矩阵乘法运算的性能加速比达到了5.3~7.6。  相似文献   

7.
SIMD技术的出现使得基础数学库扩展到向量数学库成为必然趋势。基础数学库中多数函数存在代码实现复杂、分支判断多的特点,增加了向量化的难度,同时SIMD指令的不完备导致函数中的部分功能无法直接向量化,频繁的拆分和拼接操作降低了函数的性能。针对这些问题,提出了向量数学库的向量化方法,通过确定核心代码段、数据预处理过程向量化及指令向量化3个步骤,可以快速有效地对基础数学库进行向量化。实验表明,运用该方法,exp,pow,log10等典型函数的性能平均提高了24.2%。  相似文献   

8.
BLAS (Basic Linear Algebra Subprograms)是一个以向量和矩阵为操作对象的基础函数库.该库中函数分为3个级别,各个级别分别提供了向量-向量(1级)、向量-矩阵(2级)、矩阵-矩阵(3级)之间的基本运算.本文研究如何在申威众核处理器上BLAS-1、2级函数的并行实现,并充分利用平台特性对它们进行深度的性能调优,归纳总结程序在申威平台上的并行实现与优化技巧.申威26010 CPU采用了异构众核架构,众多计算核心提供的大规模并行处理能力,使单块芯片具有3 TFLOPS的双精度浮点计算性能.实验结果显示BLAS-1、2级函数相对于GotoBLAS参考实现版的平均加速比分别高达11.x和6.x,对于每一优化手段,均有明显的性能加速.  相似文献   

9.
基于国产的FT-M7002平台高性能DSP,针对不同类型的点积算法进行了优化实现,完善了该处理器平台数学库的技术链,充分发挥了FT-M7002内核体系结构优势,对点积算法实现了SIMD向量并行化、DMA双通道传输和SVR传输等优化。该研究充分挖掘了程序的向量并行性,有效地提升了数据传输的速度,提高了程序性能。实验结果表明,输入不同规模大小的数组,不同类型的点积算法在FT-M7002平台上优化后和优化前的平均性能比为12.416 6~45.233 8。相较于TI官网的dsplib库中不同类型的点积函数在TMS320C6678处理器上运行的性能,FT-M7002平台优化后的性能与TI平台的平均性能比为1.371 6~4.519 6。实验结果表明了该DSP平台相对于TI主流平台的计算性能优势。  相似文献   

10.
基于加速遗传算法的选择性支持向量机集成*   总被引:3,自引:1,他引:2  
为有效提升支持向量机的泛化性能,提出基于加速遗传算法的选择性支持向量机集成。通过Bootstrap技术产生并训练得到多个独立子SVM,基于负相关学习理论构造适应度函数,提高子SVM的泛化性能,并增大其之间差异度。利用加速遗传算法计算各子SVM在加权平均中的最优权重,然后选择权值大于一定阈值的部分SVM进行加权集成。实验结果表明,该算法是一种有效的集成方法,能进一步提高SVM的集成效率和泛化性能。  相似文献   

11.
作为SIMD扩展部件向量化的重要手段,自动向量化已在LLVM编译器中得到实现,但向量长度以及指令集功能的差异,导致国产平台在自动向量化过程中容易错失向量化机会以及向量化后产生倒加速的问题。为使SIMD得到充分应用,结合国产平台的指令集特征完善指令代价信息以提高收益分析精准度,使其在自动向量化后生成后端支持且简洁高效的向量指令。在此基础上,提出一种改进的控制流向量化方法,通过添加指令代价信息提高自动向量化的适配能力,从而形成一套面向国产平台的LLVM自动向量化系统。实验结果表明,相比自动向量化移植前,通过该方法进行移植优化后,SPEC测试的整体性能提升10.8%,TSVC测试集中的加速比提升16%,精准代价指导下的加速比提升42%,控制流向量化下的加速比提升51%。  相似文献   

12.
13.
14.
Compilation Techniques for Multimedia Processors   总被引:5,自引:0,他引:5  
The huge processing power needed by multimedia applications has led to multimedia extensions in the instruction set of microprocessors which exploit subword parallelism. Examples of these extended instruction sets are the Visual Instruction Set of the UltraSPARC processor, the AltiVec instruction set of the PowerPC processor, the MMX and ISS extensions of the Pentium processors, and the MAX-2 instruction set of the HP PA-RISC processor. Currently, these extensions can only be used by programs written in assembly language, through system libraries or by calling specialized macros in a high-level language. Therefore, these instructions are not used by most applications. We propose two code generation techniques to produce native code using these multimedia extensions for programs written in a high-level language: classical vectorization and vectorization by unrolling. Vectorization by unrolling is simpler than classical vectorization since data dependence analysis is reduced to acyclic control flow graph analysis. Furthermore, we address the problem of unaligned memory accesses. This can be handled by both static analysis and dynamic runtime checking. Preliminary experimental results for a code generator for the UltraSPARC VIS instruction set show that speedups of up to a factor of 4.8 are possible, and that vectorization by unrolling is much simpler but as effective as classical vectorization.  相似文献   

15.
Recently, real-time processing of image recognition is required for embedded applications such as automotive applications, robotics, entertainment, and so on. To realize real-time processing of image recognition on such systems we need optimized libraries for embedded processors. OpenCV is one of the most widely used libraries for computer vision applications and has many functions optimized for Intel processors, but no function is optimized for embedded processors. We present a parallel implementation of OpenCV library on the Cell Broadband Engine (Cell), which is one of the most widely used high performance embedded processors. Experimental result shows that most of the functions optimized for the Cell processor are faster than functions optimized for Intel Core 2 Duo E6850 3.00 GHz.  相似文献   

16.
The Cell processor is a heterogeneous multi-core processor with one power processing engine (PPE) core and eight synergistic processing engine (SPE) cores. There is a significant amount of ongoing research in programming models and tools that attempts to make it easy to exploit the computation power of the Cell architecture. In our work, we explore supporting OpenMP on the Cell processor. It is attractive to support OpenMP because programmers can continue using their familiar programming model, and existing code can be re-used. We base our work on IBM’s XL compiler, and developed new components in the XL compiler and a new runtime library. Three major issues are addressed: (1) synchronization support on heterogeneous cores; (2) code generation targeting the different instruction sets; (3) data transfers and implement the OpenMP memory model. We present experimental results for some SPEC OMP 2001 and NAS benchmarks to demonstrate the effectiveness of this approach. A visualization tool based on Paraver is also used to provide some insights into actual thread and synchronization behaviors.  相似文献   

17.
在一系列新的任意阶电流模式低通滤波器设计方法实现的基础上,提出一种基于MOCCⅡ(多端输出第二代电流传输器)n阶任意形式的电流模式信号处理系统的设计方法。用该方法可以设计出任意形式的模拟信号处理系统,并且用该方法设计出的所有信号处理系统都能实现高通滤波、低通滤波、带通滤波及其他任意形式的信号处理功能。不但电路简单,而且每种信号处理器中的所有无源器件都接地,有利于集成,使得基于MOCCⅡ的电流模式信号处理器的系统设计有法可循。  相似文献   

18.
Modern high energy physics experiments have to process terabytes of input data produced in particle collisions. The core of many data reconstruction algorithms in high energy physics is the Kalman filter. Therefore, the speed of Kalman filter based algorithms is of crucial importance in on-line data processing. This is especially true for the combinatorial track finding stage where the Kalman filter based track fit is used very intensively. Therefore, developing fast reconstruction algorithms, which use maximum available power of processors, is important, in particular for the initial selection of events which carry signals of interesting physics.One of such powerful feature supported by almost all up-to-date PC processors is a SIMD instruction set, which allows packing several data items in one register and to operate on all of them, thus achieving more operations per clock cycle. The novel Cell processor extends the parallelization further by combining a general-purpose PowerPC processor core with eight streamlined coprocessing elements which greatly accelerate vector processing applications.In the investigation described here, after a significant memory optimization and a comprehensive numerical analysis, the Kalman filter based track fitting algorithm of the CBM experiment has been vectorized using inline operator overloading. Thus the algorithm continues to be flexible with respect to any CPU family used for data reconstruction.Because of all these changes the SIMDized Kalman filter based track fitting algorithm takes 1 μs per track that is 10000 times faster than the initial version. Porting the algorithm to a Cell Blade computer gives another factor of 10 of the speedup.Finally, we compare performance of the tracking algorithm running on three different CPU architectures: Intel Xeon, AMD Opteron and Cell Broadband Engine.  相似文献   

19.
Many embedded platforms consist of a heterogeneous collection of processing elements, memory modules, and communication subsystems. These components often implement different scheduling/arbitration policies, have different interfaces, and are supplied by different vendors. Hence, compositional techniques for modeling and analyzing such platforms are of interest. In prior work, the real-time calculus framework has proven to be very effective in this regard. However, real-time calculus has heretofore been limited to systems with uniprocessor processing elements, which is a serious impediment given the advent of multicore technologies. In this paper, a two-step approach is proposed that allows the power of real-time calculus to be applied in globally-scheduled multiprocessor systems: first, assuming that job response-time bounds are given, determine whether these bounds are met; second, using these bounds, determine the resulting residual processor supply and streams of job completion events using formalisms from real-time calculus. For this methodology to be applied in settings where response-time bounds are not specified, such bounds must be determined. Closed-form expressions for calculating such response-time bounds are presented for a large family of fixed-job-priority schedulers. We have also applied the developed analysis framework in a case study.  相似文献   

20.
基于图形拓扑特征的工程图整体识别方法   总被引:2,自引:0,他引:2       下载免费PDF全文
本文试图模拟人的视图形知觉过程,实现工程图扫扫描图象矢量化处理。这个方法首先分析图象描述中节点区域与图段的连通关系,获得图形在交大范围内的拓扑特征,多面手以图形的拓扑特征为基础,对图形的局部细节进行分析与识别。利用对局部细菌的了解再完善对整体的理解。经过从整体到局部,再从局部到整体的过程,完成工程图纸扫描图象矢量化。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号