期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

宋广辉郭绍忠赵捷陶小涵李飞许瑾晨《软件学报》2023,34(12):5704-5723

混合精度在深度学习和精度调整与优化方面取得了许多进展,广泛研究表明,面向Stencil计算的混合精度优化也是一个很有挑战性的方向.同时,多面体模型在自动并行化领域取得的一系列研究成果表明,该模型为循环嵌套提供很好的数学抽象,可以在其基础上进行一系列的循环变换.基于多面体编译技术设计并实现了一个面向Stencil计算的自动混合精度优化器,通过在中间表示层进行迭代空间划分、数据流分析和调度树转换,首次实现了源到源的面向Stencil计算的混合精度优化代码自动生成.实验表明,经过自动混合精度优化之后的代码,在减少精度冗余的基础上能够充分发挥其并行潜力,提升程序性能.以高精度计算为基准,在x86平台上最大加速比是1.76,几何平均加速比是1.15;在新一代国产申威平台上最大加速比是1.64,几何平均加速比是1.20. 相似文献

2.

一种面向异构众核处理器的并行编译框架 总被引：1，自引：0，他引：1

李雁冰赵荣彩韩林赵捷徐金龙李颖颖《软件学报》2019,30(4):981-1001

异构众核处理器是面向高性能计算领域处理器发展的重要趋势,但其更为复杂的体系结构使得编程难的问题更加突出.针对这一问题,基于开源编译器Open64,提出了一种面向异构众核处理器的并行编译框架,将程序自动转换为异构并行程序.该框架主要包括4个模块：任务划分模块用来识别适合进行加速计算的程序段,实现了嵌套循环的多维并行识别方法;数据布局模块完成数据在主存和SPM之间的布局,实现了数组边界分析和指针范围分析;传输优化模块实现了数据传输合并、传输外提、打包传输、数组转置等多种数据传输优化方法;收益评估模块在构建代价模型的基础上实现了一种动静结合的收益评估方法.并且,基于SW26010处理器,对该编译框架进行了实现,测试结果表明,该编译框架能够实现一些程序以面向异构众核结构的并行变换,且获得较好的加速效果. 相似文献

3.

面向异构多核处理器的的循环分块

李雁冰赵荣彩赵博黄品丰《计算机工程与设计》2015,36(1):168-173

将OpenACC编程模型用于异构多核处理器时,由于异构多核处理器加速设备内存有限,操作大量数据的代码不能获得很好的加速。针对这一问题,在OpenACC中引入循环分块子句,对循环进行分块处理,使每个循环块使用的数据能够存储在设备内存中;提出面向异构多核处理器的循环分块子句生成算法,并在基于Open64的"源-源"自动并行化系统Auto-ACC中进行实现。测试结果表明,在异构多核处理器上,扩展的循环分块子句及所提生成算法能够对程序进行明显的加速。相似文献

4.

面向MPI代码生成的Open64编译器后端

《计算机学报》2014,(7)

随着计算机体系结构的发展,分布式存储结构以其良好的扩展性逐渐占据了高性能计算机体系结构市场的主导地位.为了将现有的串行程序转换为能够在高性能计算机上运行的并行程序,研究人员提出了并行化编译器.然而,当前面向分布存储并行系统的编译器发展却相对较慢,而面向共享存储并行系统的编译器及其相应技术已逐渐成熟.一种开发面向分布存储并行系统编译器的可行方法是改进现有的面向共享存储并行系统的编译器,使其自动生成能够在分布存储结构高性能计算机上运行的MPI(Message Passing Interface)并行程序.因此,该文为面向共享存储并行系统的编译器Open64设计并实现了一个支持MPI代码生成的后端.根据分布式并行化编译的特点,主要从自动生成计算划分、改进循环优化和自动生成MPI并行代码3个方面对Open64进行了改进,使其能够实现面向分布存储的并行化编译.实验测试利用带有MPI后端的Open64对串行程序进行编译,生成的MPI并行代码可直接运行在具有分布存储结构的高性能计算机上.通过将该MPI并行代码的执行效率与传统面向分布存储并行系统编译器生成的MPI代码效率进行比较,并行效率有明显的提升. 相似文献

5.

基于OPS的计算流体力学软件多平台自动并行

王巍车永刚徐传福王正华《计算机工程与科学》2021,43(5):773-781

当前高性能计算机体系结构呈现多样性特征,给并行应用软件开发带来巨大挑战.采用领域特定语言OPS对高阶精度计算流体力学软件HNSC进行面向多平台的并行化,使用OPS API实现了代码的重构,基于OPS前后端自动生成了纯M PI、OpenM P、M PI+OpenM P和M PI+CUDA版本的可执行程序.在一个配有2块Intel Xeon CPU E5-2660 V3 CPU和1块NVIDIA Tesla K80 GPU的服务器上的性能测试表明,基于O PS自动生成的并行代码性能与手工并行代码的性能可比甚至更优,并且O PS自动生成的GPU并行代码相对于其CPU并行代码有明显的性能加速.测试结果说明,使用OPS等领域特定语言进行面向多平台的计算流体力学并行软件开发是一种可行且高效的途径. 相似文献

6.

面向异构多核架构的自适应编译框架

《计算机学报》2014,(7)

针对应用在移植到异构多核高性能计算机系统中所面临的可移植性差以及性能优化难度大的问题,文中提出一种面向异构多核架构的自适应编译框架.通过源到源编译解决传统并行编程模型应用向异构多核架构的映射问题;同时利用动态剖分信息,自适应地调整插桩并配置优化策略,形成迭代式的自动优化过程.文中自适应编译框架将软硬件映射机制与优化策略结合,有效地解决了同构并行应用向异构多核架构的移植问题并提高了应用的整体性能.实验结果表明,文中基于Cell架构实现的原型系统,很好地解决了异构多核架构下应用移植性等问题,同时应用性能有所提高. 相似文献

7.

面向规则DOACROSS循环的流水并行代码自动生成

刘晓娴赵荣彩赵捷徐金龙《软件学报》2014,25(6):1154-1168

发掘DOACROSS 循环中蕴含的并行性,选择合适的策略将其并行执行,对提升程序的并行性能非常重要.流水并行方式是规则DOACROSS 循环并行的重要方式.自动生成性能良好的流水并行代码是一项困难的工作,并行编译器对程序自动并行时常常对DOACROSS 循环作保守处理,损失了DOACROSS 循环包含的并行性,限制了程序的并行性能.针对上述问题,设计了一种选择计算划分循环层和循环分块层的启发式算法,给出了一个基于流水并行代价模型的循环分块大小计算公式,并使用计数信号量进行并行线程之间的同步,实现了基于OpenMP 的规则DOACROSS 循环流水并行代码的自动生成.通过对有限差分松弛法（finite difference relaxation,简称FDR）的波前（wavefront）循环和时域有限差分法（finite difference time domain,简称FDTD）中典型循环以及程序Poisson,LU 和Jacobi 的测试,算法自动生成的流水并行代码能够在多核处理器上获得明显的性能提升,使用的流水分块大小计算公式能够较为精确地计算出循环流水并行时的最佳分块大小.自动生成的流水并行代码与基于手工选择的最优分块大小的流水并行代码相比,加速比达到手工选择加速比的89%. 相似文献

8.

自动并行化中不规则循环的通信代码生成

傅立国姚远丁锐《计算机应用》2014,34(4):1014-1018

不规则计算在大规模并行应用中广泛存在。在面向分布存储结构的自动并行化过程中,较难在编译时为不规则循环生成并行代码。并行代码中的通信代码对程序运行结果的正确性以及加速效果有着严重的影响。通过分析程序的数组重分布图,使用部分冗余的通信方式来维持不规则数组访问的生产者消费者关系,可以在编译时为一类常见的不规则循环自动生成有效的通信代码。该方法使用计算分解和数组引用的访问表达式求解不规则数组在各处理器的本地定义集作为通信的数据集,分析针对此类不规则循环划分的通信策略,继而生成相应的通信代码。实验测试的结果取得了预期的加速效果,验证了方法的有效性。相似文献

9.

一种面向SIMD扩展部件的向量化统一架构

刘鹏赵荣彩赵博高伟《计算机科学》2014,41(9):28-31,44

随着多媒体应用的普及和高性能计算的需求,越来越多的处理器集成了SIMD扩展。为了针对不同SIMD扩展部件自动生成高效的向量化代码,设计了一套虚拟向量指令集,在此基础上构建了一种面向SIMD扩展部件的向量化统一架构。将输入程序通过向量识别等阶段转变为虚拟向量指令的中间表示,而后通过向量长度解虚拟化和指令集解虚拟化,将其转变为特定SIMD部件的向量指令集。在申威1600、DSP和Alpha上的实验结果表明:统一架构能够针对3种平台自动变换出高效的向量化代码,在DSP上的加速比要明显优于其它两种平台。相似文献

10.

面向应用的可重构编译器ASCRA(英文) 总被引：1，自引：0，他引：1

下载免费PDF全文

吴艳霞顾国昌孙延腾杨敏杨杰牛晓霞孙霖《计算机科学与探索》2011,5(3):267-279

在很多应用领域已经开展了可重构计算的研究,但是由于缺乏高层设计工具,设计者需要较深的软件和硬件专业知识才能开发GPP/RAU架构的程序,阻碍了其大规模应用。提出了一种面向应用的可重构编译器——ASCRA的初始架构,它可以自动将C语言映射为VHDL语言,从而解决可重构计算中自动编译工具的瓶颈。ASCRA编译器主要研究软硬件划分技术和面向硬件的优化技术,如脉动阵列、循环流水技术。在ML505开发平台上,设计实现了ASCRA编译器的验证平台,并通过实验给出了核心程序段生成VHDL代码的综合信息。相似文献

11.

Towards optimized tensor code generation for deep learning on sunway many-core processor

Mingzhen LI Changxi LIU Jianjin LIAO Xuegui ZHENG Hailong YANG Rujun SUN Jun XU Lin GAN Guangwen YANG Zhongzhi LUAN Depei QIAN 《Frontiers of Computer Science》2024,18(2):182101

The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the existing deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient codes for deep learning workloads on Sunway. The experiment results show that the codes generated by swTVM achieve 1.79

×

improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway, across eight representative benchmarks. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind. We believe this work will encourage more people to embrace the power of deep learning and Sunway many-core processor. 相似文献

12.

High‐level specifications for automatically generating parallel code

Alejandro Acosta Francisco Almeida Ignacio Pelez 《Concurrency and Computation》2013,25(7):989-1012

The arrival of multicore systems, along with the speed‐up potential available in graphics processing units, has given us unprecedented low‐cost computing power. These systems address some of the known architecture problems but at the expense of considerably increased programming complexity. Heterogeneity, at both the architectural and programming levels, poses a great challenge to programmers. Many proposals have been put forth to facilitate the job of programmers. Leaving aside proposals based on the development of new programming languages because of the effort this represents for the user (effort to learn and reuse code), the remaining proposals are based on transforming sequential code into parallel code, or on transforming parallel code designed for one architecture into parallel code designed for another. A different approach relies on the use of skeletons. The programmer has available set of parallel standards that comprise the basis for developing parallel code while programming sequential code. In this context, we propose a methodology for developing an automatic source‐to‐source transformation in a specific domain. This methodology is instantiated in a framework aimed at solving dynamic programming problems. Using this framework, the final user (a physician, mathematician, biologist, etc.) can express her problem using an equation in Latex, and the system will automatically generate the optimal parallel code for homogeneous or heterogeneous architectures. This approach allows for great portability toward these new emerging architectures and for great productivity, as evidenced by the computational results.Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

13.

下一代网络处理器及应用综述

赵玉宇程光刘旭辉袁帅唐路《软件学报》2021,32(2):445-474

网络处理器作为能够完成路由查找、高速分组处理以及QoS保障等主流业务的网络设备核心计算芯片,可以结合自身可编程性完成多样化分组处理需求,适配不同网络应用场景.面向超高带宽及智能化终端带来的网络环境转变,高性能可演进的下一代网络处理器设计是网络通信领域的热点问题,受到学者们的广泛关注.融合不同芯片架构优势、高速服务特定业... 相似文献

14.

神威E级原型机互连网络和消息机制

高剑刚卢宏生何王全任秀江陈淑平斯添浩周舟胡舒凯于康魏迪《计算机学报》2021,44(1):222-234

本文描述了神威E级原型机的互连网络和消息机制.神威E级原型机是继神威蓝光、神威?太湖之光之后神威家族的第三代计算机.该计算机作为一台E级计算机的原型机,峰值性能3.13PFlops,其最大的特色之一就是采用28Gbps传输技术,设计开发了新一代的神威高阶路由器和神威高性能网络接口两款芯片,在传统胖树的基础上,设计了双轨... 相似文献

15.

A framework to generate domain-specific manycore architectures from dataflow programs

《Microprocessors and Microsystems》2020

In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures.However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture.We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures. 相似文献

16.

Welcome to the opportunities of binary translation 总被引：2，自引：0，他引：2

Altman E.R. Kaeli D. Sheffer Y. 《Computer》2000,33(3):40-45

A new processor architecture poses significant financial risk to hardware and software developers alike, so both have a vested interest in easily porting code from one processor to another. Binary translation offers solutions for automatically converting executable code to run on new architectures without recompiling the source code 相似文献

17.

Characterizing the challenges and evaluating the efficacy of a CUDA-to-OpenCL translator

Mark Gardner Paul SathreWu-chun Feng Gabriel Martinez 《Parallel Computing》2013

The proliferation of heterogeneous computing systems has led to increased interest in parallel architectures and their associated programming models. One of the most promising models for heterogeneous computing is the accelerator model, and one of the most cost-effective, high-performance accelerators currently available is the general-purpose, graphics processing unit (GPU). 相似文献

18.

Micro-Task Processing in Heterogeneous Reconfigurable Systems

下载免费PDF全文

Sebastian Wallner 《计算机科学技术学报》2005,20(5):624-634

New reconfigurable computing architectures are introduced to overcome some of the limitations of conventional microprocessors and fine-grained reconfigurable devices (e.g., FPGAs). One of the new promising architectures are Configurable System-on-Chip (CSoC) solutions. They were designed to offer high computational performance for real-time signal processing and for a wide range of applications exhibiting high degrees of parallelism. The programming of such systems is an inherently challenging problem due to the lack of an programming model. This paper describes a novel heterogeneous system architecture for signal processing and data streaming applications. It offers high computational performance and a high degree of flexibility and adaptability by employing a micro Task Controller (mTC) unit in conjunction with programmable and configurable hardware. The hierarchically organized architecture provides a programming model, allows an efficient mapping of applications and is shown to be easy scalable to future VLSI technologies. Several mappings of commonly used digital signal processing algorithms for future telecommunication and multimedia systems and implementation results are given for a standard-cell ASIC design realization in 0.18 micron 6-layer UMC CMOS technology. 相似文献

19.

Generating data transfers for distributed GPU parallel programs

F. Silber-Chaussumier A. Muller R. Habel 《Journal of Parallel and Distributed Computing》2013

Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and the accelerators and between computing nodes. We consider the case where the intra-node parallelism is handled with HMPP compiler directives and message-passing programming with MPI is used to program the inter-node communications. This way of programming on such an heterogeneous architecture is costly and error-prone. In this paper, we specifically demonstrate the transformation of HMPP programs designed to exploit a single computing node equipped with a GPU into an heterogeneous HMPP + MPI exploiting multiple GPUs located on different computing nodes. 相似文献