首页 | 本学科首页   官方微博 | 高级检索  
     

申威众核处理器访存与通信融合编译优化
引用本文:方燕飞,李雁冰,董恩铭,王云飞,刘齐.申威众核处理器访存与通信融合编译优化[J].软件学报,2024,35(6).
作者姓名:方燕飞  李雁冰  董恩铭  王云飞  刘齐
作者单位:(国家并行计算机工程技术研究中心, 北京 100190
基金项目:先进计算与智能工程(国家级)实验室基金;国家重点研发计划重点专项项目(2021YFB0301100)
摘    要:申威众核片上多级存储层次是缓解众核“访存墙”的重要结构.完全由软件管理的SPM结构和片上RMA通信机制给应用性能提升带来很多机会,但也给应用程序开发优化与移植提出了很大挑战.为充分挖掘片上存储层次特点提升应用程序性能,同时减轻用户编程优化负担,本文提出了一种多级存储层次访存与通信融合的编译优化方法.该方法首先设计了融合编译指示,将程序高层信息传递给编译器.其次构建了编译优化收益模型并设计了启发式循环优化方案迭代求解框架,并由编译器完成循环优化方案的求解和优化代码的变换.通过编译生成的DMA和RMA批量数据传输操作,将较低存储层次空间中高访问延迟的核心数据批量缓冲进低访问延迟的更高存储层次空间中.在三个典型测试用例上进行了优化实验测试与分析,结果表明本文所提出的优化在性能上与手工优化相当,较未优化版程序性能有显著提升.

关 键 词:申威众核处理器  多级存储层次  RMA通信  并行语言  编译优化
收稿时间:2023/9/11 0:00:00
修稿时间:2023/10/30 0:00:00

Memory Access and Communication Fusion Compilation Optimization for Sunway Many-Core Processors
FANG Yan-Fei,LI Yan-Bing,DONG En-Ming,WANG Yun-Fei,LIU Qi.Memory Access and Communication Fusion Compilation Optimization for Sunway Many-Core Processors[J].Journal of Software,2024,35(6).
Authors:FANG Yan-Fei  LI Yan-Bing  DONG En-Ming  WANG Yun-Fei  LIU Qi
Affiliation:National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China
Abstract:The on-chip memory hierarchy of Sunway many-core process is an important structure to alleviate the "memory access wall". The SPM structure and the on-chip RMA communication mechanism completely managed by software bring many opportunities for improving application performance, but also pose great challenges for application development optimization and porting. In order to fully explore the hierarchical features of on-chip memory, improve application performance, and reduce the burden of user programming optimization, this paper proposes a compilation optimization method that integrates multi-level memory access and communication. This method first designs a fusion compiler directive to transfer high-level information of the program to the compiler. Secondly, a compiler optimization revenue model is constructed and a heuristic loop optimization scheme iterative solution framework is designed. The compiler completes the loop optimization scheme solution and code transformation. DMA and RMA batch data transmission operations are generated through compilation, batch buffer core data with high access latency from lower storage hierarchy spaces into higher storage hierarchy spaces with low access latency. Optimization experiments and analysis were conducted on three typical test cases, and the results showed that the program performance optimized by this method was comparable to manual optimization, and significantly improved compared to the unoptimized version.
Keywords:Sunway many-core processor  memory hierarchy  RMA communication  parallel language  compiler optimization
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号