期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

李雁冰赵荣彩韩林赵捷徐金龙李颖颖《软件学报》2019,30(4):981-1001

异构众核处理器是面向高性能计算领域处理器发展的重要趋势,但其更为复杂的体系结构使得编程难的问题更加突出.针对这一问题,基于开源编译器Open64,提出了一种面向异构众核处理器的并行编译框架,将程序自动转换为异构并行程序.该框架主要包括4个模块：任务划分模块用来识别适合进行加速计算的程序段,实现了嵌套循环的多维并行识别方法;数据布局模块完成数据在主存和SPM之间的布局,实现了数组边界分析和指针范围分析;传输优化模块实现了数据传输合并、传输外提、打包传输、数组转置等多种数据传输优化方法;收益评估模块在构建代价模型的基础上实现了一种动静结合的收益评估方法.并且,基于SW26010处理器,对该编译框架进行了实现,测试结果表明,该编译框架能够实现一些程序以面向异构众核结构的并行变换,且获得较好的加速效果. 相似文献

2.

Towards Efficient Short-Range Pair Interaction on Sunway Many-Core Architecture

下载免费PDF全文

Jun-Shi Chen Hong An Wen-Ting Han Zeng Lin Xin Liu 《计算机科学技术学报》2021,36(1):123-139

The short-range pair interaction consumes most of the CPU time in molecular dynamics(MD)simulations.The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core ar-chitecture.In this paper,we present a highly efficient short-range force kernel on the Sunway,a novel many-core architecture with many unique features.The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts.To enhance the data locality,we adopt a super cluster based neighbor list with an appropriate granularity that fits in the local memory of computing cores.In the absence of a low overhead locking mechanism,using data-privatization force array is a more feasible method to avoid write conflicts,but results in the large overhead of data reduction.We adopt a dual-slice partitioning scheme for both hardware resources and computing tasks,which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing.Moreover,we exploit the single instruction multiple data(SIMD)parallelism and perform instruction reordering of the force kernel on this many-core processor.The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20％of peak flop rate on the Sunway many-core processor. 相似文献

3.

PipeRench: a reconfigurable architecture and compiler 总被引：1，自引：0，他引：1

Goldstein S.C. Schmit H. Budiu M. Cadambi S. Moe M. Taylor R.R. 《Computer》2000,33(4):70-77

With the proliferation of highly specialized embedded computer systems has come a diversification of workloads for computing devices. General-purpose processors are struggling to efficiently meet these applications' disparate needs, and custom hardware is rarely feasible. According to the authors, reconfigurable computing, which combines the flexibility of general-purpose processors with the efficiency of custom hardware, can provide the alternative. PipeRench and its associated compiler comprise the authors' new architecture for reconfigurable computing. Combined with a traditional digital signal processor, microcontroller or general-purpose processor, PipeRench can support a system's various computing needs without requiring custom hardware. The authors describe the PipeRench architecture and how it solves some of the pre-existing problems with FPGA architectures, such as logic granularity, configuration time, forward compatibility, hard constraints and compilation time 相似文献

4.

AutoConfig: 面向深度学习编译优化的自动配置机制

张洪滨周旭林邢明杰武延军赵琛《软件学报》2024,35(6)

随着深度学习模型和硬件架构的快速发展,深度学习编译器已经被广泛应用.目前,深度学习模型的编译优化和调优的方法主要依赖基于高性能算子库的手动调优和基于搜索的自动调优策略.然而,面对多变的目标算子和多种硬件平台的适配需求,高性能算子库往往需要为各种架构进行多次重复实现.此外,现有的自动调优方案也面临着搜索开销大和缺乏可解释性的挑战.为了解决上述问题,本文提出了AutoConfig,一种面向深度学习编译优化的自动配置机制.针对不同的深度学习计算负载和特定的硬件平台,AutoConfig可以构建具备可解释性的优化算法分析模型,采用静态信息提取和动态开销测量的方法进行综合分析,并基于分析结果利用可配置的代码生成技术自动完成算法选择和调优.本文创新性地将优化分析模型与可配置的代码生成策略相结合,不仅保证了性能加速效果,还减少了重复开发的开销,同时简化了调优过程.在此基础上,本文进一步将AutoConfig集成到深度学习编译器Buddy Compiler中,对矩阵乘法和卷积的多种优化算法建立分析模型,并将自动配置的代码生成策略应用在多种SIMD硬件平台上进行评估.实验结果验证了AutoConfig在代码生成策略中有效地完成了参数配置和算法选择.与经过手动或自动优化的代码相比,由AutoConfig生成的代码可达到相似的执行性能,并且无需承担手动调优的重复实现开销和自动调优的搜索开销. 相似文献

5.

申威众核处理器访存与通信融合编译优化

方燕飞李雁冰董恩铭王云飞刘齐《软件学报》2024,35(6)

申威众核片上多级存储层次是缓解众核“访存墙”的重要结构.完全由软件管理的SPM结构和片上RMA通信机制给应用性能提升带来很多机会,但也给应用程序开发优化与移植提出了很大挑战.为充分挖掘片上存储层次特点提升应用程序性能,同时减轻用户编程优化负担,本文提出了一种多级存储层次访存与通信融合的编译优化方法.该方法首先设计了融合编译指示,将程序高层信息传递给编译器.其次构建了编译优化收益模型并设计了启发式循环优化方案迭代求解框架,并由编译器完成循环优化方案的求解和优化代码的变换.通过编译生成的DMA和RMA批量数据传输操作,将较低存储层次空间中高访问延迟的核心数据批量缓冲进低访问延迟的更高存储层次空间中.在三个典型测试用例上进行了优化实验测试与分析,结果表明本文所提出的优化在性能上与手工优化相当,较未优化版程序性能有显著提升. 相似文献

6.

swLLVM: 面向神威新一代超级计算机的优化编译器

沈莉周文浩王飞肖谦武文浩张鲁飞安虹漆锋滨《软件学报》2024,35(5):2359-2378

异构众核架构具有超高的能效比, 已成为超级计算机体系结构的重要发展方向. 然而, 异构系统的复杂性给应用开发和优化提出了更高要求, 其在发展过程中面临好用性和可编程性等众多技术挑战. 我国自主研制的神威新一代超级计算机采用了国产申威异构众核处理器SW26010Pro. 为了发挥新一代众核处理器的性能优势, 支撑新兴科学计算应用的开发和优化, 设计并实现面向SW26010Pro平台的优化编译器swLLVM. 该编译器支持Athread和SDAA双模态异构编程模型, 提供多级存储层次描述及向量操作扩展, 并且针对SW26010Pro架构特点实现控制流向量化、基于代价的节点合并以及针对多级存储层次的编译优化. 测试结果表明, 所设计并实现的编译优化效果显著, 其中, 控制流向量化和节点合并优化的平均加速比分别为1.23和1.11, 而访存相关优化最高可获得2.49倍的性能提升. 最后, 使用SPEC CPU2006标准测试集从多个维度对swLLVM进行了综合评估, 相较于SWGCC的相同优化级别, swLLVM整型课题性能平均下降0.12%, 浮点型课题性能平均提升9.04%, 整体性能平均提升5.25%, 编译速度平均提升79.1%, 代码尺寸平均减少1.15%. 相似文献

7.

Superblock-based performance optimization for Sunway Math Library on SW26010 many-core processor

Cao Hao Guo Shaozhong Hao Jiangwei Xia Yuanyuan Xu Jinchen 《The Journal of supercomputing》2022,78(4):4827-4849

The Journal of Supercomputing - The SW26010 many-core processor is based on the Sunway architecture that is composed of management and computing processing elements (MPE and CPE, respectively),... 相似文献

8.

众核处理器Cache一致性研究综述

韩立敏安建峰高德远樊晓桠任向隆《计算机应用研究》2012,29(11):4011-4016

以瓦片结构众核处理器一致性协议的设计为主线,综述了国内外近年来关于众核处理器cache一致性的相关研究;介绍了不同NUCA结构对一致性协议的影响;分析和对比了几种传统目录一致性协议的特性及其存在的问题;归纳了最新几个面向众核结构一致性协议的设计思想和特性。最后为设计具备应用程序适应性和可扩展性的cache一致性协议指出了几个关键的设计方向。相似文献

9.

Baring it all to software: Raw machines 总被引：2，自引：0，他引：2

Waingold E. Taylor M. Srikrishna D. Sarkar V. Lee W. Lee V. Kim J. Frank M. Finch P. Barua R. Babb J. Amarasinghe S. Agarwal A. 《Computer》1997,30(9):86-93

The most radical of the architectures that appear in this issue are Raw processors-highly parallel architectures with hundreds of very simple processors coupled to a small portion of the on-chip memory. Each processor, or tile, also contains a small bank of configurable logic, allowing synthesis of complex operations directly in configurable hardware. Unlike the others, this architecture does not use a traditional instruction set architecture. Instead, programs are compiled directly onto the Raw hardware, with all units told explicitly what to do by the compiler. The compiler even schedules most of the intertile communication. The real limitation to this architecture is the efficacy of the compiler. The authors demonstrate impressive speedups for simple algorithms that lend themselves well to this architectural model, but whether this architecture will be effective for future workloads is an open question 相似文献

10.

How many cores do we need to run a parallel workload: A test drive of the Intel SCC platform?

Chen Liu Pollawat Thanarungroj Jean-Luc Gaudiot 《Journal of Parallel and Distributed Computing》2014

As semiconductor manufacturing technology continues to improve, it is possible to integrate more and more transistors onto a single processor. Many-core processor design has resulted in part from the search to utilize this enormous transistor real estate. The Single-Chip Cloud Computer (SCC) is an experimental many-core processor created by Intel Labs. In this paper we present a study in which we analyze this innovative many-core system by running several workloads with distinctive parallelism characteristics. We investigate the effect on system performance by monitoring specific hardware performance counters. Then, we experiment on varying different hardware configuration parameters such as number of cores, clock frequency and voltage levels. We execute the chosen workloads and collect the timing, power consumption and energy consumption information on such a many-core research platform. Thus, we can comprehensively analyze the behavior and scalability of the Intel SCC system with the introduced workload in terms of performance and energy consumption. Our results show that the profiled parallel workload execution has a communication bottleneck on the Intel SCC system. Moreover, our results indicate that we should carefully choose the number of cores to execute different workloads in order to yield a balance between execution performance and energy efficiency for different applications. 相似文献

11.

NNBlocks: a Blockly framework for AI computing

Chen Tai-Liang Chen Yi-Ru Yu Meng-Shiun Lee Jenq-Kuen 《The Journal of supercomputing》2021,77(8):8622-8652

The Journal of Supercomputing - Deep learning compiler tool, Tensor Virtual Machine (TVM), has excellent deployment, compilation, and optimization capabilities supported by the industry following... 相似文献

12.

Parallelizing and optimizing a bioinformatics pairwise sequence alignment algorithm for many-core architecture

David Díaz Francisco José Esteban Pilar Hernández Juan Antonio Caballero Gabriel Dorado Sergio Gálvez 《Parallel Computing》2011,37(4-5):244-259

Current computer engineering evolves at an accelerated pace, with hardware advancing towards new chip multiprocessors (CMP) architectures and with supporting software gearing towards new programming and abstraction paradigms, to obtain the maximum efficiency of the hardware at a low cost. In this context, Tilera Corporation has developed a brand new CMP architecture with 64 cores (tiles) called Tile64, and has launched several Peripheral Component Interconnect Express (PCIe) cards to be used and monitored from a host Personal Computer (PC). These cards may execute parallel applications built in C/C++ and compiled with the Tile-GCC compiler. We have previously demonstrated the usefulness of the Tile64 architecture for bioinformatics [S. Gálvez, D. Díaz, P. Hernández, F.J. Esteban, J.A. Caballero, G. Dorado, Next-generation bioinformatics: using many-core processor architecture to develop a web service for sequence alignment, Bioinformatics, 26 (2010) 683–686]. We have chosen a bioinformatics algorithm to test this many-core Tile64 architecture because of actual bioinformatics challenging needs: data-intensive workloads, space and time-consuming requirements and massive calculation. This algorithm, known as Needleman–Wunsch/Smith–Waterman (NW/SW), obtains an optimal sequence alignment in quadratic time and space cost, yet requires to be optimized to take full advantage of computing parallelization. In this paper we redesign, implement and fine-tune this algorithm, introducing key optimizations and changes that take advantage of specific Tile64 characteristics: RISC architecture, local tile’s cache, length of memory word, shared memory usage, RAM file system, tile’s intercommunication and job selection from a pool. The resulting algorithm – named MC64-NW/SW for Multicore64 Needleman–Wunsch/Smith–Waterman – achieves a gain of ～1000% when compared with the same algorithm on a ×86 multi-core architecture. As far as we know, our NW/SW implementation is the fastest ever published for a standalone PC when aligning a pair of sequences larger than 20 kb. 相似文献

13.

Landing Stencil Code on Godson-T

下载免费PDF全文

Hui-Min Cui Lei Wang Dong-Rui Fan Xiao-Bing Feng 《计算机科学技术学报》2010,25(4):886-894

The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology — together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures. 相似文献

14.

一种加速访存地址计算的编译优化

高秀武姜军白书敬黄亮明《计算机工程》2023,49(1):173-180

在国产申威高性能多核服务器系统中,基础编译系统对应用程序中访存操作进行代码生成时,没有考虑国产处理器指令特征,导致编译器生成的访存地址计算代码效率较低,影响国产高性能处理器的性能。为充分发挥国产处理器高性能计算能力,提出一种加速访存地址计算的编译优化方法。加速访存地址计算编译优化基于处理器支持带扩展因子的运算指令,在编译器后端内存地址表达式合法性检查中,添加针对乘加模式的地址计算表达式合法性检查算法,自动识别地址表达式中存在的乘加运算并进行合法性检验,对符合条件的地址表达式在代码生成阶段匹配生成带扩展因子的运算指令来快速计算访存地址,从而加快访存指令的发射与执行以及应用程序中的访存地址生成,提升访存效率。使用行业标准性能测试集SPEC CPU2006对优化效果进行评测,结果表明,相比优化前SPECspeed Integer与SPECspeed Float Point两个子集,该优化方法平均性能分别提高了2.53%与1.50%。相似文献

15.

Milepost GCC: Machine Learning Enabled Self-tuning Compiler

Grigori Fursin Yuriy Kashnikov Abdul Wahid Memon Zbigniew Chamski Olivier Temam Mircea Namolaru Elad Yom-Tov Bilha Mendelson Ayal Zaks Eric Courtois Francois Bodin Phil Barnard Elton Ashton Edwin Bonilla John Thomson Christopher K. I. Williams Michael O��Boyle 《International journal of parallel programming》2011,39(3):296-327

Tuning compiler optimizations for rapidly evolving hardware makes porting and extending an optimizing compiler for each new platform extremely challenging. Iterative optimization is a popular approach to adapting programs to a new architecture automatically using feedback-directed compilation. However, the large number of evaluations required for each program has prevented iterative compilation from widespread take-up in production compilers. Machine learning has been proposed to tune optimizations across programs systematically but is currently limited to a few transformations, long training phases and critically lacks publicly released, stable tools. Our approach is to develop a modular, extensible, self-tuning optimization infrastructure to automatically learn the best optimizations across multiple programs and architectures based on the correlation between program features, run-time behavior and optimizations. In this paper we describe Milepost GCC, the first publicly-available open-source machine learning-based compiler. It consists of an Interactive Compilation Interface (ICI) and plugins to extract program features and exchange optimization data with the cTuning.org open public repository. It automatically adapts the internal optimization heuristic at function-level granularity to improve execution time, code size and compilation time of a new program on a given architecture. Part of the MILEPOST technology together with low-level ICI-inspired plugin framework is now included in the mainline GCC. We developed machine learning plugins based on probabilistic and transductive approaches to predict good combinations of optimizations. Our preliminary experimental results show that it is possible to automatically reduce the execution time of individual MiBench programs, some by more than a factor of 2, while also improving compilation time and code size. On average we are able to reduce the execution time of the MiBench benchmark suite by 11% for the ARC reconfigurable processor. We also present a realistic multi-objective optimization scenario for Berkeley DB library using Milepost GCC and improve execution time by approximately 17%, while reducing compilation time and code size by 12% and 7% respectively on Intel Xeon processor. 相似文献

16.

面向应用的可重构编译器ASCRA(英文) 总被引：1，自引：0，他引：1

下载免费PDF全文

吴艳霞顾国昌孙延腾杨敏杨杰牛晓霞孙霖《计算机科学与探索》2011,5(3):267-279

在很多应用领域已经开展了可重构计算的研究,但是由于缺乏高层设计工具,设计者需要较深的软件和硬件专业知识才能开发GPP/RAU架构的程序,阻碍了其大规模应用。提出了一种面向应用的可重构编译器——ASCRA的初始架构,它可以自动将C语言映射为VHDL语言,从而解决可重构计算中自动编译工具的瓶颈。ASCRA编译器主要研究软硬件划分技术和面向硬件的优化技术,如脉动阵列、循环流水技术。在ML505开发平台上,设计实现了ASCRA编译器的验证平台,并通过实验给出了核心程序段生成VHDL代码的综合信息。相似文献

17.

Hardware Transactional Memory Exploration in Coherence-Free Many-Core Architectures

Dimitra Papagiannopoulou Andrea Marongiu Tali Moreshet Luca Benini Maurice Herlihy R. Iris Bahar 《International journal of parallel programming》2018,46(6):1304-1328

High-end embedded systems, like their general-purpose counterparts, are turning to many-core cluster-based shared-memory architectures that provide a shared memory abstraction subject to non-uniform memory access costs. In order to keep the cores and memory hierarchy simple, many-core embedded systems tend to employ simple, scratchpad-like memories, rather than hardware managed caches that require some form of cache coherence management. These “coherence-free” systems still require some means to synchronize memory accesses and guarantee memory consistency. Conventional lock-based approaches may be employed to accomplish the synchronization, but may lead to both usability and performance issues. Instead, speculative synchronization, such as hardware transactional memory, may be a more attractive approach. However, hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory accesses among the cores. The lack of a cache-coherence protocol adds new challenges in the design of hardware speculative support. In this article, we present a new scheme for hardware transactional memory (HTM) support within a cluster-based, many-core embedded system that lacks an underlying cache-coherence protocol. We propose two alternative data versioning implementations for the HTM support, Full-Mirroring and Distributed Logging and we conduct a performance comparison between them. To the best of our knowledge, these are the first designs for speculative synchronization for this type of architecture. Through a set of benchmark experiments using our simulation platform, we show that our designs can achieve significant performance improvements over traditional lock-based schemes. 相似文献

18.

众核处理器片上同步机制和评估方法研究 总被引：1，自引：0，他引：1

徐卫志宋风龙刘志勇范东睿余磊张帅《计算机学报》2010,33(10)

同步机制是片上多核/众核处理器正确执行和协同通信的关键,其效率对处理器的性能非常重要.针对片上众核体系结构,提出并实现了两种粗粒度同步机制和一种细粒度同步机制,即片上专用硬件支持的同步机制、基于原语的片上互斥访问同步机制和基于满空标志位的细粒度同步机制;提出了粗粒度同步机制的评估标准和评估方法,并设计了量化评估程序.以片上同构众核处理器Godson-T模拟器和AMD Opteron商业片上多核处理器为平台,评估比较了提出的硬件支持的同步机制与基于原语的同步机制的性能.结果表明,硬件支持可以使得片上众核处理器的同步机制性能明显提高;在传统基于原语的同步机制中,大部分性能损失是由于负载不平衡和同步点的串行化操作而造成的等待时间. 相似文献

19.

低速率语音编解码专用处理器设计

宋辉崔慧娟唐昆《微计算机信息》2006,22(26):19-21

本文给出一种基于编码速率600bps的高质量声码器算法的专用处理器设计。介绍了语音编解码算法原理,专用处理器的体系结构,汇编器的开发和算法的移植。采用软硬件协同设计的方法,大大降低了算法的存储复杂度和运算复杂度,并在电路中验证了声码器地正确性。相似文献

20.

Larrabee: A Many-Core x86 Architecture for Visual Computing

《Micro, IEEE》2009,29(1):10-21

The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architecture's programmability as compared to standard GPUs. The article describes the Larrabee architecture, a software renderer optimized for it, and other highly parallel applications. The article analyzes performance through scalability studies based on real-world workloads. 相似文献