期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

刘晓娴赵荣彩丁锐《计算机科学》2013,40(9):38-43

多核处理器能够提升多线程程序的性能,但早已存在的诸多单线程程序无法从中获益,程序员也习惯于编写单线程程序.自动并行化技术是将单线程程序移植到多核上的重要手段,但是当循环中存在无法确定的数据依赖或复杂的控制流时,传统的自动并行化技术无法取得良好效果.Ottoni等人针对传统自动并行失败的循环提出了Decoupled Software Pipelining(DSWP)算法用以实现指令级的细粒度并行,但其需要对处理器体系结构的深入了解以及对核间通信队列和专用指令的硬件支持,并行性能和应用广泛性受到限制.基于OpenMP应用编程接口实现的DSWP并行不依赖于硬件上对核间通信队列和专用指令的支持,且不受平台的限制,但现有的OpenMP任务调度机制无法满足DSWP并行中对任务调度的需求.对现有的OpenMP任务调度机制进行扩展,增加了任务与线程绑定的属性,保证了基于OpenMP的DSWP并行程序的正确执行.在GCC的OpenMP运行库libgomp中扩展了任务绑定属性子句的功能,扩展后的GCC作为OpenMP DSWP程序的基础编译器,为自动并行提供支持.通过对基准测试集NPB3.3.1的测试表明,传统自动并行失败的循环,经OpenMP DSWP自动并行后在双核处理器上平均加速比达到1.23以上;使用添加了OpenMP DSWP算法的Open64编译器生成的并行程序,与仅使用传统自动并行方法的Intel 编译器和Open64编译器所得程序相比,平均加速比分别高出22％和26％. 相似文献

2.

面向规则DOACROSS循环的流水并行代码自动生成

刘晓娴赵荣彩赵捷徐金龙《软件学报》2014,25(6):1154-1168

发掘DOACROSS 循环中蕴含的并行性,选择合适的策略将其并行执行,对提升程序的并行性能非常重要.流水并行方式是规则DOACROSS 循环并行的重要方式.自动生成性能良好的流水并行代码是一项困难的工作,并行编译器对程序自动并行时常常对DOACROSS 循环作保守处理,损失了DOACROSS 循环包含的并行性,限制了程序的并行性能.针对上述问题,设计了一种选择计算划分循环层和循环分块层的启发式算法,给出了一个基于流水并行代价模型的循环分块大小计算公式,并使用计数信号量进行并行线程之间的同步,实现了基于OpenMP 的规则DOACROSS 循环流水并行代码的自动生成.通过对有限差分松弛法（finite difference relaxation,简称FDR）的波前（wavefront）循环和时域有限差分法（finite difference time domain,简称FDTD）中典型循环以及程序Poisson,LU 和Jacobi 的测试,算法自动生成的流水并行代码能够在多核处理器上获得明显的性能提升,使用的流水分块大小计算公式能够较为精确地计算出循环流水并行时的最佳分块大小.自动生成的流水并行代码与基于手工选择的最优分块大小的流水并行代码相比,加速比达到手工选择加速比的89%. 相似文献

3.

面向Open64的OpenMP程序优化

刘京郑启龙李彭勇郭连伟《计算机系统应用》2016,25(1):154-159

OpenMP规范了一系列的编译制导、环境变量和运行库,具有简单、可移植、支持增量并行等优点.但同时,采用FORK-JOIN模型所引起的频繁的线程管理开销也是制约OpenMP程序性能的瓶颈之一.本文讨论了如何利用并行区的合并与扩展,实现并行区的重构,并在此基础上利用Open64的IPA优化部件所提供的全局间过程分析能力,实现跨越过程边界的并行块的合并.最终实验表明,该方法有效地改进了OpenMP程序的运行性能. 相似文献

4.

利用OpenMP技术实现线性方程组并行求解

徐胜利《信息网络安全》2013,(5)

文章介绍了OpenMP的并行执行原理和语言规范,讨论了OpenMP的循环并行化、迭代相关、数据共享、任务调度等问题.接着研究了高斯-约当消元法固有的并行性,提出并行高斯-约当消元法,并基于多处理器平台HP Z620进行了测试.实验结果表明,理论分析与实验结果是一致的. 相似文献

5.

面向MPI代码生成的Open64编译器后端

《计算机学报》2014,(7)

随着计算机体系结构的发展,分布式存储结构以其良好的扩展性逐渐占据了高性能计算机体系结构市场的主导地位.为了将现有的串行程序转换为能够在高性能计算机上运行的并行程序,研究人员提出了并行化编译器.然而,当前面向分布存储并行系统的编译器发展却相对较慢,而面向共享存储并行系统的编译器及其相应技术已逐渐成熟.一种开发面向分布存储并行系统编译器的可行方法是改进现有的面向共享存储并行系统的编译器,使其自动生成能够在分布存储结构高性能计算机上运行的MPI(Message Passing Interface)并行程序.因此,该文为面向共享存储并行系统的编译器Open64设计并实现了一个支持MPI代码生成的后端.根据分布式并行化编译的特点,主要从自动生成计算划分、改进循环优化和自动生成MPI并行代码3个方面对Open64进行了改进,使其能够实现面向分布存储的并行化编译.实验测试利用带有MPI后端的Open64对串行程序进行编译,生成的MPI并行代码可直接运行在具有分布存储结构的高性能计算机上.通过将该MPI并行代码的执行效率与传统面向分布存储并行系统编译器生成的MPI代码效率进行比较,并行效率有明显的提升. 相似文献

6.

面向异构多核处理器的并行代价模型

黄品丰赵荣彩姚远赵捷《计算机应用》2013,33(6):1544-1547

现有的并行代价模型大多是面向共享存储或分布存储结构设计的,不完全适合异构多核处理器。为解决这个问题,提出了面向异构多核处理器的并行代价模型,通过定量刻画计算核心运算能力、存储访问延迟和数据传输开销对循环并行执行时间的影响,提高加速并行循环识别的准确性。实验结果表明,提出的并行代价模型能有效识别加速并行循环,将其识别结果作为后端生成并行代码的依据,可有效提高并行程序在异构多核处理器上的性能。相似文献

7.

面向神威高性能多核处理器的并行编译优化方法

周雍浩徐金龙李斌钱宏聂凯《计算机工程》2022,48(9):130-138

在神威高性能多核服务器上,自动并行化编译系统为识别和申明程序中的并行性,产生的OpenMP程序没有经过充分的优化,其采用简单的fork-join模型,存在大量的并行循环嵌套,导致运行效率低。为提升自动并行化编译系统产生的OpenMP程序的运行效率,提出一种并行域重构优化技术。并行域重构技术通过合并程序中的并行域和扩展嵌套循环中的并行域范围,减少OpenMP程序的并行域数目,降低线程组频繁创建和合并等控制开销,将简单fork-join模型的OpenMP程序转换为性能更为高效的单程序多数据模型的OpenMP程序。实验结果表明,在新一代神威高性能多核服务器SW1621平台上,并行域重构技术在NPB3.3-OMP测试集和SPEC OMP2012测试集上的运行效率分别提高了10.77%和7.94%的,可有效提升自动并行化编译系统OpenMP程序的执行效率。相似文献

8.

基于Docker的MPI和OpenMP混合编程

赵博颖肖鹏张力《计算机与现代化》2018,(5):60

针对当前搭建集群并行系统复杂且耗时等问题,提出基于Docker搭建并行系统。介绍轻量级虚拟化技术Docker的核心概念和基本架构,并基于Docker技术在Linux平台上搭建集群并行开发环境。简要阐述并行计算的思想,叙述MPI和OpenMP并行计算的基本概念和特点,针对矩阵并行乘法的算法建立MPI和OpenMP的混合编程模型,并给出混合编程模型与MPI并行编程模型以及OpenMP并行编程模型的性能对比,分析出现差异的原因。基于该混合编程模型比较Docker与传统物理机两者搭建的并行系统的并行效率。相似文献

9.

基于LLVM Pass的复杂嵌套循环自动并行化框架

马春燕吕炳旭叶许姣张雨《软件学报》2023,34(7):3022-3042

随着多核处理器的普及应用,针对嵌入式遗留系统中串行代码的自动并行化方法是研究热点.其中,针对具有非完美嵌套结构、非仿射依赖关系特征的复杂嵌套循环的自动并行化方法存在技术挑战.提出了一种基于LLVMPass的复杂嵌套循环的自动并行化框架(CNLPF).首先,提出了一种复杂嵌套循环的表示模型,即循环结构树,并将嵌套循环的正则区域自动转换为循环结构树表示;然后,对循环结构树进行数据依赖分析,构建循环内和循环间的依赖关系;最后,基于OpenMP共享内存的编程模型生成并行的循环程序.针对SPEC2006数据集中包含近500个复杂嵌套循环的6个程序案例,分别对其进行复杂嵌套循环占比统计和并行性能加速测试.结果表明,提出的自动并行化框架可以处理LLVMPolly无法优化的复杂嵌套循环,增强了LLVM的并行编译优化能力,且该方法结合Polly的组合优化,比单独采用Polly优化的加速效果提升了9%-43%. 相似文献

10.

基于多面体表示的向量化收益评估方法

下载免费PDF全文

张媛媛赵荣彩韩林《计算机工程》2012,38(7):266-268,272

循环变换可提高程序性能,但对其向量化后可能会导致代码性能损失,并不一定会得到预期性能提升。针对该问题,结合目标体系结构特征,在Open64中实现一个基于多面体表示指导循环变换的向量化收益评估模型。该模型可以有效分析各种循环变换方案的代价,选择向量化收益最大的方案组合作为最终的向量化方案。对SPEC测试集的swim等5个程序进行测试,结果表明,收益评估结果与实测向量化加速比相近,可避免盲目优化。相似文献

11.

Experiences Developing the OpenUH Compiler and Runtime Infrastructure

Barbara Chapman Deepak Eachempati Oscar Hernandez 《International journal of parallel programming》2013,41(6):825-854

The OpenUH compiler is a branch of the open source Open64 compiler suite for C, C++, and Fortran 95/2003, with support for a variety of targets including x86_64, IA-64, and IA-32. For the past several years, we have used OpenUH to conduct research in parallel programming models and their implementation, static and dynamic analysis of parallel applications, and compiler integration with external tools. In this paper, we describe the evolution of the OpenUH infrastructure and how we’ve used it to carry out our research and teaching efforts. 相似文献

12.

Optimizing OpenMP Programs on Software Distributed Shared Memory Systems

Min Seung-Jai Basumallik Ayon Eigenmann Rudolf 《International journal of parallel programming》2003,31(3):225-249

This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is not currently being used on distributed system. The long-term goal of our project is to quantify the degree to which such a use is possible and develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have studied a number of hand optimizations that improve the baseline performance. Our approach complements related efforts that have proposed language extensions for efficient execution of OpenMP programs on distributed systems. Our results show that, while kernel benchmarks can show high efficiency of OpenMP programs on distributed systems, full applications need careful consideration of shared data access patterns. A naive translation (similar to OpenMP compilers for SMPs) leads to acceptable performance in very few applications only. However, additional optimizations, including access privatization, selective touch, and dynamic scheduling, resulting in 31% average improvement on our benchmarks. 相似文献

13.

A comparative performance study of common and popular task‐centric programming frameworks

Artur Podobas Mats Brorsson Karl‐Filip Faxn 《Concurrency and Computation》2015,27(1):1-28

Programmers today face a bewildering array of parallel programming models and tools, making it difficult to choose an appropriate one for each application. An increasingly popular programming model supporting structured parallel programming patterns in a portable and composable manner is the task‐centric programming model. In this study, we compare several popular task‐centric programming frameworks, including Cilk Plus, Threading Building Blocks, and various implementations of OpenMP 3.0. We have analyzed their performance on the Barcelona OpenMP Tasking Suite benchmark suite both on a 48‐core AMD Opteron 6172 server and a 64‐core TILEPro64 embedded many‐core processor. Our results show that the OpenMP offers the highest flexibility for programmers, and this flexibility comes to a cost. Frameworks supporting only a specific and more restrictive model, such as Cilk Plus and Threading Building Blocks, are generally more efficient both in terms of performance and energy consumption. However, Intel's implementation of OpenMP tasks performs the best and closest to the specialized run‐time systems. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

14.

Integrating Parallelizing Compilation Technologies for SMP Clusters

下载免费PDF全文

Xiao-BingFeng LiChen Yi-RanWang Xiao-MiAn LinMa Chun-LeiSang Zhao-QingZhang 《计算机科学技术学报》2005,20(1):0-0

In this paper, a source to source parallelizing compiler system, AutoPar, is presentd. The system transforms FORTRAN programs to multi-level hybrid MPI/OpenMP parallel programs. Integrated parallel optimizing technologies are utilized extensively to derive an effective program decomposition in the whole program scope. Other features such as synchronization optimization and communication optimization improve the performance scalability of the generated parallel programs, from both intra-node and inter-node. The system makes great effort to boost automation of parallelization. Profiling feedback is used in performance estimation which is the basis of automatic program decomposition. Performance results for eight benchmarks in NPB1.0 from NAS on an SMP cluster are given, and the speedup is desirable. It is noticeable that in the experiment, at most one data distribution directive and a reduction directive are inserted by the user in BT/SP/LU. The compiler is based on ORC, Open Research Compiler. ORC is a powerful compiler infrastructure, with such features as robustness, flexibility and efficiency. Strong analysis capability and well-defined infrastructure of ORC make the system implementation quite fast. 相似文献

15.

OpenGR: A directive-based grid programming environment

Motonori Hirano Mitsuhisa Sato Yoshio Tanaka 《Parallel Computing》2005,31(10-12):1140

A new grid programming environment for remote procedure call (RPC) based master–worker type task parallelization is presented. The environment is realized through the use of a set of compiler directives, called OpenGR, and is implemented in the present study based on the Omni OpenMP compiler system and Ninf-G grid-enabled RPC system as a parallel execution mechanism. Using OpenGR directives, existing sequential applications can be readily adapted to the grid environment as master–worker parallel programs using the RPC architecture. The combination of OpenGR and OpenMP directives also allows for the hybrid parallelization of sequential programs, supporting both synchronous and asynchronous parallelism. 相似文献

16.

A Comparison of Co-Array Fortran and OpenMP Fortran for SPMD Programming

Alan J. Wallcraft 《The Journal of supercomputing》2002,22(3):231-250

Co-Array Fortran, formally called F^––, is a small set of extensions to Fortran 90/95 for Single-Program-Multiple-Data (SPMD) parallel processing. OpenMP Fortran is a set of compiler directives that provide a high level interface to threads in Fortran, with both thread-local and thread-shared memory. OpenMP is primarily designed for loop-level directive-based parallelization, but it can also be used for SPMD programs by spawning multiple threads as soon as the program starts and having each thread then execute the same code independently for the duration of the run. The similarities and differences between these two SPMD programming models are described.Co-Array Fortran can be implemented using either threads or processes, and is therefore applicable to a wider range of machine types than OpenMP Fortran. It has also been designed from the ground up to support the SPMD programming style. To simplify the implementation of Co-Array Fortran, a formal Subset is introduced that allows the mapping of co-arrays onto standard Fortran arrays of higher rank. An OpenMP Fortran compiler can be extended to support Subset Co-Array Fortran with relatively little effort. 相似文献

17.

多核微机基于OpenMP的并行计算

蔡佳佳李名世郑锋《微机发展》2007,17(10):87-91

随着四核微机走向市场和八十核处理器在实验室研制成功,多核正引领软件研发发生基础性变化。开发人员需要在代码中添加线程来利用系统所提供的多个内核,从而提升PC应用软件的功能和性能。文中探讨在多核微机上进行并行计算的实现技术。介绍了共享存储系统并行编程接口OpenMP的模型、指令和库函数,以及Intel C 编译器9.1和Microsoft Visual Studio 2005等对OpenMP的支持;着重探讨了二维离散快速傅里叶变换并行算法的设计、实现与优化技术;展望了高性能并行计算软构件库的开发前景。相似文献