期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A compiler for exploiting nested parallelism in OpenMP programs

Xinmin Tian Jay P. Hoeflinger Grant Haab Yen-Kuang Chen Milind Girkar Sanjiv Shah 《Parallel Computing》2005,31(10-12):960

This paper presents the design and implementation of a parallelization framework and OpenMP runtime support in Intel^® C++ & Fortran compilers for exploiting nested parallelism in applications using OpenMP pragmas or directives. We conduct the performance evaluation of two multimedia applications parallelized with OpenMP pragmas and compiled with the Intel C++ compiler on Hyper-Threading Technology (HT) enabled multiprocessor systems. The performance results show that the multithreaded code generated by the Intel compiler achieved a speedup up to 4.69 on 4 processors with HT enabled for five different input video sequences for the H.264 encoder workload, and a 1.28 speedup on an HT enabled single-CPU system and 1.99 speedup on an HT-enabled dual-CPU system for the audio–visual speech recognition workload. The performance gain due to exploiting nested parallelism for leveraging Hyper-Threading Technology is up to 70% for two multimedia workloads under different multiprocessor system configurations. These results demonstrate that hyper-threading benefits can be achieved by exploiting nested parallelism through Intel compiler and runtime system support for OpenMP programs. 相似文献

2.

Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions

Chunhua Liao Daniel J. Quinlan Jeremiah J. Willcock Thomas Panas 《International journal of parallel programming》2010,38(5-6):361-378

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. Modern applications using high-level abstractions, such as C++ STL containers and complex user-defined class types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we use a source-to-source compiler infrastructure, ROSE, to explore compiler techniques to recognize high-level abstractions and to exploit their semantics for automatic parallelization. Several representative parallelization candidate kernels are used to study semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Preliminary results have shown that semantics of abstractions can help extend the applicability of automatic parallelization to modern applications and expose more opportunities to take advantage of multicore processors. 相似文献

3.

A Comparison of Co-Array Fortran and OpenMP Fortran for SPMD Programming

Alan J. Wallcraft 《The Journal of supercomputing》2002,22(3):231-250

Co-Array Fortran, formally called F^––, is a small set of extensions to Fortran 90/95 for Single-Program-Multiple-Data (SPMD) parallel processing. OpenMP Fortran is a set of compiler directives that provide a high level interface to threads in Fortran, with both thread-local and thread-shared memory. OpenMP is primarily designed for loop-level directive-based parallelization, but it can also be used for SPMD programs by spawning multiple threads as soon as the program starts and having each thread then execute the same code independently for the duration of the run. The similarities and differences between these two SPMD programming models are described.Co-Array Fortran can be implemented using either threads or processes, and is therefore applicable to a wider range of machine types than OpenMP Fortran. It has also been designed from the ground up to support the SPMD programming style. To simplify the implementation of Co-Array Fortran, a formal Subset is introduced that allows the mapping of co-arrays onto standard Fortran arrays of higher rank. An OpenMP Fortran compiler can be extended to support Subset Co-Array Fortran with relatively little effort. 相似文献

4.

基于LLVM Pass的复杂嵌套循环自动并行化框架

马春燕吕炳旭叶许姣张雨《软件学报》2023,34(7):3022-3042

随着多核处理器的普及应用,针对嵌入式遗留系统中串行代码的自动并行化方法是研究热点.其中,针对具有非完美嵌套结构、非仿射依赖关系特征的复杂嵌套循环的自动并行化方法存在技术挑战.提出了一种基于LLVMPass的复杂嵌套循环的自动并行化框架(CNLPF).首先,提出了一种复杂嵌套循环的表示模型,即循环结构树,并将嵌套循环的正则区域自动转换为循环结构树表示;然后,对循环结构树进行数据依赖分析,构建循环内和循环间的依赖关系;最后,基于OpenMP共享内存的编程模型生成并行的循环程序.针对SPEC2006数据集中包含近500个复杂嵌套循环的6个程序案例,分别对其进行复杂嵌套循环占比统计和并行性能加速测试.结果表明,提出的自动并行化框架可以处理LLVMPolly无法优化的复杂嵌套循环,增强了LLVM的并行编译优化能力,且该方法结合Polly的组合优化,比单独采用Polly优化的加速效果提升了9%-43%. 相似文献

5.

A parallel CFD rotor code using OpenMP

《Advances in Engineering Software》2001,32(8):665-671

The extended full-potential (FPX) helicopter rotor computational fluid dynamics (CFD) code of Fortran in its reduced two-dimensional version is successfully converted into a parallel version for multiprocessing. The FPX code with an internal grid generator solves the compressible full-potential equation using an approximately factored finite-difference scheme with added numerous physical modeling enhancements, including viscous boundary layers, shock-induced entropy corrections and wake-vortex embedding. The parallel version of the code uses open multi-processing (OpenMP) directives as parallel programming tool in shared-memory (SM) environment. The OpenMP code is portable and scalable, which can run on various computer platforms including UNIX platforms and Windows NT platforms. The performance study of the parallel code on SGI Origin 2000 UNIX platform is made. The results show that reasonable speedups through parallelization are obtained and that OpenMP is easy to use and an efficient parallel programming tool for the present problem. 相似文献

6.

Nebelung: Execution Environment for Transactional OpenMP

Miloš Milovanović Roger Ferrer Vladimir Gajinov Osman S. Unsal Adrian Cristal Eduard Ayguadé Mateo Valero 《International journal of parallel programming》2008,36(3):326-346

Future generations of Chip Multiprocessors (CMP) will provide dozens or even hundreds of cores inside the chip. Writing applications that benefit from the massive computational power offered by these chips is not going to be an easy task for mainstream programmers who are used to sequential algorithms rather than parallel ones. This paper explores the possibility of using Transactional Memory (TM) in OpenMP, the industrial standard for writing parallel programs on shared-memory architectures, for C, C++ and Fortran. One of the major complexities in writing OpenMP applications is the use of critical regions (locks), atomic regions and barriers to synchronize the execution of parallel activities in threads. TM has been proposed as a mechanism that abstracts some of the complexities associated with concurrent access to shared data while enabling scalable performance. The paper presents a first proof-of-concept implementation of OpenMP with TM. Some language extensions to OpenMP are proposed to express transactions. These extensions are implemented in our source-to-source OpenMP Mercurium compiler and our Software Transactional Memory (STM) runtime system Nebelung that supports the code generated by Mercurium. Hardware Transactional Memory (HTM) or Hardware-assisted STM (HaSTM) are seen as possible paths to make the tandem TM-OpenMP more scalable. In the evaluation section we show the preliminary results. The paper finishes with a set of open issues that still need to be addressed, either in OpenMP or in the hardware/software implementations of TM. 相似文献

7.

A scalable HPF implementation of a finite‐volume computational electromagnetics application on a CRAY T3E parallel system

Yi Pan Joseph J. S. Shang Minyi Guo 《Concurrency and Computation》2003,15(6):607-621

The time‐dependent Maxwell equations are one of the most important approaches to describing dynamic or wide‐band frequency electromagnetic phenomena. A sequential finite‐volume, characteristic‐based procedure for solving the time‐dependent, three‐dimensional Maxwell equations has been successfully implemented in Fortran before. Due to its need for a large memory space and high demand on CPU time, it is impossible to test the code for a large array. Hence, it is essential to implement the code on a parallel computing system. In this paper, we discuss an efficient and scalable parallelization of the sequential Fortran time‐dependent Maxwell equations solver using High Performance Fortran (HPF). The background to the project, the theory behind the efficiency being achieved, the parallelization methodologies employed and the experimental results obtained on the Cray T3E massively parallel computing system will be described in detail. Experimental runs show that the execution time is reduced drastically through parallel computing. The code is scalable up to 98 processors on the Cray T3E and has a performance similar to that of an MPI implementation. Based on the experimentation carried out in this research, we believe that a high‐level parallel programming language such as HPF is a fast, viable and economical approach to parallelizing many existing sequential codes which exhibit a lot of parallelism. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

8.

Productivity and Performance Portability of the OpenMP 3.0 Tasking Concept When Applied to an Engineering Code Written in Fortran 95

Paul Kapinos Dieter an Mey 《International journal of parallel programming》2010,38(5-6):379-395

The modeling of bevel gear cutting processes requires highly flexible data structures and algorithms. We compare the effort and performance of an OpenMP parallelization employing OpenMP 3.0 tasks with previously applied approaches like nesting parallel sections and stack-based algorithms when parallelizing recursive procedures written in Fortran 95 working on binary tree structures. We take a look at various combinations of recent hardware and Fortran compilers. 相似文献

9.

面向MPP Fortran的程序自动并行化

唐新春郭克榕《计算机研究与发展》1996,33(8):566-573

ＭＰＰＦｏｒｔｒａｎ是Ｃｒａｙ公司为分布存储、全局编址的ＣｒａｙＴ３ＤＭＰＰ系统推出的一种数据并行语言。本文首先介绍了ＭＰＰＦｏｒｔｒａｎ的主要特点，然后以该语言为例，对面向ＭＰＰ系统程序自动并行化的基本内容以及关键技术进行了分析和探讨。相似文献

10.

面向神威高性能多核处理器的并行编译优化方法

周雍浩徐金龙李斌钱宏聂凯《计算机工程》2022,48(9):130-138

在神威高性能多核服务器上,自动并行化编译系统为识别和申明程序中的并行性,产生的OpenMP程序没有经过充分的优化,其采用简单的fork-join模型,存在大量的并行循环嵌套,导致运行效率低。为提升自动并行化编译系统产生的OpenMP程序的运行效率,提出一种并行域重构优化技术。并行域重构技术通过合并程序中的并行域和扩展嵌套循环中的并行域范围,减少OpenMP程序的并行域数目,降低线程组频繁创建和合并等控制开销,将简单fork-join模型的OpenMP程序转换为性能更为高效的单程序多数据模型的OpenMP程序。实验结果表明,在新一代神威高性能多核服务器SW1621平台上,并行域重构技术在NPB3.3-OMP测试集和SPEC OMP2012测试集上的运行效率分别提高了10.77%和7.94%的,可有效提升自动并行化编译系统OpenMP程序的执行效率。相似文献

11.

基于JavaCC的C代码自动并行化的设计与实现

刘有耀杨鹏程《计算机应用》2016,36(9):2422-2426

针对当前大量遗产代码无法重复利用的问题,设计一种新的编译工具将C的串行代码转换为基于MPI+OpenMP的混合并行编程代码,降低了并行编程的开发成本。首先,通过对JavaCC的优化,实现一种可以解析C语言的词法和语法分析器,进行源代码分析并生成抽象语法树;其次,根据语法树对源代码进行控制依赖性和数据依赖性分析,产生可并行化的语句块分区;再次,按照提出的并行代码生成方法得到目标代码;最后,基于Visual Studio 2010构建目标代码仿真验证环境。实验结果表明,该工具可以较为理想地实现串行代码自动并行化,与手工编写的代码在加速比上的误差为8.2%~18.4%。相似文献

12.

面向OpenMP自动并行化的代价模型

李雁冰赵荣彩刘晓娴赵捷《软件学报》2014,25(S2):101-110

现有的OpenMP代价模型较为简单,既没有充分考虑OpenMP程序的执行细节,也无法适应不同的循环并行执行方式.针对上述问题,对最先进的产品级优化编译器Open64中已有的代价模型进行扩展,以单个并行候选循环为对象,建立一种用于OpenMP自动并行收益分析的代价模型.该模型在改进了Open64原有DOALL并行代价模型的基础上,又增加了DOACROSS流水并行代价模型和DSWP并行代价模型.实验结果表明,建立的代价模型能够较好地评估循环并行执行开销的趋势,为OpenMP自动并行化中的收益分析提供了有效的支持. 相似文献

13.

Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications

《Journal of Parallel and Distributed Computing》2006,66(5):686-697

In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study, we use the NanosCompiler that supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms (MPI+OpenMP and MLP) and discuss OpenMP implementation issues that affect the performance of multi-level parallel applications. 相似文献

14.

Support for Parallel and Concurrent Programming in C++

N. I. V’yukova V. A. Galatenko S. V. Samborskii 《Programming and Computer Software》2018,44(1):35-42

C++ was originally designed as a sequential programming language. For development of multithreaded applications, libraries, such as Pthreads, Windows threads, and Boost, are traditionally used. The C++11 standard introduced some basic concepts and means for developing parallel and concurrent programs, but the direct use of these low-level means requires high programming skills and significant efforts. The absence of high-level models of parallelism in C++ is somewhat compensated for by various parallel libraries and directive parallelization tools (such as OpenMP), as well as by language extensions supported by some compilers (Intel CilkPlus). Nevertheless, we still require more advanced means to express parallelism in programs at the level of language standard and language library. In this survey, we consider the means for parallel and concurrent programming that are included into the C++17 standard, as well as some capabilities that are to be expected in the future standards. 相似文献

15.

Automatic Parallelization of Array-oriented Programs for a Multi-core Machine

Wai-Mee Ching Da Zheng 《International journal of parallel programming》2012,40(5):514-531

We present the work on automatic parallelization of array-oriented programs for multi-core machines. Source programs written in standard APL are translated by a parallelizing APL-to-C compiler into parallelized C code, i.e. C mixed with OpenMP directives. We describe techniques such as virtual operations and data-partitioning used to effectively exploit parallelism structured around array-primitives. We present runtime performance data, showing the speedup of the resulting parallelized code, using different numbers of threads and different problem sizes, on a 4-core machine, for several examples. 相似文献

16.

The BonaFide C Analyzer: automatic loop-level characterization and coverage measurement

Sergio Aldea Diego R. Llanos Arturo Gonzalez-Escribano 《The Journal of supercomputing》2014,68(3):1378-1401

The advent of multicore technologies has increased the interest in parallelization techniques for existing sequential applications. These techniques include the need of detecting loops that are good candidates for parallelization, and classifying all variables of these loops according to their use, a task surprisingly hard to be carried out manually. In this paper, we introduce the BonaFide C Analyzer, an XML-based framework that combines static analysis of source code with profiling information to generate complete reports regarding all loops in a C application, including loop coverage, loop suitability for parallelization, a classification of all variables inside loops based on their accesses, and other hurdles that restrict the parallelization. This information allows to analyze how particular language constructs are used in real-world applications, and helps the programmer to parallelize the code. To show the features of the framework, we present the results of an in-depth loop characterization of C applications that are part of the SPEC CPU2006 benchmark suite. Our study shows that 47.72 % of loops present in the applications analyzed are potentially parallelizable with existent parallel programming models such as OpenMP, while an additional 37.7 % of loops could be run in parallel with the help of runtime speculative parallelization techniques. 相似文献

17.

基于Python的大规模高性能LBM多相流模拟

徐传福王曦刘舒陈世钊林玉《计算机科学》2020,47(1):17-23

Python由于具有丰富的第三方库、开发高效等优点,已成为数据科学、智能科学等应用领域最流行的编程语言之一。Python强调了对科学与工程计算的支持,目前已积累了丰富的科学与工程计算库和工具。例如,SciPy和NumPy等数学库提供了高效的多维数组操作及丰富的数值计算功能。以往,Python主要作为脚本语言,起到连接数值模拟前处理、求解器和后处理的“胶水”功能,以提升数值模拟的自动化处理水平。近年来,国外已有学者尝试采用Python代码实现求解计算功能,并在高性能计算机上开展了超大规模并行计算研究,取得了不错的效果。由于自身特点,高效大规模Python数值模拟的实现和性能优化与传统基于C/C++和Fortran的数值模拟等具有很大的不同。文中实现了国际上首个完全基于Python的大规模并行三维格子玻尔兹曼多相流模拟代码PyLBMFlow,探索了Python大规模高性能计算和性能优化方法。首先,利用NumPy多维数组和通用函数设计实现了LBM流场数据结构和典型计算内核,通过一系列性能优化并对LBM边界处理算法进行重构,大幅提升了Python的计算效率,相对于基准实现,优化后的串行性能提升了两个量级。在此基础上,采用三维流场区域分解方法,基于mpi4py和Cython实现了MPI+OpenMP混合并行;在天河二号超级计算机上成功模拟了基于D3Q19离散方法和Shan-Chen BGK碰撞模型的气液两相流,算例规模达百亿网格,并行规模达1024个结点,并行效率超过90%。相似文献

18.

应用程序并行与优化关键技术研究 总被引：4，自引：0，他引：4

莫则尧刘兴平廖振民《数值计算与计算机应用》2002,23(1):31-40

１．引言众所周知，当前流行的高性能微处理器均采用多级存储结构（寄存器、一级Ｃａｃｈｅ，二级Ｃａｃｈｅ，内存）和超标量指令流水线等多项技术，用于缓和快速的逻辑算术运算部件（主频２００ＭＨｚ－１ＧＨｚ）与较慢的访存速度（延迟１５０－５００ｎｓ，带宽２ＧＢ／ｓ）之间的不匹配．同时，这些技术使得应用程序发挥浮点峰值的比例很大程度上依赖于编译器所能挖掘的Ｃａｃｈｅ命中率的高低和指令级流水线并行度［２，４，１５］．但是，受编译技术的限制，目前串行应用程序一般只能发挥浮点峰值性能的３％－１５％，大有潜力可… 相似文献

19.

A smooth transition from serial to parallel processing in the industrial petroleum system modeling package PetroMod

H. Martin Bücker Armin I. Kauerauf Arno Rasch 《Computers & Geosciences》2008,34(11):1473-1479

Petroleum system modeling is a crucial technology to numerically simulate the generation, migration, accumulation, and loss of oil and gas through geologic time. The OpenMP programming paradigm is used to achieve modest parallelism on a shared-memory computer for the industrial petroleum system modeling package PetroMod. The significant advantage of this shared-memory parallelization approach is the simplicity of the OpenMP paradigm allowing a smooth transition to a parallel program by incrementally parallelizing a serial code. The process of using OpenMP to parallelize PetroMod is outlined and performance results on a Sun Fire E2900 are reported. 相似文献

20.

一种精确的调用图生成技术

唐新春郭春榕《计算机工程与设计》1997,18(5):34-39

调用图是过程间分析和程度自动并行化的基础。生成精确调用图可以进一步开发程序的并行性。此文针对Ｆｏｒｔｒａｎ程序，提出了一项完全消除哑过程，产生精确调用图的技术与相应的算法。该算法已在面向ＭＰＰＦｏｒｔｒａｎ的程序自动并行化工具中实现。相似文献