期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Dieter an Mey Samuel Sarholz Christian Terboven 《International journal of parallel programming》2007,35(5):459-476

OpenMP is widely accepted as a de facto standard for shared memory parallel programming in Fortran, C and C++. Nested parallelization has been included in the first OpenMP specification, but it took a few years until the first commercially available compilers supported this optional part of the specification. We employed nested parallelization using OpenMP in three production codes: a C++ code for content-based image retrieval, a C++ code for the computation of critical points in multi-block CFD datasets, and a multi-block Navier-Stokes solver written in Fortran90. In this paper we discuss the opportunities as well as the deficiencies of the nested parallelization support in OpenMP. 相似文献

2.

面向神威高性能多核处理器的并行编译优化方法

周雍浩徐金龙李斌钱宏聂凯《计算机工程》2022,48(9):130-138

在神威高性能多核服务器上,自动并行化编译系统为识别和申明程序中的并行性,产生的OpenMP程序没有经过充分的优化,其采用简单的fork-join模型,存在大量的并行循环嵌套,导致运行效率低。为提升自动并行化编译系统产生的OpenMP程序的运行效率,提出一种并行域重构优化技术。并行域重构技术通过合并程序中的并行域和扩展嵌套循环中的并行域范围,减少OpenMP程序的并行域数目,降低线程组频繁创建和合并等控制开销,将简单fork-join模型的OpenMP程序转换为性能更为高效的单程序多数据模型的OpenMP程序。实验结果表明,在新一代神威高性能多核服务器SW1621平台上,并行域重构技术在NPB3.3-OMP测试集和SPEC OMP2012测试集上的运行效率分别提高了10.77%和7.94%的,可有效提升自动并行化编译系统OpenMP程序的执行效率。相似文献

3.

Hybrid parallel computing of minimum action method

《Parallel Computing》2013,39(10):638-651

In this work, we report a hybrid (MPI/OpenMP) parallelization strategy for the minimum action method recently proposed in [17]. For nonlinear dynamical systems, the minimum action method is a useful numerical tool to study the transition behavior induced by small noise and the structure of the phase space. The crucial part of the minimum action method is to minimize the Freidlin–Wentzell action functional. Due to the fact that the corresponding Euler–Lagrange equation is, in general, highly nonlinear and of high order, we solve the optimization problem directly instead of discretizing the Euler–Lagrange equation to provide a general but equivalent numerical framework. To enhance the efficiency of the minimum action method for general dynamical systems we consider parallel computing. In particular, we present a hybrid parallelization strategy based on MPI and OpenMP. Numerical results are presented to demonstrate the efficiency of the proposed parallelization strategy. 相似文献

4.

Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions

Chunhua Liao Daniel J. Quinlan Jeremiah J. Willcock Thomas Panas 《International journal of parallel programming》2010,38(5-6):361-378

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. Modern applications using high-level abstractions, such as C++ STL containers and complex user-defined class types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we use a source-to-source compiler infrastructure, ROSE, to explore compiler techniques to recognize high-level abstractions and to exploit their semantics for automatic parallelization. Several representative parallelization candidate kernels are used to study semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Preliminary results have shown that semantics of abstractions can help extend the applicability of automatic parallelization to modern applications and expose more opportunities to take advantage of multicore processors. 相似文献

5.

CLOMP: Accurately Characterizing OpenMP Application Overheads

Greg Bronevetsky John Gyllenhaal Bronis R. de Supinski 《International journal of parallel programming》2009,37(3):250-265

Despite its ease of use, OpenMP has failed to gain widespread use on large scale systems, largely due to its failure to deliver sufficient performance. Our experience indicates that the cost of initiating OpenMP regions is simply too high for the desired OpenMP usage scenario of many applications. In this paper, we introduce CLOMP, a new benchmark to characterize this aspect of OpenMP implementations accurately. CLOMP complements the existing EPCC benchmark suite to provide simple, easy to understand measurements of OpenMP overheads in the context of application usage scenarios. Our results for several OpenMP implementations demonstrate that CLOMP identifies the amount of work required to compensate for the overheads observed with EPCC. We also show that CLOMP also captures limitations for OpenMP parallelization on SMT and NUMA systems. Finally, CLOMPI, our MPI extension of CLOMP, demonstrates which aspects of OpenMP interact poorly with MPI when MPI helper threads cannot run on the NIC. 相似文献

6.

Performance characteristics of the multi-zone NAS parallel benchmarks

《Journal of Parallel and Distributed Computing》2006,66(5):674-685

We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of meshes, but had not previously been captured in benchmarks. The new suite, named NPB (NAS parallel benchmarks) multi-zone, is derived from the NPB suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the message passing interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on four different parallel computers. We also use an empirical formula to investigate the performance characteristics of the hybrid parallel codes. 相似文献

7.

Efficient Task Scheduling in the Parallel Result-Verifying Solution of Nonlinear Systems

Thomas Beelitz Bruno Lang Christian H. Bischof 《Reliable Computing》2006,12(2):141-151

Nonlinear systems occur in diverse applications, e.g., in the steady state analysis of chemical processes. If safety concerns require the results to be provably correct then result-verifying algorithms relying on interval arithmetic should be used for solving these systems. Since such algorithms are very computationally intensive, the coarse-grained inter-box parallelism should be exploited to make them feasible in practice. In this paper we briefly describe our framework SONIC for the verified solution of nonlinear systems and give detailed information about its parallelization with OpenMP and MPI. Our numerical results show that the implemented parallelization schemes are indeed successful. The more sophisticated MPI implementation seems to be superior to the easy-to-implement OpenMP version and shows almost linear speedup up to a large number of processors. 相似文献

8.

基于SMP集群的多层次并行编程模型与并行优化技术* 总被引：4，自引：0，他引：4

单莹吴建平王正华《计算机应用研究》2006,23(10):254-256

详细描述了适用于SMP集群这种多层次并行体系结构的混合并行编程模型MPI／OpenMP,它提供了实现SMP节点间和节点内多层次并行的机制。在此基础上结合实用的性能评价方法,分别介绍了MPI,OpenMP和单处理器三个层次上的一些常用和有效的并行优化技术,并指出单处理器性能优化是提高并行程序性能一个不容忽视的问题。相似文献

9.

A hybrid message passing/shared memory parallelization of the adaptive integral method for multi-core clusters

Fangzhou Wei Ali E. Yilmaz 《Parallel Computing》2011,37(6-7):279-301

A hybrid message passing and shared memory parallelization technique is presented for improving the scalability of the adaptive integral method (AIM), an FFT based algorithm, on clusters of identical multi-core processors. The proposed hybrid MPI/OpenMP parallelization scheme is based on a nested one-dimensional (1-D) slab decomposition of the 3-D auxiliary regular grid and the associated AIM calculations: If there are M processors and T cores per processor, the scheme (i) divides the regular grid into M slabs and MT sub-slabs, (ii) assigns each slab/sub-slab and the associated operations to one of the processors/cores, and (iii) uses MPI for inter-processor data communication and OpenMP for intra-processor data exchange. The MPI/OpenMP parallel AIM is used to accelerate the solution of the combined-field integral equation pertinent to the analysis of time-harmonic electromagnetic scattering from perfectly conducting surfaces. The scalability of the scheme is investigated theoretically and verified on a state-of-the-art multi-core cluster for benchmark scattering problems. Timing and speedup results on up to 1024 quad-core processors show that the hybrid MPI/OpenMP parallelization of AIM exhibits better strong scalability (fixed problem size speedup) than pure MPI parallelization of it when multiple cores are used on each processor. 相似文献

10.

OpenGR: A directive-based grid programming environment

Motonori Hirano Mitsuhisa Sato Yoshio Tanaka 《Parallel Computing》2005,31(10-12):1140

A new grid programming environment for remote procedure call (RPC) based master–worker type task parallelization is presented. The environment is realized through the use of a set of compiler directives, called OpenGR, and is implemented in the present study based on the Omni OpenMP compiler system and Ninf-G grid-enabled RPC system as a parallel execution mechanism. Using OpenGR directives, existing sequential applications can be readily adapted to the grid environment as master–worker parallel programs using the RPC architecture. The combination of OpenGR and OpenMP directives also allows for the hybrid parallelization of sequential programs, supporting both synchronous and asynchronous parallelism. 相似文献

11.

一种适用于机群OpenMP系统的有效调度算法

吴少刚章隆兵蔡飞胡伟武《计算机研究与发展》2004,41(7):1298-1305

OpenMP作为共享存储并行编程标准，以其良好的易用性、支持增量并行等特点成为并行程序设计的主流模型之一．OpenMP标准是针对UMA共享存储结构制定的，其循环调度机制只考虑了负载平衡而无须考虑数据分布．然而在机群OpenMP系统中，数据局部性是影响性能的关键因素．针对OpenMP标准中静态调度策略不适合机群计算的缺点，提出了一个充分体现拥有者计算原则的LBS调度算法，并通过扩展制导的方式在机群OpenMP系统（OpenMP/JIAJIA)上加以实现．测试结果表明，LBS算法对于机群OpenMP系统很有效．相似文献

12.

A smooth transition from serial to parallel processing in the industrial petroleum system modeling package PetroMod

H. Martin Bücker Armin I. Kauerauf Arno Rasch 《Computers & Geosciences》2008,34(11):1473-1479

Petroleum system modeling is a crucial technology to numerically simulate the generation, migration, accumulation, and loss of oil and gas through geologic time. The OpenMP programming paradigm is used to achieve modest parallelism on a shared-memory computer for the industrial petroleum system modeling package PetroMod. The significant advantage of this shared-memory parallelization approach is the simplicity of the OpenMP paradigm allowing a smooth transition to a parallel program by incrementally parallelizing a serial code. The process of using OpenMP to parallelize PetroMod is outlined and performance results on a Sun Fire E2900 are reported. 相似文献

13.

面向OpenMP自动并行化的代价模型

李雁冰赵荣彩刘晓娴赵捷《软件学报》2014,25(S2):101-110

现有的OpenMP代价模型较为简单,既没有充分考虑OpenMP程序的执行细节,也无法适应不同的循环并行执行方式.针对上述问题,对最先进的产品级优化编译器Open64中已有的代价模型进行扩展,以单个并行候选循环为对象,建立一种用于OpenMP自动并行收益分析的代价模型.该模型在改进了Open64原有DOALL并行代价模型的基础上,又增加了DOACROSS流水并行代价模型和DSWP并行代价模型.实验结果表明,建立的代价模型能够较好地评估循环并行执行开销的趋势,为OpenMP自动并行化中的收益分析提供了有效的支持. 相似文献

14.

机群OpenMP系统的设计与实现 总被引：5，自引：0，他引：5

吴少刚章隆兵蔡飞顾丽红唐志敏《计算机学报》2004,27(7):904-912

OpenMP以其易用性和支持增量并行的特点成为共享存储体系结构的编程标准．目前机群系统已成为高性能计算的主流平台,研究机群OpenMP系统对推进并行应用的开发和普及非常有意义．该文作者以软件DSM系统JIAJIA作为OpenMP的运行时系统,结合一个前端编译器OMP2JIA,在机群系统上实现了OpenMP／JIAJIA计算环境,同时在提高性能方面根据机群系统特点扩展了OpenMP制导,优化了后端运行时库。通过11个OpenMP应用,作者比较了该计算环境和一个支持OpenMP的硬件cc-NUMA系统(SGI 2100)的性能．结果表明,作者的机群OpenMP系统的7机平均加速比为4．62;SGI 2100系统为4．55,二者性能相当．相似文献

15.

Productivity and Performance Portability of the OpenMP 3.0 Tasking Concept When Applied to an Engineering Code Written in Fortran 95

Paul Kapinos Dieter an Mey 《International journal of parallel programming》2010,38(5-6):379-395

The modeling of bevel gear cutting processes requires highly flexible data structures and algorithms. We compare the effort and performance of an OpenMP parallelization employing OpenMP 3.0 tasks with previously applied approaches like nesting parallel sections and stack-based algorithms when parallelizing recursive procedures written in Fortran 95 working on binary tree structures. We take a look at various combinations of recent hardware and Fortran compilers. 相似文献

16.

Performance Evaluation of a Multi-Zone Application in Different OpenMP Approaches

Haoqiang Jin Barbara Chapman Lei Huang Dieter an Mey Thomas Reichstein 《International journal of parallel programming》2008,36(3):312-325

We describe a performance study of a multi-zone application benchmark implemented in several OpenMP approaches that exploit multi-level parallelism and deal with unbalanced workload. The multi-zone application was derived from the well-known NAS Parallel Benchmarks (NPB) suite that involves flow solvers on collections of loosely coupled discretization meshes. Parallel versions of this application have been developed using the Subteam concept and Workqueuing model as extensions to the current OpenMP. We examine the performance impact of these extensions to OpenMP and compare with hybrid and nested OpenMP approaches on several large parallel systems. 相似文献

17.

A high‐performance face detection system using OpenMP

P. E. Hadjidoukas V. V. Dimakopoulos M. Delakis C. Garcia 《Concurrency and Computation》2009,21(15):1819-1837

We present the development of a novel high‐performance face detection system using a neural network‐based classification algorithm and an efficient parallelization with OpenMP. We discuss the design of the system in detail along with experimental assessment. Our parallelization strategy starts with one level of threads and moves to the exploitation of nested parallel regions in order to further improve, by up to 19%, the image‐processing capability. The presented system is able to process images in real time (38 images/sec) by sustaining almost linear speedups on a system with a quad‐core processor and a particular OpenMP runtime library. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

18.

The BonaFide C Analyzer: automatic loop-level characterization and coverage measurement

Sergio Aldea Diego R. Llanos Arturo Gonzalez-Escribano 《The Journal of supercomputing》2014,68(3):1378-1401

The advent of multicore technologies has increased the interest in parallelization techniques for existing sequential applications. These techniques include the need of detecting loops that are good candidates for parallelization, and classifying all variables of these loops according to their use, a task surprisingly hard to be carried out manually. In this paper, we introduce the BonaFide C Analyzer, an XML-based framework that combines static analysis of source code with profiling information to generate complete reports regarding all loops in a C application, including loop coverage, loop suitability for parallelization, a classification of all variables inside loops based on their accesses, and other hurdles that restrict the parallelization. This information allows to analyze how particular language constructs are used in real-world applications, and helps the programmer to parallelize the code. To show the features of the framework, we present the results of an in-depth loop characterization of C applications that are part of the SPEC CPU2006 benchmark suite. Our study shows that 47.72 % of loops present in the applications analyzed are potentially parallelizable with existent parallel programming models such as OpenMP, while an additional 37.7 % of loops could be run in parallel with the help of runtime speculative parallelization techniques. 相似文献

19.

Experiments with Parallelizing Tribology Simulations

V. Chaudhary W. L. Hase H. Jiang L. Sun D. Thaker 《The Journal of supercomputing》2004,28(3):323-343

Different parallelization methods vary in their system requirements, programming styles, efficiency of exploring parallelism, and the application characteristics they can handle. For different situations, they can exhibit totally different performance gains. This paper compares OpenMP, MPI, and Strings for parallelizing a complicated tribology problem. The problem size and computing infrastructure is changed to assess the impact of this on various parallelization methods. All of them exhibit good performance improvements and it exhibits the necessity and importance of applying parallelization in this field. 相似文献

20.

SGI系统上星载SAR并行成像算法

高国荣王开志刘兴钊韩传钊《计算机工程》2004,30(19):45-46,67

对星载合成孔径雷达(SAR)并行处理算法在分布式共享存储器(DSM)HPC平台下的实现作了深入研究，对比了用消息传递和OpenMP两种并行编程模型实现的并行方案，在此基础上提出了基于进程的共享变量并行模型。这种模型克服了前两种模型的缺点，经过实验测试和实际SAR成像应用，证明是一种高效、稳定的并行方案。相似文献