期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performance comparison of MPI and OpenMP on shared memory multiprocessors

Graud Krawezik Franck Cappello 《Concurrency and Computation》2006,18(1):29-61

When using a shared memory multiprocessor, the programmer faces the issue of selecting the portable programming model which will provide the best performance. Even if they restricts their choice to the standard programming environments (MPI and OpenMP), they have to select a programming approach among MPI and the variety of OpenMP programming styles. To help the programmer in their decision, we compare MPI with three OpenMP programming styles (loop level, loop level with large parallel sections, SPMD) using a subset of the NAS benchmark (CG, MG, FT, LU), two dataset sizes (A and B), and two shared memory multiprocessors (IBM SP3 NightHawk II, SGI Origin 3800). We have developed the first SPMD OpenMP version of the NAS benchmark and gathered other OpenMP versions from independent sources (PBN, SDSC and RWCP). Experimental results demonstrate that OpenMP provides competitive performance compared with MPI for a large set of experimental conditions. Not surprisingly, the two best OpenMP versions are those requiring the strongest programming effort. MPI still provides the best performance under some conditions. We present breakdowns of the execution times and measurements of hardware performance counters to explain the performance differences. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献

2.

Performance characteristics of the multi-zone NAS parallel benchmarks

《Journal of Parallel and Distributed Computing》2006,66(5):674-685

We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of meshes, but had not previously been captured in benchmarks. The new suite, named NPB (NAS parallel benchmarks) multi-zone, is derived from the NPB suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the message passing interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on four different parallel computers. We also use an empirical formula to investigate the performance characteristics of the hybrid parallel codes. 相似文献

3.

COUPL+:并行PDE求解函数库

陈江赵永华迟学斌《计算机工程》2005,31(22):58-60,94

COUPL＋是一种基于消息传递模型的并行库,它将并行程序巾需要处理的数据划分、消息传递函数的调用等都封装在其函数中。COUPL＋可以简化在分布式存储结构并行机上编写基于网格的应用程序的任务。该文简要介绍了COUPL＋的基本原理,以及它与MPI、OpenMP和HPF的特性对比;并且使用COUPL＋实现了共轭梯度法和结构化网格计算两种并行计算中常用的任务,也对比了使用MPI和HPF的性能差异。相似文献

4.

Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications

《Journal of Parallel and Distributed Computing》2006,66(5):686-697

In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study, we use the NanosCompiler that supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms (MPI+OpenMP and MLP) and discuss OpenMP implementation issues that affect the performance of multi-level parallel applications. 相似文献

5.

High performance computing using MPI and OpenMP on multi-core parallel systems 总被引：1，自引：0，他引：1

Haoqiang Jin Dennis JespersenPiyush Mehrotra Rupak BiswasLei Huang Barbara Chapman 《Parallel Computing》2011,37(9):562-575

The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems - distributed memory across nodes and shared memory with non-uniform memory access within each node - poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems - a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures. 相似文献

6.

Performance of a new CFD flow solver using a hybrid programming paradigm

《Journal of Parallel and Distributed Computing》2005,65(4):414-423

This paper presents several algorithmic innovations and a hybrid programming style that lead to highly scalable performance using shared memory for a new computational fluid dynamics flow solver. This hybrid model is then converted to a strict message-passing implementation, and performance results for the two are compared. Results show that using this hybrid approach our OpenMP implementation is actually marginally faster than the MPI version, with parallel speedups of up to 599 out of 640 using OpenMP and 486 with MPI. 相似文献

7.

CLOMP: Accurately Characterizing OpenMP Application Overheads

Greg Bronevetsky John Gyllenhaal Bronis R. de Supinski 《International journal of parallel programming》2009,37(3):250-265

Despite its ease of use, OpenMP has failed to gain widespread use on large scale systems, largely due to its failure to deliver sufficient performance. Our experience indicates that the cost of initiating OpenMP regions is simply too high for the desired OpenMP usage scenario of many applications. In this paper, we introduce CLOMP, a new benchmark to characterize this aspect of OpenMP implementations accurately. CLOMP complements the existing EPCC benchmark suite to provide simple, easy to understand measurements of OpenMP overheads in the context of application usage scenarios. Our results for several OpenMP implementations demonstrate that CLOMP identifies the amount of work required to compensate for the overheads observed with EPCC. We also show that CLOMP also captures limitations for OpenMP parallelization on SMT and NUMA systems. Finally, CLOMPI, our MPI extension of CLOMP, demonstrates which aspects of OpenMP interact poorly with MPI when MPI helper threads cannot run on the NIC. 相似文献

8.

多核环境下AREM模式混合并行计算研究 总被引：1，自引：1，他引：0

下载免费PDF全文

赵军吴建平宋君强辜旭赞《计算机工程与应用》2011,47(21):61-63

使用多核处理器已成为构建高性能计算机系统的主流方式。结合多核高性能计算机系统集共享内存结构和分布式内存结构于一体的体系结构特点,对AREM模式开展MPI/OpenMP混合并行计算研究与实现。性能测试结果表明,使用MPI/OpenMP混合并行计算可以将并行应用扩展至更大处理机规模,缩短计算时间,不对原程序结构做大的改动、以增量方式和较小的并行化代价,取得比较好的并行计算效果。相似文献

9.

MPI Correctness Checking for OpenMP/MPI Applications

Tobias Hilbrich Matthias S. Müller Bettina Krammer 《International journal of parallel programming》2009,37(3):277-291

The MPI interface is the de-facto standard for message passing applications, but it is also complex and defines several usage patterns as erroneous. A current trend is the investigation of hybrid programming techniques that use MPI processes and multiple threads per process. As a result, more and more MPI implementations support multi-threading, which are restricted by several rules of the MPI standard. In order to support developers of hybrid MPI applications, we present extensions to the MPI correctness checking tool Marmot. Basic extensions make it aware of OpenMP multi-threading, while further ones add new correctness checks. As a result, it is possible to detect errors that actually occur in a run with Marmot. However, some errors only occur for certain execution orders, thus, we present a novel approach using artificial data races, which allows us to employ thread checking tools, e.g., Intel Thread Checker, to detect MPI usage errors. 相似文献

10.

OpenMP extensions for master–slave message passing computing

P.E. Hadjidoukas T.S. Papatheodorou 《Parallel Computing》2005,31(10-12):1155

This paper presents a directive-based programming environment for master–slave message passing applications that enables the efficient execution of the same code on both shared and distributed memory multiprocessors. The environment exports an extension of the OpenMP workqueuing model, supports multiple levels of task parallelism and more than one master and provides transparent load balancing with a combination of static and dynamic scheduling of tasks. In addition, it operates exclusively through the available hardware on shared-memory machines and exploits MPI for explicit communication on clusters. Experimental results on a Linux-cluster demonstrate the successful combination of ease of programming with the performance of message passing. 相似文献

11.

多核集群系统上的混合编程模型研究

张军万剑怡《计算机与现代化》2009,(5)

对采用多核处理器作为SMP集群系统的计算节点的系统上的一种混合编程模型-MPI+OpenMP混合编程模型进行了深入的研究.建立了两个矩阵乘的混合并行算法,在多核集群平台上与纯MPI算法分别进行了实验,并进行了性能方面的比较.试验表明,混合编程具有更好的性能. 相似文献

12.

NAS Parallel Benchmarks with CUDA and beyond

Gabriell Araujo Dalvan Griebler Dinei A. Rockenbach Marco Danelutto Luiz G. Fernandes 《Software》2023,53(1):53-80

NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior. 相似文献

13.

Supporting OpenMP on Cell

Kevin O’Brien Kathryn O’Brien Zehra Sura Tong Chen Tao Zhang 《International journal of parallel programming》2008,36(3):289-311

The Cell processor is a heterogeneous multi-core processor with one power processing engine (PPE) core and eight synergistic processing engine (SPE) cores. There is a significant amount of ongoing research in programming models and tools that attempts to make it easy to exploit the computation power of the Cell architecture. In our work, we explore supporting OpenMP on the Cell processor. It is attractive to support OpenMP because programmers can continue using their familiar programming model, and existing code can be re-used. We base our work on IBM’s XL compiler, and developed new components in the XL compiler and a new runtime library. Three major issues are addressed: (1) synchronization support on heterogeneous cores; (2) code generation targeting the different instruction sets; (3) data transfers and implement the OpenMP memory model. We present experimental results for some SPEC OMP 2001 and NAS benchmarks to demonstrate the effectiveness of this approach. A visualization tool based on Paraver is also used to provide some insights into actual thread and synchronization behaviors. 相似文献

14.

Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs

Benkner Siegfried Sipkova Viera 《International journal of parallel programming》2003,31(1):3-19

Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single programming paradigm that allows exploiting the hierarchical structure of these machines. Most parallel applications deployed on SMP clusters are based on MPI, the standard API for distributed-memory parallel programming, and thus may miss a number of optimization opportunities offered by the shared memory available within SMP nodes. In this paper we present extensions to the data parallel programming language HPF and associated compilation techniques for optimizing HPF programs on clusters of SMPs. The proposed extensions enable programmers to control key aspects of distributed-memory and shared-memory parallelization at a high-level of abstraction. Based on these language extensions, a compiler can adopt a hybrid parallelization strategy which closely reflects the hierarchical structure of SMP clusters by automatically exploiting shared-memory parallelism based on OpenMP within cluster nodes and distributed-memory parallelism utilizing MPI across nodes. We describe the implementation of these features in the VFC compiler and present experimental results which show the effectiveness of these techniques. 相似文献

15.

基于分布/共享内存层次结构的并行程序设计 总被引：1，自引：0，他引：1

李清宝张平《计算机应用》2004,24(6):148-150,158

分布内存结构和共享内存结构各具特点，又有很强的互补性，分布／共享内存层次结构将两种结构相结合，以充分发挥其优势。文中主要讨论基于分布／共享内存层次结构的并行程序设计问题，介绍了MPI和OpenMP混合并行程序设计模式。相似文献

16.

SMP机群混合编程模型研究 总被引：12，自引：0，他引：12

陈勇陈国良李春生何家华《小型微型计算机系统》2004,25(10):1763-1767

研究了适用于 SMP机群的混合编程模型 ,并把它划分为 Open MP MPI和 Thread MPI两类 .通过研究指出 ,Open MP MPI优于 Thread MPI.在此基础上 ,重点研究了 Open MP MPI的实现机制、粗粒度和细粒度并行化方法、循环选择、优化措施以及注意事项等 ,得出细粒度并行化的 Open MP MPI是 SMP机群编程模型的一个较好选择的结论相似文献

17.

多核机群下MPI程序优化技术的研究

王洁衷璐洁曾宇《计算机科学》2011,38(10):281-284

多核处理器的新特性使多核机群的存储层次更加复杂,同时也给MPI程序带来了新的优化空间。国内外学者提出了许多多核机群下MPI程序的优化方法和技术。测试了3个不同多核机群的通信性能,并分别在Intel与 AMD多核机群下实验评估了几种具有普遍意义的优化技术:混合MPI/OpcnMP、优化MPI运行时参数以及优化 MPI进程摆放,同时对实验结果和优化性能进行了分析。相似文献

18.

OpenMP‐oriented applications for distributed shared memory architectures

Ami Marowka Zhenying Liu Barbara Chapman 《Concurrency and Computation》2004,16(4):371-384

The rapid rise of OpenMP as the preferred parallel programming paradigm for small‐to‐medium scale parallelism could slow unless OpenMP can show capabilities for becoming the model‐of‐choice for large scale high‐performance parallel computing in the coming decade. The main stumbling block for the adaptation of OpenMP to distributed shared memory (DSM) machines, which are based on architectures like cc‐NUMA, stems from the lack of capabilities for data placement among processors and threads for achieving data locality. The absence of such a mechanism causes remote memory accesses and inefficient cache memory use, both of which lead to poor performance. This paper presents a simple software programming approach called copy‐inside–copy‐back (CC) that exploits the data privatization mechanism of OpenMP for data placement and replacement. This technique enables one to distribute data manually without taking away control and flexibility from the programmer and is thus an alternative to the automat and implicit approaches. Moreover, the CC approach improves on the OpenMP‐SPMD style of programming that makes the development process of an OpenMP application more structured and simpler. The CC technique was tested and analyzed using the NAS Parallel Benchmarks on SGI Origin 2000 multiprocessor machines. This study shows that OpenMP improves performance of coarse‐grained parallelism, although a fast copy mechanism is essential. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

19.

基于SMP集群的混合并行编程模型研究 总被引：9，自引：3，他引：6

下载免费PDF全文

王惠春 ZHU Ding-ju 朱定局曹学年樊建平《计算机工程》2009,35(3):271-273

提出一种适用于SMP集群的混合MPI＋OpenMP并行编程模型。该模型贴近于SMP集群的体系结构且综合了消息传递和共享内存2种编程模型的优势,能获得较好的性能。讨论该混合模型的实现机制以及MPI消息传递模型的特点。实验结果表明,在一定条件下,该混合并行编程模型是SMP集群的最优选择。相似文献

20.

DaSH: A benchmark suite for hybrid dataflow and shared memory programming models

《Parallel Computing》2015

The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory models. Their findings have influenced the introduction of task dependency in the OpenMP 4.0 standard.This article presents DaSH – the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH also includes sequential and shared-memory implementations based on OpenMP and Intel TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics. 相似文献