共查询到20条相似文献,搜索用时 9 毫秒
1.
Bernd Mohr Allen D. Malony Sameer Shende Felix Wolf 《The Journal of supercomputing》2002,23(1):105-128
This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the Expert automatic event trace analyzer [17, 18] and the TAU performance analysis framework [13]. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both Expert and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP+MPI) applications. 相似文献
2.
为了解决OpenMP程序性能退化问题,本文提出性能退化区和性能退化强度的概念.使用性能退化强度能够剔除非性能退化区并突出执行时间较长的性能退化代码段;同时.性能退化区的分解能够逐步缩小性能退化区并最终准确定位引发性能退化的代码段.去除引发性能退化的根源就能有效改进OpenMP程序的执行性能.实例分析证实了本文提出的OpenMP程序性能退化诊断与处理方法的有效性. 相似文献
3.
Alejandro Duran Roger Ferrer Juan José Costa Marc Gonzàlez Xavier Martorell Eduard Ayguadé Jesús Labarta 《International journal of parallel programming》2007,35(4):393-416
OpenMP has been focused in performance applied to numerical applications, but when we try to move this focus to other kind
of applications, like Web servers, we detect one important lack. In these applications, performance is important, but reliability
is even more important, and OpenMP does not have any recovery mechanism. In this paper we present a novel proposal to address
this lack. In order to add error handling to OpenMP we propose some extensions to the current OpenMP specification. A directive
and a clause are proposed, defining a scope for the error handling (where the error can occur) and specifying a behaviour
for handling the specific errors. Some examples of use are presented, and we present also an evaluation showing the impact
of this proposal in OpenMP applications. We show that this impact is low enough to consider the proposal worthwhile for OpenMP. 相似文献
4.
5.
一种利用并行复算实现的OpenMP 容错机制 总被引:1,自引:0,他引:1
基于并行复算的故障恢复技术,将故障恢复的计算任务分配至未发生故障的结点上并行执行,从而显著缩短复算时间,有效降低故障恢复开销,提高并行程序容错性能.基于该故障恢复技术,提出了一种针对OpenMP并行程序的容错机制PR-OMP,有效解决了分段复算、复算负载重分布等问题;此外,还扩展了传统编译数据流分析技术,提出了针对OpenMP并行程序的数据流分析技术,并基于该技术计算状态保存开销进行优化.设计实现了用于支持PR-OMP的编译工具GiFT-OMP,并通过实验证明了PR-OMP机制及其支持工具的有效性,评估并分析了其性能和可扩展性. 相似文献
6.
Important components of molecular modeling applications are estimation and minimization of the internal energy of a molecule.
For macromolecules such as proteins and amino acids, energy estimation is performed using empirical equations known as force
fields. Over the past several decades, much effort has been directed towards improving the accuracy of these equations, and
the resulting increased accuracy has come at the expense of greater computational complexity. For example, the interactions
between a protein and surrounding water molecules have been modeled with improved accuracy using the generalized Born solvation
model, which increases the computational complexity to O (n
3). Fortunately, many force-field calculations are amenable to parallel execution. This paper describes the steps that were
required to transform the Born calculation from a serial program into a parallel program suitable for parallel execution in
both the OpenMP and MPI environments. Measurements of the parallel performance on a symmetric multiprocessor reveal that the
Born calculation scales well for up to 144 processors. In some cases the OpenMP implementation scales better than the MPI
implementation, but in other cases the MPI implementation scales better than the OpenMP implementation. However, in all cases
the OpenMP implementation performs better than the MPI implementation, and requires less programming effort as well.
Trademark Legend Sun, Sun Microsystems, SPARC, UltraSPARC, Sun Fire, Sun Performance Library and Sun HPC Cluster Tools are
trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. 相似文献
7.
WANG Jue HU ChangJun ZHANG JiLin & LI JianJiang School of Information Engineering University of Science Technology Beijing Beijing China 《中国科学:信息科学(英文版)》2010,(5):932-944
OpenMP is an emerging industry standard for shared memory architectures. While OpenMP has advantages on its ease of use and incremental programming, message passing is today still the most widely-used programming model for distributed memory architectures. How to effectively extend OpenMP to distributed memory architectures has been a hot spot. This paper proposes an OpenMP system, called KLCoMP, for distributed memory architectures. Based on the partially replicating shared arrays memory model, we propose ... 相似文献
8.
When using a shared memory multiprocessor, the programmer faces the issue of selecting the portable programming model which will provide the best performance. Even if they restricts their choice to the standard programming environments (MPI and OpenMP), they have to select a programming approach among MPI and the variety of OpenMP programming styles. To help the programmer in their decision, we compare MPI with three OpenMP programming styles (loop level, loop level with large parallel sections, SPMD) using a subset of the NAS benchmark (CG, MG, FT, LU), two dataset sizes (A and B), and two shared memory multiprocessors (IBM SP3 NightHawk II, SGI Origin 3800). We have developed the first SPMD OpenMP version of the NAS benchmark and gathered other OpenMP versions from independent sources (PBN, SDSC and RWCP). Experimental results demonstrate that OpenMP provides competitive performance compared with MPI for a large set of experimental conditions. Not surprisingly, the two best OpenMP versions are those requiring the strongest programming effort. MPI still provides the best performance under some conditions. We present breakdowns of the execution times and measurements of hardware performance counters to explain the performance differences. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献
9.
采用加矩形窗的积累互相关法和基于Fourier变换频域移位性质的最小熵法进行一维距离像包络对齐。针对包络对齐算法数据量大、复杂度高、运行时间长等缺点,提出一种应用于多核处理器的包络对齐并行算法。该方法利用OpenMP编译指导指令#pragma omp section和#pragma omp for对积累互相关算法和最小熵算法进行多线程并行优化。理论分析和仿真实验表明,该方法大大提升了算法的执行效率。 相似文献
10.
随着并行计算在军用、工业技术等领域的广泛应用,更多的用户开始应用并行处理技术解决问题,针对嵌入式多核多处理器平台的并行程序开发也更加普及。并行调试技术是并行程序开发的一个重要环节,调试实时应用程序的过程十分复杂,但是,并行调试环境却相对比较薄弱。文章研究设计了一种基于并行嵌入式实时系统的性能分析工具,只需使用极少的资源便可对应用程序进行跟踪分析,具有较高的性能。 相似文献
11.
Tien-Hsiung Weng Ruey-Kuen Perng Barbara Chapman 《International journal of parallel programming》2007,35(5):493-505
In this paper, we describe our experience of creating an OpenMP implementation of the SPICE3 circuit simulator program. Given
the irregular patterns of access to dynamic data structures in the SPICE code, a parallelization using current standard OpenMP
directives is impossible without major rewriting of the original program. The aim of this work is to present a case study
showing the development of a shared memory parallel code with minimum effort. We present two implementations, one with minimal
code modification and one without modification to the original SPICE3 program using Intel’s taskq construct. We also discuss the results of the case study in terms of what future compiler tools may be needed to help OpenMP
application developers with similar porting goals. Our experiments using SPICE3, based on SRAM model simulation, were compiled
by the SUN compiler running on a SunFire V880 UltraSPARC-III 750 MHz and by the Intel icc compiler running on both an IBM
Itanium with four CPUs and Intel Xeon of two processors machines. The results are promising. 相似文献
12.
13.
A new parallel normalized optimized approximate inverse algorithm, based on the concept of antidiagonal wave pattern, for computing classes of explicitly approximate inverses, is introduced for symmetric multiprocessor systems. The parallel normalized explicit approximate inverses are used in conjunction with parallel normalized explicit preconditioned conjugate gradient schemes for the efficient solution of finite element sparse linear systems. The parallel design and implementation issues of the new algorithm are discussed and the parallel performance is presented using OpenMP. Copyright © 2008 John Wiley & Sons, Ltd. 相似文献
14.
Y. Charlie Hu Honghui Lu Alan L. Cox Willy Zwaenepoel 《Journal of Parallel and Distributed Computing》2000,60(12):160
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed shared-memory (SDSM) system. In contrast to previous SDSM systems for SMPs, the modified TreadMarks system uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intranode hardware shared memory. We present performance results for seven applications (Barnes-Hut, CLU, and Water from SPLASH-2, 3D-FFT from NAS, Red-Black SOR, TSP, and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the thread implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7–30% of the MPI versions. 相似文献
15.
Haoqiang Jin Barbara Chapman Lei Huang Dieter an Mey Thomas Reichstein 《International journal of parallel programming》2008,36(3):312-325
We describe a performance study of a multi-zone application benchmark implemented in several OpenMP approaches that exploit
multi-level parallelism and deal with unbalanced workload. The multi-zone application was derived from the well-known NAS
Parallel Benchmarks (NPB) suite that involves flow solvers on collections of loosely coupled discretization meshes. Parallel
versions of this application have been developed using the Subteam concept and Workqueuing model as extensions to the current
OpenMP. We examine the performance impact of these extensions to OpenMP and compare with hybrid and nested OpenMP approaches
on several large parallel systems. 相似文献
16.
17.
Alan Morris Allen D. Malony Sameer S. Shende 《International journal of parallel programming》2007,35(4):417-436
Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation
and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread
performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach
has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications
with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify
threads. 相似文献
18.
为了充分利用现有的多核CPU计算资源,提出一种基于OpenMP框架的快速并行分层算法,并对其性能进行讨论.该算法利用模型自然分组特征建立拓扑关系分组,使得模型拓扑数据结构的建立时间缩短;在此基础上,采用基于OpenMP的多线程计算,将拓扑结构的建立过程和求取层片轮廓的过程并行化计算,可以取得接近CPU核数的加速比,因此分层时间明显降低.对于复杂三维模型的超大STL文件进行分层处理,实例计算表明文中算法是一种高效且易于实现的方法. 相似文献
19.
This paper describes the implementation of a sizable subset of OpenMP on networks of workstations(NOWs) and the source-to-source OpenMP complier(AutoPar) is used for the JIAJIA home-based shared virtual memory system (SVM).The paper suggests some simple modifications and extensions to the OpenMP standard for the difference between SVM and SMP(symmetric multi processor),at which the OpenMP specification is aimed.The OpenMP translator is based on an automatic paralleization compiler,so it is possible to check the correctness of the semantics of OpenMP programs which is not required in an OpenMP-compliant implementation AutoPar is measured for five applications including both programs from NAS Parallel Benchmarks and real applications on a cluster of eight Pentium Ⅱ PCs connected by a 100 Mbps switched Eternet.The evaluation shows that the parallelization by annotaing OpenMPdirectives is simple and the performance of generatd JIAJIA code is still acceptable on NOWs. 相似文献
20.
针对当前搭建集群并行系统复杂且耗时等问题,提出基于Docker搭建并行系统。介绍轻量级虚拟化技术Docker的核心概念和基本架构,并基于Docker技术在Linux平台上搭建集群并行开发环境。简要阐述并行计算的思想,叙述MPI和OpenMP并行计算的基本概念和特点,针对矩阵并行乘法的算法建立MPI和OpenMP的混合编程模型,并给出混合编程模型与MPI并行编程模型以及OpenMP并行编程模型的性能对比,分析出现差异的原因。基于该混合编程模型比较Docker与传统物理机两者搭建的并行系统的并行效率。 相似文献