首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 62 毫秒
This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the Expert automatic event trace analyzer [17, 18] and the TAU performance analysis framework [13]. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both Expert and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP+MPI) applications.  相似文献   

Virtual execution environments, such as the Java virtual machine, promote platform‐independent software development. However, when it comes to analyzing algorithm complexity and performance bottlenecks, available tools focus on platform‐specific metrics, such as the CPU time consumption on a particular system. Other drawbacks of many prevailing profiling tools are high overhead, significant measurement perturbation, as well as reduced portability of profiling tools, which are often implemented in platform‐dependent native code. This article presents a novel profiling approach, which is entirely based on program transformation techniques, in order to build a profiling data structure that provides calling‐context‐sensitive program execution statistics. We explore the use of platform‐independent profiling metrics in order to make the instrumentation entirely portable and to generate reproducible profiles. We implemented these ideas within a Java‐based profiling tool called JP. A significant novelty is that this tool achieves complete bytecode coverage by statically instrumenting the core runtime libraries and dynamically instrumenting the rest of the code. JP provides a small and flexible API to write customized profiling agents in pure Java, which are periodically activated to process the collected profiling information. Performance measurements point out that, despite the presence of dynamic instrumentation, JP causes significantly less overhead than a prevailing tool for the profiling of Java code. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify threads.  相似文献   

This paper presents a novel profiling approach, which is entirely based on program transformation techniques in order to enable exact profiling, preserving complete call stacks, method invocation counters, and bytecode instruction counters. We exploit the number of executed bytecode instructions as profiling metric, which has several advantages, such as making the instrumentation entirely portable and generating reproducible profiles. These ideas have been implemented as the JP tool. It provides a small and flexible API to write portable profiling agents in pure Java, which are periodically activated to process the collected profiling information. Performance measurements point out that JP causes significantly less overhead than a prevailing tool for the exact profiling of Java code.  相似文献   

程序优化是提高程序运行效率的重要步骤,程序剖析是程序优化的第一步。对于串行语言,程序剖析代码是由编译器通过一个命令行开关自动插入。但是,大部分并行语言编译器都不具有这个功能。该文以并行C++语言的可移植的动态剖析程序(profiler)为例,从两方面对问题进行了论述:首先给出实现可移植动态剖析程序的一般方法;然后分析一个用于pC++插桩(Instrumentation)工具。  相似文献   

当前高性能计算机体系结构呈现多样性特征,给并行应用软件开发带来巨大挑战.采用领域特定语言OPS对高阶精度计算流体力学软件HNSC进行面向多平台的并行化,使用OPS API实现了代码的重构,基于OPS前后端自动生成了纯M PI、OpenM P、M PI+OpenM P和M PI+CUDA版本的可执行程序.在一个配有2块Intel Xeon CPU E5-2660 V3 CPU和1块NVIDIA Tesla K80 GPU的服务器上的性能测试表明,基于O PS自动生成的并行代码性能与手工并行代码的性能可比甚至更优,并且O PS自动生成的GPU并行代码相对于其CPU并行代码有明显的性能加速.测试结果说明,使用OPS等领域特定语言进行面向多平台的计算流体力学并行软件开发是一种可行且高效的途径.  相似文献   

本文分析了非结构网格多群粒子输运Sn方程求解的并行性,拟合多核机群系统的特点,设计了MPI/OpenMP混合程序,针对空间网格点采用区域分解划分,计算结点间基于消息传递MPI编程,每个MPI计算进程在计算过程中碰到关于能群的计算,就生成多个OpenMP线程,计算结点内针对能群进行多线程并行计算。数值测试结果表明,非结构网格上的粒子输运问题的混合并行计算能较好地匹配多核机群系统的硬件结构,具有良好的可扩展性,可以扩展到1024个CPU核。  相似文献   

适合机群OpenMP系统的制导扩展   总被引:1,自引:0,他引:1  
OpenMP以其易用性和支持增量并行的特点成为共享存储体系结构的编程标准.机群OpenMP系统在机群上实现了OpenMP计算环境,它将OpenMP的易编程性和机群的可扩展性结合起来,是很有意义的.OpenMP的编程方式主要有循环级和SPMD两种,其中循环级方式易于编程而SPMD方式难于编程.然而在机群OpenMP系统中获得高性能OpenMP程序,必需采用SPMD方式.该文描述了适合机群OpenMP系统的一个简单的OpenMP制导扩展子集(包括数据分布制导、循环调度模式),并在机群OpenMP系统OpenMP/JIAJIA上进行了实现.应用测试表明,利用这些制导扩展进行编程,既保持循环级方式的易编程性又获得与SPMD方式相当的性能,是有效的编程方式.  相似文献   

When using a shared memory multiprocessor, the programmer faces the issue of selecting the portable programming model which will provide the best performance. Even if they restricts their choice to the standard programming environments (MPI and OpenMP), they have to select a programming approach among MPI and the variety of OpenMP programming styles. To help the programmer in their decision, we compare MPI with three OpenMP programming styles (loop level, loop level with large parallel sections, SPMD) using a subset of the NAS benchmark (CG, MG, FT, LU), two dataset sizes (A and B), and two shared memory multiprocessors (IBM SP3 NightHawk II, SGI Origin 3800). We have developed the first SPMD OpenMP version of the NAS benchmark and gathered other OpenMP versions from independent sources (PBN, SDSC and RWCP). Experimental results demonstrate that OpenMP provides competitive performance compared with MPI for a large set of experimental conditions. Not surprisingly, the two best OpenMP versions are those requiring the strongest programming effort. MPI still provides the best performance under some conditions. We present breakdowns of the execution times and measurements of hardware performance counters to explain the performance differences. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

Biological Sequence Comparison is an important operation in Bioinformatics that is often used to relate organisms. Smith and Waterman proposed an exact algorithm that compares two sequences in quadratic time and space. Due to high computing power and memory requirements, SW is usually executed on High Performance Computing (HPC) platforms such as multicore clusters and CellBEs. Since HPC architectures exhibit very different hardware characteristics, porting an application to them is an error-prone time-consuming task. BSP++ is an implementation of BSP that aims to facilitate parallel programming, reducing the effort to port code. In this paper, we propose and evaluate a parallel BSP++ strategy to execute SW on multiple multicore and manycore platforms. Given the same base code, we generated MPI, OpenMP, MPI/OpenMP, CellBE and MPI/CellBE versions, which were executed on heterogeneous platforms with up to 6,144 cores. The results obtained with real DNA sequences show that the performance of our versions is comparable to the hand-tuned strategies in the literature, evidencing the appropriateness and flexibility of our approach.  相似文献   

The extended full-potential (FPX) helicopter rotor computational fluid dynamics (CFD) code of Fortran in its reduced two-dimensional version is successfully converted into a parallel version for multiprocessing. The FPX code with an internal grid generator solves the compressible full-potential equation using an approximately factored finite-difference scheme with added numerous physical modeling enhancements, including viscous boundary layers, shock-induced entropy corrections and wake-vortex embedding. The parallel version of the code uses open multi-processing (OpenMP) directives as parallel programming tool in shared-memory (SM) environment. The OpenMP code is portable and scalable, which can run on various computer platforms including UNIX platforms and Windows NT platforms. The performance study of the parallel code on SGI Origin 2000 UNIX platform is made. The results show that reasonable speedups through parallelization are obtained and that OpenMP is easy to use and an efficient parallel programming tool for the present problem.  相似文献   

The purpose of this paper is to investigate the scalability and performance of seven, simple OpenMP test programs and to compare their performance with equivalent MPI programs on an SGI Origin 2000. Data distribution directives were used to make sure that the OpenMP implementation had the same data distribution as the MPI implementation. For the matrix‐times‐vector (test 5) and the matrix‐times‐matrix (test 7) tests, the syntax allowed in OpenMP 1.1 does not allow OpenMP compilers to be able to generate efficient code since the reduction clause is not currently allowed for arrays. (This problem is corrected in OpenMP 2.0.) For the remaining five tests, the OpenMP version performed and scaled significantly better than the corresponding MPI implementation, except for the right shift test (test 2) for a small message. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

机群OpenMP系统的设计与实现   总被引:5,自引:0,他引:5  
OpenMP以其易用性和支持增量并行的特点成为共享存储体系结构的编程标准.目前机群系统已成为高性能计算的主流平台,研究机群OpenMP系统对推进并行应用的开发和普及非常有意义.该文作者以软件DSM系统JIAJIA作为OpenMP的运行时系统,结合一个前端编译器OMP2JIA,在机群系统上实现了OpenMP/JIAJIA计算环境,同时在提高性能方面根据机群系统特点扩展了OpenMP制导,优化了后端运行时库。通过11个OpenMP应用,作者比较了该计算环境和一个支持OpenMP的硬件cc-NUMA系统(SGI 2100)的性能.结果表明,作者的机群OpenMP系统的7机平均加速比为4.62;SGI 2100系统为4.55,二者性能相当.  相似文献   

The MPI interface is the de-facto standard for message passing applications, but it is also complex and defines several usage patterns as erroneous. A current trend is the investigation of hybrid programming techniques that use MPI processes and multiple threads per process. As a result, more and more MPI implementations support multi-threading, which are restricted by several rules of the MPI standard. In order to support developers of hybrid MPI applications, we present extensions to the MPI correctness checking tool Marmot. Basic extensions make it aware of OpenMP multi-threading, while further ones add new correctness checks. As a result, it is possible to detect errors that actually occur in a run with Marmot. However, some errors only occur for certain execution orders, thus, we present a novel approach using artificial data races, which allows us to employ thread checking tools, e.g., Intel Thread Checker, to detect MPI usage errors.  相似文献   

OpenMP has been focused in performance applied to numerical applications, but when we try to move this focus to other kind of applications, like Web servers, we detect one important lack. In these applications, performance is important, but reliability is even more important, and OpenMP does not have any recovery mechanism. In this paper we present a novel proposal to address this lack. In order to add error handling to OpenMP we propose some extensions to the current OpenMP specification. A directive and a clause are proposed, defining a scope for the error handling (where the error can occur) and specifying a behaviour for handling the specific errors. Some examples of use are presented, and we present also an evaluation showing the impact of this proposal in OpenMP applications. We show that this impact is low enough to consider the proposal worthwhile for OpenMP.  相似文献   

为了解决OpenMP程序性能退化问题,本文提出性能退化区和性能退化强度的概念.使用性能退化强度能够剔除非性能退化区并突出执行时间较长的性能退化代码段;同时.性能退化区的分解能够逐步缩小性能退化区并最终准确定位引发性能退化的代码段.去除引发性能退化的根源就能有效改进OpenMP程序的执行性能.实例分析证实了本文提出的OpenMP程序性能退化诊断与处理方法的有效性.  相似文献   

The message passing interface (MPI) is a standard used by many parallel scientific applications. It offers the advantage of a smoother migration path for porting applications from high performance computing systems to the Grid. In this paper Grid-enabled tools and libraries for developing MPI applications are presented. The first is MARMOT, a tool that checks the adherence of an application to the MPI standard. The second is PACX-MPI, an implementation of the MPI standard optimized for Grid environments. Besides the efficient development of the program, an optimal execution is of paramount importance for most scientific applications. We therefore discuss not only performance on the level of the MPI library, but also several application specific optimizations, e.g., for a sparse, parallel equation solver and an RNA folding code, like latency hiding, prefetching, caching and topology-aware algorithms.  相似文献   

Inserting instrumentation code in a program is an effective technique for detecting, recording, and measuring many aspects of a program's performance. Instrumentation code can be added at any stage of the compilation process by specially-modified system tools such as a compiler or linker or by new tools from a measurement system. For several reasons, adding instrumentation code after the compilation process—by rewriting the executable file—presents fewer complications and leads to more complete measurements. This paper describes the difficulties in adding code to executable files that arose in developing the profiling and tracing tools qp and qpt. The techniques used by these tools to instrument programs on MIPS and SPARC processors are applicable in other instrumentation systems running on many processors and operating systems. In addition, many difficulties could have been avoided with minor changes to compilers and executable file formats. These changes would simplify this approach to measuring program performance and make it more generally useful.  相似文献   

This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号