首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the Expert automatic event trace analyzer [17, 18] and the TAU performance analysis framework [13]. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both Expert and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP+MPI) applications.  相似文献   

2.
为了解决OpenMP程序性能退化问题,本文提出性能退化区和性能退化强度的概念.使用性能退化强度能够剔除非性能退化区并突出执行时间较长的性能退化代码段;同时.性能退化区的分解能够逐步缩小性能退化区并最终准确定位引发性能退化的代码段.去除引发性能退化的根源就能有效改进OpenMP程序的执行性能.实例分析证实了本文提出的OpenMP程序性能退化诊断与处理方法的有效性.  相似文献   

3.
OpenMP has been focused in performance applied to numerical applications, but when we try to move this focus to other kind of applications, like Web servers, we detect one important lack. In these applications, performance is important, but reliability is even more important, and OpenMP does not have any recovery mechanism. In this paper we present a novel proposal to address this lack. In order to add error handling to OpenMP we propose some extensions to the current OpenMP specification. A directive and a clause are proposed, defining a scope for the error handling (where the error can occur) and specifying a behaviour for handling the specific errors. Some examples of use are presented, and we present also an evaluation showing the impact of this proposal in OpenMP applications. We show that this impact is low enough to consider the proposal worthwhile for OpenMP.  相似文献   

4.
5.
OpenMP is an emerging industry standard for shared memory architectures. While OpenMP has advantages on its ease of use and incremental programming, message passing is today still the most widely-used programming model for distributed memory architectures. How to effectively extend OpenMP to distributed memory architectures has been a hot spot. This paper proposes an OpenMP system, called KLCoMP, for distributed memory architectures. Based on the partially replicating shared arrays memory model, we propose ...  相似文献   

6.
Important components of molecular modeling applications are estimation and minimization of the internal energy of a molecule. For macromolecules such as proteins and amino acids, energy estimation is performed using empirical equations known as force fields. Over the past several decades, much effort has been directed towards improving the accuracy of these equations, and the resulting increased accuracy has come at the expense of greater computational complexity. For example, the interactions between a protein and surrounding water molecules have been modeled with improved accuracy using the generalized Born solvation model, which increases the computational complexity to O (n 3). Fortunately, many force-field calculations are amenable to parallel execution. This paper describes the steps that were required to transform the Born calculation from a serial program into a parallel program suitable for parallel execution in both the OpenMP and MPI environments. Measurements of the parallel performance on a symmetric multiprocessor reveal that the Born calculation scales well for up to 144 processors. In some cases the OpenMP implementation scales better than the MPI implementation, but in other cases the MPI implementation scales better than the OpenMP implementation. However, in all cases the OpenMP implementation performs better than the MPI implementation, and requires less programming effort as well. Trademark Legend Sun, Sun Microsystems, SPARC, UltraSPARC, Sun Fire, Sun Performance Library and Sun HPC Cluster Tools are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.  相似文献   

7.
In this paper, we describe our experience of creating an OpenMP implementation of the SPICE3 circuit simulator program. Given the irregular patterns of access to dynamic data structures in the SPICE code, a parallelization using current standard OpenMP directives is impossible without major rewriting of the original program. The aim of this work is to present a case study showing the development of a shared memory parallel code with minimum effort. We present two implementations, one with minimal code modification and one without modification to the original SPICE3 program using Intel’s taskq construct. We also discuss the results of the case study in terms of what future compiler tools may be needed to help OpenMP application developers with similar porting goals. Our experiments using SPICE3, based on SRAM model simulation, were compiled by the SUN compiler running on a SunFire V880 UltraSPARC-III 750 MHz and by the Intel icc compiler running on both an IBM Itanium with four CPUs and Intel Xeon of two processors machines. The results are promising.  相似文献   

8.
We describe a performance study of a multi-zone application benchmark implemented in several OpenMP approaches that exploit multi-level parallelism and deal with unbalanced workload. The multi-zone application was derived from the well-known NAS Parallel Benchmarks (NPB) suite that involves flow solvers on collections of loosely coupled discretization meshes. Parallel versions of this application have been developed using the Subteam concept and Workqueuing model as extensions to the current OpenMP. We examine the performance impact of these extensions to OpenMP and compare with hybrid and nested OpenMP approaches on several large parallel systems.  相似文献   

9.
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed shared-memory (SDSM) system. In contrast to previous SDSM systems for SMPs, the modified TreadMarks system uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intranode hardware shared memory. We present performance results for seven applications (Barnes-Hut, CLU, and Water from SPLASH-2, 3D-FFT from NAS, Red-Black SOR, TSP, and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the thread implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7–30% of the MPI versions.  相似文献   

10.
随着并行计算在军用、工业技术等领域的广泛应用,更多的用户开始应用并行处理技术解决问题,针对嵌入式多核多处理器平台的并行程序开发也更加普及。并行调试技术是并行程序开发的一个重要环节,调试实时应用程序的过程十分复杂,但是,并行调试环境却相对比较薄弱。文章研究设计了一种基于并行嵌入式实时系统的性能分析工具,只需使用极少的资源便可对应用程序进行跟踪分析,具有较高的性能。  相似文献   

11.
针对Stewart并联机器人控制难度大的问题,实现了一种基于OpenMP的Stewart并联机器人上位机控制系统,对机器人实现了快速有效直观的振动控制.系统共包含了动力学解算模块、数据传输模块和人机交互模块,不仅能够准确地计算出振动数据,控制机器人振动,而且也设计了简洁明了的用户界面,提升用户体验.为了提高软件的执行效率,还加入了OpenMP多线程并行计算技术加速控制算法,最高达到了2.18倍的加速比.验证了软件计算的正确性、控制的稳定性和执行的高效性.  相似文献   

12.
Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify threads.  相似文献   

13.
为了充分利用现有的多核CPU计算资源,提出一种基于OpenMP框架的快速并行分层算法,并对其性能进行讨论.该算法利用模型自然分组特征建立拓扑关系分组,使得模型拓扑数据结构的建立时间缩短;在此基础上,采用基于OpenMP的多线程计算,将拓扑结构的建立过程和求取层片轮廓的过程并行化计算,可以取得接近CPU核数的加速比,因此分层时间明显降低.对于复杂三维模型的超大STL文件进行分层处理,实例计算表明文中算法是一种高效且易于实现的方法.  相似文献   

14.
以IBM Blade Center高性能集群系统的性能测试与分析为背景,在对该高性能集群系统测试的基础上,研究了大规模Linux集群上并行程序性能的优化方法。针对该集群系统的硬件特征,编写了以矩阵相乘的Cannon算法为基础的并行性能测试程序,通过在该集群系统3种不同并行环境下(分布式并行、共享内存并行以及混合式并行)对性能测试程序的测试和分析,提出了优化程序性能的准则,展现了该高性能平台的效率,为进一步的开发研究提供了基础。  相似文献   

15.
The aim of this work is to provide a high performance air quality simulation using the sulphur transport Eulerian model 2 (STEM-II) program. First of all we optimize the sequential program with the aim of increasing data locality. Then, the optimized program is parallelized using OpenMP shared-memory directives. Experimental results on a 32-processor SGI Origin 2000 show that the parallel program achieves important reductions in the execution times.  相似文献   

16.
17.
This paper discusses a novel approach to implementing OpenMP on clusters. Traditional approaches to do so rely on Software Distributed Shared Memory systems to handle shared data. We discuss these and then introduce an alternative approach that translates OpenMP to Global Arrays (GA), explaining the basic strategy. GA requires a data distribution. We do not expect the user to supply this; rather, we show how we perform data distribution and work distribution according to the user-supplied OpenMP static loop schedules. An inspector–executor strategy is employed for irregular applications in order to gather information on accesses to potentially non-local data, group non-local data transfers and overlap communications with local computations. Furthermore, a new directive INVARIANT is proposed to provide information about the dynamic scope of data access patterns. This directive can help us generate efficient codes for irregular applications using the inspector–executor approach. We also illustrate how to deal with some hard cases containing reshaping and strided accesses during the translation. Our experiments show promising results for the corresponding regular and irregular GA codes.  相似文献   

18.
为了准确分析OpenMP程序的负载均衡问题,详细分析了在同步点之间进行测量的恰当位置,定义了性能分析单元,给出了负载不均衡程度的计算公式,并提出了一种以性能分析单元为分析对象来测量OpenMP并行程序负载平衡的方法。该方法利用Opari对OpenMP源程序自动插入POMP性能监控函数,通过在相关的性能函数中插入定时器的方式,以分析单元为基本对象来收集程序的负载情况。该方法已在一个OpenMP性能分析工具中得到了实现,能够有效地帮助用户找出程序中负载不均衡的瓶颈。  相似文献   

19.
为了提高新一代音视频编解码技术标准AVS的编码速度,利用OpenMP在多核处理器平台上研究并实现了AVS的GOP级、条带级,帧级和基于任务队列模型的帧级并行编码算法.对CIF格式的视频序列进行了测试,在四核处理器平台上加速比最高能达到3.82x.另外,基于任务队列模型的帧级并行算法在保持图像质量不变的基础上解决了帧级并行算法加速比偏低的缺点.实验结果表明,OpenMP是一种简单而有效的并行化编程工具,基于OpenMP的各个AVS并行编码算法与原串行算法相比,编码速度都有显著提高.  相似文献   

20.
机群OpenMP系统的设计与实现   总被引:5,自引:0,他引:5  
OpenMP以其易用性和支持增量并行的特点成为共享存储体系结构的编程标准.目前机群系统已成为高性能计算的主流平台,研究机群OpenMP系统对推进并行应用的开发和普及非常有意义.该文作者以软件DSM系统JIAJIA作为OpenMP的运行时系统,结合一个前端编译器OMP2JIA,在机群系统上实现了OpenMP/JIAJIA计算环境,同时在提高性能方面根据机群系统特点扩展了OpenMP制导,优化了后端运行时库。通过11个OpenMP应用,作者比较了该计算环境和一个支持OpenMP的硬件cc-NUMA系统(SGI 2100)的性能.结果表明,作者的机群OpenMP系统的7机平均加速比为4.62;SGI 2100系统为4.55,二者性能相当.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号