期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Design and Prototype of a Performance Tool Interface for OpenMP

Mohr Bernd Malony Allen D. Shende Sameer Wolf Felix 《The Journal of supercomputing》2001,23(1):105-128

This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the Expert automatic event trace analyzer [17, 18] and the TAU performance analysis framework [13]. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both Expert and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP+MPI) applications. 相似文献

2.

PC机群上共享存储与消息传递的比较 总被引：7，自引：0，他引：7

下载免费PDF全文

章隆兵吴少刚蔡飞胡伟武《软件学报》2004,15(6):842-849

共享存储和消息传递是目前两种主流的并行编程模型.一般认为,消息传递的可编程性不及共享存储友好.OpenMP是目前共享存储编程的实际工业标准.机群OpenMP系统在机群上提供了OpenMP编程环境,具有易编程和可扩展的特点,但是其性能如何一直是关注的热点.以机群OpenMP系统OpenMP/JIAJIA和典型的消息传递系相似文献

3.

Complete Formal Specification of the OpenMP Memory Model

Greg Bronevetsky Bronis R. de Supinski 《International journal of parallel programming》2007,35(4):335-392

相似文献

4.

Optimizing OpenMP Programs on Software Distributed Shared Memory Systems

Min Seung-Jai Basumallik Ayon Eigenmann Rudolf 《International journal of parallel programming》2003,31(3):225-249

This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is not currently being used on distributed system. The long-term goal of our project is to quantify the degree to which such a use is possible and develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have studied a number of hand optimizations that improve the baseline performance. Our approach complements related efforts that have proposed language extensions for efficient execution of OpenMP programs on distributed systems. Our results show that, while kernel benchmarks can show high efficiency of OpenMP programs on distributed systems, full applications need careful consideration of shared data access patterns. A naive translation (similar to OpenMP compilers for SMPs) leads to acceptable performance in very few applications only. However, additional optimizations, including access privatization, selective touch, and dynamic scheduling, resulting in 31% average improvement on our benchmarks. 相似文献

5.

Performance Evaluation of Mixed-Mode OpenMP/MPI Implementations

J. Mark Bull James Enright Xu Guo Chris Maynard Fiona Reid 《International journal of parallel programming》2010,38(5-6):396-417

With the current prevalence of multi-core processors in HPC architectures mixed-mode programming, using both MPI and OpenMP in the same application, is seen as an important technique for achieving high levels of scalability. As there are few standard benchmarks written in this paradigm, it is difficult to assess the likely performance of such programs. To help address this, we examine the performance of mixed-mode OpenMP/MPI on a number of popular HPC architectures, using a synthetic benchmark suite and two large-scale applications. We find performance characteristics which differ significantly between implementations, and which highlight possible areas for improvement, especially when multiple OpenMP threads communicate simultaneously via MPI. 相似文献

6.

Signals,timers, and continuations for multithreaded user‐level protocols

Juan Carlos Gomez Jorge R. Ramos Vernon Rego 《Software》2006,36(5):449-471

Precise timing and asynchronous I/O are appealing features for many applications. Unix kernels provide such features on a per‐process basis, using signals to communicate asynchronous events to applications. Per‐process signals and timers are grossly inadequate for complex multithreaded applications that require per‐thread signals and timers that operate at finer granularity. To respond to this need, we present a scheme that integrates asynchronous (Unix) signals with user‐level threads, using the ARIADNE system as a platform. This is done with a view towards support for portable, multithreaded, and multiprotocol distributed applications, namely the CLAM (connectionless, lightweight, and multiway) communications library. In the same context, we propose the use of continuations as an efficient mechanism for reducing thread context‐switching and busy‐wait overheads in multithreaded protocols. Our proposal for integrating timers and signal‐handling mechanisms not only solves problems related to race conditions, but also offers an efficient and flexible interface for timing and signalling threads. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

7.

GROMACS软件并行计算性能分析

张宝花徐顺《计算机系统应用》2016,25(12):16-23

分子动力学模拟是对微观分子原子体系在时间与空间上的运动模拟,是从微观本质上认识体系宏观性质的有力方法.针对如何提升分子动力学并行模拟性能的问题,本文以著名软件GROMACS为例,分析其在分子动力学模拟并行计算方面的实现策略,结合分子动力学模拟关键原理与测试实例,提出MPI+OpenMP并行环境下计算性能的优化策略,为并行计算环境下实现分子动力学模拟的最优化计算性能提供理论和实践参考.对GPU异构并行环境下如何进行MPI、OpenMP、GPU搭配选择以达到性能最优,本文亦给出了一定的理论和实例参考. 相似文献

8.

The Design of OpenMP Tasks 总被引：2，自引：0，他引：2

Ayguad Eduard Copty Nawal Duran Alejandro Hoeflinger Jay Lin Yuan Massaioli Federico Teruel Xavier Unnikrishnan Priya Zhang Guansong 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(3):404-418

OpenMP has been very successful in exploiting structured parallelism in applications. With increasing application complexity, there is a growing need for addressing irregular parallelism in the presence of complicated control structures. This is evident in various efforts by the industry and research communities to provide a solution to this challenging problem. One of the primary goals of OpenMP 3.0 is to define a standard dialect to express and efficiently exploit unstructured parallelism. This paper presents the design of the OpenMP tasking model by members of the OpenMP 3.0 tasking sub-committee which was formed for this purpose. The paper summarizes the efforts of the sub-committee (spanning over two years) in designing, evaluating and seamlessly integrating the tasking model into the OpenMP specification. In this paper, we present the design goals and key features of the tasking model, including a rich set of examples and an in-depth discussion of the rationale behind various design choices. We compare a prototype implementation of the tasking model with existing models, and evaluate it on a wide range of applications. The comparison shows that the OpenMP tasking model provides expressiveness, flexibility, and huge potential for performance and scalability. 相似文献

9.

Design and Prototype of a Performance Tool Interface for OpenMP

Bernd Mohr Allen D. Malony Sameer Shende Felix Wolf 《The Journal of supercomputing》2002,23(1):105-128

This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the Expert automatic event trace analyzer [17, 18] and the TAU performance analysis framework [13]. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both Expert and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP+MPI) applications. 相似文献

10.

高性能应用软件并行优化策略的研究

贾伟乐史小冬吕海峰《数据与计算发展前沿》2013,4(4):89-93

多核时代的来临对现有的应用软件提出了严重挑战,串行代码难以充分发挥硬件资源的性能;软件的并行优化成为亟待解决的重要问题。本文综合了 MPI,OpenMP,众核编程模型 CUDA 三个编程模型进行研究,讨论了适用于不同软件并行优化的方法,提出了适用于企业级应用的软件并行优化策略,最后总结和展望了软件并行优化的挑战和前景。相似文献

11.

Supporting Nested OpenMP Parallelism in the TAU Performance System

Alan Morris Allen D. Malony Sameer S. Shende 《International journal of parallel programming》2007,35(4):417-436

Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify threads. 相似文献

12.

机群OpenMP系统的设计与实现 总被引：5，自引：0，他引：5

吴少刚章隆兵蔡飞顾丽红唐志敏《计算机学报》2004,27(7):904-912

OpenMP以其易用性和支持增量并行的特点成为共享存储体系结构的编程标准．目前机群系统已成为高性能计算的主流平台,研究机群OpenMP系统对推进并行应用的开发和普及非常有意义．该文作者以软件DSM系统JIAJIA作为OpenMP的运行时系统,结合一个前端编译器OMP2JIA,在机群系统上实现了OpenMP／JIAJIA计算环境,同时在提高性能方面根据机群系统特点扩展了OpenMP制导,优化了后端运行时库。通过11个OpenMP应用,作者比较了该计算环境和一个支持OpenMP的硬件cc-NUMA系统(SGI 2100)的性能．结果表明,作者的机群OpenMP系统的7机平均加速比为4．62;SGI 2100系统为4．55,二者性能相当．相似文献

13.

CLOMP: Accurately Characterizing OpenMP Application Overheads

Greg Bronevetsky John Gyllenhaal Bronis R. de Supinski 《International journal of parallel programming》2009,37(3):250-265

Despite its ease of use, OpenMP has failed to gain widespread use on large scale systems, largely due to its failure to deliver sufficient performance. Our experience indicates that the cost of initiating OpenMP regions is simply too high for the desired OpenMP usage scenario of many applications. In this paper, we introduce CLOMP, a new benchmark to characterize this aspect of OpenMP implementations accurately. CLOMP complements the existing EPCC benchmark suite to provide simple, easy to understand measurements of OpenMP overheads in the context of application usage scenarios. Our results for several OpenMP implementations demonstrate that CLOMP identifies the amount of work required to compensate for the overheads observed with EPCC. We also show that CLOMP also captures limitations for OpenMP parallelization on SMT and NUMA systems. Finally, CLOMPI, our MPI extension of CLOMP, demonstrates which aspects of OpenMP interact poorly with MPI when MPI helper threads cannot run on the NIC. 相似文献

14.

基于分块数据结构的冲击问题并行计算 总被引：1，自引：0，他引：1

肖永浩黄清南《计算机辅助工程》2011,20(1):33-36

针对三维冲击问题,基于分块数据结构在共享内存并行机上实现OpenMP并行计算.分块数据结构不仅能有效利用计算机多层存储结构,而且增加OpenMP的并行粒度.数值实验表明:在使用分块数据结构后,串行程序的计算速度能提高3倍.通过柱体冲击平板数值模拟实验讨论并行程序的加速比和效率,表明并行程序能有效减少总计算时间. 相似文献

15.

面向CPS复杂事件流的不确定性研究

下载免费PDF全文

曹科宁李仁发张小明张鑫龙《计算机工程与科学》2015,37(3):415-421

信息物理融合系统CPS获得广泛应用需要解决的一个关键问题是软件中的信息处理部分,而复杂事件处理是CPS中信息处理的核心任务之一。CPS环境下的事件具有异构、分散、海量和不确定性等特征。在CPS实际应用中,因噪声、传感器误差、通讯技术等原因而造成的事件不确定性急需解决。为了解决CPS系统中存在的海量不确定事件流问题,提出一种处理不确定事件流的复杂事件处理方法USCEP,该方法不仅可以实时有效地处理海量不确定事件流,还可以有效计算复杂事件的概率。USCEP对现有RFID复杂事件监测方法 RCEDA进行了改进,提供了历史概率事件查询处理的支持,提出一种事件概率模型进行概率计算,并通过关联查询表来提高效率。实验表明,在处理不确定事件流时,该方法比传统方法具有更好的性能。相似文献

16.

Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications

《Journal of Parallel and Distributed Computing》2006,66(5):686-697

In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study, we use the NanosCompiler that supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms (MPI+OpenMP and MLP) and discuss OpenMP implementation issues that affect the performance of multi-level parallel applications. 相似文献

17.

A pi-calculus based semantics for WS-BPEL

《The Journal of Logic and Algebraic Programming》2007,70(1):96-118

相似文献

18.

Performance‐based parallel loop self‐scheduling using hybrid OpenMP and MPI programming on multicore SMP clusters

Chao‐Tung Yang Chao‐Chin Wu Jen‐Hsiang Chang 《Concurrency and Computation》2011,23(8):721-744

Parallel loop self‐scheduling on parallel and distributed systems has been a critical problem and it is becoming more difficult to deal with in the emerging heterogeneous cluster computing environments. In the past, some self‐scheduling schemes have been proposed as applicable to heterogeneous cluster computing environments. In recent years, multicore computers have been widely included in cluster systems. However, previous researches into parallel loop self‐scheduling did not consider certain aspects of multicore computers; for example, it is more appropriate for shared‐memory multiprocessors to adopt Open Multi‐Processing (OpenMP) for parallel programming. In this paper, we propose a performance‐based approach using hybrid OpenMP and MPI parallel programming, which partition loop iterations according to the performance weighting of multicore nodes in a cluster. Because iterations assigned to one MPI process are processed in parallel by OpenMP threads run by the processor cores in the same computational node, the number of loop iterations allocated to one computational node at each scheduling step depends on the number of processor cores in that node. Experimental results show that the proposed approach performs better than previous schemes. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

19.

Low‐level PGAS computing on many‐core processors with TSHMEM

Bryant C. Lam Alan D. George Herman Lam Vikas Aggarwal 《Concurrency and Computation》2015,27(17):5288-5310

Diminishing returns from increased clock frequencies and instruction‐level parallelism have forced computer architects to adopt architectures that exploit wider parallelism through multiple processor cores. While emerging many‐core architectures have progressed at a remarkable rate, concerns arise regarding the performance and productivity of numerous parallel‐programming tools for application development. Development of parallel applications on many‐core processors often requires developers to familiarize themselves with unique characteristics of a target platform while attempting to maximize performance and maintain correctness of their applications. The family of partitioned global address space (PGAS) programming models comprises the current state of the art in balancing performance and programmability. One such PGAS approach is SHMEM, a lightweight, shared‐memory programming library that has demonstrated high performance and productivity potential for parallel‐computing systems with distributed‐memory architectures. In the paper, we present research, design, and analysis of a new SHMEM infrastructure specifically crafted for low‐level PGAS on modern and emerging many‐core processors featuring dozens of cores and more. Our approach (with a new library known as TSHMEM) is investigated and evaluated atop two generations of Tilera architectures, which are among the most sophisticated and scalable many‐core processors to date, and is intended to enable similar libraries atop other architectures now emerging. In developing TSHMEM, we explore design decisions and their impact on parallel performance for the Tilera TILE‐Gx and TILEPro many‐core architectures, and then evaluate the designs and algorithms within TSHMEM through microbenchmarking and applications studies with other communication libraries. Our results with barrier primitives provided by the Tilera libraries show dissimilar performance between the TILE‐Gx and TILEPro; therefore, TSHMEM's barrier design takes an alternative approach and leverages the on‐chip mesh network to provide consistent low‐latency performance. In addition, our experiments with TSHMEM show that naive collective algorithms consistently outperformed linear distributed collective algorithms when executed in an SMP‐centric environment. In leveraging these insights for the design of TSHMEM, our approach outperforms the OpenSHMEM reference implementation, achieves similar to positive performance over OpenMP and OSHMPI atop MPICH, and supports similar libraries in delivering high‐performance parallel computing to emerging many‐core systems. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

20.

基于Docker的MPI和OpenMP混合编程

赵博颖肖鹏张力《计算机与现代化》2018,(5):60

针对当前搭建集群并行系统复杂且耗时等问题,提出基于Docker搭建并行系统。介绍轻量级虚拟化技术Docker的核心概念和基本架构,并基于Docker技术在Linux平台上搭建集群并行开发环境。简要阐述并行计算的思想,叙述MPI和OpenMP并行计算的基本概念和特点,针对矩阵并行乘法的算法建立MPI和OpenMP的混合编程模型,并给出混合编程模型与MPI并行编程模型以及OpenMP并行编程模型的性能对比,分析出现差异的原因。基于该混合编程模型比较Docker与传统物理机两者搭建的并行系统的并行效率。相似文献