期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

章隆兵吴少刚蔡飞胡伟武《软件学报》2004,15(6):842-849

共享存储和消息传递是目前两种主流的并行编程模型.一般认为,消息传递的可编程性不及共享存储友好.OpenMP是目前共享存储编程的实际工业标准.机群OpenMP系统在机群上提供了OpenMP编程环境,具有易编程和可扩展的特点,但是其性能如何一直是关注的热点.以机群OpenMP系统OpenMP/JIAJIA和典型的消息传递系相似文献

2.

Wavelet Packet Image Decomposition on MIMD Architectures

《Real》2002,8(5):399-412

In this work, we describe and analyze algorithms for 2D wavelet packet (WP) decomposition for multicomputers and multiprocessors. In the case of multicomputers, the main goal is the generalization of former parallel WP algorithms which are constrained to a number of processor elements equal to a power of 4. For multiprocessors, we discuss several optimizations of shared-memory algorithms and finally we compare the results obtained on multicomputers and multi-processors employing the message passing (MPI) and shared-memory programming (OpenMP) paradigm, respectively. 相似文献

3.

A parallel CFD rotor code using OpenMP

《Advances in Engineering Software》2001,32(8):665-671

The extended full-potential (FPX) helicopter rotor computational fluid dynamics (CFD) code of Fortran in its reduced two-dimensional version is successfully converted into a parallel version for multiprocessing. The FPX code with an internal grid generator solves the compressible full-potential equation using an approximately factored finite-difference scheme with added numerous physical modeling enhancements, including viscous boundary layers, shock-induced entropy corrections and wake-vortex embedding. The parallel version of the code uses open multi-processing (OpenMP) directives as parallel programming tool in shared-memory (SM) environment. The OpenMP code is portable and scalable, which can run on various computer platforms including UNIX platforms and Windows NT platforms. The performance study of the parallel code on SGI Origin 2000 UNIX platform is made. The results show that reasonable speedups through parallelization are obtained and that OpenMP is easy to use and an efficient parallel programming tool for the present problem. 相似文献

4.

Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs

Benkner Siegfried Sipkova Viera 《International journal of parallel programming》2003,31(1):3-19

Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single programming paradigm that allows exploiting the hierarchical structure of these machines. Most parallel applications deployed on SMP clusters are based on MPI, the standard API for distributed-memory parallel programming, and thus may miss a number of optimization opportunities offered by the shared memory available within SMP nodes. In this paper we present extensions to the data parallel programming language HPF and associated compilation techniques for optimizing HPF programs on clusters of SMPs. The proposed extensions enable programmers to control key aspects of distributed-memory and shared-memory parallelization at a high-level of abstraction. Based on these language extensions, a compiler can adopt a hybrid parallelization strategy which closely reflects the hierarchical structure of SMP clusters by automatically exploiting shared-memory parallelism based on OpenMP within cluster nodes and distributed-memory parallelism utilizing MPI across nodes. We describe the implementation of these features in the VFC compiler and present experimental results which show the effectiveness of these techniques. 相似文献

5.

OpenMP for Networks of SMPs

Y. Charlie Hu Honghui Lu Alan L. Cox Willy Zwaenepoel 《Journal of Parallel and Distributed Computing》2000,60(12):160

In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed shared-memory (SDSM) system. In contrast to previous SDSM systems for SMPs, the modified TreadMarks system uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intranode hardware shared memory. We present performance results for seven applications (Barnes-Hut, CLU, and Water from SPLASH-2, 3D-FFT from NAS, Red-Black SOR, TSP, and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the thread implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7–30% of the MPI versions. 相似文献

6.

Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

Nikolopoulos Dimitrios S. Ayguadé Eduard Polychronopoulos Constantine D. 《International journal of parallel programming》2002,30(4):225-255

This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA architectures. We investigate the performance of automatic page placement algorithms implemented in the operating system, runtime algorithms based on dynamic page migration, runtime algorithms based on loop scheduling transformations and manual data distribution. These techniques present the programmer with trade-offs between performance and programming effort. Automatic page placement algorithms are transparent to the programmer, but may compromise memory access locality. Dynamic page migration algorithms are also transparent, but require careful engineering and tuned implementations to be effective. Manual data distribution requires substantial programming effort and architecture-specific extensions to the API, but may localize memory accesses in a nearly optimal manner. Loop scheduling transformations may or may not require intervention from the programmer, but conform better to an architecture-agnostic programming paradigm like OpenMP. We identify the conditions under which runtime data distribution algorithms can optimize memory access locality in OpenMP. We also present two novel runtime data distribution techniques, one based on memory access traces and another based on affinity scheduling of parallel loops. These techniques can be used to effectively replace manual data distribution in regular applications. The results provide a proof of concept that it is possible to scale a portable shared-memory programming model up to more than 100 processors, without modifying the API and without exposing architectural details to the programmer. 相似文献

7.

Optimizing OpenMP Programs on Software Distributed Shared Memory Systems

Min Seung-Jai Basumallik Ayon Eigenmann Rudolf 《International journal of parallel programming》2003,31(3):225-249

This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is not currently being used on distributed system. The long-term goal of our project is to quantify the degree to which such a use is possible and develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have studied a number of hand optimizations that improve the baseline performance. Our approach complements related efforts that have proposed language extensions for efficient execution of OpenMP programs on distributed systems. Our results show that, while kernel benchmarks can show high efficiency of OpenMP programs on distributed systems, full applications need careful consideration of shared data access patterns. A naive translation (similar to OpenMP compilers for SMPs) leads to acceptable performance in very few applications only. However, additional optimizations, including access privatization, selective touch, and dynamic scheduling, resulting in 31% average improvement on our benchmarks. 相似文献

8.

基于SMP集群系统的并行编程模式研究与分析

宋伟宋玉《微机发展》2007,17(2):164-167

并行计算技术是计算机技术发展的重要方向之一,SMP与集群是当前主流的并行体系结构。当前并行程序设计方法主要采用基于消息传递模型的MPI和基于共享存储模型的OpenMP,两种编程模式各有特点和适用范围。对SMP集群以及MPI和OpenMP的特点进行了分析,介绍了在SMP集群系统中利用MPI和OpenMP混合编程的可行性方法。相似文献

9.

The distributed virtual shared-memory system based on the InfiniBand architecture

《Journal of Parallel and Distributed Computing》2005,65(10):1271-1280

Even though there have been strong research activities about distributed virtual shared-memory (DVSM) systems, their architectures have been not widely used in current high-performance computing markets. The reason is that the previously introduced DVSM systems use conventional interconnection technologies like Ethernet, which incurs high execution overhead due to process interruption at data communication for memory consistency. In this paper, we present the DVSM architecture based on the next generation of an interconnection technique, the InfiniBand Architecture (IBA). Because the IBA supports shared-memory programming semantics by means of remote direct-memory access (RDMA) and atomic operations in hardware, we can minimize the communication overhead for memory consistency on the DVSM system. For characterizing multithreaded applications on our IBA-based DVSM system, we examined two different shared-memory programming models, i.e. SPMD and OpenMP benchmarks. We show that our DVSM to use full features of the IBA can improve the performance significantly over the IPoIB-based DVSM system in all benchmarks, and also comparable to the bus-based shared-memory multiprocessor system in some benchmarks. 相似文献

10.

针对特普利茨线性系统的多级并行算法

下载免费PDF全文

张哲《计算机工程》2011,37(1):36-38

利用并行体系结构中不同层次级别的内存和计算单元,提出一种求解对称结构化特普利茨线性系统的多级并行算法。通过数学推导将特普利茨线性系统转换成柯西式线性系统,利用消息传递接口和开放多平台共享内存并行程序设计工具实现该算法,并通过实验验证其可行性。相似文献

11.

Nebelung: Execution Environment for Transactional OpenMP

Miloš Milovanović Roger Ferrer Vladimir Gajinov Osman S. Unsal Adrian Cristal Eduard Ayguadé Mateo Valero 《International journal of parallel programming》2008,36(3):326-346

Future generations of Chip Multiprocessors (CMP) will provide dozens or even hundreds of cores inside the chip. Writing applications that benefit from the massive computational power offered by these chips is not going to be an easy task for mainstream programmers who are used to sequential algorithms rather than parallel ones. This paper explores the possibility of using Transactional Memory (TM) in OpenMP, the industrial standard for writing parallel programs on shared-memory architectures, for C, C++ and Fortran. One of the major complexities in writing OpenMP applications is the use of critical regions (locks), atomic regions and barriers to synchronize the execution of parallel activities in threads. TM has been proposed as a mechanism that abstracts some of the complexities associated with concurrent access to shared data while enabling scalable performance. The paper presents a first proof-of-concept implementation of OpenMP with TM. Some language extensions to OpenMP are proposed to express transactions. These extensions are implemented in our source-to-source OpenMP Mercurium compiler and our Software Transactional Memory (STM) runtime system Nebelung that supports the code generated by Mercurium. Hardware Transactional Memory (HTM) or Hardware-assisted STM (HaSTM) are seen as possible paths to make the tandem TM-OpenMP more scalable. In the evaluation section we show the preliminary results. The paper finishes with a set of open issues that still need to be addressed, either in OpenMP or in the hardware/software implementations of TM. 相似文献

12.

OpenMP: shared-memory parallelism from the ashes

Throop J. 《Computer》1999,32(5):108-109

The paper discusses OpenMP, an emerging standard for shared-memory parallelism that has considerable industry support. Large corporations such as IBM, Intel, and Silicon Graphics are building products to support the standard and also participating in its development 相似文献

13.

High Performance Air Pollution Simulation Using OpenMP

María J. Martín Marta Parada Ramón Doallo 《The Journal of supercomputing》2004,28(3):311-321

The aim of this work is to provide a high performance air quality simulation using the sulphur transport Eulerian model 2 (STEM-II) program. First of all we optimize the sequential program with the aim of increasing data locality. Then, the optimized program is parallelized using OpenMP shared-memory directives. Experimental results on a 32-processor SGI Origin 2000 show that the parallel program achieves important reductions in the execution times. 相似文献

14.

The Nornir run-time system for parallel programs using Kahn process networks on multi-core machines—a flexible alternative to MapReduce

Željko Vrba Pål Halvorsen Carsten Griwodz Paul Beskow Håvard Espeland Dag Johansen 《The Journal of supercomputing》2013,63(1):191-217

Even though shared-memory concurrency is a paradigm frequently used for developing parallel applications on small- and middle-sized machines, experience has shown that it is hard to use. This is largely caused by synchronization primitives which are low-level, inherently non-deterministic, and, consequently, non-intuitive to use. In this paper, we present the Nornir run-time system. Nornir is comparable to well-known frameworks such as MapReduce and Dryad that are recognized for their efficiency and simplicity. Unlike these frameworks, Nornir also supports process structures containing branches and cycles. Nornir is based on the formalism of Kahn process networks, which is a shared-nothing, message-passing model of concurrency. We deem this model a simple and deterministic alternative to shared-memory concurrency. Experiments with real and synthetic benchmarks on up to 8 CPUs show that performance in most cases scales almost linearly with the number of CPUs, when not limited by data dependencies. We also show that the modeling flexibility allows Nornir to outperform its MapReduce counterparts using well-known benchmarks. 相似文献

15.

Optimizations and OpenMP implementation for the direct simulation Monte Carlo method

Da Gao Thomas E. Schwartzentruber 《Computers & Fluids》2011,42(1):73-81

Parallel implementation of a three-dimensional direct simulation Monte Carlo (DSMC) code employing complex data structures and dynamic memory allocation is detailed for shared memory systems using Open Multi-Processing (OpenMP). Several techniques to optimize the serial implementation of the DSMC method are first discussed. Specifically for a 3-level Cartesian grid, a Cartesian-based movement technique including particle indexing is demonstrated to result in a modest decrease in overall simulation expense of 34% compared with a ray-tracing technique combined with stored cell-connectivity. Two strategies for data localization leading to optimal usage of cache memory are demonstrated to speed up certain cell-based functions (such as collision computations) by a factor of 3.38–4.36. The shared-memory parallel implementation using OpenMP is then described in detail. Synchronization points and related critical sections are identified as major factors that impact the OpenMP parallel performance. Techniques to remove all such synchronization points in the OpenMP implementation of the DSMC method are outlined. For dual-core and quad-core systems, speedups of 1.99 and 3.74, respectively, are obtained for a (free-stream flow) test simulation with low granularity. Finally, the parallel performance of identical source code employing OpenMP is shown to be strongly correlated to the underlying computer architecture. Both Symmetric Multiprocessor (SMP) and non-uniform memory access (NUMA) systems are studied in order to achieve a better understanding of their impacts on parallel scalability when using OpenMP. 相似文献

16.

基于多核处理器的三维场景并行化探讨

吴玮欣陆达《计算机与现代化》2010,(3):15-18

提高三维场景的运行速度一直以来都是程序开发人员需要面临的一大难题,随着面向主流应用的多核处理器的出现与普及,利用处理器提供的多个内核而不通过编写多线程的方法来提高程序的并行性成为了一种可能。本文介绍虚拟现实开发工具OpenGL和共享存储系统并行编程接口OpenMP;分析OpenGL绘制三维场景的一般过程;并以纹理映射为例着重探讨在OpenGL程序中使用OpenMP来提高程序并行性的方法。相似文献

17.

A study of shared-memory parallelism in a multifrontal solver

《Parallel Computing》2014,40(3-4):34-46

We introduce shared-memory parallelism in a parallel distributed-memory solver, targeting multi-core architectures. Our concern in this paper is pure shared-memory parallelism, although the work will also impact distributed-memory parallelism. Our approach avoids a deep redesign and fully benefits from the numerical kernels and features of the original code. We use performance models to exploit coarse-grain parallelism in an OpenMP environment while, at the same time, also relying on third-party optimized multithreaded libraries. In this context, we propose simple approaches to take advantage of NUMA architectures, and original optimizations to limit thread synchronization costs. The performance gains are analyzed in detail on test problems from various application areas. Although the studied code is a direct solver for sparse systems of linear equations, the contributions of this paper are more general and could be useful in a wider range of situations. 相似文献

18.

Distributed shared memory infrastructure for virtual enterprise in building and construction

Fadi Sandakly João Garcia Paulo Ferreira Patrice Poyet 《Journal of Intelligent Manufacturing》2001,12(2):199-212

This paper proposes a new approach to building a virtual enterprise (VE) software infrastructure that offers persistence, concurrent access, coherence and security on a distributed datastore based on the distributed shared-memory paradigm. The platform presented, persistent distributed store (PerDiS), is demonstrated with test applications that show its suitability and adequate performance for the building and construction domain. In particular, the successful adaptation of standard data access interface (SDAI) to PerDiS and a comparison with CORBA are presented as examples of the potential of the distributed shared-memory paradigm for the VE environment. 相似文献

19.

Data race: tame the beast

K. Leung Z. Huang Q. Huang P. Werstein 《The Journal of supercomputing》2010,51(3):258-278

Data races hamper parallel programming and threaten the reliability of future software. This paper proposes the data race prevention scheme View-Oriented Data race Prevention (VODAP), which can prevent data races in the View-Oriented Parallel Programming (VOPP) model. VOPP is a novel shared-memory data-centric parallel programming model, which uses views to bundle mutual exclusion with data access. We have implemented the data race prevention scheme with a memory protection mechanism. Experimental results show that the extra overhead of memory protection is trivial in our applications. The performance is evaluated and compared with modern programming models such as OpenMP and Cilk. 相似文献

20.

DaSH: A benchmark suite for hybrid dataflow and shared memory programming models

《Parallel Computing》2015

The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory models. Their findings have influenced the introduction of task dependency in the OpenMP 4.0 standard.This article presents DaSH – the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH also includes sequential and shared-memory implementations based on OpenMP and Intel TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics. 相似文献