共查询到20条相似文献,搜索用时 0 毫秒
1.
基于分布/共享内存层次结构的并行程序设计 总被引:1,自引:0,他引:1
分布内存结构和共享内存结构各具特点,又有很强的互补性,分布/共享内存层次结构将两种结构相结合,以充分发挥其优势。文中主要讨论基于分布/共享内存层次结构的并行程序设计问题,介绍了MPI和OpenMP混合并行程序设计模式。 相似文献
2.
随着多核计算机的出现,并行计算技术的发展进入了一个新的阶段,如何将并行技术引入空间数据处理系统成为了当前研究的热点问题。本文给出了一种基于分布式/共享内存结构的并行空间数据处理系统设计方案,用于解决空间数据量增大和下行速度大幅度提高所带来的处理速度慢,数据积压等问题。 相似文献
3.
When using a shared memory multiprocessor, the programmer faces the issue of selecting the portable programming model which will provide the best performance. Even if they restricts their choice to the standard programming environments (MPI and OpenMP), they have to select a programming approach among MPI and the variety of OpenMP programming styles. To help the programmer in their decision, we compare MPI with three OpenMP programming styles (loop level, loop level with large parallel sections, SPMD) using a subset of the NAS benchmark (CG, MG, FT, LU), two dataset sizes (A and B), and two shared memory multiprocessors (IBM SP3 NightHawk II, SGI Origin 3800). We have developed the first SPMD OpenMP version of the NAS benchmark and gathered other OpenMP versions from independent sources (PBN, SDSC and RWCP). Experimental results demonstrate that OpenMP provides competitive performance compared with MPI for a large set of experimental conditions. Not surprisingly, the two best OpenMP versions are those requiring the strongest programming effort. MPI still provides the best performance under some conditions. We present breakdowns of the execution times and measurements of hardware performance counters to explain the performance differences. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献
4.
Y. Charlie Hu Honghui Lu Alan L. Cox Willy Zwaenepoel 《Journal of Parallel and Distributed Computing》2000,60(12):160
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed shared-memory (SDSM) system. In contrast to previous SDSM systems for SMPs, the modified TreadMarks system uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intranode hardware shared memory. We present performance results for seven applications (Barnes-Hut, CLU, and Water from SPLASH-2, 3D-FFT from NAS, Red-Black SOR, TSP, and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the thread implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7–30% of the MPI versions. 相似文献
5.
《国际计算机数学杂志》2012,89(9):1212-1238
In this paper, we present a highly efficient approach for numerically solving the Black–Scholes equation in order to price European and American basket options. Therefore, hardware features of contemporary high performance computer architectures such as non-uniform memory access and hardware-threading are exploited by a hybrid parallelization using MPI and OpenMP which is able to drastically reduce the computing time. In this way, we achieve very good speed-ups and are able to price baskets with up to six underlyings. Our approach is based on a sparse grid discretization with finite elements and makes use of a sophisticated adaption. The resulting linear system is solved by a conjugate gradient method that uses a parallel operator for applying the system matrix implicitly. Since we exploit all levels of the operator's parallelism, we are able to benefit from the compute power of more than 100 cores. Several numerical examples as well as an analysis of the performance for different computer architectures are provided. 相似文献
6.
针对非规则应用的OpenMP制导扩展 总被引:1,自引:0,他引:1
许多非规则应用的棱心是稀疏矩阵运算.稀疏矩阵运算的特点是对一个数组元素的引用依赖于另两个数组的元素值,因此具有非规则访存特点.本文针对稀疏矩阵运算特点,提出一种新的OpenMP制导子句indirect,并在机群OpenMP系统OpenMP/JIAJIA上进行了实现.采用一个实的OpenMP应用Equake进行了测试,测试结果表明该制导扩展很有效,对于直接使用该制导子句的函数代码,其性能改进了18%,而整个应用的性能改进了15%. 相似文献
7.
8.
This paper presents the comparison of the COMOPS benchmark performance in MPI and shared memory on four different shared memory platforms: the DEC AlphaServer 8400/300, the SGI Power Challenge, the SGI Origin2000, and the HP-Convex Exemplar SPP1600. The paper also qualitatively analyzes the obtained performance data based on an understanding of the corresponding architecture and the MPI implementations. Some conclusions are made for the inter-processor communication performance on these four shared memory platforms. 相似文献
9.
10.
Min Seung-Jai Basumallik Ayon Eigenmann Rudolf 《International journal of parallel programming》2003,31(3):225-249
This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is not currently being used on distributed system. The long-term goal of our project is to quantify the degree to which such a use is possible and develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have studied a number of hand optimizations that improve the baseline performance. Our approach complements related efforts that have proposed language extensions for efficient execution of OpenMP programs on distributed systems. Our results show that, while kernel benchmarks can show high efficiency of OpenMP programs on distributed systems, full applications need careful consideration of shared data access patterns. A naive translation (similar to OpenMP compilers for SMPs) leads to acceptable performance in very few applications only. However, additional optimizations, including access privatization, selective touch, and dynamic scheduling, resulting in 31% average improvement on our benchmarks. 相似文献
11.
适合机群OpenMP系统的制导扩展 总被引:1,自引:0,他引:1
OpenMP以其易用性和支持增量并行的特点成为共享存储体系结构的编程标准.机群OpenMP系统在机群上实现了OpenMP计算环境,它将OpenMP的易编程性和机群的可扩展性结合起来,是很有意义的.OpenMP的编程方式主要有循环级和SPMD两种,其中循环级方式易于编程而SPMD方式难于编程.然而在机群OpenMP系统中获得高性能OpenMP程序,必需采用SPMD方式.该文描述了适合机群OpenMP系统的一个简单的OpenMP制导扩展子集(包括数据分布制导、循环调度模式),并在机群OpenMP系统OpenMP/JIAJIA上进行了实现.应用测试表明,利用这些制导扩展进行编程,既保持循环级方式的易编程性又获得与SPMD方式相当的性能,是有效的编程方式. 相似文献
12.
基于SMP集群系统的并行编程模式研究与分析 总被引:4,自引:1,他引:4
并行计算技术是计算机技术发展的重要方向之一,SMP与集群是当前主流的并行体系结构。当前并行程序设计方法主要采用基于消息传递模型的MPI和基于共享存储模型的OpenMP,两种编程模式各有特点和适用范围。对SMP集群以及MPI和OpenMP的特点进行了分析,介绍了在SMP集群系统中利用MPI和OpenMP混合编程的可行性方法。 相似文献
13.
介绍了采用双核处理器的共享存储多处理机(SMP)作为计算节点时,高性能并行计算集群的结构。研究了此类系统的并行计算粒度和优化方法,描述了该集群MPI+OpenMP的混合编程平台构建方法。利用此平台,实现了求解现行方程组的Mann迭代算法,通过数值测试,表明此类集群具有良好的计算性能。此系统已用于实际工作中,取得了良好的效果。 相似文献
14.
D. J. Mavriplis Raja Das Joel Saltz R. E. Vermeland 《The Journal of supercomputing》1995,8(4):329-344
An efficient three-dimensional unstructured Euler solver is parallelized on a CRAY Y-MP C90 shared-memory computer and on an Intel Touchstone Delta distributed-memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between the two differing architectures are made.This work was sponsored in part by ARPA (NAG-1-1485) and by NASA Contract No. NAS1-19480 while authors Mavriplis, Saltz and Das were in residence at ICASE, NASA Langley Research Center, Hampton, Virginia. This research was performed in part using the Intel Touchstone Delta System operated by Caltech on behalf of the Concurrent Supercomputing Consortium. Access to this fecility was provided by NASA Langley Research Center and the Center for Research in Parallel Processing. The content of the information does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred. 相似文献
15.
近年来,图像处理技术取得了巨大进步,但随之也带来了不断增长的计算量.为了提高图像处理速度,作为主流解决方案之一的基于MPI的并行处理技术日益受到重视.文中首先介绍了MPI的基本知识,然后以图像处理中的一项重要技术-图像锐化(采用梯度算法)为例,说明了采用MPI对图像进行并行处理的具体过程,并就其中的两个核心步骤:建立并行算法模型及程序代码的编写进行了详细的阐述.最后通过对实验数据的分析说明了并行计算在图像处理领域所起到的重要作用. 相似文献
16.
Minh Thanh Chung Josef Weidendorfer Karl Fürlinger Dieter Kranzlmüller 《Concurrency and Computation》2023,35(24):e7828
Load balancing is often a challenge in task-parallel applications. The balancing problems are divided into static and dynamic. “Static” means that we have some prior knowledge about load information and perform balancing before execution, while “dynamic” must rely on partial information of the execution status to balance the load at runtime. Conventionally, work stealing is a practical approach used in almost all shared memory systems. In distributed memory systems, the communication overhead can make stealing tasks too late. To improve, people have proposed a reactive approach to relax communication in balancing load. The approach leaves one dedicated thread per process to monitor the queue status and offload tasks reactively from a slow to a fast process. However, reactive decisions might be mistaken in high imbalance cases. First, this article proposes a performance model to analyze reactive balancing behaviors and understand the bound leading to incorrect decisions. Second, we introduce a proactive approach to improve further balancing tasks at runtime. The approach exploits task-based programming models with a dedicated thread as well, namely . Nevertheless, the main idea is to force not only to monitor load; it will characterize tasks and train load prediction models by online learning. “Proactive” indicates offloading tasks before each execution phase proactively with an appropriate number of tasks at once to a potential victim (denoted by an underloaded/fast process). The experimental results confirm speedup improvements from to in important use cases compared to the previous solutions. Furthermore, this approach can support co-scheduling tasks across multiple applications. 相似文献
17.
The purpose of this paper is to investigate the scalability and performance of seven, simple OpenMP test programs and to compare their performance with equivalent MPI programs on an SGI Origin 2000. Data distribution directives were used to make sure that the OpenMP implementation had the same data distribution as the MPI implementation. For the matrix‐times‐vector (test 5) and the matrix‐times‐matrix (test 7) tests, the syntax allowed in OpenMP 1.1 does not allow OpenMP compilers to be able to generate efficient code since the reduction clause is not currently allowed for arrays. (This problem is corrected in OpenMP 2.0.) For the remaining five tests, the OpenMP version performed and scaled significantly better than the corresponding MPI implementation, except for the right shift test (test 2) for a small message. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献
18.
Super-Object模型提出了一种新的方法,在分布存储器多计算机上实现语言级虚拟共享存储器以支持共享存储器通信模式.Super-Object模型引入新的概念super-object,不同于其它模型,基于super-object,它提出了新的共享数据定位方法,全局地址标识(name,off-set).Super-Object模型与Fortran77结合,我们实现了一个运行时间系统和库调用,支持程序员使用Fortran语言编写并行程序,最后介绍了系统的实现和取得的性能. 相似文献
19.
《Microprocessors and Microsystems》1995,19(10):609-619
The Reflective Memory System (RMS) represents a modular busbased architecture that belongs to the class of distributed shared memory (DSM) systems with a hardware-implemented mechanism for maintenance of memory consistency. The RMS is characterized with an update consistency mechanism for shared data in the 'mirrored' memory regions. The goal of the research presented here was to achieve a relatively large global performance and scalability improvement, using a number of relatively small particular enhancements on independent and relatively non-correlated system components (a RISC-style methodology applied to DSM). This work analyses five proposed improvements to the existing RMS. Simulation analysis using a functional simulator based on a convenient and flexible synthetic workload model was carried out, in order to evaluate each proposed idea for a wide variety of workload parameters. Besides simulation results, each presented idea is also considered from the implementational point of view. Finally, the entire fully implemented prototype design is briefly mentioned. 相似文献
20.
采用CUDA+MPI+OpenMP的三级并行编程模式,实现节点间的粗粒度并行,节点内的细粒度并行以及将GPU作为并行计算设备的CUDA编程模型.这种新的三级并行混合编程模式为SMP机群提供了一种更为高效的并行策略.本文讨论了三级并行编程环境的快速搭建以及多粒度混合并行编程方法,并在多个节点的机群环境中完成测试工作. 相似文献