期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs

Benkner Siegfried Sipkova Viera 《International journal of parallel programming》2003,31(1):3-19

Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single programming paradigm that allows exploiting the hierarchical structure of these machines. Most parallel applications deployed on SMP clusters are based on MPI, the standard API for distributed-memory parallel programming, and thus may miss a number of optimization opportunities offered by the shared memory available within SMP nodes. In this paper we present extensions to the data parallel programming language HPF and associated compilation techniques for optimizing HPF programs on clusters of SMPs. The proposed extensions enable programmers to control key aspects of distributed-memory and shared-memory parallelization at a high-level of abstraction. Based on these language extensions, a compiler can adopt a hybrid parallelization strategy which closely reflects the hierarchical structure of SMP clusters by automatically exploiting shared-memory parallelism based on OpenMP within cluster nodes and distributed-memory parallelism utilizing MPI across nodes. We describe the implementation of these features in the VFC compiler and present experimental results which show the effectiveness of these techniques. 相似文献

2.

Parallelization of stochastic bounds for Markov chains on multicore and manycore platforms

Jarosław Bylina 《The Journal of supercomputing》2018,74(4):1497-1509

The author demonstrates the methodology for parallelizing of finding stochastic bounds for Markov chains on multicore and manycore platforms. The stochastic bounds algorithm for Markov chains with the sparse matrices is investigated, thus needing a lot of irregular memory access. Its parallel implementations should scale across multiple threads and characterize with a high performance and performance portability between multicore and manycore platforms. The presented methods are built on the usage of two parallelization extensions of the C++ language: OpenMP and Cilk Plus. For this two extensions, we use two programming models, namely loop parallelism and task-based parallelism. The numerical experiments show the execution time of the implementations and the scalability on multicore and manycore platforms. This work provides the parallel implementations and at the same time presents an educational example of how computer science problems with irregular memory access can be implemented for high performance using OpenMP and Cilk Plus. 相似文献

3.

Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

Nikolopoulos Dimitrios S. Ayguadé Eduard Polychronopoulos Constantine D. 《International journal of parallel programming》2002,30(4):225-255

This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA architectures. We investigate the performance of automatic page placement algorithms implemented in the operating system, runtime algorithms based on dynamic page migration, runtime algorithms based on loop scheduling transformations and manual data distribution. These techniques present the programmer with trade-offs between performance and programming effort. Automatic page placement algorithms are transparent to the programmer, but may compromise memory access locality. Dynamic page migration algorithms are also transparent, but require careful engineering and tuned implementations to be effective. Manual data distribution requires substantial programming effort and architecture-specific extensions to the API, but may localize memory accesses in a nearly optimal manner. Loop scheduling transformations may or may not require intervention from the programmer, but conform better to an architecture-agnostic programming paradigm like OpenMP. We identify the conditions under which runtime data distribution algorithms can optimize memory access locality in OpenMP. We also present two novel runtime data distribution techniques, one based on memory access traces and another based on affinity scheduling of parallel loops. These techniques can be used to effectively replace manual data distribution in regular applications. The results provide a proof of concept that it is possible to scale a portable shared-memory programming model up to more than 100 processors, without modifying the API and without exposing architectural details to the programmer. 相似文献

4.

一种适用于机群OpenMP系统的有效调度算法

吴少刚章隆兵蔡飞胡伟武《计算机研究与发展》2004,41(7):1298-1305

OpenMP作为共享存储并行编程标准，以其良好的易用性、支持增量并行等特点成为并行程序设计的主流模型之一．OpenMP标准是针对UMA共享存储结构制定的，其循环调度机制只考虑了负载平衡而无须考虑数据分布．然而在机群OpenMP系统中，数据局部性是影响性能的关键因素．针对OpenMP标准中静态调度策略不适合机群计算的缺点，提出了一个充分体现拥有者计算原则的LBS调度算法，并通过扩展制导的方式在机群OpenMP系统（OpenMP/JIAJIA)上加以实现．测试结果表明，LBS算法对于机群OpenMP系统很有效．相似文献

5.

基于OpenTM的并行程序设计

黄昕《计算机与现代化》2009,(6)

OpenTM在OpenMP的基础上引入事务的语法和语义,为事务存储程序设计提供了基于指导命令的程序设计接口.本文选取标准并行基准测试程序NPB中的应用程序LU作为例子,利用事务存储的投机并行执行能力和OpenTM接口实现了流水算法的并行.实验表明,OpenTM程序设计简单,避免了使用锁模式的复杂性,能够在科学计算领域发挥重大作用. 相似文献

6.

GRAPES动力框架中大规模稀疏线性系统并行求解及优化

张琨贾金芳严文昕黄建强王晓英《计算机工程》2022,48(1):149-154+162

赫姆霍兹方程求解是GRAPES数值天气预报系统动力框架中的核心部分,可转换为大规模稀疏线性系统的求解问题,但受限于硬件资源和数据规模,其求解效率成为限制系统计算性能提升的瓶颈。分别通过MPI、MPI+OpenMP、CUDA三种并行方式实现求解大规模稀疏线性方程组的广义共轭余差法,并利用不完全分解LU预处理子（ILU）优化系数矩阵的条件数,加快迭代法收敛。在CPU并行方案中,MPI负责进程间粗粒度并行和通信,OpenMP结合共享内存实现进程内部的细粒度并行,而在GPU并行方案中,CUDA模型采用数据传输、访存合并及共享存储器方面的优化措施。实验结果表明,通过预处理优化减少迭代次数对计算性能提升明显,MPI+OpenMP混合并行优化较MPI并行优化性能提高约35%,CUDA并行优化较MPI+OpenMP混合并行优化性能提高约50%,优化性能最佳。相似文献

7.

Michael S. Noble 《Concurrency and Computation》2008,20(16):1877-1891

相似文献

8.

面向嵌入式多核的OpenMP扩展方法

下载免费PDF全文

王庆季振洲刘涛《计算机科学与探索》2011,5(1):81-86

为多核平台开发一种有效的编程方法已经成为并行软件研究的一个重要目标.在嵌入式多核平台上进行了OpenMP并行程序的有效的实施运行.针对嵌入式具有有限内存资源的特点,提出了通过扩展OpenMP自定义制导语句tiling来提高并行程序在嵌入式多核平台上的运行效率.扩展后的OpenMP并行程序支持循环分片,从而能够充分利用层... 相似文献

9.

多色SSOR-PCG的MPI+OpenMP混合编程实现

林绍忠许合伟颉志强《计算机辅助工程》2013,22(6):79-83

针对对称逐步超松驰预处理共轭梯度（Symmetric Successive Over Relaxation Preconditioned Conjugate Gradient,SSOR-PCG）法并行化时每步迭代都要并行求解2个三角方程组的困难,采用多色排序技术提高并行度,基于MPI＋OpenMP混合编程模型开发适合于分布共享内存计算机的并行程序,通过测试选择有效的MPI通信函数,并给出3种避免共享数据竞争的措施,供不同规模问题和不同内存容量计算机情况选用．相似文献

10.

Performance characteristics of the multi-zone NAS parallel benchmarks

《Journal of Parallel and Distributed Computing》2006,66(5):674-685

We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of meshes, but had not previously been captured in benchmarks. The new suite, named NPB (NAS parallel benchmarks) multi-zone, is derived from the NPB suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the message passing interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on four different parallel computers. We also use an empirical formula to investigate the performance characteristics of the hybrid parallel codes. 相似文献

11.

OpenMP extensions for master–slave message passing computing

P.E. Hadjidoukas T.S. Papatheodorou 《Parallel Computing》2005,31(10-12):1155

This paper presents a directive-based programming environment for master–slave message passing applications that enables the efficient execution of the same code on both shared and distributed memory multiprocessors. The environment exports an extension of the OpenMP workqueuing model, supports multiple levels of task parallelism and more than one master and provides transparent load balancing with a combination of static and dynamic scheduling of tasks. In addition, it operates exclusively through the available hardware on shared-memory machines and exploits MPI for explicit communication on clusters. Experimental results on a Linux-cluster demonstrate the successful combination of ease of programming with the performance of message passing. 相似文献

12.

Data race: tame the beast

K. Leung Z. Huang Q. Huang P. Werstein 《The Journal of supercomputing》2010,51(3):258-278

Data races hamper parallel programming and threaten the reliability of future software. This paper proposes the data race prevention scheme View-Oriented Data race Prevention (VODAP), which can prevent data races in the View-Oriented Parallel Programming (VOPP) model. VOPP is a novel shared-memory data-centric parallel programming model, which uses views to bundle mutual exclusion with data access. We have implemented the data race prevention scheme with a memory protection mechanism. Experimental results show that the extra overhead of memory protection is trivial in our applications. The performance is evaluated and compared with modern programming models such as OpenMP and Cilk. 相似文献

13.

Comparing programming models for medical imaging on multi‐core systems

Philipp Kegel Maraike Schellmann Sergei Gorlatch 《Concurrency and Computation》2011,23(10):1051-1065

Multi‐core processors offer a huge potential of parallelism but pose a challenge of program development for achieving high performance in real applications. We compare three popular parallel programming models—POSIX threads (Pthreads), OpenMP, and Threading Building Blocks (TBB)—regarding their use for multi‐core systems. We analyze how these models can be employed for implementing various parallelizations of a real‐world application from the area of medical imaging, and we conduct extensive runtime experiments to measure performance. Our main contribution is a comprehensive comparison of Pthreads, OpenMP, and TBB with respect to the following criteria: program development effort, programming style, level of abstraction, and runtime performance on multi‐cores. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

14.

基于SMP集群的多层次并行编程模型与并行优化技术* 总被引：4，自引：0，他引：4

单莹吴建平王正华《计算机应用研究》2006,23(10):254-256

详细描述了适用于SMP集群这种多层次并行体系结构的混合并行编程模型MPI／OpenMP,它提供了实现SMP节点间和节点内多层次并行的机制。在此基础上结合实用的性能评价方法,分别介绍了MPI,OpenMP和单处理器三个层次上的一些常用和有效的并行优化技术,并指出单处理器性能优化是提高并行程序性能一个不容忽视的问题。相似文献

15.

多层次并行体绘制算法的研究与应用 总被引：1，自引：0，他引：1

洪振刚罗省贤《计算机工程与科学》2009,31(Z1)

三维数据场的体绘制技术是科学可视化中一个重要的研究方向,本文在研究和总结体绘制的发展历程与关键技术的基础之上,着重研究了体绘制中的光线投射算法,结合多核处理器机群系统,提出并实现了一种基于多层次并行编程模型的并行光线投射体绘制算法,并成功地将该算法应用于三维城市浅层地质模型,取得了良好的可视化效果。分别对MPI环境和多层次并行编程MPI+OpenMP环境下的光线投射算法进行了不同计算规模的性能比较实验。实验和分析表明,多层次并行光线投射体绘制算法加快了体绘制的速度,MPI+OpenMP多层次并行模型性能高于纯MPI编程模型的性能。相似文献

16.

VCluster: a thread‐based Java middleware for SMP and heterogeneous clusters with thread migration support

Hua Zhang Joohan Lee Ratan Guha 《Software》2008,38(10):1049-1071

Clusters, composed of symmetric multiprocessor (SMP) machines and heterogeneous machines, have become increasingly popular for high‐performance computing. Message‐passing libraries, such as message‐passing interface (MPI) and parallel virtual machine (PVM), are de facto parallel programming libraries for clusters that usually consist of homogeneous and uni‐processor machines. For SMP machines, MPI is combined with multithreading libraries like POSIX Thread and OpenMP to take advantage of the architecture. In addition to existing parallel programming libraries that are in C/C++ and FORTRAN programming languages, the Java programming language presents itself as another alternative with its object‐oriented framework, platform neutral byte code, and ever‐increasing performance. This paper presents a new parallel programming model and a library, VCluster, which implements this model. VCluster is based on migrating virtual threads instead of processes to support clusters of SMP machines more efficiently. The implementation uses thread migration, which can be used in dynamic load balancing. VCluster was developed in pure Java, utilizing the portability of Java to support clusters of heterogeneous machines. Several applications are developed to illustrate the use of this library and compare the usability and performance of VCluster with other approaches. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

17.

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

《Journal of Parallel and Distributed Computing》2014,74(12):3202-3216

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries. 相似文献

18.

异构多核上多级并行模型支持及性能优化

李士刚胡长军王珏李建江《软件学报》2013,24(12):2782-2796

低功耗及廉价性使得异构多核在超级计算机计算资源中占有重要比例.然而,异构多核具有高带宽及松耦合一致性等特点,获得理想的存储及计算性能需要更多地考虑底层硬件细节.实现了一种针对典型的异构多核Cell BE 处理器的多级并行模型CellMLP,通过C 语言扩展编译指导语句,实现了对数据并行、任务并行以及流水并行编程模型的支持,提高了并行程序生产率.运行支持优化方面,数据并行采用SPE 并行数据传输、双缓冲等优化手段来提高数据传输带宽;任务并行使用一种新式混合任务队列以支持异步任务窃取,降低SPE 线程间竞争,提高了任务并行的可扩展性;流水并行首次使用阻塞信号传输机制实现SPE 线程间的低开销同步操作.实验对Stream,NASBenchmark 及BOTS 等应用进行了测试,结果表明,CellMLP 可对多种典型并行应用进行高效支持.与目前同类编程模型SARC 及CellSs 进行性能对比,其结果表明,CellMLP 实际数据传输带宽以及非规则应用的支持方面具有明显优势. 相似文献

19.

Improving Performance of Dynamic Programming via Parallelism and Locality on Multicore Architectures

Guangming Tan Ninghui Sun Gao G.R. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(2):261-274

Dynamic programming (DP) is a popular technique which is used to solve combinatorial search and optimization problems. This paper focuses on one type of DP, which is called nonserial polyadic dynamic programming (NPDP). Owing to the nonuniform data dependencies of NPDP, it is difficult to exploit either parallelism or locality. Worse still, the emerging multi/many-core architectures with small on-chip memory make these issues more challenging. In this paper, we address the challenges of exploiting the fine grain parallelism and locality of NPDP on multicore architectures. We describe a latency-tolerant model and a percolation technique for programming on multicore architectures. On an algorithmic level, both parallelism and locality do benefit from a specific data dependence transformation of NPDP. Next, we propose a parallel pipelining algorithm by decomposing computation operators and percolating data through a memory hierarchy to create just-in-time locality. In order to predict the execution time, we formulate an analytical performance model of the parallel algorithm. The parallel pipelining algorithm achieves not only high scalability on the 160-core IBM Cyclops64, but portable performance as well, across the 8-core Sun Niagara and quad-cores Intel Clovertown. 相似文献

20.

COUPL+:并行PDE求解函数库

陈江赵永华迟学斌《计算机工程》2005,31(22):58-60,94

COUPL＋是一种基于消息传递模型的并行库,它将并行程序巾需要处理的数据划分、消息传递函数的调用等都封装在其函数中。COUPL＋可以简化在分布式存储结构并行机上编写基于网格的应用程序的任务。该文简要介绍了COUPL＋的基本原理,以及它与MPI、OpenMP和HPF的特性对比;并且使用COUPL＋实现了共轭梯度法和结构化网格计算两种并行计算中常用的任务,也对比了使用MPI和HPF的性能差异。相似文献