首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This article focuses on the effect of both process topology and load balancing on various programming models for SMP clusters and iterative algorithms. More specifically, we consider nested loop algorithms with constant flow dependencies, that can be parallelized on SMP clusters with the aid of the tiling transformation. We investigate three parallel programming models, namely a popular message passing monolithic parallel implementation, as well as two hybrid ones, that employ both message passing and multi-threading. We conclude that the selection of an appropriate mapping topology for the mesh of processes has a significant effect on the overall performance, and provide an algorithm for the specification of such an efficient topology according to the iteration space and data dependencies of the algorithm. We also propose static load balancing techniques for the computation distribution between threads, that diminish the disadvantage of the master thread assuming all inter-process communication due to limitations often imposed by the message passing library. Both improvements are implemented as compile-time optimizations and are further experimentally evaluated. An overall comparison of the above parallel programming styles on SMP clusters based on micro-kernel experimental evaluation is further provided, as well.  相似文献   

2.
SMP集群系统上矩阵特征问题并行求解器的有效算法   总被引:2,自引:0,他引:2  
对称矩阵三对角化和三对角对称矩阵的特征值求解是稠密对称矩阵特征问题并行求解器的关键步 .针对SMP集群系统的多级体系结构,基于Householder变换的矩阵三对角化和三对角矩阵特征值问题的分而治之算法,给出了它们的MPI OpenMP混合并行算法 .算法研究集中在SMP集群系统环境下的负载平衡、通信开销和性能评价 .混合并行算法的设计结合了粗粒度线程并行模式和任务共享的动态调用方法,改善了MPI算法中的负载平衡问题、降低了通信开销 .在深腾6800上的实验表明,基于混合并行算法的求解器比纯MPI版本的求解器具有更好的性能和可扩展性 .  相似文献   

3.
Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single programming paradigm that allows exploiting the hierarchical structure of these machines. Most parallel applications deployed on SMP clusters are based on MPI, the standard API for distributed-memory parallel programming, and thus may miss a number of optimization opportunities offered by the shared memory available within SMP nodes. In this paper we present extensions to the data parallel programming language HPF and associated compilation techniques for optimizing HPF programs on clusters of SMPs. The proposed extensions enable programmers to control key aspects of distributed-memory and shared-memory parallelization at a high-level of abstraction. Based on these language extensions, a compiler can adopt a hybrid parallelization strategy which closely reflects the hierarchical structure of SMP clusters by automatically exploiting shared-memory parallelism based on OpenMP within cluster nodes and distributed-memory parallelism utilizing MPI across nodes. We describe the implementation of these features in the VFC compiler and present experimental results which show the effectiveness of these techniques.  相似文献   

4.
The problem of solving an infinite system of linear equations finitely expressed is addressed. Modifications of the Gauss–Seidel method are presented, especially suitable for the implementation on SMP machines with a small number of processors. One of the proposed parallel algorithms, which concentrates the computational efforts where they are most needed, results to be more efficient than the sequential algorithm, even from the point of view of the total number of operations.  相似文献   

5.
基于半经典分子动力学模型,在SMP集群中实现激光化学反应双层并行模拟系统。结合粗粒度的原子分解算法和细粒度的矩阵并行乘法实现激光化学反应模拟中力计算部分的并行化,分析粒度划分对半经典分子动力学模拟并行效率的影响。在SMP集群中测试表明,采用128个处理器模拟由500个C原子构成的分子体系,并行效率可达70%。在CPU数量固定的情况下,SMP节点内的细粒度的并行对提高半经典分子动力学模拟并行效率影响较大。该系统能够模拟大分子体系的激光化学反应,在提高加速比的同时保证计算资源的利用效率,满足激光化学反应模拟需求。  相似文献   

6.
Hua Zhang  Joohan Lee  Ratan Guha 《Software》2008,38(10):1049-1071
Clusters, composed of symmetric multiprocessor (SMP) machines and heterogeneous machines, have become increasingly popular for high‐performance computing. Message‐passing libraries, such as message‐passing interface (MPI) and parallel virtual machine (PVM), are de facto parallel programming libraries for clusters that usually consist of homogeneous and uni‐processor machines. For SMP machines, MPI is combined with multithreading libraries like POSIX Thread and OpenMP to take advantage of the architecture. In addition to existing parallel programming libraries that are in C/C++ and FORTRAN programming languages, the Java programming language presents itself as another alternative with its object‐oriented framework, platform neutral byte code, and ever‐increasing performance. This paper presents a new parallel programming model and a library, VCluster, which implements this model. VCluster is based on migrating virtual threads instead of processes to support clusters of SMP machines more efficiently. The implementation uses thread migration, which can be used in dynamic load balancing. VCluster was developed in pure Java, utilizing the portability of Java to support clusters of heterogeneous machines. Several applications are developed to illustrate the use of this library and compare the usability and performance of VCluster with other approaches. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

7.
A unified distance transform algorithm and architecture   总被引:1,自引:0,他引:1  
Standard distance transform algorithms produce approximate results and are unsuitable for real-time implementation since they require massive parallelism. A new unified algorithm that computes distance and related nearest feature transforms concurrently for arbitrary bit maps based on any distance function from a broad class is presented. The algorithm has an efficient implementation on serial processors and a unified transform architecture is proposed for feasible real-time performance based on parallel row followed by parallel column scanning. Its importance lies in that it supports real-time performance and a broader set of machine vision applications than the standard approach.  相似文献   

8.
Particle-in-cell simulations often suffer from load-imbalance on parallel machines due to the competing requirements of the field-solve and particle-push computations. We propose a new algorithm that balances the two computations independently. The grid for the field-solve computation is statically partitioned. The particles within a processor's sub-domain(s) are dynamically balanced by migrating spatially-compact groups of particles from heavily loaded processors to lightly loaded ones as needed. The algorithm has been implemented in the quicksilver electromagnetic particle-in-cell code. We provide details of the implementation and present performance results for quicksilver running models with up to a billion grid cells and particles on thousands of processors of a large distributed-memory parallel machine.  相似文献   

9.
Many computational-intensive problems from science and engineering are irregular in nature. This makes it difficult to develop an efficient parallel implementation, even for shared-memory machines. As a typical example, we investigate a parallel implementation of an irregular particle simulation algorithm. We concentrate on the issue which programming and system support is needed to yield an efficient implementation for a large number of processors. As an execution platform we use the SB-PRAM, a shared memory machine with up to 2048 processors. The processors of the SB-PRAM can access the global memory in unit time which is the basis for an exact performance prediction. Common approaches for parallel implementations like lock protection for concurrent accesses and sequential or distributed task queues are replaced by more efficient access mechanisms and data structures which can be realized by the powerful multiprefix operations of the SB-PRAM. Their use simplifies the implementation and yields large speedup values.  相似文献   

10.
FOR-loops are the main source of parallelism in programs. A nonlinear transformation algorithm for parallelizing the execution of FOR-loop models is proposed. It is shown that by the mapping of nonlinear transformation, iterations of FOR-loops can be executed in a parallel form. The algorithm is useful in exploiting the parallelism of FOR-loops with one or more partitions on the innermost loop. Algorithms to partition and map the nested FOR-loops onto fixed size systolic arrays are discussed. Based on the time and space mapping schemes, all the iterations of FOR-loops can be correctly executed on the array processors in a parallel form  相似文献   

11.
基于SMP集群的三维网格多粒度混合并行编程模型   总被引:2,自引:0,他引:2  
为提高大规模三维网格并行算法的执行效率,针对SMP集群分布/共享两级内存层次结构的特点,介绍适用于SMP集群混合编程的不同实现方法.对三维网格模型最短路径问题的并行求解提出了多粒度混合并行编程模型,给出了实现该问题的MPI+OpenMP混合并行算法,并在SMP集群上同粗粒度MPI(Message Passing Interface)并行算法做了性能比较.结果表明,采用该多粒度混合并行编程模型具有更好的加速比和运行效率.  相似文献   

12.
A control parallel and a novel data parallel implementation of the Sutherland-Hodgman polygon clipping algorithm are presented. The two implementations are based on the INMOS transputer and the AMT Distributed Array Processor respectively; both of these machines are general purpose parallel processors. Performance Figures are reported and implications for further work are discussed.  相似文献   

13.
The fact that conventional line-drawing algorithms, when applied directly on parallel machines, can lead to very inefficient codes is addressed. It is suggested that instead of modifying an existing algorithm for a parallel machine, a more efficient implementation can be produced by going back to the invariants in the definition. Popular line-drawing algorithms are compared with two alternatives; distance to a line (a point is on the line if sufficiently close to it) and intersection with a line (a point on the line if an intersection point). For massively parallel single-instruction-multiple-data (SIMD) machines (with thousands of processors and up), the alternatives provide viable line-drawing algorithms. Because of the pixel-per-processor mapping, their performance is independent of the line length orientation  相似文献   

14.
以脉动式阵列(Systolic)应用研究为背景,分析了离散傅立叶变换(DFT)经典并行算法及其在阵列机上的特点;并针对该算法在处理机上并行实现的弱点,提出了在并行处理环境下适合大规模DFT的方体向量法。这种方法不需要在处理机之间进行数据转置,减少了处理之间的通信以及运算数据之间的依赖性,使变换能够在较大程度上异步进行,并摆脱了在操作数规模上的制约。文章还给出了在Systolic阵列上由方体向量法实现的三维DFT的具体例子。  相似文献   

15.
混合并行技术在激光化学反应模拟中的应用   总被引:2,自引:0,他引:2  
为提高激光化学反应模拟效率,在半经典分子动力学模拟中引入混合并行技术和双层并行思想。基于MPI+OpenMP混合模型设计并实现激光化学反应双层并行模拟算法,上层基于MPI实现节点间的原子分解并行,下层基于OpenMP实现节点内的多线程矩阵并行乘法。在SMP集群中测试表明,模拟大分子体系激光化学反应并行效率可达60%以上。因此,应用混合并行技术可有效提高激光化学反应模拟效率。  相似文献   

16.
一种面向异构众核处理器的并行编译框架   总被引:1,自引:0,他引:1  
异构众核处理器是面向高性能计算领域处理器发展的重要趋势,但其更为复杂的体系结构使得编程难的问题更加突出.针对这一问题,基于开源编译器Open64,提出了一种面向异构众核处理器的并行编译框架,将程序自动转换为异构并行程序.该框架主要包括4个模块:任务划分模块用来识别适合进行加速计算的程序段,实现了嵌套循环的多维并行识别方法;数据布局模块完成数据在主存和SPM之间的布局,实现了数组边界分析和指针范围分析;传输优化模块实现了数据传输合并、传输外提、打包传输、数组转置等多种数据传输优化方法;收益评估模块在构建代价模型的基础上实现了一种动静结合的收益评估方法.并且,基于SW26010处理器,对该编译框架进行了实现,测试结果表明,该编译框架能够实现一些程序以面向异构众核结构的并行变换,且获得较好的加速效果.  相似文献   

17.
基于SMP集群的混合并行编程模型研究   总被引:9,自引:3,他引:6       下载免费PDF全文
提出一种适用于SMP集群的混合MPI+OpenMP并行编程模型。该模型贴近于SMP集群的体系结构且综合了消息传递和共享内存2种编程模型的优势,能获得较好的性能。讨论该混合模型的实现机制以及MPI消息传递模型的特点。实验结果表明,在一定条件下,该混合并行编程模型是SMP集群的最优选择。  相似文献   

18.
Data-parallel,volumerendering algorithms   总被引:1,自引:0,他引:1  
In this presentation, we consider the image-composition scheme for parallel volume rendering in which each processor is assigned a portion of the volume. A processor renders its data by using any existing volume-rendering algorithm. We describe one such parallel algorithm that also takes advantage of vector-processing capabilities. The resulting images from all processors are then combined (composited) in visibility order to form the final image. The major advantage of this approach is that, as viewing and shading parameters change, only 2D partial images, and not 3D volume data, are communicated among processors. Through experimental results and performance analysis, we show that our parallel algorithm is amenable to extremely efficient implementations on distributed memory, multiple instruction-multiple data (MIMD), vector-processor architectures. This algorithm is also very suitable for hardware implementation based on image composition architectures. It supports various volume-rendering algorithms, and it can be extended to provide load-balanced execution.  相似文献   

19.
多核环境下AREM模式混合并行计算研究   总被引:1,自引:1,他引:0       下载免费PDF全文
使用多核处理器已成为构建高性能计算机系统的主流方式。结合多核高性能计算机系统集共享内存结构和分布式内存结构于一体的体系结构特点,对AREM模式开展MPI/OpenMP混合并行计算研究与实现。性能测试结果表明,使用MPI/OpenMP混合并行计算可以将并行应用扩展至更大处理机规模,缩短计算时间,不对原程序结构做大的改动、以增量方式和较小的并行化代价,取得比较好的并行计算效果。  相似文献   

20.
Streamline computation in a very large vector field data set represents a significant challenge due to the nonlocal and data-dependent nature of streamline integration. In this paper, we conduct a study of the performance characteristics of hybrid parallel programming and execution as applied to streamline integration on a large, multicore platform. With multicore processors now prevalent in clusters and supercomputers, there is a need to understand the impact of these hybrid systems in order to make the best implementation choice. We use two MPI-based distribution approaches based on established parallelization paradigms, parallelize over seeds and parallelize over blocks, and present a novel MPI-hybrid algorithm for each approach to compute streamlines. Our findings indicate that the work sharing between cores in the proposed MPI-hybrid parallel implementation results in much improved performance and consumes less communication and I/O bandwidth than a traditional, nonhybrid distributed implementation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号