首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
This article focuses on the effect of both process topology and load balancing on various programming models for SMP clusters and iterative algorithms. More specifically, we consider nested loop algorithms with constant flow dependencies, that can be parallelized on SMP clusters with the aid of the tiling transformation. We investigate three parallel programming models, namely a popular message passing monolithic parallel implementation, as well as two hybrid ones, that employ both message passing and multi-threading. We conclude that the selection of an appropriate mapping topology for the mesh of processes has a significant effect on the overall performance, and provide an algorithm for the specification of such an efficient topology according to the iteration space and data dependencies of the algorithm. We also propose static load balancing techniques for the computation distribution between threads, that diminish the disadvantage of the master thread assuming all inter-process communication due to limitations often imposed by the message passing library. Both improvements are implemented as compile-time optimizations and are further experimentally evaluated. An overall comparison of the above parallel programming styles on SMP clusters based on micro-kernel experimental evaluation is further provided, as well.  相似文献   

基于SMP集群的MPI+OpenMP混合编程模型研究   总被引:4,自引:1,他引:3  
讨论了MPI+OpenMP混合编程模型的特点及其实现方法。建立了对拉普拉斯偏微分方程求解的混合并行算法,并在HL-2A高性能计算系统上同纯MPI算法作了性能方面的比较。结果表明,该混合并行算法具有更好的扩展性和加速比。  相似文献   

Parallel loop self‐scheduling on parallel and distributed systems has been a critical problem and it is becoming more difficult to deal with in the emerging heterogeneous cluster computing environments. In the past, some self‐scheduling schemes have been proposed as applicable to heterogeneous cluster computing environments. In recent years, multicore computers have been widely included in cluster systems. However, previous researches into parallel loop self‐scheduling did not consider certain aspects of multicore computers; for example, it is more appropriate for shared‐memory multiprocessors to adopt Open Multi‐Processing (OpenMP) for parallel programming. In this paper, we propose a performance‐based approach using hybrid OpenMP and MPI parallel programming, which partition loop iterations according to the performance weighting of multicore nodes in a cluster. Because iterations assigned to one MPI process are processed in parallel by OpenMP threads run by the processor cores in the same computational node, the number of loop iterations allocated to one computational node at each scheduling step depends on the number of processor cores in that node. Experimental results show that the proposed approach performs better than previous schemes. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

基于SMP集群的混合并行编程模型研究   总被引:6,自引:3,他引:6       下载免费PDF全文
提出一种适用于SMP集群的混合MPI+OpenMP并行编程模型。该模型贴近于SMP集群的体系结构且综合了消息传递和共享内存2种编程模型的优势,能获得较好的性能。讨论该混合模型的实现机制以及MPI消息传递模型的特点。实验结果表明,在一定条件下,该混合并行编程模型是SMP集群的最优选择。  相似文献   

The Earth Simulator (ES) is an SMP cluster system. There are two types of parallel programming models available on the ES. One is a flat programming model, in which a parallel program is implemented by MPI interfaces only, both within an SMP node and among nodes. The other is a hybrid programming model, in which a parallel program is written by using thread programming within an SMP node and MPI programming among nodes simultaneously. It is generally known that it is difficult to obtain the same high level of performance using the hybrid programming model as can be achieved with the flat programming model.

In this paper, we have evaluated scalability of the code for direct numerical simulation of the Navier–Stokes equations on the ES. The hybrid programming model achieves the sustained performance of 346.9 Gflop/s, while the flat programming model achieves 296.4 Gflop/s with 16 PNs of the ES for a DNS problem size of 2563. For small scale problems, however, the hybrid programming model is not as efficient because of microtasking overhead. It is shown that there is an advantage for the hybrid programming model on the ES for the larger size problems.  相似文献   

This paper describes the ideas and developments of the project EP-CACHE. Within this project new methods and tools are developed to improve the analysis and the optimization of programs for cache architectures, especially for SMP clusters. The tool set comprises the semi-automatic instrumentation of user programs, the monitoring of the cache behavior, the visualization of the measured data, and optimization techniques for improving the user program for better cache usage.

As current hardware performance counters do not give sufficient user relevant information, new hardware monitors are designed that provide more detailed information about the cache utilization related to the data structures and code blocks in the user program. The expense of the hardware and software realization will be assessed to minimize the risk of a real implementation of the investigated monitors. The usefulness of the hardware monitors is evaluated by a cache simulator.  相似文献   

Ming Hsiang Huang  Wuu Yang 《Software》2020,50(10):1877-1904
OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.  相似文献   

Management of forthcoming exascale clusters requires frequent collection of run‐time information about the nodes and the running applications. This paper presents a new paradigm for providing online information to the management system of scalable clusters, consisting of a large number of nodes and one or more masters that manage these nodes. We describe the details of resilient gossip algorithms for sharing local information within subsets of nodes and for sending global information to a master, which holds information on all the nodes. The presented algorithms are decentralized, scalable and resilient, working well even when some nodes fail, without needing any recovery protocol. The paper gives formal expressions for approximating the average ages of the local information at each node and the information collected by the master. It then shows that these results closely match the results of simulations and measurements on a real cluster. The paper also investigates the resilience of the algorithms and the impact on the average age when nodes or masters fail. The main outcome of this paper is that partitioning of large clusters can improve the quality of information available to the management system without increasing the number of messages per node. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

All over the world, human resources are used on all kinds of different scheduling problems, many of which are time-consuming and tedious. Scheduling tools are thus very welcome. This paper presents a research project, where Genetic Algorithms (GAs) are used as the basis for solving a timetabling problem concerning medical doctors attached to an emergency service. All the doctors express personal preferences, thereby making the scheduling rather difficult. In its natural form, the timetabling problem for the emergency service is stated as a number of constraints to be fulfilled. For this reason, it was decided to compare the strength of a Co-evolutionary Constraint Satisfaction (CCS) technique with that of two other GA approaches. Distributed GAs and a simple special-purpose hill climber were introduced, to improve the performance of the three algorithms. Finally, the performance of the GAs was compared with that of some standard, nonGA approaches. The distributed hybrid GAs were by far the most successful, and one of these hybrid algorithms is currently used for solving the timetabling problem at the emergency service. © 1997 John Wiley & Sons, Ltd.  相似文献   

MUPPET is a problem-solving environment for scientific computing with message-based multiprocessors. It consists of four part—concurrent languages, programming environments, application environments and man-machine interfaces. The programming paradigm of MUPPET is based on parallel abstract machines and transformations between them. This paradigm allows the development of programs which are portable among multiprocessors with different interconnection topologies.

In this paper we discuss the MUPPET programming paradigm. We give an introduction to the language CONCURRENT MODULA-2 and the graphic specification system GONZO. The graphic specification system tries to introduce graphics as a tool for programming. It is also the basis for programming generation and transformation.  相似文献   

This paper describes the design and implementation of an Efficient Architecture for Running THreads (EARTH) runtime system for a multi‐processor/multi‐node cluster. The (EARTH) model was designed to support the efficient execution of parallel (multi‐threaded) programs with irregular fine‐grain parallelism using off‐the‐shelf computers. Implementing an EARTH runtime system requires an explicitly threaded runtime system. For portability, we built this runtime system on top of Pthreads under Linux and used sockets for inter‐node communication. Moreover, in order to make the best use of the resources available on a cluster of symmetric multi‐processors (SMP), this implementation enables the overlapping of communication and computation. We used Threaded‐C, a language designed to implement the programming model supported by the EARTH architecture. This language allows the expression of various levels of parallelism and provides the primitives needed to manage the required communication and synchronization. The Threaded‐C programming language supports irregular fine‐grain parallelism through a two‐level hierarchy of threads and fibers. It also provides various synchronization and communication constructs that reflect the nature of EARTH's fibers—non‐preemptive execution with data‐driven scheduling—as well as the extensive use of split‐phase transactions on EARTH to execute long‐latency operations. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

The paper presents a comparison of ant algorithms and simulated annealing as well as their applications in multicriteria discrete dynamic programming. The considered dynamic process consists of finite states and decision variables. In order to describe the effectiveness of multicriteria algorithms, four measures of the quality of the nondominated set approximations are used.  相似文献   

In this article we evaluate a family of criss–cross algorithms for linear programming problems, comparing the results obtained by these algorithms over a set of test problems with those obtained by the simplex algorithms implemented in the XPRESS commercial package. We describe the known criss–cross variants existing in the literature and introduce new versions obtained with a slight modification based on the original criss–cross algorithm. Moreover, we consider some computational details of our implementation and describe the set of test problems used.  相似文献   

SMP集群系统上矩阵特征问题并行求解器的有效算法   总被引:2,自引:0,他引:2  
对称矩阵三对角化和三对角对称矩阵的特征值求解是稠密对称矩阵特征问题并行求解器的关键步.针对SMP集群系统的多级体系结构,基于Householder变换的矩阵三对角化和三对角矩阵特征值问题的分而治之算法,给出了它们的MPI+OpenMP混合并行算法.算法研究集中在SMP集群系统环境下的负载平衡、通信开销和性能评价.混合并行算法的设计结合了粗粒度线程并行模式和任务共享的动态调用方法,改善了MPI算法中的负载平衡问题、降低了通信开销.在深腾6800上的实验表明,基于混合并行算法的求解器比纯MPI版本的求解器具有更好的性能和可扩展性.  相似文献   

Since a static work distribution does not allow for satisfactory speed‐ups of parallel irregular algorithms, there is a need for a dynamic distribution of work and data that can be adapted to the runtime behavior of the algorithm. Task pools are data structures which can distribute tasks dynamically to different processors where each task specifies computations to be performed and provides the data for these computations. This paper discusses the characteristics of task‐based algorithms and describes the implementation of selected types of task pools for shared‐memory multiprocessors. Several task pools have been implemented in C with POSIX threads and in Java. The task pools differ in the data structures to store the tasks, the mechanism to achieve load balance, and the memory manager used to store the tasks. Runtime experiments have been performed on three different shared‐memory systems using a synthetic algorithm, the hierarchical radiosity method, and a volume rendering algorithm. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

整数规划问题智能求解算法综述*   总被引:7,自引:0,他引:7  
为了对大规模整数规划问题的求解方法提供参考,对基于智能算法求解整数规划问题的研究进行了分析和评述。鉴于现有算法的缺陷与不足,讨论了应用智能算法求解整数规划问题未来可能的研究方向。  相似文献   

We derive a bound on the computational complexity of linear programs whose coefficients are real algebraic numbers. Key to this result is a notion of problem size that is analogous in function to the binary size of a rational-number problem. We also view the coefficients of a linear program as members of a finite algebraic extension of the rational numbers. The degree of this extension is an upper bound on the degree of any algebraic number that can occur during the course of the algorithm, and in this sense can be viewed as a supplementary measure of problem dimension. Working under an arithmetic model of computation, and making use of a tool for obtaining upper and lower bounds on polynomial functions of algebraic numbers, we derive an algorithm based on the ellipsoid method that runs in time bounded by a polynomial in the dimension, degree, and size of the linear program. Similar results hold under a rational number model of computation, given a suitable binary encoding of the problem input.This research was founded by the National Science Foundation under Grant DMS88-10192.  相似文献   

为了提高分子动力学模拟在对称多处理(SMP)集群上的计算速度,在分子动力学并行方法中引入MPI+TBB的混合并行编程模型。基于该模型,在分子动力学软件LAMMPS中设计并实现混合并行算法,在节点间采用MPI及空间分解技术实施进程级并行,节点内采用TBB及临界区技术实施线程级并行。在SMP集群中的测试表明,该方法在体系较大以及节点数较多时可以明显减少通信时间,使加速比在纯MPI模型上提高45%。结果表明,MPI+TBB混合并行编程模型可促进分子动力学并行模拟且效率明显提升。  相似文献   

This paper evaluates different forms of rank-based selection that are used with genetic algorithms and genetic programming. Many types of rank based selection have exactly the same expected value in terms of the sampling rate allocated to each member of the population. However, the variance associated with that sampling rate can vary depending on how selection is implemented. We examine two forms of tournament selection and compare these to linear rank-based selection using an explicit formula. Because selective pressure has a direct impact on population diversity, we also examine the interaction between selective pressure and different mutation strategies.  相似文献   

混合动力汽车通常由内燃机和电池两种不同的动力源驱动,对于给定的功率需求,如何分配两种动力源的输出功率,使得整个循环的耗油量达到最小是混合动力系统控制表示法需要解决的问题.本文采用改进动态规划方法来优化两种动力源的输出功率,并用PSATv6.1进行了系统仿真.仿真结果表明,与开关式相比,该方法能有效的降低串联混合动力汽车...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号