首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
To solve the load imbalance problem of a solution-adaptive finite element application program on a distributed memory multicomputer, nodes of a refined finite element graph can be remapped to processors or load of a refined finite element graph can be redistributed based on the current load of each processor. For the former case, remapping can be performed by some fast mapping algorithms. For the latter case, a load-balancing algorithm can be applied to balance the computational load of each processor. In this paper, three tree-based parallel load-balancing methods, the MCSTLB method, the BTLB method, and the CBTLB method, were proposed to deal with the load imbalance problems of solution-adaptive finite element application programs. To evaluate the performance of the proposed methods, we have implemented those methods along with three mapping methods, the AE/ORB method, the AE/MC method, and the MLkP method, on an SP2 parallel machine. Three criteria, the execution time of mapping/load-balancing methods, the execution time of a solution-adaptive finite element application program under different mapping/load-balancing methods, and the speedups achieved by mapping/load-balancing methods for a solution-adaptive finite element application program, are used for the performance evaluation. The experimental results show that 1) if the initial mapping is performed by a mapping method and the same mapping method and load-balancing methods were used in each refinement to balance the load of processors, the execution time of an application program under a load-balancing method is always shorter than that of the mapping method, and 2) the execution time of an application program under the CBTLB method is shorter than that of the BTLB method and the MCSTLB method  相似文献   

2.
If a finite element mesh has a sufficiently regular structure, it is easy to decide in advance how to distribute the mesh among the processors of a distributed-memory parallel processor, but if the mesh is unstructured the problem becomes much more difficult. The distribution should be made so that each processor has approximately equal work to do, and such that communication overhead is minimized. If the mesh is solution-adaptive, i.e. the mesh and hence the load-balancing problem change discretely during execution of the code, then it is most efficient to decide the optimal mesh distribution in parallel. In this paper three parallel algorithms, orthogonal recursive bisection (ORB), eigenvector recursive bisection (ERB) and a simple parallelization of simulated annealing (SA) have been implemented for load balancing a dynamic unstructured triangular mesh on 16 processors of an NCUBE machine. The test problem is a solution-adaptive Laplace solver, with an initial mesh of 280 elements, refined in seven stages to 5772 elements. We present execution times for the solver resulting from the mesh distributions using the three algorithms, as well as results on imbalance, communication traffic and element migration. The load-balancing itself is fastest with ORB, but a very long run of SA produces a saving of 21% in the execution time of the Laplace solver. ERB is only a little slower than ORB, and yet produces a mesh distribution whose execution time is 15% faster than ORB.  相似文献   

3.
In this paper, we propose a prefix code matching parallel load-balancing method (PCMPLB) to efficiently deal with the load imbalance of solution-adaptive finite element application programs on distributed memory multicomputers. The main idea of the PCMPLB method is first to construct a prefix code tree for processors. Based on the prefix code tree, a schedule for performing load transfer among processors can be determined by concurrently and recursively dividing the tree into two subtrees and finding a maximum matching for processors in the two subtrees until the leaves of the prefix code tree are reached. We have implemented the PCMPLB method on an SP2 parallel machine and compared its performance with two load-balancing methods, the directed diffusion method and the multilevel diffusion method, and five mapping methods, the AE/ORB method, the AE/MC method, the MLkP method, the PARTY library method, and the JOSTLE-MS method. An unstructured finite element graph Truss was used as a test sample. During the execution, Truss was refined five times. Three criteria, the execution time of mapping/load-balancing methods, the execution time of an application program under different mapping/load-balancing methods, and the speedups achieved by mapping/load-balancing methods for an application program, are used for the performance evaluation. The experimental results show that (1) if a mapping method is used for the initial partitioning and this mapping method or a load-balancing method is used in each refinement, the execution time of an application program under a load-balancing method is less than that of the mapping method. (2) The execution time of an application program under the PCMPLB method is less than that of the directed diffusion method and the multilevel diffusion method.  相似文献   

4.
Twelve adaptive image-space decomposition algorithms are presented for sort-first parallel direct volume rendering (DVR) of unstructured grids on distributed-memory architectures. The algorithms are presented under a novel taxonomy based on the dimension of the screen decomposition, the dimension of the workload arrays used in the decomposition, and the scheme used for workload-array creation and querying the workload of a region. For the 2D decomposition schemes using 2D workload arrays, a novel scheme is proposed to query the exact number of screen-space bounding boxes of the primitives in a screen region in constant time. A probe-based chains-on-chains partitioning algorithm is exploited for load balancing in optimal 1D decomposition and iterative 2D rectilinear decomposition (RD). A new probe-based optimal 2D jagged decomposition (OJD) is proposed which is much faster than the dynamic-programming based OJD scheme proposed in the literature. The summed-area table is successfully exploited to query the workload of a rectangular region in constant time in both OJD and RD schemes for the subdivision of general 2D workload arrays. Two orthogonal recursive bisection (ORB) variants are adapted to relax the straight-line division restriction in conventional ORB through using the medians-of-medians approach on regular mesh and quadtree superimposed on the screen. Two approaches based on the Hilbert space-filling curve and graph-partitioning are also proposed. An efficient primitive classification scheme is proposed for redistribution in 1D, and 2D rectilinear and jagged decompositions. The performance comparison of the decomposition algorithms is modeled by establishing appropriate quality measures for load-balancing, amount of primitive replication and parallel execution time. The experimental results on a Parsytec CC system using a set of benchmark volumetric datasets verify the validity of the proposed performance models. The performance evaluation of the decomposition algorithms is also carried out through the sort-first parallelization of an efficient DVR algorithm.  相似文献   

5.
The Convergence of Realistic Distributed Load-Balancing Algorithms   总被引:1,自引:0,他引:1  
We give a general model of partially asynchronous, distributed load-balancing algorithms for the discrete load model in parallel computers, where the processor loads are treated as non-negative integers. We prove that all load-balancing algorithms in this model are finite. This means that all load-balancing algorithms based on this model are guaranteed to reach a stable situation at a certain time (which depends on the particular algorithm) at which no load will be sent from one processor to another. With an additional assumption, we prove that the largest load difference between any two processors, in the final stable situation of the load-balancing algorithms in this model, is upper-bounded by the diameter of the topology.  相似文献   

6.
To efficiently execute a finite element application program on a distributed memory multicomputer, we need to distribute nodes of a finite element graph to processors of a distributed memory multicomputer as evenly as possible and minimize the communication cost of processors. This partitioning problem is known to be NP-complete. Therefore, many heuristics have been proposed to find satisfactory sub-optimal solutions. Based on these heuristics, many graph partitioners have been developed. Among them, Jostle, Metis, and Party are considered as the best graph partitioners available up-to-date. For these three graph partitioners, in order to minimize the total cut-edges, in general, they allow 3% to 5% load imbalance among processors. This is a tradeoff between the communication cost and the computation cost of the partitioning problem. In this paper, we propose an optimization method, the dynamic diffusion method (DDM), to balance the 3% to 5% load imbalance allowed by these three graph partitioners while minimizing the total cut-edges among partitioned modules. To evaluate the proposed method, we compare the performance of the dynamic diffusion method with the directed diffusion method and the multilevel diffusion method on an IBM SP2 parallel machine. Three 2D and two 3D irregular finite element graphs are used as test samples. For each test sample, 3% and 5% load imbalance situations are tested. From the experimental results, we have the following conclusions. (1) The dynamic diffusion method can improve the partition results of these three partitioners in terms of the total cut-edges and the execution time of a Laplace solver in most test cases while the directed diffusion method and the multilevel diffusion method may fail in many cases. (2) The optimization results of the dynamic diffusion method are better than those of the directed diffusion method and the multilevel diffusion method in terms of the total cut-edges and the execution time of a Laplace solver for most test cases. (3) The dynamic diffusion method can balance the load of processors for all test cases.  相似文献   

7.
Load balancing algorithms are designed essentially to equally distribute the load on processors and maximize their utilities while minimizing the total task execution time. In order to achieve these goals, the load-balancing mechanism should be “fair” in distributing the load across the different processors. This implies that the difference between the heaviest-loaded and the lightest-loaded processors should be minimized. Therefore, the load information on each processor must be updated such that the load-balancing mechanism can be more effective. In this work, we present an application independent dynamic algorithm for scheduling tasks and load- balancing in message passing systems. We propose a DAG-based Dynamic Load Balancing algorithm for Real time applications (DAG-DLBR) that is designed to work dynamically to cope with possible changes in the load that might occur during runtime. This algorithm addresses the challenge of devising a load balancing scheme which judicially deals with the hybrid execution of existing real-time application (represented by a Direct Acyclic Graph (DAG)) together with newly arriving jobs. The main objective of this algorithm is to reduce response times of the newly arriving jobs while maintaining the time constrains of the existing DAG. To evaluate the performance of the DAG-DLBR algorithm, a comparison with the performance of two common dynamic load balancing algorithms is presented. This comparison is performed by evaluating, experimentally, the execution time of different load balancing algorithms on a homogenous real parallel machine. In addition, the values of load imbalance, the execution time, and the communication overhead time are evaluated analytically using different benchmarks as test-bed workloads. These workloads cover a wide range of dynamic applications with different task types. Experimental results illustrate the improved performance of the DAG-DLBR algorithm compared to both distributed and hierarchal based algorithms by at least 12 and 19%, respectively. This improvement is true for all workloads, even with highly dependent workload. The DAG-DLBR algorithm achieves lower computation time than its corresponding values of both the distributed and the hierarchical-based algorithms for 4, 8, 12 and 16 processors.  相似文献   

8.
The paper presents an extension of the composite programmable graph grammar (CP‐graph grammar) suitable for modeling the parallel direct solver algorithm utilized by the hp finite element method (hp‐FEM). In the proposed graph grammar model, the computational mesh is represented by a CP‐graph. The presented graph grammar models the solver algorithm by a set of graph grammar productions. The graph grammar model makes it possible to examine the concurrency of the algorithm by analyzing the interdependence between the atomic tasks, tasks and super‐tasks. The atomic tasks correspond to the graph grammar productions, representing basic undividable parts of the algorithms. The level of atomic tasks models the concurrency for the shared memory architectures. On the other hand, the tasks correspond to the groups of atomic tasks with predefined inter‐task communication channels. They constitute the grain for the decomposition of the parallel algorithm for the distributed memory architecture. Finally, the super‐tasks correspond to a group of tasks resulting from the execution of load balancing algorithm. The solver algorithm is tested on distributed memory linux cluster for up to 192 processors. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

9.
We describe a compiler and run-time system that allow data-parallel programs to execute on a network of heterogeneous UNIX workstations. The programming language supported is Dataparallel C, a SIMD language with virtual processors and a global name space. This parallel programming environment allows the user to take advantage of the power of multiple workstations without adding any message-passing calls to the source program. Because the performance of Individual workstations in a multi-user environment may change during the execution of a Dataparallel C program, the run-time system automatically performs dynamic load balancing. We present experimental results that demonstrate the usefulness of dynamic load-balancing In a multi-user environment These results suggest that initially allocating the same amount of work to each processor and letting the dynamic load balancing algorithm adjust the load during program execution yields very good performance. Hence neither the compiler nor the run-time system need a priori knowledge of the speeds of the machines that will participate in a program execution.  相似文献   

10.
In a shared-nothing parallel database system, a join operation is split into a set of tasks that are allocated to the nodes in the system to be executed concurrently and independently. While parallel processing could greatly reduce the completion time of a join operation, the system performance may degrade because of load imbalance across the nodes caused by data skewness in the relations. Load-balanced join processing uses various techniques to evenly distribute the load among nodes in a system and hence improves the overall system performance. In this paper, the basic issues in designing load-balanced parallel join algorithms are identified. From the solutions to those issues, a large set of load-balanced join algorithms can be constructed. Performance of four representative algorithms-two dynamic load-balancing algorithms proposed in this paper and two static load-balancing algorithms adapted from similar algorithms in the literature-is studied and compared with that of a parallel join algorithm that does not balance the join load. The results of our study clearly show the benefits of load-balancing. This study also demonstrates that the dynamic load-balancing techniques proposed in this paper not only are feasible but also provide good system performance.  相似文献   

11.
Observations on using genetic algorithms for dynamic load-balancing   总被引:2,自引:0,他引:2  
Load-balancing problems arise in many applications, but, most importantly, they play a special role in the operation of parallel and distributed computing systems. Load-balancing deals with partitioning a program into smaller tasks that can be executed concurrently and mapping each of these tasks to a computational resource such as a processor (e.g., in a multiprocessor system) or a computer (e.g., in a computer network). By developing strategies that can map these tasks to processors in a way that balances out the load, the total processing time will be reduced with improved processor utilization. Most of the research on load-balancing focused on static scenarios that, in most of the cases, employ heuristic methods. However, genetic algorithms have gained immense popularity over the last few years as a robust and easily adaptable search technique. The work proposed here investigates how a genetic algorithm can be employed to solve the dynamic load-balancing problem. A dynamic load-balancing algorithm is developed whereby optimal or near-optimal task allocations can “evolve” during the operation of the parallel computing system. The algorithm considers other load-balancing issues such as threshold policies, information exchange criteria, and interprocessor communication. The effects of these and other issues on the success of the genetic-based load-balancing algorithm as compared with the first-fit heuristic are outlined  相似文献   

12.
Many parallel algorithms use hypercubes as the communication topology among their processes. When such algorithms are executed on hypercube multicomputers the communication cost is kept minimum since processes can be allocated to processors in such a way that only communication between neighbor processors is required. However, the scalability of hypercube multicomputers is constrained by the fact that the interconnection cost-per-node increases with the total number of nodes. From scalability point of view, meshes and toruses are more interesting classes of interconnection topologies. This paper focuses on the execution of algorithms with hypercube communication topology on multicomputers with mesh or torus interconnection topologies. The proposed approach is based on looking at different embeddings of hypercube graphs onto mesh or torus graphs. The paper concentrates on toruses since an already known embedding, which is called standard embedding, is optimal for meshes. In this paper, an embedding of hypercubes onto toruses of any given dimension is proposed. This novel embedding is called xor embedding. The paper presents a set of performance figures for both the standard and the xor embeddings and shows that the latter outperforms the former for any torus. In addition, it is proven that for a one-dimensional torus (a ring) the xor embedding is optimal in the sense that it minimizes the execution time of a class of parallel algorithms with hypercube topology. This class of algorithms is frequently found in real applications, such as FFT and some class of sorting algorithms  相似文献   

13.
The paper presents a generic technique for mapping parallel algorithms onto parallel architectures. The proposed technique is a fast recursive mapping algorithm which is a component of the Cluster-M programming tool. The other components of Cluster-M are the Specification module and the Representation module. In the Specification module, for a given task specified by a high-level machine-independent program, a clustered task graph called Spec graph is generated. In the Representation module, for a given architecture or computing organization, a clustered system graph called Rep graph is generated. Given a task (or system) graph, a Spec (or Rep) graph can be generated using one of the clustering algorithms presented in the paper. The clustering is done only once for a given task graph (system graph) independent of any system graphs (task graphs). It is a machine-independent (application-independent) clustering, and therefore it is not repeated for different mappings. The Cluster-M mapping algorithm presented produces a sub-optimal matching of a given Spec graph containing M task modules, onto a Rep graph of N processors, in O(MN) time. This generic algorithm is suitable for both the allocation problem and the scheduling problem. Its performance is compared to other leading techniques. We show that Cluster-M produces better or similar results in significantly less time and using fewer or an equal number of processors as compared to the other known methods.  相似文献   

14.
针对大规模结构非线性动力问题的有限元分析非常耗时,基于消息传递接口(MPI)机群环境,提出多种基于并行求解策略的显式有限元并行算法。基于显式消息传递的区域分解技术,采取重叠、非重叠区域分解技术及动态任务分配方法,通过将计算与通信重叠,优化处理器间的通信,对非重叠通信区域分解并行算法、重叠通信区域分解并行算法、群动态任务分配算法、动态任务分配算法及动态负载平衡算法进行研究。为在机群环境下实现非线性动力有限元分析,开发了基于有效并行求解策略的显式有限元并行算法。编写了基于消息传递编程模式的并行有限元程序,在工作站机群上实现了数值算例,分析了算法的性能,并与传统的Newmark算法进行了比较。算例表明:群动态任务分配算法的性能优于动态任务分配算法,低于区域分解算法的性能,动态负载平衡算法最优。对相同规模的问题提出的算法比Newmark算法快,优于Newmark算法。对结构非线性动力问题的有限元分析,所提出的并行算法是可行有效的。  相似文献   

15.
This paper gives a systematic methodology for the formulation of parallel computation structures and algorithms. The fundamental definition of a computation structure is a graph where each node is the binding of an action to a data object and the arcs are the dependency relationships between the unit computations executed at the nodes. The structure of the graph is determined by the selection of elements for the model of computation in which the graph is expressed. An abstract machine is created by defining the resources including for example instruction sets for the processors, which realize the conceptual elements of the model of parallel computations. An algorithm is a mapping of the computation graph to the abstract machine and a program which traverses the mapped graph to execute the computation. The methodology proceeds by describing parallel computations on successively more fully specified abstract machines. A model of parallel computation is selected and an abstract machine implementing the model of computation is defined. Specification of increasingly resolved abstract machines is structured by both increasing the span of elements from the model of computation represented in the machine and be increasing the level of detail resolved for each element.  相似文献   

16.
周筠  蒋富 《计算机科学》2018,45(Z11):573-575
Marching Cubes是医学体数据可视化的经典算法,但生产的网格质量差、算法执行速度慢成为阻碍其用于数值分析的两个主要缺点。文中提出一种基于硬件加速的Marching Cubes改进算法。该算法采用统一设备架构(CUDA)充分发挥Marching Cubes算法分而治之的优点,利用CUDA的可编程性并行分类体数据,加快了活跃体素和活跃边的提取;同时,该改进算法将得到的活跃边按照中点投影方式进行偏移,从而达到了改善网格质量的目的。最后通过实验表明,该算法可以保证在阈值未知的情况下,进行交互式的高质量网格建模。  相似文献   

17.
A new mapping algorithm is presented for domain decomposition for the purpose of allowing researchers to conduct finite element analysis on massively parallel computers. Over the last few years, massively parallel MIMD machines such as the Intel Touchstone Delta and recently the Intel Touchstone Paragon have become increasingly popular for speeding up finite element computations. Most of these applications use domain decomposition as a first step towards conquering the problem. Many different algorithms have been developed by researchers to achieve an effective domain decomposition. Some of these methods use connectivity information only, some use coordinate information only, while others use both of them together. Some algorithms are based on assigning weights to nodes using a particular strategy while others are recursive in nature. As will be discussed in this paper, the logic employed in various algorithms works perfectly well for certain meshes to be decomposed, in certain numbers of subdomains; while it gives far from perfect results for other meshes or for same meshes to be decomposed in a different number of subdomains. The logic used in the proposed algorithm has been developed in a creative way such that it is closer to a human's natural thinking when making decisions. Fairly large meshes can be decomposed in a matter of seconds on a Sun Sparc station by the proposed algorithm. Its execution time remains almost the same for any number of subdomains.  相似文献   

18.
The problem of maximal clique enumeration (MCE) is to enumerate all of the maximal cliques in a graph. Once enumerated, maximal cliques are widely used to solve problems in areas such as 3-D protein structure alignment, genome mapping, gene expression analysis, and detection of social hierarchies. Even the most efficient serial MCE algorithms require large amounts of time to enumerate the maximal cliques in networks arising from these problems that contain hundreds, thousands, or larger numbers of vertices. The previous attempts to provide practical solutions to the MCE problem through parallel implementation have had limited success, largely due to a number of challenges inherent to the nature of the MCE combinatorial search space. On the one hand, MCE algorithms often create a backtracking search tree that has a highly irregular and hard-or-impossible to predict structure; therefore, almost any static decomposition of the search tree by parallel processors results in highly unbalanced processor execution times. On the other hand, the data-intensive nature of the MCE problem often makes naive dynamic load distribution strategies that require extensive data movement prohibitively expensive. As a result, good scaling of the overall execution time of parallel MCE algorithms has been reported for only up to a couple hundred processors. In this paper, we propose a parallel, scalable, and memory-efficient MCE algorithm for distributed and/or shared memory high performance computing architectures, whose runtime scales linearly for thousands of processors on real-world application graphs with hundreds and thousands of nodes. Its scalability and efficiency are attributed to the proposed: (a) representation of the search tree decomposition to enable parallelization; (b) parallel depth-first backtracking search to both constrain the search space and minimize memory requirement; (c) least stringent synchronization to minimize data movement; and (d) on-demand work stealing intelligently coupled with work stack splitting to minimize computing elements’ idle time. To the best of our knowledge, the proposed parallel MCE algorithm is the first to achieve a linear scaling runtime using up to 2048 processors on Cray XT machines for a number of real-world biological networks.  相似文献   

19.
调度Fork-Join任务图的贪心算法   总被引:1,自引:2,他引:1  
任务调度算法的目标是把组成并行程序的一组任务分配到多个处理器以使得程序的完成时间最短,这是一个NP完全问题.虽然许多算法在任务满足某些条件时能产生最优调度,但大多都忽略了节省处理器个数和最小化程序总的完成时间等问题.Fork-Join结构是一种并行处理的基本结构.因此,专门针对Fork-Join任务图,提出了一个能产生最优调度的新的贪心调度算法,该算法具有高的加速比和总体效率,时间复杂度为O(v2),其中,v表示任务集中任务的个数.实验结果表明,相比其它算法,该算法具有较短的调度长度、较短的完成时间,使用的处理器数较少.  相似文献   

20.
随着电商网站用户规模不断增长,高并发问题成为在搭建大规模电商网站系统时面临的一项重大挑战,通过负载均衡算法来实现Web服务集群中各节点均衡负载是解决高并发的手段之一.然而,目前通用的负载均衡算法都存在一些不足之处,针对这一问题,提出了一种动态自适应权重轮询随机负载均衡算法(Dynamic Adaptive Weight Round-Robin Random Load-Balancing,DAWRRRLB),该算法考虑到影响Web服务集群中服务器节点性能的多重因素,根据节点在运行过程中的实时负载情况动态的改变集群中节点的负载性能,并结合改进的Pick-K算法对权重轮询负载均衡算法进行优化,始终保证性能最优的服务器节点在提供服务.通过多次实验对比,改进的DAWRRRLB算法可以有效的提高负载均衡效率.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号