首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 375 毫秒
1.
The two-way stripes partition mapping and the greedy assignment mapping are proposed to map finite element graphs composed of a number of rectilinear four-node elements on hypercubes. The two-way stripes partition mapping is a two-phase mapping approach. In the first phase a two-way stripes partition heuristic is used to lower the communication cost. In the second phase the load transfer heuristic is used to balance the computational load among processors. The greedy assignment mapping tries to minimize the communication cost and balance the computational load of processors simultaneously. Our simulation results show that the speedups for the two-way stripes partition mapping are better than those for the greedy assignment mapping when the load balancing criterion is achieved in both approaches (that is, the number of nodes in each processor is at most one more than the number of nodes in any other processor). However, the greedy approach performs well at a much lower cost.The work of this author was supported in part by NSF under contract CCR-9110812.  相似文献   

2.
Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation load balance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops. But a large category of DOALL loops inevitably result in communication and the trade-offs between computation and communication must be carefully analyzed for these loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+communication load balanced partitioning through static data and iteration space distribution. Our approach first performs partitioning of iteration and data spaces of a loop nest by analyzing communication and parallelism; it then performs architecture-dependent analysis to adjust the granularity of partitions, load balance each partition with respect to total computation+communication, and then performs mapping of partitions onto the available number of processors. This multiphase partitioning method works as follows. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and reused, eliminating a larger communication volume than parallelism. We then perform data space partitioning based on a new larger partition owns rule to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller noncompute intensive partition. A partition interaction graph is then constructed which is used by the architecture-dependent analysis phase to merge the partitions to achieve granularity adjustment, computation+communication load balance, and mapping on the actual number of available processors. Relevant theory and algorithms are developed along with a performance evaluation on the Cray T3D.  相似文献   

3.
This paper proposes a simple yet efficient algorithm to distribute loads evenly on multiprocessor computers with hypercube interconnection networks. This algorithm was developed based upon the well-known dimension exchange method. However, the error accumulation suffered by other algorithms based on the dimension exchange method is avoided by exploiting the notion of regular distributions, which are commonly deployed for data distributions in parallel programming. This algorithm achieves a perfect load balance over P processors with an error of 1 and the worst-case time complexity of O(M log2 P), where M is the maximum number of tasks initially assigned to each processor. Furthermore, perfect load balance is achieved over subcubes as well—once a hypercube is balanced, if the cube is decomposed into two subcubes by the lowest bit of node addresses, then the difference between the numbers of the total tasks of these subcubes is at most 1.  相似文献   

4.
Parallelizing the Data Cube   总被引:1,自引:0,他引:1  
This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p.  相似文献   

5.

The most widely used technique to allow for parallel simulations in molecular dynamics is spatial domain decomposition, where the physical geometry is divided into boxes, one per processor. This technique can inherently produce computational load imbalance when either the spatial distribution of particles or the computational cost per particle is not uniform. This paper shows the benefits of using a hybrid MPI+OpenMP model to deal with this load imbalance. We consider LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator), a prototypical molecular dynamics simulator that provides its own balancing mechanism and an OpenMP implementation for many of its modules, allowing for a hybrid setup. In this work, we extend the current OpenMP implementation of LAMMPS and optimize it and evaluate three different setups: MPI-only, MPI with the LAMMPS balance mechanism, and hybrid setup using our improved OpenMP version. This comparison is made using the five standard benchmarks included in the LAMMPS distribution plus two additional test cases. Results show that the hybrid approach can deal with load balancing problems better and more effectively (50% improvement versus MPI-only for a highly imbalanced test case) than the LAMMPS balance mechanism (only 43% improvement) and improve simulations with issues other than load imbalance.

  相似文献   

6.
Amnon Barak  Amnon Shiloh 《Software》1985,15(9):901-913
This paper deals with the organization of a distributed load-balancing policy for a multicomputer system which consists of a cluster of independent computers that are interconnected by a local area communication network. We introduce three algorithms necessary to maintain load balancing in this system: the local load algorithm, used by each processor to monitor its own load; the exchange algorithm, for exchanging load information between the processors, and the process migration algorithm that uses this information to dynamically migrate processes from overloaded to underloaded processors. The policy that we present is distributed, i.e. each processor uses the same policy. It is both dynamic, responding to load changes without using an a priori knowledge of the resources that each process requires; and stable, unnecessary overloading of a processor is minimized. We give the essential details of the implementation of the policy and initial results on its performance. Our results confirm the feasibility of building distributed systems that are based on network communication for uniform access, resource sharing and improved reliability, as well as the use of workstations without a secondary storage device.  相似文献   

7.
张娜  杨波  陈贞翔  孙润元 《计算机工程》2008,34(17):102-104
负载均衡设备是提高网络性能的重要设备。该文研究负载均衡系统及其算法,对多种算法进行比较后选择基于agent的动态反馈负载均衡算法,在Intel网络处理器IXP425上采用VxWorks5.5嵌入式内核,设计出适用于园区网络的负载均衡器。在实验室环境内对多个校园网代理出口进行负载均衡测试,结果表明,该负载均衡器作为中小型网络负载均衡设备使用时效果良好。  相似文献   

8.
Diffusion Schemes for Load Balancing on Heterogeneous Networks   总被引:1,自引:0,他引:1  
Several different diffusion schemes have previously been developed for load balancing on homogeneous processor networks. We generalize existing schemes, in order to deal with heterogeneous networks. Generalized schemes may operate efficiently on networks where each processor can have arbitrary computing power, i.e., the load will be balanced proportionally to these powers. The balancing flow that is calculated by schemes for homogeneous networks is minimal with regard to the l 2 -norm and we prove this to hold true for generalized schemes, too. We demonstrate the usability of generalized schemes by a number of experiments on several heterogeneous networks.  相似文献   

9.
In this paper, we present a static load balancing method for mapping production rules in an expert system onto processors of a message-passing multicomputer. The method uses simulated annealing to achieve a nearly optimal allocation of production rules onto processor nodes. The approach balances the initial rule distribution to avoid higher communication demand among processor nodes at run time. A formal mapping model is developed and a new cost function is defined for the annealing process. New heuristic swap functions and cooling policies are provided to ensure the quality of the annealing process. A software load balancing package, SIMAL, was implemented on a SUN workstation to carry out the benchmark experiments. The overhead associated with this mapping method is O(m In m), where m is the number of rules in the production system. Two benchmark production systems, Toru-waltz and Tourney, are mapped onto a hypercube computer with 32 nodes. Experimental benchmark results verify the effectiveness of the rule mapping method. The method can be applied for distributed artificial intelligence processing or for the parallel execution of cooperating expert systems on a message-passing multicomputer.  相似文献   

10.
面向数据驱动处理器阵列的自动综合   总被引:1,自引:0,他引:1  
本文提出了一种数据驱动处理器阵列结构,该结构能有效平衡存储和计算,适合用于在FPGA上实现高性能的算法加速,同时提出了一个面向该结构的自动综合框架,通过该框架可以将常规循环有效地映射到数据驱动处理器阵列上。实验结果表明了该自动综合框架的有效性,且生成的设计性能优于通用处理器。  相似文献   

11.
张苗  张德贤 《计算机应用》2011,31(7):1808-1810
异构多核处理器体系结构可以有效减少功效开销,是处理器发展的趋势,负载不平衡问题会造成处理器执行的不稳定。提出一种基于异构感知的静态调度和动态线程迁移相结合的异构多核调度机制,解决了不同核之间的负载平衡问题,提高了吞吐量。仿真实验通过将此调度机制与静态调度策略(SS)比较,表明该机制提高了异构多核处理器的性能并保证了执行过程的稳定性。  相似文献   

12.
李佳  陈志刚  章志兵  陈容 《计算机工程》2007,33(14):113-115
任务分配与调度是网格计算中的核心问题之一,通过建立任务调度测评模型体系,结合调度的通信代价、处理器利用率和负载平衡提出了一种对调度策略优越性的测评算法。对算法进行了一些问题的定义,提出了联系概率、处理器满足率和负载压强差等概念,对算法的结构及步骤进行了说明,通过用数学计算对不同的分配方法进行了计算,并对计算结果进行了分析,给出了分析的结论。算法取消了一系列理想化的假设,有一定的现实意义。  相似文献   

13.
Diffusion algorithms are some of the most popular algorithms for dynamic load balancing in which loads move from heavily loaded processors to lightly loaded neighbor processors. To achieve a global load balance in a parallel computer, the algorithm is iterated until the load difference between any two processors is smaller than a specified value. Therefore, one fundamental property to be studied is algorithm convergence. Several analytical works on the convergence of different diffusion load balancing algorithms have been carried out, but they treat loads as non-negative real quantities. In this paper, we describe the Diffusion Algorithm Searching Unbalanced Domains (DASUD) algorithm, which uses loads as non-negative integer values and, unlike existing algorithms, reaches a local balance situation where the maximum load difference between any two processor in the set of neighbor processors for each processor is one load unit. The convergence property of an asynchronous implementation of DASUD using integer loads is proven theoretically.  相似文献   

14.
The consolidation of Internet devices into a universal/portable device will soon be accomplishable through the incorporation of reconfigurable computing in system-on-a-chip (SOC). At any particular moment, it could be a video/audio mobile phone, an MP3 song player, and other devices. The basic construct of these multimedia processing algorithms can be described as deep nested Do loop algorithms. They are considered the most demanding data-intensive algorithms and hence ideal candidates for an array of reconfigurable nanoprocessors. Therefore, algorithm to hardware synthesis methodology is important for an efficient exploitation of both spatial parallelism and temporal pipelining. In this paper, we propose a processor array synthesis methodology. It can map an n-level nested Do loop represented by a nonuniform or shift-variant data dependence graph to a near-optimal of one-or two-dimensional processor array under the available resource constraints to satisfy high-throughput computation demands.  相似文献   

15.
A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterized cost functions. The considered loops can be imperfectly nested. New data layouts are propagated through the connected references and through the loop nests as constraints for optimizing the next connected reference in the same nest or in the other ones. Unlike many existing methods, special attention is paid to TLB (Translation Lookaside Buffer) effectiveness since TLB misses can take from tens to hundreds of processor cycles. Our approach only considers active data, that is, array elements that are actually accessed by a loop, in order to prevent useless memory loads and take advantage of storage compression and temporal locality. Moreover, the same data transformation is not necessarily applied to a whole array. Depending on the referenced data subsets, the transformation can result in different data layouts for a same array. This can significantly improve the performance since a priori incompatible references can be simultaneously optimized. Finally, the process does not only consider the innermost loop level but all levels. Hence, large strides when control returns to the enclosing loop are avoided in several cases, and better optimization is provided in the case of a small index range of the innermost loop.  相似文献   

16.
17.
A loosely coupled multiprocessor system contains multiple processors which have their own local memories. To balance the load among multiple processors is of fundamental importance in enhancing the performance of such a multiple processor system. Probabilistic load balancing in a heterogeneous multiple processor system with many job classes is considered in this study. The load balancing scheme is formulated as a nonlinear programming problem with linear constraints. An optimal probabilistic load balancing algorithm is proposed to solve this nonlinear programming problem. The proposed load balancing method is proven globally optimum in the sense that it results in a minimum overall average job response time on a probabilistic basis.  相似文献   

18.
Array syntax, which is supported in many technical programming languages, adds expressive power by allowing operations on and assignments to whole arrays and array sections. To compile an array assignment statement to a uniprocessor, the language processor must convert the statement into a loop that has the same meaning. This process is called scalarization.Scalarization presents a significant technical problem because an array assignment needs to be implemented as if all inputs are fetched before any outputs are stored. Since a loop intermixes loads and stores, the compiler typically allocates a temporary array to hold the intermediate result. Because these extra temporary arrays can cause performance problems in cache, many techniques have been developed to avoid their use or minimize their size.In this paper, we present a novel application of two compiler strategies—loop alignment and loop skewing—to address this problem. We show that these strategies can achieve the asymptotically minimal memory allocation for stencil computations. Our experiments with loop alignment and loop skewing demonstrate that it is extremely effective in improving memory hierarchy performance of Fortran 90 array code on standard uniprocessors. The result should be applicable to other array languages, such as MATLAB.  相似文献   

19.
In this paper, we develop load balancing strategies for scalable high-performance parallel A* algorithms suitable for distributed-memory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algorithm) grow with the number of processors P used, thus restricting its scalability. To alleviate this effect, we propose a novel parallel startup phase and an efficient dynamic load balancing strategy called the quality equalizing (QE) strategy. Our new parallel startup scheme executes optimally in Θ(log P) time and, in addition, achieves good initial load balance. The QE strategy prossess certain unique quantitative and qualitative load balancing properties that enable it to significantly reduce starvation and nonessential work. Consequently, we obtain a highly scalable parallel A* algorithm with an almost-linear speedup. The startup and load balancing schemes were employed in parallel A* algorithms to solve the Traveling Salesman Problem on an nCUBE2 hypercube multicomputer. The QE strategy yields average speedup improvements of about 20-185% and 15-120% at low and intermediate work densities (the ratio of the problem size to P), respectively, over three well-known load balancing methods-the round-robin (RR), the random communication (RC), and the neighborhood averaging (NA) strategies. The average speedup observed on 1024 processors is about 985, representing a very high efficiency of 0.96. Finally, we analyze and empirically evaluate the scalability of parallel A* algorithms in terms of the isoefficiency metric. Our analysis gives (1) a Θ(P log P) lower bound on the isoefficiency function of any parallel A* algorithm, and (2) a general expression for the upper bound on the isoefficiency function of our parallel A* algorithm using the QE strategy on any topology-for the hypercube and 2-D mesh architectures the upper bounds on the isoefficiency function are found to be Θ(P log2P) and Θ(P[formula]), respectively. Experimental results validate our analysis, and also show that parallel A* search has better scalability using the QE load balancing strategy than using the RR, RC, or NA strategies.  相似文献   

20.
并行循环的自调度模式是研究以最小运行开销和最佳负载平衡将循环体分布到各处理器上做并行计算,早期的自调度模式基于悲观的思想,认为并行循环是非均匀分布的,因此为克服负载不平衡,循环体被分割成大量任务包,因而导致较大的调度开销,本文提出一类乐观自调度模式,假定循环是均匀分布的,按现有处理器数对循环做初始划分可取得较好的负载平衡,同时,乐观模式还提出克服初始划分不良引起负载不平衡的一种简单且有效的方法,模  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号