首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we present a static load balancing method for mapping production rules in an expert system onto processors of a message-passing multicomputer. The method uses simulated annealing to achieve a nearly optimal allocation of production rules onto processor nodes. The approach balances the initial rule distribution to avoid higher communication demand among processor nodes at run time. A formal mapping model is developed and a new cost function is defined for the annealing process. New heuristic swap functions and cooling policies are provided to ensure the quality of the annealing process. A software load balancing package, SIMAL, was implemented on a SUN workstation to carry out the benchmark experiments. The overhead associated with this mapping method is O(m In m), where m is the number of rules in the production system. Two benchmark production systems, Toru-waltz and Tourney, are mapped onto a hypercube computer with 32 nodes. Experimental benchmark results verify the effectiveness of the rule mapping method. The method can be applied for distributed artificial intelligence processing or for the parallel execution of cooperating expert systems on a message-passing multicomputer.  相似文献   

2.
Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation load balance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops. But a large category of DOALL loops inevitably result in communication and the trade-offs between computation and communication must be carefully analyzed for these loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+communication load balanced partitioning through static data and iteration space distribution. Our approach first performs partitioning of iteration and data spaces of a loop nest by analyzing communication and parallelism; it then performs architecture-dependent analysis to adjust the granularity of partitions, load balance each partition with respect to total computation+communication, and then performs mapping of partitions onto the available number of processors. This multiphase partitioning method works as follows. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and reused, eliminating a larger communication volume than parallelism. We then perform data space partitioning based on a new larger partition owns rule to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller noncompute intensive partition. A partition interaction graph is then constructed which is used by the architecture-dependent analysis phase to merge the partitions to achieve granularity adjustment, computation+communication load balance, and mapping on the actual number of available processors. Relevant theory and algorithms are developed along with a performance evaluation on the Cray T3D.  相似文献   

3.
张娟  陆林生 《计算机工程》2010,36(9):73-76,7
针对现有分块算法并行度低、负载不平衡的缺陷,提出一种基于多区域多代码问题的自动分块算法。通过循环分配算法实现计算区域间的处理器分配,基于Block的递归二分法对无向图进行剖分,实现计算区域内的任务分配。实验结果表明,该算法可使整个计算空间分配到的处理器量大致相等,处理器间的通信量最小。  相似文献   

4.
This paper introduces a new load balancing and communication minimizing heuristic used in theInverse Remote Procedure Call(IRPC) system. While the paper briefly describes the IRPC system, the focus is on the newIRPCassignment heuristic. The IRPC compiler maps a distributed program to a graph that represents program objects and their dependencies (due to invocations and parameter passing) as nodes and edges, respectively. In the graph, the system preserves conditional and iterative flows, records network transmission and execution costs, and marks nodes that have to reside at specific network sites. The graph is then partitioned by the heuristic to derive a (sub)optimal node assignment to network sites minimizing load balancing and network data transport. The resulting program partition is then reflected in the physical object distribution, and remote and local object communication is transparently implemented. The compiler and run-time system use efficient implementation techniques such as type prediction, inlining, splitting and subprogram passing. The last of these allows remote code to be copied to local data, as an alternative to copying data to the remote site, whenever this will reduce network data transport. The IRPC graph partitioning heuristic operates in timeO(E(logd+l+ logM)), whereMis the number of network sites,Eis the number of communication edges, anddis the maximum degree of a node;lis a parameter of the algorithm, and can vary between 1 andN, whereNis the number of communicating objects. This complexity is more nearly independent ofM, and considerably better in terms ofEandN, than that of previously known related algorithms, such as A*, which employs backtracking and is potentially exponential, or the max-flow/min-cut class of network flow algorithms or heuristics which tend to be at least of Ω(MN2E), and it can be made (by choosinglappropriately) as efficient as even such fast heuristics as heaviest-edge-first, minimal communication, and Kernighan–Lin. In an extensive quantitative evaluation, the heuristic has been demonstrated to perform very well, giving on the average 75% traffic cost reductions for over 95% of the programs when compared to random partitioning, and outperforming in cost reduction and actual execution time the three aforementioned fast heuristics, even with a largel. Thus, to the best of our knowledge, this is the first report of a well-performing assignment heuristic that is bothessentially linearin the number of communication edges, andbetterthan existing, established heuristics of no better complexity.  相似文献   

5.
The monetary cost optimization of a computer utility load distribution problem is examined. The problem is to find the sequence in which to distribute divisible computing load from a root processor to its children processors which achieves the lowest monetary distribution cost. The convergence performance of a heuristic greedy algorithm is studied. This problem is directly relevant to computer utilities which offer computing and software hosting to organizations for a monetary charge.  相似文献   

6.
张娟  陆林生 《计算机工程》2010,36(9):73-76,79
针对现有分块算法并行度低、负载不平衡的缺陷,提出一种基于多区域多代码问题的自动分块算法。通过循环分配算法实现计算区域间的处理器分配,基于Block的递归二分法对无向图进行剖分,实现计算区域内的任务分配。实验结果表明,该算法可使整个计算空间分配到的处理器量大致相等,处理器间的通信量最小。  相似文献   

7.
为了解决网格仿真中的静态任务调度问题,提出一种创新的基于交互优先算法的实体调度策略.该策略使用了一种兼顾通信优化和计算平衡的评价标准.调度策略分为聚类和映射两阶段进行.聚类过程使用一种交互优先算法,将交互密集的实体聚类到一个实体组中并且将被调度到同一处理器上运行;映射过程则使用一种计算优先的启发式算法,优先映射计算消耗大的聚类实体到处理能力强的处理器上.最后,将使用上述两种算法的调度策略同贪婪对分法进行对比实验,结果证明本文的方法能够较好地优化仿真中的通信和计算性能并且更加适合于网格仿真.  相似文献   

8.
Partitioning and mapping of nested loops for linear array multicomputers   总被引:1,自引:1,他引:0  
In distributed-memory multicomputers, minimizing interprocessor communication is the key to the efficient execution of parallel programs. In order to reduce the amount of communication overhead, parallel programs on multicomputers must be carefully scheduled by parallelizing compilers. This paper proposes some compilation techniques for partitioning and mapping nested loops with constant data dependences onto linear array multicomputers. First, a systematic partition strategy is proposed to project ann-dimensional computational structure, representing ann-nested loop, onto a line to form a one-dimensional projected structure with low communication overhead. Then, a mapping algorithm is proposed for mapping the partitioned loops onto linear arrays in a way that balances the workload and minimizes the communication cost among processors. Finally, parallel execution codes can be automatically generated for such linear array multicomputers.  相似文献   

9.
Computational Grids are emerging as a new infrastructure for high performance computing. Since the resources in a Grid can be heterogeneous and distributed, mesh-based applications require a mesh partitioner that considers both processor and network heterogeneity. We have developed a heterogeneous mesh partitioner, called PaGrid. PaGrid uses a multilevel graph partitioning approach, augmented by execution time load balancing in the final uncoarsening phase. We show that minimization of total communication cost (e.g., as used by JOSTLE) can lead to significant load being placed on processors connected by slow links, which results in higher application execution times. Therefore, PaGrid balances the estimated execution time of the application across processors. PaGrid performance is compared with two existing mesh partitioners, METIS 4.0 and JOSTLE 3.0, for mapping several application meshes to two models of heterogeneous computational Grids. PaGrid is found to produce significantly better partitions than JOSTLE and slightly better partitions than METIS in most cases, in terms of estimated application execution time averaged over a large number of runs with different random number seeds.  相似文献   

10.
Parallelizing the Data Cube   总被引:1,自引:0,他引:1  
This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p.  相似文献   

11.
Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. We present a novel method calledPLUMto dynamically balance the processor workloads with a global view. This paper describes the implementation and integration of all major components within our dynamic load balancing strategy for adaptive grid calculations. Mesh adaption, repartitioning, processor assignment, and remapping are critical components of the framework that must be accomplished rapidly and efficiently so as not to cause a significant overhead to the numerical simulation. A data redistribution model is also presented that predicts the remapping cost on the SP2. This model is required to determine whether the gain from a balanced workload distribution offsets the cost of data movement. Results presented in this paper demonstrate thatPLUMis an effective dynamic load balancing strategy which remains viable on a large number of processors.  相似文献   

12.
To solve the load imbalance problem of a solution-adaptive finite element application program on a distributed memory multicomputer, nodes of a refined finite element graph can be remapped to processors or load of a refined finite element graph can be redistributed based on the current load of each processor. For the former case, remapping can be performed by some fast mapping algorithms. For the latter case, a load-balancing algorithm can be applied to balance the computational load of each processor. In this paper, three tree-based parallel load-balancing methods, the MCSTLB method, the BTLB method, and the CBTLB method, were proposed to deal with the load imbalance problems of solution-adaptive finite element application programs. To evaluate the performance of the proposed methods, we have implemented those methods along with three mapping methods, the AE/ORB method, the AE/MC method, and the MLkP method, on an SP2 parallel machine. Three criteria, the execution time of mapping/load-balancing methods, the execution time of a solution-adaptive finite element application program under different mapping/load-balancing methods, and the speedups achieved by mapping/load-balancing methods for a solution-adaptive finite element application program, are used for the performance evaluation. The experimental results show that 1) if the initial mapping is performed by a mapping method and the same mapping method and load-balancing methods were used in each refinement to balance the load of processors, the execution time of an application program under a load-balancing method is always shorter than that of the mapping method, and 2) the execution time of an application program under the CBTLB method is shorter than that of the BTLB method and the MCSTLB method  相似文献   

13.
A fundamental issue affecting the performance of a parallel application running on a heterogeneous computing system is the assignment of tasks to the processors in the system. The task assignment problem for more than three processors is known to be NP-hard, and therefore satisfactory suboptimal solutions obtainable in an acceptable amount of time are generally sought. This paper proposes a simple and effective iterative greedy algorithm to deal with the problem with goal of minimizing the total sum of execution and communication costs. The main idea in this algorithm is to improve the quality of the assignment in an iterative manner using results from previous iterations. The algorithm first uses a constructive heuristic to find an initial assignment and iteratively improves it in a greedy way. Through simulations over a wide range of parameters, we have demonstrated the effectiveness of our algorithm by comparing it with recent competing task assignment algorithms in the literature.  相似文献   

14.
Aggregate data objects (such as arrays) are distributed across the processor memories when compiling a data-parallel language for a distributed-memory machine. The mapping determines the amount of communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: analignmentthat maps all the objects to an abstract template, followed by adistributionthat maps the template to the processors. This paper describes algorithms for solving the various facets of the alignment problem: axis and stride alignment, static and mobile offset alignment, and replication labeling. We show that optimal axis and stride alignment is NP-complete for general program graphs and give a heuristic method that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. We also show how local graph contractions can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. We show how to model the static offset alignment problem using linear programming, and we show that loop-dependent mobile offset alignment is sometimes necessary for optimum performance. We describe an algorithm with for determining mobile alignments for objects withindoloops. We also identify situations in which replicated alignment is either required by the program itself or can be used to improve performance. We describe an algorithm based on network flow that replicates objects so as to minimize the total amount of broadcast communication in replication.  相似文献   

15.
Homogeneous processor arrays are emerging in tera-scale computation and effective fault tolerance techniques are essential to improving the reliability of such complex integrated circuits. We study the degradable processor arrays to achieve fault tolerance by employing reconfiguration. Three bypass schemes and three rerouting schemes are proposed to reconfigure three-dimensional processor arrays with defective processors to achieve target arrays without faults. A heuristic algorithm is proposed to construct a target array on the selected rows and columns. It is also proved that the proposed greedy plane rerouting algorithm (GPR) produces maximum target array. In addition, the problem of constructing the communication efficient array is considered in this paper. An algorithm is proposed to refine the communication among processors within the target array constructed by GPR. Experimental study shows that the proposed algorithm GPR produces target arrays with higher harvest and lower degradation on the host arrays with fault density no more than 5%. In addition, the communication performance is significantly optimized by reducing the number of long interconnects, and the average improvement is about 34% for all cases considered in this paper.  相似文献   

16.
A repartitioning hypergraph model for dynamic load balancing   总被引:1,自引:0,他引:1  
In parallel adaptive applications, the computational structure of the applications changes over time, leading to load imbalances even though the initial load distributions were balanced. To restore balance and to keep communication volume low in further iterations of the applications, dynamic load balancing (repartitioning) of the changed computational structure is required. Repartitioning differs from static load balancing (partitioning) due to the additional requirement of minimizing migration cost to move data from an existing partition to a new partition. In this paper, we present a novel repartitioning hypergraph model for dynamic load balancing that accounts for both communication volume in the application and migration cost to move data, in order to minimize the overall cost. The use of a hypergraph-based model allows us to accurately model communication costs rather than approximate them with graph-based models. We show that the new model can be realized using hypergraph partitioning with fixed vertices and describe our parallel multilevel implementation within the Zoltan load balancing toolkit. To the best of our knowledge, this is the first implementation for dynamic load balancing based on hypergraph partitioning. To demonstrate the effectiveness of our approach, we conducted experiments on a Linux cluster with 1024 processors. The results show that, in terms of reducing total cost, our new model compares favorably to the graph-based dynamic load balancing approaches, and multilevel approaches improve the repartitioning quality significantly.  相似文献   

17.
We consider the problem of reducing data traffic among processor nodes during the parallel factorization of a sparse matrix on a hypercube multiprocessor. A task assignment strategy based on the structure of an elimination tree is presented. This assignment is aimed at achieving load balancing among the processors and also reducing the amount of processor-to-processor data communication. An analysis of regular grid problems is presented, providing a bound on communication volume generated by the new strategy, and showing that the allocation scheme is optimal in the asymptotic sense. Some experimental results on the performance of this scheme are presented.  相似文献   

18.
Real quantum computing technologies have different restrictions and constraints which need to be considered during circuit synthesis. In certain technologies, only physically adjacent qubits can interact, which restricts their realizations to only linear nearest neighbor (LNN) architecture. In this work, we formulate the line ordering problem in LNN architecture as task assignment problem to find a mapping (permutation) between task graph and processor graph with minimum cost. We propose two different approaches, a greedy heuristic and a meta-heuristic algorithm based on Harmony Search to solve the task assignment problem. Experimental results show that our algorithms were able to reduce the quantum cost of benchmark circuits by approximately 30 % on average. Moreover, the proposed algorithms were compared to one recently proposed ordering algorithm and were found to further improve the cost by approximately 16 %.  相似文献   

19.
In this paper, a processor allocation mechanism for NoC-based chip multiprocessors is presented. Processor allocation is a well-known problem in parallel computer systems and aims to allocate the processing nodes of a multiprocessor to different tasks of an input application at run time. The proposed mechanism targets optimizing the on-chip communication power/latency and relies on two procedures: processor allocation and task migration. Allocation is done by a fast heuristic algorithm to allocate the free processors to the tasks of an incoming application when a new application begins execution. The task-migration algorithm is activated when some application completes execution and frees up the allocated resources. Task migration uses the recently deallocated processors and tries to rearrange the current tasks in order to find a better mapping for them. The proposed method can also capture the dynamic traffic pattern of the network and perform task migration based on the current communication demands of the tasks. Consequently, task migration adapts the task mapping to the current network status. We adopt a non-contiguous processor allocation strategy in which the tasks of the input application are allowed to be mapped onto disjoint regions (groups of processors) of the network. We then use virtual point-to-point circuits, a state-of-the-art fast on-chip connection designed for network-on-chips, to virtually connect the disjoint regions and make the communication latency/power closer to the values offered by contiguous allocation schemes. The experimental results show considerable improvement over existing allocation mechanisms.  相似文献   

20.
The problem is addressed of assigning a task with a precedence constraint to a distributed computing system. Task turnaround time with regard to communication overhead and idle time is adopted to measure the performance of task assignment. The assignment of the module is determined as is the sequence of message transmission to balance the processor load and reduce communication overhead. The search for the optimal task assignment with a precedence constraint is known to be NP-complete (Garey et al. 1979) in the strong sense. A heuristic algorithm with polynomial time complexity is then proposed in order to solve the task-assignment problem effectively. Experimental results show that the proposed approach produces a near-optimal or even optimal task assignment.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号