首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In distributed memory multicomputers, local memory accesses are much faster than those involving interprocessor communication. For the sake of reducing or even eliminating the interprocessor communication, the array elements in programs must be carefully distributed to local memory of processors for parallel execution. We devote our efforts to the techniques of allocating array elements of nested loops onto multicomputers in a communication-free fashion for parallelizing compilers. We first analyze the pattern of references among all arrays referenced by a nested loop, and then partition the iteration space into blocks without interblock communication. The arrays can be partitioned under the communication-free criteria with nonduplicate or duplicate data. Finally, a heuristic method for mapping the partitioned array elements and iterations onto the fixed-size multicomputers under the consideration of load balancing is proposed. Based on these methods, the nested loops can execute without any communication overhead on the distributed memory multicomputers. Moreover, the performance of the strategies with nonduplicate and duplicate data for matrix multiplication is studied  相似文献   

2.
Various contiguous and noncontiguous processor allocation policies have been proposed for mesh-connected multicomputers. Contiguous allocation suffers from high external processor fragmentation because it requires that the processors allocated to a parallel job be contiguous and have the same topology as the multicomputer. The goal of lifting the contiguity condition in noncontiguous allocation is reducing processor fragmentation. However, this can increase the communication overhead because the distances traversed by messages can be longer, and messages from different jobs can interfere with each other by competing for communication resources. The extra communication overhead depends on how the allocation request is partitioned and mapped to free processors. In this paper, we investigate a new class of noncontiguous allocation schemes for two-dimensional mesh-connected multicomputers. These schemes are different from previous ones in that request partitioning is based on the submeshes available for allocation. The available submeshes selected for allocation to a job are such that a high degree of contiguity among their processors is achieved. The proposed policies are compared to previous noncontiguous policies using detailed simulations, where several common communication patterns are considered. The results show that the proposed policies can reduce the communication overhead and improve performance substantially.  相似文献   

3.
Mapping of parallel programs onto parallel computers for efficient execution is a fundamental problem of great significance in parallel processing. This paper presents an architecture-independent software tool for contention-free mapping of arbitrary parallel programs onto parallel computers with arbitrary configurations. This mapping tool is based on an efficient heuristic algorithm that runs in time O(n 3+m 4) in the worst case for mapping n tasks onto m processors, where m n in most practical cases. It is fully implemented and incorporated into a graph editing system to produce a graphical mapping tool which enables its user to monitor and control the mapping process. The user can assist the mapping process or employ the algorithm to map automatically. Our mapping tool has been tested and its performance evaluated extensively. Experimental results show that our tool combines user intuition and mapping heuristics effectively to make it a powerful mapping tool which is practical to use. Our mapping tool can be easily extended for use in the more general case when the link contention-degree is bounded to a fixed system-specified value without increasing its complexity.  相似文献   

4.
Efficient checkpointing and resumption of multicomputer applications is essential if multicomputers are to support time-sharing and the automatic resumption of jobs after a system failure. We present a checkpointing scheme that is transparent, imposes overhead only during checkpoints, requires minimal message logging, and allows for quick resumption of execution from a checkpointed image. Furthermore, the checkpointing algorithm allows each processorp to continue running the application being checkpointed except during the time thatp is actively taking a local snapshot, and requires no global stop or freeze of the multicomputer. Since checkpointing multicomputer applications poses requirements different from those posed by checkpointing general distributed systems, existing distributed checkpointing schemes are inadequate for multicomputer checkpointing. Our checkpointing scheme makes use of special properties of wormhole routing networks to satisfy this new set of requirements.  相似文献   

5.
A method for executing nested loops with constant loop-carried dependencies in parallel on message-passing multiprocessor systems to reduce communication overhead is presented. In the partitioning phase, the nested loop is divided into blocks that reduce the interblock communication, without regard to the machine topology. The execution ordering of the iterations is defined by a given time function based on L. Lamport's (1974) hyperplane method. The iterations are then partitioned into blocks so that the execution ordering is not disturbed, and the amount of interblock communication is minimized. In the mapping phase, the partitioned blocks are mapped onto a fixed-size multiprocessor system in such a manner that the blocks that have to exchange data frequently are allocated to the same processor or neighboring processors. A heuristic mapping algorithm for hypercube machines is proposed  相似文献   

6.
Nectar系统是Carnegie Mellon大学于1988年研制成功的一个典型的异构型多计算机系统,本文通过分析Nectar系统的结构,详述了在设计多系统时所面临的一些问题和解决办法,此外,以Nectar系统为例讨论了如何通过对软硬件的优化来减少网络通讯的开销,提高网络通讯效率。  相似文献   

7.
We address the problem of mapping divide-and-conquer programs to mesh connected multicomputers with wormhole or store-and-forward routing. We propose the binomial tree as an efficient model of parallel divide-and-conquer and present two mappings of the binomial tree to the 2D mesh. Our mappings exploit regularity in the communication structure of the divide-and-conquer computation and are also sensitive to the underlying flow control scheme of the target architecture. We evaluate these mappings using new metrics which are extensions of the classical notions of dilation and contention. We introduce the notion of communication slowdown as a measure of the total communication overhead incurred by a parallel computation. We conclude that significant performance gains can be realized when the mapping is sensitive to the flow control scheme of the target architecture  相似文献   

8.
9.
A generalized mapping strategy that uses a combination of graph theory, mathematical programming, and heuristics is proposed. The authors use the knowledge from the given algorithm and the architecture to guide the mapping. The approach begins with a graphical representation of the parallel algorithm (problem graph) and the parallel computer (host graph). Using these representations, the authors generate a new graphical representation (extended host graph) on which the problem graph is mapped. An accurate characterization of the communication overhead is used in the objective functions to evaluate the optimality of the mapping. An efficient mapping scheme is developed which uses two levels of optimization procedures. The objective functions include minimizing the communication overhead and minimizing the total execution time which includes both computation and communication times. The mapping scheme is tested by simulation and further confirmed by mapping a real world application onto actual distributed environments  相似文献   

10.
In this paper, we propose a prefix code matching parallel load-balancing method (PCMPLB) to efficiently deal with the load imbalance of solution-adaptive finite element application programs on distributed memory multicomputers. The main idea of the PCMPLB method is first to construct a prefix code tree for processors. Based on the prefix code tree, a schedule for performing load transfer among processors can be determined by concurrently and recursively dividing the tree into two subtrees and finding a maximum matching for processors in the two subtrees until the leaves of the prefix code tree are reached. We have implemented the PCMPLB method on an SP2 parallel machine and compared its performance with two load-balancing methods, the directed diffusion method and the multilevel diffusion method, and five mapping methods, the AE/ORB method, the AE/MC method, the MLkP method, the PARTY library method, and the JOSTLE-MS method. An unstructured finite element graph Truss was used as a test sample. During the execution, Truss was refined five times. Three criteria, the execution time of mapping/load-balancing methods, the execution time of an application program under different mapping/load-balancing methods, and the speedups achieved by mapping/load-balancing methods for an application program, are used for the performance evaluation. The experimental results show that (1) if a mapping method is used for the initial partitioning and this mapping method or a load-balancing method is used in each refinement, the execution time of an application program under a load-balancing method is less than that of the mapping method. (2) The execution time of an application program under the PCMPLB method is less than that of the directed diffusion method and the multilevel diffusion method.  相似文献   

11.
Many parallel algorithms use hypercubes as the communication topology among their processes. When such algorithms are executed on hypercube multicomputers the communication cost is kept minimum since processes can be allocated to processors in such a way that only communication between neighbor processors is required. However, the scalability of hypercube multicomputers is constrained by the fact that the interconnection cost-per-node increases with the total number of nodes. From scalability point of view, meshes and toruses are more interesting classes of interconnection topologies. This paper focuses on the execution of algorithms with hypercube communication topology on multicomputers with mesh or torus interconnection topologies. The proposed approach is based on looking at different embeddings of hypercube graphs onto mesh or torus graphs. The paper concentrates on toruses since an already known embedding, which is called standard embedding, is optimal for meshes. In this paper, an embedding of hypercubes onto toruses of any given dimension is proposed. This novel embedding is called xor embedding. The paper presents a set of performance figures for both the standard and the xor embeddings and shows that the latter outperforms the former for any torus. In addition, it is proven that for a one-dimensional torus (a ring) the xor embedding is optimal in the sense that it minimizes the execution time of a class of parallel algorithms with hypercube topology. This class of algorithms is frequently found in real applications, such as FFT and some class of sorting algorithms  相似文献   

12.
Benchmark evaluation of the IBM SP2 for parallel signal processing   总被引:1,自引:0,他引:1  
This paper evaluates the IBM SP2 architecture, the AIX parallel programming environment, and the IBM message-passing library (MPL) through STAP (Space-Time Adaptive Processing) benchmark experiments. Only coarse-grain parallelism was exploited on the SP2 due to its high communication overhead. A new parallelization scheme is developed for programming message passing multicomputers. Parallel STAP benchmark structures are illustrated with domain decomposition, efficient mapping of partitioned programs, and optimization of collective communication operations. We measure the SP2 performance in terms of execution time, Gflop/s rate, speedup over a single SP2 node, and overall system utilization. With 256 nodes, the Maul SP2 demonstrated the best performance of 23 Gflop/s in executing the High-Order Post-Doppler program, corresponding to a 34% system utilization. We have conducted a scalability analysis to reveal the performance growth rate as a function of machine size and STAP problem size. Important lessons learned from these parallel processing benchmark experiments are discussed in the context of real-time, adaptive, radar signal processing on massively parallel processors (MPP)  相似文献   

13.
《Parallel Computing》1997,22(12):1661-1675
This paper presents a mapping scheme for parallel pipelined execution of the Back-propagation Learning Algorithm on distributed memory multiprocessors. The proposed implementation exhibits inter-layer or pipelined parallelism, unique to the multilayer neural networks. Simple algorithms have been presented, which allow the data transfer involved in both recall and learning phases of the back-propagation algorithm to be carried out with a small communication overhead. The effectiveness of the mapping scheme has been illustrated, by estimating the speedup of the proposed implementation on an array of T-805 transputers.  相似文献   

14.
In this paper, we present a static load balancing method for mapping production rules in an expert system onto processors of a message-passing multicomputer. The method uses simulated annealing to achieve a nearly optimal allocation of production rules onto processor nodes. The approach balances the initial rule distribution to avoid higher communication demand among processor nodes at run time. A formal mapping model is developed and a new cost function is defined for the annealing process. New heuristic swap functions and cooling policies are provided to ensure the quality of the annealing process. A software load balancing package, SIMAL, was implemented on a SUN workstation to carry out the benchmark experiments. The overhead associated with this mapping method is O(m In m), where m is the number of rules in the production system. Two benchmark production systems, Toru-waltz and Tourney, are mapped onto a hypercube computer with 32 nodes. Experimental benchmark results verify the effectiveness of the rule mapping method. The method can be applied for distributed artificial intelligence processing or for the parallel execution of cooperating expert systems on a message-passing multicomputer.  相似文献   

15.
Adaptive data partitioning (ADP) which reduces the execution time of parallel programs by reducing interprocessor communication for iterative parallel loops is discussed. It is shown that ADP can be integrated into a communication-reducing back end for existing parallelizing compilers or as part of a machine-specific partitioner for parallel programs. A multiprocessor model to analyze program execution factors that lead to interprocessor communication and a model for the iterative parallel loop to quantify communication patterns within a program are defined. A vector notation is chosen to quantify communication across a global data set. Communication parameters are computed by examining the indexes of array accesses and are adjusted to reflect the underlying system architecture by compensating for cache line sizes. These values are used to generate rectangular and hexagonal partitions that reduce interprocessor communication  相似文献   

16.
Partitioning of processors on a multiprocessor system involves logically dividing the system into processor partitions. Programs can be executed in the different partitions in parallel. Optimally setting the partition size can significantly improve the throughput of multiprocessor systems. The speedup characteristics of parallel programs can be defined by execution signatures. The execution signature of a parallel program on a multiprocessor system is the rate at which the program executes in the absence of other programs and depends upon the number of allocated processors, the specific architecture, and the specific program implementation. Based on the execution signatures, this paper analyzes simple Markovian models of dynamic partitioning. From the analysis, when there are at most two multiprocessor partitions, the optimal dynamic partition size can be found which maximizes throughput. Compared against other partitioning schemes, the dynamic partitioning scheme is shown to be the best in terms of throughput when thereconfiguration overhead is low. If the reconfiguration overhead is high, dynamic partitioning is to be avoided. An expression for the reconfiguration overhead threshold is derived. A general iterative partitioning technique is presented. It is shown that the technique gives maximum throughput forn partions.  相似文献   

17.
Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient algorithms for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BLOCK-CYCLIC(r) to BLOCK-CYCLIC(kr) redistribution. The most significant improvement of our methods is that a processor does not need to construct the send/receive data sets for a redistribution. Based on the packing/unpacking information that derived from the BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) redistribution and vice versa, a processor can pack/unpack array elements into (from) messages directly. To evaluate the performance of our methods, we have implemented our methods along with the Thakur's methods and the PITFALLS method on an IBM SP2 parallel machine. The experimental results show that our algorithms outperform the Thakur's methods and the PITFALLS method for all test samples. This result encourages us to use the proposed algorithms for array redistribution.  相似文献   

18.
Parallel ray tracing of complex scenes on multicomputers requires the distribution of both computation and scene data to the processors. This is carried out during preprocessing and usually consumes too much time and memory. The paper presents an efficient parallel subdivision algorithm that decomposes a given scene into rectangular regions adaptively and maps the resultant regions to the node processors of a multicomputer. The proposed algorithm uses efficient data structures to identify the splitting planes quickly. Furthermore the mapping of the regions and the objects to the node processors is performed while parallel spatial subdivision proceeds. The proposed algorithm is implemented on an Intel iPSC/2 hypercube multicomputer and promising results have been obtained.  相似文献   

19.
Guo  Minyi  Pan  Yi  Liu  Zhen 《The Journal of supercomputing》2003,25(3):199-214
Communication set generation significantly influences the performance of parallel programs. However, studies seldom give attention to the problem of communication set generation for irregular applications. In this paper, we propose communication optimization techniques for the situation of irregular array references in nested loops. In our methods, the local array distribution schemes are determined so that the total number of communication messages is minimized. Then, we explain how to support communication set generation at compile-time by introducing some symbolic analysis techniques. In our symbolic analysis, symbolic solutions of a set of symbolic expression are obtained by using certain restrictions. We introduce symbolic analysis algorithms to obtain the solutions in terms of a set of equalities and inequalities. Finally, experimental results on a parallel computer CM-5 are presented to validate our approach.  相似文献   

20.
In many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access on distributed memory multicomputers. Since the redistribution is performed at run-time, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient methods for multi-dimensional array redistribution. Based on the previous work, the basic-cycle calculation technique, we present a basic-block calculation (BBC) and a complete-dimension calculation (CDC) techniques. We also developed a theoretical model to analyze the computation costs of these two techniques. The theoretical model shows that the BBC method has smaller indexing costs and performs well for the redistribution with small array size. The CDC method has smaller packing/unpacking costs and performs well when array size is large. When implemented these two techniques on an IBM SP2 parallel machine along with the PITFALLS method and the Prylli's method, the experimental results show that the BBC method has the smallest execution time of these four algorithms when the array size is small. The CDC method has the smallest execution time of these four algorithms when the array size is large.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号