首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
This paper presents the analysis of a parallel formulation of depth-first search. At the heart of this parallel formulation is a dynamic work-distribution scheme that divides the work between different processors. The effectiveness of the parallel formulation is strongly influenced by the work-distribution scheme and the target architecture. We introduce the concept of isoefficiency function to characterize the effectiveness of different architectures and work-distribution schemes. Many researchers considered the ring architecture to be quite suitable for parallel depth-first search. Our analytical and experimental results show that hypercube and shared-memory architectures are significantly better. The analysis of previously known work-distribution schemes motivated the design of substantially improved schemes for ring and shared-memory architectures. In particular, we present a work-distribution algorithm that guarantees close to optimal performance on a shared-memory/-network-with-message-combining architecture (e.g. RP3). Much of the analysis presented in this paper is applicable to other parallel algorithms in which work is dynamically shared between different processors (e.g., parallel divide-and-conquer algorithms). The concept of isoefficiency is useful in characterizing the scalability of a variety of parallel algorithms.This work was supported by Army Research Office Grant No. DAAG29-84-K-0060 to the Artificial Intelligence Laboratory, and Office of Naval Research Grant N00014-86-K-0763 to the Computer Science Department at the University of Texas at Austin.  相似文献   

2.
In this paper, we develop load balancing strategies for scalable high-performance parallel A* algorithms suitable for distributed-memory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algorithm) grow with the number of processors P used, thus restricting its scalability. To alleviate this effect, we propose a novel parallel startup phase and an efficient dynamic load balancing strategy called the quality equalizing (QE) strategy. Our new parallel startup scheme executes optimally in Θ(log P) time and, in addition, achieves good initial load balance. The QE strategy prossess certain unique quantitative and qualitative load balancing properties that enable it to significantly reduce starvation and nonessential work. Consequently, we obtain a highly scalable parallel A* algorithm with an almost-linear speedup. The startup and load balancing schemes were employed in parallel A* algorithms to solve the Traveling Salesman Problem on an nCUBE2 hypercube multicomputer. The QE strategy yields average speedup improvements of about 20-185% and 15-120% at low and intermediate work densities (the ratio of the problem size to P), respectively, over three well-known load balancing methods-the round-robin (RR), the random communication (RC), and the neighborhood averaging (NA) strategies. The average speedup observed on 1024 processors is about 985, representing a very high efficiency of 0.96. Finally, we analyze and empirically evaluate the scalability of parallel A* algorithms in terms of the isoefficiency metric. Our analysis gives (1) a Θ(P log P) lower bound on the isoefficiency function of any parallel A* algorithm, and (2) a general expression for the upper bound on the isoefficiency function of our parallel A* algorithm using the QE strategy on any topology-for the hypercube and 2-D mesh architectures the upper bounds on the isoefficiency function are found to be Θ(P log2P) and Θ(P[formula]), respectively. Experimental results validate our analysis, and also show that parallel A* search has better scalability using the QE load balancing strategy than using the RR, RC, or NA strategies.  相似文献   

3.
The grid and the mesh of trees (or MOT) are among the best-known parallel architectures in the literature. Both of them enjoy efficient VLSI layouts, simplicity of topology, and a large number of parallel algorithms that can efficiently execute on them. One drawback of these architectures is that algorithms that perform best on one of them do not perform very well on the other. Thus there is a gap between the algorithmic capabilities of these two architectures. We propose a new class of parallel architectures, called the mesh-connected trees (or MCT) that can execute grid algorithms as efficiently as the grid, and MOT algorithms as efficiently as the MOT, up to a constant amount of slowdown. In particular, the MCT topology contains the MOT as a subgraph and emulates the grid via embedding with dilation 3 and congestion two. This significant amount of computational versatility offered by the MCT comes at no additional VLSI area cost over these earlier networks. Many topological, routing, and embedding properties analyzed here suggest that the MCT architecture is also a serious competitor for the hypercube. In fact, while the MCT is much simpler and cheaper than the hypercube, for all the algorithms we developed, the running time complexity on the MCT matches those of well known hypercube algorithms. We also present an interesting variant of the MCT architecture that admits both the MOT and the torus as its subgraphs. While most of the discussion in this paper is focused on the MCT architecture itself, these analyses can be easily extended to the variant of the MCT presented here  相似文献   

4.
A novel reconfigurable architecture based on a multiring multiprocessor network is described. The reconfigurability of the architecture is shown to result in a low network diameter and also a low degree of connectivity for each node in the network. The mathematical properties of the network topology and the hardware for the reconfiguration switch are described. Primitive parallel operations on the network topology are described and analyzed. The architecture is shown to contain 2D mesh topologies of varying sizes and also a single one factor of the Boolean hypercube in any given configuration. A large class of algorithms for the 2D mesh and the Boolean n-cube are shown to map efficiently on the proposed architecture without loss of performance. The architecture is shown to be well suited for a number of problems in low and intermediate level computer vision such as the FFT, edge detection, template matching, and the Hough transform. Timing results for typical low and intermediate level vision algorithms on a transputer based prototype are presented  相似文献   

5.
A massively parallel fine-grained SIMD (single-instruction multi-data-stream) computer for machine vision computations is described. The architecture features a polymorphic-torus network which inserts an individually controllable switch into every node of the two-dimensional torus such that the network is dynamically reconfigurable to match the algorithm. Reconfiguration is accomplished by circuit switching and is achieved at fine-grained level. Using both the processor coordinate in the torus and the data for reconfiguration, the polymorphic-torus achieves solution time that is superior or equivalent to that of popular vision architectures such as mesh, tree, pyramid and hypercube for many vision algorithms discussed. Implementation of the architecture is given to illustrate its VLSI efficiency  相似文献   

6.
The crossed cube architecture for parallel computation   总被引:4,自引:0,他引:4  
The construction of a crossed cube which has many of the properties of the hypercube, but has diameter only about half as large, is discussed. This network is self-routing, in the sense that there is a simple distributed routing algorithm which guarantees optimal paths between any pair of vertices. This fact, together with other properties such as regularity, symmetry, high connectivity, and a simple recursive structure, suggests that the crossed cube may be an attractive alternative to the ordinary hypercube for massively parallel architectures, SIMD algorithms, which utilize the architecture are developed, and it is shown that the CQn architecture can profitably emulate the ordinary hypercube. It is also shown that addition of simple switches can improve the capabilities of the system significantly. For instance, the dynamic reconfiguration capability allows hypercube algorithms to be executed on the proposed architecture. The use of these switches also improves the embedding properties of the system  相似文献   

7.
Various Artificial Neural Networks (ANNs) have been proposed in recent years to mimic the human brain in solving problems involving human-like intelligence. Efficient mapping of ANNs comprising of large number of neurons onto various distributed MIMD architectures is discussed in this paper. The massive interconnection among neurons demands a communication efficient architecture. Issues related to the suitability of MIMD architectures for simulating neural networks are discussed. Performance analysis of ring, torus, binary tree, hypercube, and extended hypercube for simulating artificial neural networks is presented. Our studies reveal that the performance of the extended hypercube is better than those of ring, torus, binary tree, and hypercube topologies.  相似文献   

8.
《Parallel Computing》1988,9(1):37-52
We have mapped onto the iPSC hypercube a particle-in-cell (PIC) algorithm that executes a plasma simulation. PIC simulates the movement of charged particles under the influence of an electrostatic field. This application provides a simple example of the problems associated with load balancing on distributed memory architectures. We present several alternative solutions to mappings of the algorithm onto the hypercube. One solution's performance is modeled and benchmarked with data from an implementation on the iPSC. The model is used to predict performance for larger size problems and a state-of-the-art hypercube architecture. We also introduce the use of provably optimal global communication algorithms that are needed for the PIC implementation on the hypercube.  相似文献   

9.
This paper presents a design methodology for developing efficient distributed-memory parallel programs for block recursive algorithms such as the fast Fourier transform (FFT) and bitonic sort. This design methodology is specifically suited for most modern supercomputers having a distributed-memory architecture with a circuit-switched or wormhole routed mesh or a hypercube interconnection network. A mathematical framework based on the tensor product and other matrix operations is used for representing algorithms. Communication-efficient implementations with effectively overlapped computation and communication are achieved by manipulating the mathematical representation using the tensor product algebra. Performance results for FFT programs on the Intel Paragon are presented. © 1998 John Wiley & Sons, Ltd.  相似文献   

10.
Tensor products of matrices play a very important role in approximation and interpolation. This paper describes a systolic algorithm for tensor products in mesh connected arrays and the closely related hypercube architectures (including the Connection Machine). It is based on a new operator called bullet operator which is a higher dimensional matrix operation. The applications of tensor products to multivariable spline blending approximation as well as graphics/image processing are indicated.  相似文献   

11.
This paper analyzes the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5 parallel computer. It is shown that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor,as a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation  相似文献   

12.
In this paper, we propose and analyze a new interconnection network, the k-ary hypercube. This new architecture captures the advantages of the mesh network and those of the binary hypercube. We show that the hamiltoniacity of this network and its capability of efficiently simulating other topologies. It has a smaller degree than that of its equivalent binary hypercube (the one with at least as many nodes) and has a smaller diameter than its equivalent mesh of processors.  相似文献   

13.
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithm-architecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objectives of this paper are to critically assess the state of the art in the theory of scalability analysis, and to motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying scalability issues, and discuss their interrelationships. For example, we derive an important relationship between time-constrained scaling and the isoefficiency function. We point out some of the weaknesses of the existing schemes for measuring scalability, and discuss possible ways of extending them.  相似文献   

14.
In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics: the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible (or very difficult) to estimate the size of total work at a given processor. Such problems require a load balancing scheme that distributes the work dynamically among different processors. Our goal here is to determine the most scalable load balancing schemes for different architectures such as hypercube, mesh, and network of workstations. For each of these architectures, we establish lower bounds on the scalability of any possible load balancing scheme. We present the scalability analysis of a number of load balancing schemes that have not been analyzed before. This gives us valuable insights into their relative performance for different problem and architectural characteristics. For each of these architectures, we are able to determine near optimal load balancing schemes. Results obtained from implementation of these schemes in the context of the Tautology Verification problem on the Ncube/2 (a trademark of the Ncube Corporation) multicomputer are used to validate our theoretical results for the hypercube architecture. These results also demonstrate the accuracy and viability of our framework for scalability analysis.  相似文献   

15.
We present two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyze their scalability using the isoefficiency metric. We show that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers, while QSP1 is fairly close to optimal. Langet al. (1) and Schnorret al. (2) have developed parallel sorting algorithms for the mesh architecture that have either optimal (Schnorr) or close to optimal (Lang) run-time complexity for the one-element-perprocessor case. Both QSP1 and QSP2 have better scalability than the scaled-down variants of these algorithms (for the case in which there are more elements than processors). We also analyze a different variant of Lang's sort which is as scalable as QSP2. We briefly discuss another metric called resource consumption. According to this metric, both QSP1 and QSP2 are superior to variants of Lang's sort.  相似文献   

16.
There are two ways, other than the standard fast Fourier transform (FFT) algorithm, of computing Fourier transforms of real data, namely, (1)the real fast Fourier transform (RFFT) algorithm, and (2) the fast Hartley transform (FHT) algorithm. On a sequential computer, it has been shown that both the RFFT and the FHT algorithms are faster than the FFT algorithm. However, it is not obvious that the same is true on a parallel machine. The communication requirements of the RFFT and the FHT algorithms, which are critical to the cost of any parallel implementation, are different from those of the FFT algorithm. In this paper we present efficient implementations of the RFFT and the FHT algorithms on a hypercube machine. Experimental results are given for the implementation of the RFFT and the FHT algorithms on the NCUBE machine.  相似文献   

17.
Many parallel algorithms use hypercubes as the communication topology among their processes. When such algorithms are executed on hypercube multicomputers the communication cost is kept minimum since processes can be allocated to processors in such a way that only communication between neighbor processors is required. However, the scalability of hypercube multicomputers is constrained by the fact that the interconnection cost-per-node increases with the total number of nodes. From scalability point of view, meshes and toruses are more interesting classes of interconnection topologies. This paper focuses on the execution of algorithms with hypercube communication topology on multicomputers with mesh or torus interconnection topologies. The proposed approach is based on looking at different embeddings of hypercube graphs onto mesh or torus graphs. The paper concentrates on toruses since an already known embedding, which is called standard embedding, is optimal for meshes. In this paper, an embedding of hypercubes onto toruses of any given dimension is proposed. This novel embedding is called xor embedding. The paper presents a set of performance figures for both the standard and the xor embeddings and shows that the latter outperforms the former for any torus. In addition, it is proven that for a one-dimensional torus (a ring) the xor embedding is optimal in the sense that it minimizes the execution time of a class of parallel algorithms with hypercube topology. This class of algorithms is frequently found in real applications, such as FFT and some class of sorting algorithms  相似文献   

18.
The simplicity of regular mesh topology Network on Chip (NoC) architecture leads to reductions in design time and manufacturing cost. A weakness of the regular shaped architecture is its inability to efficiently support cores of different sizes. A proposed way in literature to deal with this is to utilize the region concept, which helps to accommodate cores larger than the tile size in mesh topology NoC architectures. Region concept offers many new opportunities for NoC design, as well as provides new design issues and challenges. One of the most important among these is the design of an efficient deadlock free routing algorithm. Available adaptive routing algorithms developed for regular mesh topology cannot ensure freedom from deadlocks. In this paper, we list and discuss many new design issues which need to be handled for designing NoC systems incorporating cores larger than the tile size. We also present and compare two deadlock free routing algorithms for mesh topology NoC with regions. The idea of the first algorithm is borrowed from the area of fault tolerant networks, where a network topology is rendered irregular due to faults in routers or links, and is adapted for the new context. We compare this with an algorithm designed using a methodology for design of application specific routing algorithms for communication networks. The application specific routing algorithm tries to maximize adaptivity by using static and dynamic communication requirements of the application. Our study shows that the application specific routing algorithm not only provides much higher adaptivity, but also superior performance as compared to the other algorithm in all traffic cases. But this higher performance for the second algorithm comes at a higher area cost for implementing network routers.  相似文献   

19.
We present in this paper an efficient algorithm for solving the integral Knapsack problem on hypercube. The main idea is to represent the computations of the dynamic programming formulation as a precedence graph (which has the structure of an irregular mesh). Then, we propose a time optimal scheduling algorithm for computing the irregular meshes on hypercube.  相似文献   

20.
Providing highly flexible connectivity is a major architectural challenge for hardware implementation of reconfigurable neural networks. We perform an analytical evaluation and comparison of different configurable interconnect architectures (mesh NoC, tree, shared bus and point-to-point) emulating variants of two neural network topologies (having full and random configurable connectivity). We derive analytical expressions and asymptotic limits for performance (in terms of bandwidth) and cost (in terms of area and power) of the interconnect architectures considering three communication methods (unicast, multicast and broadcast). It is shown that multicast mesh NoC provides the highest performance/cost ratio and consequently it is the most suitable interconnect architecture for configurable neural network implementation. Routing table size requirements and their impact on scalability were analyzed. Modular hierarchical architecture based on multicast mesh NoC is proposed to allow large scale neural networks emulation. Simulation results successfully validate the analytical models and the asymptotic behavior of the network as a function of its size.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号