首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In general, message passing multiprocessors suffer from communication overhead between processors and shared memory multiprocessors suffer from memory contention. Also, in computer vision tasks, data I/O overhead limits performance. In particular, high level vision tasks, which are complex and require nondeterministic communication, are strongly affected by these disadvantages. This paper proposes a flexibly (tightly/loosely) coupled hypercube multiprocessor (FCHM) for high level vision to alleviate these problems. A variable address space memory scheme in which a set of adjacent memory modules can be merged into a shared memory module by a dynamically partitionable hypercube topology is proposed. The architecture is quantitatively analyzed using computational models and simulated on the Intel’s Personal SuperComputer (iPSC/I), a hypercube multiprocessor. A parallel algorithm for exhaustive search is simulated on FCHM using the iPSC/I showing significant performance improvements over that of the iPSC/I. This research was supported in part by IBM corporation.  相似文献   

2.
The hypercube is one of the most widely used topologies because it provides small diameter and embedding of various interconnection networks. For very large systems, however, the number of links needed with the hypercube may become prohibitively large. In this paper, we propose a hierarchical interconnection network based on hypercubes called hierarchical hypercube network (HHN) for massively parallel computers. The HHN has a smaller number of links than the comparable hypercube and in particular, when we construct networks with 2Knodes, the node degree of HHN with the minimum node degree isO([formula]) while that of hypercube isO(K). Regardless of its smaller node degree, many parallel algorithms can be executed in HHN with the same time complexity as in the hypercube.  相似文献   

3.
Performance modeling of Cartesian product networks   总被引:1,自引:0,他引:1  
This paper presents a comprehensive performance model for fully adaptive routing in wormhole-switched Cartesian product networks. Besides the generality of the model which makes it suitable to be used for any product graph, experimental (simulation) results show that the proposed model exhibits high accuracy even in heavy traffic and saturation region, where other models have severe problems to predict the performance of the network. Most popular interconnection network can be defined as a Cartesian product of two or more networks including the mesh, hypercube, and torus networks. Torus and mesh networks are the most popular topologies used in recent supercomputing parallel machines. They have been widely used for realizing on-chip network in recent on-chip multicore and multiprocessors system.  相似文献   

4.
《Parallel Computing》2007,33(1):2-20
In multiprocessor systems, interconnection network design is critical for overall system performance. Among the popular interconnection networks, unidirectional ring-based networks have been one of popular choices for high performance large-scale shared memory multiprocessor systems. In this paper, we propose “Torus Ring”, which is a modified version of two-level hierarchical ring. The Torus Ring has the same complexity as the hierarchical rings, and the only difference is the way it connects the local rings. Compared to hierarchical rings, the Torus Ring helps exploit the memory access locality of application programs more efficiently. It has an advantage over the hierarchical ring when the destination of a packet is the adjacent local ring, especially the backward adjacent local ring. Although we assume that the destination of a network packet is uniformly distributed across the processing nodes, the average number of hops in Torus Ring is equal to that of the hierarchical ring. However, the performance gain of the Torus Ring is expected to increase, due to the memory access locality of the application programs in the real parallel programming environment. In the simulation results, the latency of the interconnection network is reduced by up to 19% and the execution time is reduced by up to 10%, with the moderate ring utilization ratio.  相似文献   

5.
Efficient Collective Communications in Dual-Cube   总被引:1,自引:2,他引:1  
The hypercube, or n-cube, has been widely used as the interconnection network in parallel computers. However, the major drawback of the hypercube is the increase in the number of communication links for each node with the increase in the total number of nodes in the system. This paper introduces a new interconnection network, namely dual-cube, for large-scale parallel computers and describes the algorithms for efficient collective communications in dual-cube. The dual-cube network mitigates the problem of increasing number of links in the large-scale hypercube network while retains hypercube's topological properties. Design of efficient routing algorithms for collective communications is the key issue for any interconnection network. In this paper, we show that the collective communications can be done in dual-cube with almost the same communication times as in hypercube.  相似文献   

6.
In distributed shared memory multiprocessor systems, parallel tasks communicate through sharing memory data. As the system size increases, such communication cost becomes the main factor that limits the overall parallelism and performance. In this paper, we propose a new solution to the problem through judiciously managing the relevant resource, namely, the shared data and the interconnection network (IN) through which the sharing is carried out. In this approach, communication cost is minimized by means of data migration/allocation which is based on analyzing general layered task graphs, sharing behavior of parallel tasks, and network topology. Our method is not applicable for read only variables. Further, for the time being, the usefulness of the method is limited to multiprocessors where no cache coherence mechanism is implemented. Four typical interconnection topologies for multiprocessors are considered, namely, shared-bus, hierarchical-bus, 2-D mesh, and fat-tree structures. Efficient data allocation algorithms for each of the four network topologies are developed that make decision on data allocation/migration at the compile time. The complexity of one algorithm isO(np) for shared-bus andO(n2p) for the remaining three in a system withnprocessors executing ap-layer task graph for one shared variable. We have also given an algorithm to determine optimal allocation/migration scheme for multiple shared variables. However, the cost of the algorithm become prohibitive when the number of shared variables is high. Therefore, a heuristic of low complexity is suggested. The heuristic is optimal for some topologies.  相似文献   

7.
几种超立方体互连结构性能研究   总被引:2,自引:0,他引:2  
随着并行处理系统规模的不断扩大,人们开始广泛使用超级互连结构,本文通过研究网络结构的性能价格比,对几种常用的超级立方体互连结构做了分析,得出了一些有用的结论。  相似文献   

8.
《国际计算机数学杂志》2012,89(9):1774-1781
Diagnosability of a multiprocessor system is one important study topic in the parallel processing area. As a family of promising optical interconnection topologies for massively parallel computers, the optical multi-mesh hypercube (OMMH) networks integrate positive features of both hypercube and mesh topologies and circumvent the lack of scalability of hypercubes and the large diameter of meshes. This paper studies an (l, m, n)-OMMH network and, proves that its diagnosability under the comparison diagnosis model is n+4 for l≥5, m≥5, n≥3.  相似文献   

9.
This paper proposes a simple yet efficient algorithm to distribute loads evenly on multiprocessor computers with hypercube interconnection networks. This algorithm was developed based upon the well-known dimension exchange method. However, the error accumulation suffered by other algorithms based on the dimension exchange method is avoided by exploiting the notion of regular distributions, which are commonly deployed for data distributions in parallel programming. This algorithm achieves a perfect load balance over P processors with an error of 1 and the worst-case time complexity of O(M log2 P), where M is the maximum number of tasks initially assigned to each processor. Furthermore, perfect load balance is achieved over subcubes as well—once a hypercube is balanced, if the cube is decomposed into two subcubes by the lowest bit of node addresses, then the difference between the numbers of the total tasks of these subcubes is at most 1.  相似文献   

10.
Many parallel algorithms and library routines for computer vision and image processing (CVIP) tasks on distributed-memory multiprocessors are available. The typical image distribution may use column, row, and block based mapping. Integrating a set of library routines for a CVIP application requires a global optimization to determine the data mapping of individual tasks by considering inter-task communication. The main difficulty in deriving the optimal image data distribution for each task is that CVIP task computation may involve loops, and the number of processors available and the size of the input image may vary at the run time. In this paper, a CVIP application is modeled using a task chain with imperfectly nested loops, specified by conventional visual languages such asKhorosandExplorer. A mapping algorithm is proposed that optimizes the average run-time performance for CVIP applications with nested loops by considering the data redistribution overheads and possible run-time parameter variations. A taxonomy of CVIP operations is provided and used for further reducing the complexity of the algorithm. Experimental results on both low-level image processing and high-level computer vision applications are presented to validate this approach.  相似文献   

11.
For the interconnection of the nodes of massively parallel processor networks, concepts are required, which are extendable. In this paper, a ‘recursive network’ is described. With a basic building block, essentially with a fixed number of links, arbitrarily large systems can be configured. At all levels, the interconnection topology is the same. Hence, a simple routing algorithm can be applied. The recursive network is described and compared with hypercube and mesh networks (with respect to the system diameter and the efficient use of the links).  相似文献   

12.
Transposing anN×Narray that is distributed row- or columnwise acrossP=Nprocessors is a fundamental communication task that requires time- consuming interprocessor communication. It is the underlying communication task for the fast Fourier transform of long sequences and multidimensional arrays. It is also the key communication task for certain weather and climate models. A parallel transposition algorithm is presented for hypercube and mesh connected multicomputers with programmable networks. The optimal scheduling of network transmissions is not unique and is known to be nontrivial. Here, scheduling is determined by a single de Bruijn sequence ofNbits. The elements in each processor are first preordered and then, in groups of log2 Padjacent elements, either transmitted or not transmitted, depending on the corresponding bit in the de Bruijn sequence. The algorithm is optimal both in overall time and the time that any individual element is in the network. The results are extended to other communication tasks including shuffles, bit reversal, index reversal, and general index-digit permutation. The casePNand rectangular arrays with non-power-of-two dimensions are also discussed. Algorithms for mesh connected multicomputers are developed by embedding the hypercube in the mesh. The optimal implementation of the algorithms requires certain architectural features that are not currently available in the marketplace.  相似文献   

13.
Many parallel algorithms use hypercubes as the communication topology among their processes. When such algorithms are executed on hypercube multicomputers the communication cost is kept minimum since processes can be allocated to processors in such a way that only communication between neighbor processors is required. However, the scalability of hypercube multicomputers is constrained by the fact that the interconnection cost-per-node increases with the total number of nodes. From scalability point of view, meshes and toruses are more interesting classes of interconnection topologies. This paper focuses on the execution of algorithms with hypercube communication topology on multicomputers with mesh or torus interconnection topologies. The proposed approach is based on looking at different embeddings of hypercube graphs onto mesh or torus graphs. The paper concentrates on toruses since an already known embedding, which is called standard embedding, is optimal for meshes. In this paper, an embedding of hypercubes onto toruses of any given dimension is proposed. This novel embedding is called xor embedding. The paper presents a set of performance figures for both the standard and the xor embeddings and shows that the latter outperforms the former for any torus. In addition, it is proven that for a one-dimensional torus (a ring) the xor embedding is optimal in the sense that it minimizes the execution time of a class of parallel algorithms with hypercube topology. This class of algorithms is frequently found in real applications, such as FFT and some class of sorting algorithms  相似文献   

14.
This paper analyzes the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5 parallel computer. It is shown that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor,as a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation  相似文献   

15.
The binary hypercube has been one of the most frequently chosen interconnection networks for parallel computers because it provides low diameter and is so robust that it can very efficiently emulate a wide variety of other frequently used networks. However, the major drawback of the hypercube is the increase in the number of communication channels for each processor with an increase in the total number of processors in the system. This drawback has a direct effect on the very large scale integration complexity of the hypercube network. This short note proposes a new topology that is produced from the hypercube by a uniform reduction in the number of edges for each node. This edge reduction technique produces networks with lower complexity than hypercubes while maintaining, to a high extent, the powerful hypercube properties. An extensive comparison of the proposed reduced hypercube (RH) topology with the conventional hypercube is included. It is also shown that several copies of the popular cube-connected cycles network can be emulated simultaneously by an RH with dilation 1  相似文献   

16.
1 IntroductionLet G = (V, E) be a connected, undirected graph with a weight function W on the set Eof edges to the set of reals. A spanning tree is a subgraph T = (V, ET), ET G E, of C suchthat T is a tree. The weight W(T) of a spanning tree T is the sum of the weights of its edges.A spanning tree with the smallest possible'weight is called a minimum spanning tree (MST)of G. Computing an MST of a given weighted graph is an important problem that arisesin many applications. For this …  相似文献   

17.
All-to-all personalized exchange is one of the most dense collective communication patterns and occurs in many important applications in parallel computing. Previous all-to-all personalized exchange algorithms were mainly developed for hypercube and mesh/torus networks. Although the algorithms for a hypercube may achieve optimal time complexity, the network suffers from unbounded node degrees and thus has poor scalability in terms of I/O port limitation in a processor. On the other hand, a mesh/torus has a constant node degree and better scalability in this aspect, but the all-to-all personalized exchange algorithms have higher time complexity. In this paper, we propose an alternative approach to efficient all-to-all personalized exchange by considering another important type of networks, multistage networks, for parallel computing systems. We present a new all-to-all personalized exchange algorithm for a class of unique-path, self-routable multistage networks. We first develop a generic method for decomposing all-to-all personalized exchange patterns into some permutations which are realizable in these networks, and then present a new all-to-all personalized exchange algorithm based on this method. The newly proposed algorithm has O(n) time complexity for an n×n network, which is optimal for all-to-all personalized exchange. By taking advantage of fast switch setting of self-routable switches and the property of a single input/output port per processor in a multistage network, we believe that a multistage network could be a better choice for implementing all-to-all personalized exchange due to its shorter communication latency and better scalability  相似文献   

18.
The Chained-Cubic Tree (CCT) interconnection network topology was recently proposed as a continuation for the extended efforts in the area of interconnection networks’ performance improvement. This topology, which promises to exhibit the best properties of the hypercube and tree topologies, needs to be deeply investigated in order to evaluate its performance among other interconnection networks’ topologies. This work comes as a complementary effort, in which the load balancing technique is investigated as one of the most important aspects of performance improvement. This paper proposes a new load balancing algorithm on CCT interconnection networks. The proposed algorithm, which is called Hybrid Dynamic Parallel Scheduling Algorithm (HD-PSA), is a combination of two common load balancing strategies; dynamic load balancing and parallel scheduling. The performance of the proposed algorithm is evaluated both, analytically and experimentally, in terms of various performance metrics; including, execution time, load balancing accuracy, communication cost, number of tasks hops, and tasks locality.  相似文献   

19.
We deal with the permutation routing problem on graphs modeling interconnection networks. In our model, calledrouting via factors, at each routing step, the communication pattern is a directed 1-factor in a symmetric digraph. This adds a new feature, that of continuous packet movement, to preciously studied routing types, where the routing of a permutation is reduced to a sequence of permutations from a given class. We especially focus on bipartite graphs and we give sufficient conditions for a graph to be rearrangeable in our model. We propose a general technic for routing via factors that we apply to the 2D mesh and the hypercube.  相似文献   

20.
Obtaining efficient execution of parallel programs in workstation networks is a difficult problem for the user. Unlike dedicated parallel computer resources, network resources are shared, heterogeneous, vary in availability, and offer communication performance that is still an order of magnitude slower than parallel computer interconnection networks. Prophet, a system that automatically schedules data parallel SPMD programs in workstation networks for the user, has been developed. Prophet uses application and resource information to select the appropriate type and number of workstations, divide the application into component tasks and data across these workstations, and assign tasks to workstations. This system has been integrated into the Mentat parallel processing system developed at the University of Virginia. A suite of scientific Mentat applications has been scheduled using Prophet on a heterogeneous workstation network. The results are promising and demonstrate that scheduling SPMD applications can be automated with good performance. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号