首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A new methodology named CALMANT (CC-cube Algorithms on Meshes and Tori) for mapping a type of algorithm that we call CC-cube algorithm onto multicomputers with hypercube, mesh, or torus interconnection topology is proposed. This methodology is suitable when the initial problem can be expressed as a set of processes that communicate through a hypercube topology (a CC-cube algorithm). There are many important algorithms that fit into the CC-cube type. CALMANT is based on three different techniques: (a) the standard embedding to assign the processes of the algorithm to the nodes of the mesh multicomputer; (b) the communication pipelining technique to increase the level of communication parallelism inherent in the CC-cube algorithms; and (c) optimal message-scheduling algorithms proposed in this work in order to avoid conflicts and minimizing in this way the communication time. Although CALMANT is proposed for multicomputers with different interconnection network topologies, the paper only focuses on the particular case of meshes.  相似文献   

2.
We present the design of E-kernel, an embedding kernel on the Victor V256 message-passing partitionable multiprocessor, developed for the support of program mapping and network reconfiguration. E-kernel supports the embedding of a new network topology onto Victor's 2D mesh and also the embedding of a task graph onto the 2D mesh network or the reconfigured network. In the current implementation, the reconfigured network can be a line or an even-size ring, and the task graphs meshes or tori of a variety of dimensions and shapes or graphs with similar topologies. For application programs having these task graph topologies and that are designed according to the communication model of E-kernel, they can be run without any change on partitions connected by the 2D mesh, line, or ring. Further, E-kernel attempts the communication optimization of these programs on the different networks automatically, thus making both the network topology and the communication optimization attempt completely transparent to the application programs. Many of the embeddings used in E-kernel are optimal or asymptotically optimal (with respect to minimum dilation cost). The implementation of E-kernel translated some of the many theoretical results in graph embeddings into practical tools for program mapping and network reconfiguration in a parallel system. E-kernel is functional on Victor V256. Measurements of E-kernel's performance on V256 are also included  相似文献   

3.
This paper proposes a routing algorithm for the interconnection of multiple processors based on the shortest-path and deflection-routing principles. The routing algorithm, named SPDRA (Shortest Path and Deflection Routing Algorithm), is applied to multiprocessor systems with a single-stage shuffle physical topology. SPDRA is general-purpose, as opposed to the majority of routing algorithms for multiprocessor systems which are optimized for particular traffic patterns generated by a restricted class of parallel algorithms. The general-purpose nature of SPDRA allows perfomance comparisons with a wide class of routing algorithms for multiprocessor systems that, similar to the single-stage shuffle physical topology, have a fixed node-to-processor ratio. The paper compares SPDRA with hypercube algorithms for bidimensional meshes and torus physical topologies, routing algorithms for hierarchical tridimensional tori, and algorithms for routing permutations in shuffle networks, which constitute the most widely accepted approaches for multiprocessor interconnection. SPDRA exhibits a performance advantage for a broad range of network sizes and, in general, the performance advantage grows as the number of processors increases. However, this paper compares the SPDRA algorithm against a limited set of multiprocessor systems and does not demonstrate a general superiority of SPDRA over all systems with a fixed node-to-processor ratio and, especially, with a growing node-to-processor ratio, such as multistage networks.  相似文献   

4.
在大规模并行系统中,系统级互连网络的设计至关重要.InfiniBand作为一种高性能交换式网络被广泛应用于大规模并行处理系统中.mesh/torus拓扑结构相较于目前普遍应用于InfiniBand网络的胖树拓扑结构拥有更好的性能与可扩展性.尽管如此,研究发现,用传统的mesh/torus拓扑结构构建InfiniBand互连网络存在诸多问题.分析了传统网络拓扑结构的缺陷,并提出了一种基于InfiniBand的多链路mesh/torus互连网络.这种改进型的拓扑结构通过充分利用交换机间的多链路可以获得比传统mesh/torus网络更高的带宽.另外,同时给出了与该网络拓扑结构相配套的高效路由算法.最后,通过网络仿真技术对提出的算法进行了评估,实验结果显示提出的路由算法相较于其他路由算法拥有更好的性能与可扩展性.  相似文献   

5.
The paper presents novel embeddings of various classical topologies into the OPAM multicomputer. OPAM consists of a large number of processors that are connected by a two level, crossbar based interconnection network. The network combines a large, optical circuit-switched crossbar (reconfigurable network), with many small, packet-switching crossbars. The necessary embedding is very different than classical approaches. The goal in our case is to minimize routing decisions, so that communication requests can be satisfied by passing through two small crossbars. We show how to map parallel programs to this architecture using graph contraction notations. The family of parallel programs that we consider consists of multiple processes and communication links that are represented by connected, regular graphs such as rings, trees, two dimensional grids, cube connected cycles and hypercubes. In each case we show how to partition the vertex set of the program's graph to subsets, and how to assign each subset a cluster of processors in order to realize the topology of the given problem. In some of the cases we also prove that our partition and assignment algorithms are optimal  相似文献   

6.
We study the cross product as a method for generating and analyzing interconnection network topologies for multiprocessor systems. Consider two interconnection graphs G1 and G2 each with some established properties such as symmetry, low degree and diameter, scalability, simple optimal routing, recursive structure (partitionability), fault tolerance, existence of node-disjoint paths, low cost embedding, and efficient broadcasting. We investigate and evaluate the corresponding properties for the cross product of G1 and G2 based on the properties of G1 and those of G2. We also give a mathematical characterization of product families of graphs which are closed under the cross product operation. This investigation is useful in two ways. On one hand, it gives a new tool for further studying some of the known interconnection topologies, such as the hypercube and the mesh, which can be defined using the cross product operation. On the other hand, it can be used in defining and evaluating new interconnection graphs using the cross product operation on known topologies  相似文献   

7.
In a hypercube of high dimension, the interconnection is quite complex and thus links are likely to fail. In this paper, a general, systematic solution is proposed for the embedding of a group of regular graphs in an injured hypercube with faulty links. These graphs include rings, linear paths, binomial trees, binary trees, meshes, tori, products of the above graphs, and many others. Unlike many existing algorithms which are capable of embedding only one type of graphs, our algorithm embeds the above graphs in a unified way, all centered around a notion callededge matrix.  相似文献   

8.
《国际计算机数学杂志》2012,89(8):1595-1602
Twisted cubes are an important class of hypercube-variant interconnection networks for parallel computing. In this paper, we evaluate the fault-tolerant mesh/torus embedding abilities of twisted cubes. By reducing the fault-tolerant mesh/torus embedding problem to the fault-tolerant pancyclicity, we propose several schemes for embedding meshes/tori in faulty twisted cubes. The obtained results reveal another appealing fault-tolerant feature of twisted cubes.  相似文献   

9.
We introduce and analyze a new interconnection topology, called the k-dimensional folded Petersen (FPk) network, which is constructed by iteratively applying the Cartesian product operation on the well-known Petersen graph. Since the number of nodes in FPk is restricted to a power of ten, for better scalability we propose a generalization, the folded Petersen cube network FPQn,k =Qn×FPk, which is a product of the n-dimensional binary hypercube (Qn) and FPk. The FPQn,k topology provides regularity, node- and edge-symmetry, optimal connectivity (and therefore maximal fault-tolerance), logarithmic diameter, modularity, and permits simple self-routing and broadcasting algorithms. With the same node-degree and connectivity, FPQ n,k has smaller diameter and accommodates more nodes than Q n+3k, and its packing density is higher compared to several other product networks. This paper also emphasizes the versatility of the folded Petersen cube networks as a multicomputer interconnection topology by providing embeddings of many computationally important structures such as rings, multi-dimensional meshes, hypercubes, complete binary trees, tree machines, meshes of trees, and pyramids. The dilation and edge-congestion of all such embeddings are at most two  相似文献   

10.
The authors compare the performance of two join algorithms on both cube and ring interconnections for message-based multicomputers, and investigate the effects that the number of processors and the type of interconnection scheme have on the performance. First, the parallel hybrid-hash join algorithm and the parallel join-index join algorithm for both the cube and ring connected multicomputers are presented. The performance of these algorithms is then compared through analytical cost modeling. The result shows that the join-index join algorithm gives good performance only when the join selectivity is very small, and the hybrid-hash join algorithm performs consistently well under most situations. It is shown that the cube topology yields better execution time than the same algorithm on the ring topology. Furthermore, increasing the number of processors has a more significant improvement on the execution time of the cube than for the ring configuration. The applicability of join indexes on the parallel database algorithms is also discussed  相似文献   

11.
The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture  相似文献   

12.
《国际计算机数学杂志》2012,89(9):1774-1781
Diagnosability of a multiprocessor system is one important study topic in the parallel processing area. As a family of promising optical interconnection topologies for massively parallel computers, the optical multi-mesh hypercube (OMMH) networks integrate positive features of both hypercube and mesh topologies and circumvent the lack of scalability of hypercubes and the large diameter of meshes. This paper studies an (l, m, n)-OMMH network and, proves that its diagnosability under the comparison diagnosis model is n+4 for l≥5, m≥5, n≥3.  相似文献   

13.
The embedding of one interconnection network into another is a very important issue in the design and analysis of parallel algorithms. Through such embeddings, the algorithms originally developed for one architecture can be directly mapped to another architecture. This paper describes a new embedding method, based on matrix transformations, for optimally embedding hierarchical hypercube networks (HHNs) into the hypercube (binary n-cube). Thus, this embedding method has practical importance in enhancing the capabilities and extending the usefulness of the hypercube, since hierarchical hypercube networks have proven to be very cost-effective for a wide range of applications  相似文献   

14.
Transposing anN×Narray that is distributed row- or columnwise acrossP=Nprocessors is a fundamental communication task that requires time- consuming interprocessor communication. It is the underlying communication task for the fast Fourier transform of long sequences and multidimensional arrays. It is also the key communication task for certain weather and climate models. A parallel transposition algorithm is presented for hypercube and mesh connected multicomputers with programmable networks. The optimal scheduling of network transmissions is not unique and is known to be nontrivial. Here, scheduling is determined by a single de Bruijn sequence ofNbits. The elements in each processor are first preordered and then, in groups of log2 Padjacent elements, either transmitted or not transmitted, depending on the corresponding bit in the de Bruijn sequence. The algorithm is optimal both in overall time and the time that any individual element is in the network. The results are extended to other communication tasks including shuffles, bit reversal, index reversal, and general index-digit permutation. The casePNand rectangular arrays with non-power-of-two dimensions are also discussed. Algorithms for mesh connected multicomputers are developed by embedding the hypercube in the mesh. The optimal implementation of the algorithms requires certain architectural features that are not currently available in the marketplace.  相似文献   

15.
We experimentally analyze some properties of simulated annealing algorithms (SA) and genetic algorithms (GA) for mapping data to multicomputers. These properties include sensitiviiy to user parameters, fault tolerance capability, and applicability to different multicomputer topologies. Some user parameters are included in the objective function and are architecture- or problem-dependent parameters. The others are used in the GA and SA algorithms. The fault tolerance capability is demonstrated by mapping data to a multicomputer with some faulty processors. We assume a hypercube multicomputer architecture in most experiments. However, mapping to mesh, array, ring, tree, and star graph topologies is also demonstrated. The experimental results show that the GA and SA are insensitive to user parameters in wide ranges, completely fault tolerant, and unbiased towards particular multicomputer topologies. These properties of flexibility and general applicability, which are lacking in other heuristic algorithms, make the GA and SA attractive for automatic parallelization systems.  相似文献   

16.
In this paper, we propose and analyze a new interconnection network, the k-ary hypercube. This new architecture captures the advantages of the mesh network and those of the binary hypercube. We show that the hamiltoniacity of this network and its capability of efficiently simulating other topologies. It has a smaller degree than that of its equivalent binary hypercube (the one with at least as many nodes) and has a smaller diameter than its equivalent mesh of processors.  相似文献   

17.
All-to-all personalized exchange is one of the most dense collective communication patterns and occurs in many important applications in parallel computing. Previous all-to-all personalized exchange algorithms were mainly developed for hypercube and mesh/torus networks. Although the algorithms for a hypercube may achieve optimal time complexity, the network suffers from unbounded node degrees and thus has poor scalability in terms of I/O port limitation in a processor. On the other hand, a mesh/torus has a constant node degree and better scalability in this aspect, but the all-to-all personalized exchange algorithms have higher time complexity. In this paper, we propose an alternative approach to efficient all-to-all personalized exchange by considering another important type of networks, multistage networks, for parallel computing systems. We present a new all-to-all personalized exchange algorithm for a class of unique-path, self-routable multistage networks. We first develop a generic method for decomposing all-to-all personalized exchange patterns into some permutations which are realizable in these networks, and then present a new all-to-all personalized exchange algorithm based on this method. The newly proposed algorithm has O(n) time complexity for an n×n network, which is optimal for all-to-all personalized exchange. By taking advantage of fast switch setting of self-routable switches and the property of a single input/output port per processor in a multistage network, we believe that a multistage network could be a better choice for implementing all-to-all personalized exchange due to its shorter communication latency and better scalability  相似文献   

18.
The well-known torus an its variants,which we call hyper-rings,as well as hypercube architectures are further studied and evaluated as interconnecion networks for multicomputers,Comparisons are made among hyper-rings and between hyper-ring and hypercube networks under different communication patterns.It is concluded that although it is believed that a hypercube is generally superior to hyper-rings in performance,this is not always the case,paricu larly for locally constrained applications,where communications occur mostly among neighboring nodes.  相似文献   

19.
Partitioning and mapping of nested loops for linear array multicomputers   总被引:1,自引:1,他引:0  
In distributed-memory multicomputers, minimizing interprocessor communication is the key to the efficient execution of parallel programs. In order to reduce the amount of communication overhead, parallel programs on multicomputers must be carefully scheduled by parallelizing compilers. This paper proposes some compilation techniques for partitioning and mapping nested loops with constant data dependences onto linear array multicomputers. First, a systematic partition strategy is proposed to project ann-dimensional computational structure, representing ann-nested loop, onto a line to form a one-dimensional projected structure with low communication overhead. Then, a mapping algorithm is proposed for mapping the partitioned loops onto linear arrays in a way that balances the workload and minimizes the communication cost among processors. Finally, parallel execution codes can be automatically generated for such linear array multicomputers.  相似文献   

20.
This paper introduces the parallelization on a distributed memory multicomputer of two iterative methods for finding all the roots of a given polynomial. The parallel algorithms share the computation of the roots among the processors and perform a total exchange of the data at each step. Since the amount of communications is the main drawback of this approach, we study the effect of the network topology on the performance of the algorithms. Particularly, we show that among the different classical processors networks topologies (ring, 2d-torus or n-cube), the hypercube topology minimizes the communications. For each topology is computed the optimal number of processors. Experiments on the hypercube FPS T40 illustrate the results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号