期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Incremental design of scalable interconnection networks using basic building blocks

Yang M. Ni L.M. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(11):1126-1140

In this paper, we present an incremental design of scalable interconnection networks in multicomputer systems using basic building blocks. Both network topologies and routing algorithms are considered. We use wormhole-routed small-scale 2D meshes as basic building blocks. The minimum requirement to expand these networks is a single building block. This implies that the network does not have to maintain the regular 2D mesh topology. Some new topologies are introduced: incomplete meshes based on those adaptive routing algorithms designed from the turn model and extended incomplete meshes based on XY routing. We show that the original routing algorithm can be adopted to send a message between any source and destination without using store-and-forward and causing deadlock. The way that the network is constructed incrementally requires no or a very small amount of rewiring and keeps high bisection density and short diameter of the network. The design methods can be used to economically and incrementally build expandable and scalable parallel computers. 相似文献

2.

On the Performance of Parallel Matrix Factorisation on the Hypermesh

Al-Ayyoub A. Ould-Khaoua M. Day K. 《The Journal of supercomputing》2001,20(1):37-53

Most common multicomputer networks, e.g. d-ary h-cubes, are graph topologies where an edge (channel) interconnects exactly two vertices (nodes). Hypergraphs are a generalisation of the graph model, where a channel interconnects an arbitrary number of nodes. Previous studies have used synthetic workloads (e.g. statistical distributions) to stress the superior performance characteristics of regular multi-dimensional hypergraphs, also known as hypermeshes, over d-ary h-cubes. There has been, however, hardly any study that has considered real-world parallel applications. This paper contributes towards filling this gap by providing a comparative study of the performance of one of the most common numerical problems, namely matrix factorisation, on the hypermesh, hypercube, and d-ary h-cube. To this end, the paper first introduces orthogonal networks as a unified model for describing both the graph and hypergraph topologies. It then develops a generalised parallel algorithm for matrix factorisation and evaluates its performance on the hypermesh, hypercube and d-ary h-cube. The results reveal that the hypermesh supports matrix computation more efficiently, and therefore provides more evidence of the hypermesh as a viable network for future large-scale multicomputers. 相似文献

3.

YOMNA – An efficient deadlock-free multicast wormhole algorithm in 2-D mesh multicomputers

《Journal of Systems Architecture》2000,46(12):1073-1091

A mesh network is a popular architecture which has been implemented in many multicomputer systems. It is preferred because it offers useful edge connectivity and is partitioned into units that are still meshes. It is also scalable and has a number of features that make it particularly amenable to high-performance computing. The 2-D mesh topology has become increasingly important to multicomputer design because of its many desirable properties including scalability, low bandwidth and fixed degree of nodes.The essential pattern in new multicomputer generations is the multicast wormhole pattern, which corresponds to one-to-many communication in which one source sends the same message to multiple destination nodes. In wormhole routing, a message is decomposed into words or flits, and flits of one message may be spread out among several nodes. Deadlock in the interconnection network occurs when no message can advance towards its destination. Some deadlock-free routing algorithms for wormhole routing were proposed, but the network latency and the network traffic were not taken into consideration. An optimal message routing should achieve both minimum traffic and minimum latency for the communication patterns involved. Unfortunately, finding optimal message routing has been shown to be NP-hard for most common multicomputer topologies.In this paper, an efficient algorithm (YOMNA) is introduced to find a deadlock-free multicast wormhole routing in 2-D mesh parallel machines. YOMNA algorithm is a tree-based technique, in which the router simultaneously sends incoming flits on more than one outgoing channel. YOMNA algorithm is compared with the dual-path multicast routing, which is a path-based technique. YOMNA algorithm has proved to be deadlock free. The network latency and the network traffic are calculated for YOMNA algorithm and for the dual-path multicast routing. The results demonstrate that YOMNA algorithm outperformed the dual-path routing. 相似文献

4.

A class of highly scalable optical crossbar-connectedinterconnection networks (SOCNs) for parallel computing systems

Webb B. Louri A. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(5):444-458

A class of highly scalable interconnect topologies called the Scalable Optical Crossbar-Connected Interconnection Networks (SOCNs) is proposed. This proposed class of networks combines the use of tunable Vertical Cavity Surface Emitting Lasers (VCSEL's), Wavelength Division Multiplexing (WDM) and a scalable, hierarchical network architecture to implement large-scale optical crossbar based networks. A free-space and optical waveguide-based crossbar interconnect utilizing tunable VCSEL arrays is proposed for interconnecting processor elements within a local cluster. A similar WDM optical crossbar using optical fibers is proposed for implementing intercluster crossbar links. The combination of the two technologies produces large-scale optical fan-out switches that could be used to implement relatively low cost, large scale, high bandwidth, low latency, fully connected crossbar clusters supporting up to hundreds of processors. An extension of the crossbar network architecture is also proposed that implements a hybrid network architecture that is much more scalable. This could be used to connect thousands of processors in a multiprocessor configuration while maintaining a low latency and high bandwidth. Such an architecture could be very suitable for constructing relatively inexpensive, highly scalable, high bandwidth, and fault-tolerant interconnects for large-scale, massively parallel computer systems. This paper presents a thorough analysis of two example topologies, including a comparison of the two topologies to other popular networks. In addition, an overview of a proposed optical implementation and power budget is presented, along with analysis of proposed media access control protocols and corresponding optical implementation 相似文献

5.

Hung-Yi Teng Chien-Nan Lin Ren-Hung Hwang 《Information Systems Frontiers》2014,16(1):45-58

Unstructured peer-to-peer (P2P) overlay networks with two-layer hierarchy, comprising an upper layer of super-peers and an underlying layer of ordinary peers, are used to improve the performance of large-scale P2P applications like content distribution and storage. In order to deal with continuous growth of participating peers, a scalable and efficient super-peer overlay topology is essential. However, there is relatively little research conducted on constructing such super-peer overlay topology. In the existed solutions, the number of connections required to be maintained by a super-peer is in direct proportion to the total number of super-peers. For super large-scale P2P applications, i.e. the number of participating peer is over 1,000,000, these solutions are not scalable and impractical. Therefore, in this paper, we propose a scalable hierarchical unstructured P2P system in which a self-similar square network graph (SSNG) is proposed to construct and maintain the super-peer overlay topology adaptively. The SSNG topology is a constant-degree topology in which each node maintains a constant number of neighbor nodes. Moreover, a simple and efficient message forwarding algorithm is presented to ensure each super-peer to receive just one flooding message. The analytical results showed that the proposed SSNG-based overlay is more scalable and efficient than the perfect difference graph (PDG)-based overlay proposed in the literature. 相似文献

6.

多机互连网络与并行算法结构关系的分析及其在Transputer网上的仿真研究

林成江《小型微型计算机系统》1993,(2)

本文通过对多机互连网络建模,着重分析了并行算法结构对各种拓扑互连结构性能的影响,以及多机系统中用于结点间交换信息的通信开销,并对该模型在Transputer网上用TRANSIM进行了仿真研究。相似文献

7.

一种引入阈值的Ad Hoc网络分级转发指针的位置管理策略

李皓李德敏蔡元正《传感器与微系统》2008,27(1):33-35,38

在移动Ad Hoc网络中,由于网络具有自组织性和节点的频繁移动,也就使得网络的拓扑结构频繁变化。随着节点数目的增加,网络的开销迅速增大,这就直接影响到网络的可扩展性。目前,在移动Ad Hoc网络中越来越多的引入了位置管理的策略,以达到减少网络开销的目的。引入分级转发指针和阈值的思想,提出了一种新的位置管理策略。通过与其他策略的方针比较,证明其在性能上具有更好的可扩展性。相似文献

8.

Optimizing I/O server placement for parallel I/O on switch-based irregular networks

Yih-Fang Lin Chien-Min Wang Jan-Jan Wu 《The Journal of supercomputing》2006,36(3):201-217

In this paper, we study I/O server placement for optimizing parallel I/O performance on switch-based clusters, which typically adopt irregular network topologies to allow construction of scalable systems with incremental expansion capability. Finding optimal solution to this problem is computationally intractable. We quantified the number of messages travelling through each network link by a workload function, and developed three heuristic algorithms to find good solutions based on the values of the workload function. The maximum-workload-based heuristic chooses the locations for I/O nodes in order to minimize the maximum value of the workload function. The distance-based heuristic aims to minimize the average distance between the compute nodes and I/O nodes, which is equivalent to minimizing average workload on the network links. The load-balance-based heuristic balances the workload on the links based on a recursive traversal of the routing tree for the network. Our simulation results demonstrate performance advantage of our algorithms over a number of algorithms commonly used in existing parallel systems. In particular, the load-balance-based algorithm is superior to the other algorithms in most cases, with improvement ratio of 10 to 95% in terms of parallel I/O throughput. 相似文献

9.

ComPaSS: A Communication Package for Scalable Software Design

《Journal of Parallel and Distributed Computing》1994,22(3):449-461

In massively parallel computers (MPCs), efficient communication among processors is critical to performance. This paper describes the initial implementation of the ComPaSS communication library to support scalable software development in MPCs. ComPaSS provides high-level global communication operations for both data manipulation and process control, many of which are based upon a small set of low-level communication primitives. The low-level operations of the ComPaSS library are provably optimal for a class of architectures representative of many commercial scalable systems, in particular those using wormhole routing and n-dimensional mesh network topologies. This paper concentrates on the multicast and multireceive components of the ComPaSS library, which are fundamental to implementing efficient high-level data parallel operations. The design of the multicast and multireceive primitives is described and an example of a data parallel application utilizing ComPaSS multicast is given. The scalability of these primitives is discussed, and improvements in performance resulting from use of the library on a 64-node nCUBE-2 are presented. 相似文献

10.

An optimized broadcasting technique for WK-recursive topologies

G. Della Vecchia C. Sanges 《Future Generation Computer Systems》1990,5(4):353-357

This paper describes a study carried out on behalf of a research within the Strategic Project on Parallel Computation, supported by the National Research Council of Italy, where the Hybrid Computing Research Center is collaborating with the Department of Computer Science of the University of Pisa on the design of a massively parallel system based on VLSI components having a direct network message passing architecture. Within this project the Authors purposely devised a new general class of communication network topology they called WK-recursive, as an effort to meet the demand of scalable and efficient communication structures for very large parallel systems. Object of the present paper is to describe an optimized technique to perform effective broadcasting operations on networks belonging to the WK-recursive class, one prototype of which has been realized at the Hybrid Computing Research Center. This technique is characterized by a number of properties which will be profitably exploited in the design of the distributed operating system for such a massively parallel system, especially as far as the kernel run-time support is concerned. 相似文献

11.

基三分层网络中的受限多播路由算法

乔保军石峰计卫星《计算机应用》2007,27(4):801-804

多播路由算法对互连网络的通信性能和多处理机系统性能的发挥起着重要作用。针对基三分层互连网络,在权衡性能、成本和实现的基础上,提出一种基于树的受限多播路由算法TRMA。该算法充分利用基三分层互连网络的层次特性和节点编码中所含的网络拓扑信息实现消息路由,算法设计简单,易于硬件实现。和其他基于树的多播路由算法相比,TRMA算法不需要源节点在发送消息前构建多播树,并将多播树的信息存放在消息中,大大降低了源节点的工作负载,提高整个系统的性能。通过仿真比较了TRMA和基于单播的多播路由算法,结果表明TRMA具有较低的网络延迟和较小的网络流量。相似文献

12.

Multipath Dissemination in Regular Mesh Topologies

Mamidisetty Kranthi K. Duan Minlan Sastry Shivakumar Sastry P.S. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(8):1188-1201

Mesh topologies are important for large-scale peer-to-peer systems that use low-power transceivers. The Quality of Service (QoS) in such systems is known to decrease as the scale increases. We present a scalable approach for dissemination that exploits all the shortest paths between a pair of nodes and improves the QoS. Despite the presence of multiple shortest paths in a system, we show that these paths cannot be exploited by spreading the messages over the paths in a simple round-robin manner; nodes along one of these paths will always handle more messages than the nodes along the other paths. We characterize the set of shortest paths between a pair of nodes in regular mesh topologies and derive rules, using this characterization, to effectively spread the messages over all the available paths. These rules ensure that all the nodes that are at the same distance from the source handle roughly the same number of messages. By modeling the multihop propagation in the mesh topology as a multistage queuing network, we present simulation results from a variety of scenarios that include link failures and propagation irregularities to reflect real-world characteristics. Our method achieves improved QoS in all these scenarios. 相似文献

13.

HiCOO: Hierarchical cooperation for scalable communication in Global Address Space programming models on Cray XT systems

Weikuan Yu Xinyu Que Vinod Tipparaju Jeffrey S. Vetter 《Journal of Parallel and Distributed Computing》2012

Global Address Space (GAS) programming models enable a convenient, shared-memory style addressing model. Typically this is realized through one-sided operations that can enable asynchronous communication and data movement. With the size of petascale systems reaching 10,000s of nodes and 100,000s of cores, the underlying runtime systems face critical challenges in (1) scalably managing resources (such as memory for communication buffers), and (2) gracefully handling unpredictable communication patterns and any associated contention. For any solution that addresses these resource scalability challenges, equally important is the need to maintain the performance of GAS programming models. In this paper, we describe a Hierarchical COOperation (HiCOO) architecture for scalable communication in GAS programming models. HiCOO formulates a cooperative communication architecture: with inter-node cooperation amongst multiple nodes (a.k.a multinode) and hierarchical cooperation among multinodes that are arranged in various virtual topologies. We have implemented HiCOO for a popular GAS runtime library, Aggregate Remote Memory Copy Interface (ARMCI). By extensively evaluating different virtual topologies in HiCOO in terms of their impact to memory scalability, network contention, and application performance, we identify MFCG as the most suitable virtual topology. The resulting HiCOO architecture is able to realize scalable resource management and achieve resilience to network contention, while at the same time maintaining or enhancing the performance of scientific applications. In one case, it reduces the total execution time of an NWChem application by 52%. 相似文献

14.

A scalable multibus configuration for connecting transputer links

Adda M. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(3):245-253

The paper presents the development and the performance of a novel bus based message passing interconnection scheme which can be used to join a large number of INMOS transputers via their serial communication links. The main feature of this architecture is that it avoids the communication overhead which occurs in systems where processing nodes relay communications to their neighbors. It also produces a flexible and scalable machine whose attractive characteristics are its simplicity and low latency for large configurations. We show that this architecture is free from deadlock, exhibits much smaller latency than most directly connected transputer networks and has a scalable bandwidth, in contrast to other bus topologies 相似文献

15.

全互连立方体网络在并行处理系统中的应用 总被引：3，自引：1，他引：2

王洪玉董秀国《计算机研究与发展》2001,38(5):609-615

提出一种应用于大规模并行处理系统的结点度等于常数的递归多级分层互连网络 ,称为全互连立方体网络 (fully connected cubic network,FCCN) .FCCN具有可扩展性好、延伸性能好等优点 .一个 m- FCCN可以由 8个(m - 1) - FCCN递归得到 ,FCCN网络的结点度与网络的规模大小无关等于常数 4,网络的直径和平均结点距离都与结点数的立方根成正比 .提出 FCCN中的简单路由算法 .并将 FCCN互连网络结构在大规模光电混合处理系统中进行应用 ,通过实际计算结果证明 FCCN具有比较高的并行处理效率相似文献

16.

基于InfiniBand的多链路mesh/torus大规模并行系统互连网络

夏晓爽刘轶王允彬钱德沛《计算机研究与发展》2012,49(1):76-82

在大规模并行系统中,系统级互连网络的设计至关重要.InfiniBand作为一种高性能交换式网络被广泛应用于大规模并行处理系统中.mesh/torus拓扑结构相较于目前普遍应用于InfiniBand网络的胖树拓扑结构拥有更好的性能与可扩展性.尽管如此,研究发现,用传统的mesh/torus拓扑结构构建InfiniBand互连网络存在诸多问题.分析了传统网络拓扑结构的缺陷,并提出了一种基于InfiniBand的多链路mesh/torus互连网络.这种改进型的拓扑结构通过充分利用交换机间的多链路可以获得比传统mesh/torus网络更高的带宽.另外,同时给出了与该网络拓扑结构相配套的高效路由算法.最后,通过网络仿真技术对提出的算法进行了评估,实验结果显示提出的路由算法相较于其他路由算法拥有更好的性能与可扩展性. 相似文献

17.

Finding the roots of a polynomial on an MIMD multicomputer

Michel Consnard Pierre Fraigniaud 《Parallel Computing》1990,15(1-3):75-85

This paper introduces the parallelization on a distributed memory multicomputer of two iterative methods for finding all the roots of a given polynomial. The parallel algorithms share the computation of the roots among the processors and perform a total exchange of the data at each step. Since the amount of communications is the main drawback of this approach, we study the effect of the network topology on the performance of the algorithms. Particularly, we show that among the different classical processors networks topologies (ring, 2d-torus or n-cube), the hypercube topology minimizes the communications. For each topology is computed the optimal number of processors. Experiments on the hypercube FPS T40 illustrate the results. 相似文献

18.

A family of simple distributed minimum connected dominating set-based topology construction algorithms

Pedro M. Wightman Miguel A. Labrador 《Journal of Network and Computer Applications》2011,34(6):1997-2010

This paper considers the problem of topology construction to save energy in wireless sensor networks. The proposed topology construction mechanisms build reduced topologies using the Connected Dominating Set approach in a distributed, efficient, and simple manner. This problem is very challenging because the solution must provide a connected network with complete coverage of the area of interest using the minimum number of nodes possible. Further, the algorithms need to be computationally inexpensive and the protocols simple enough in terms of their message and computation complexity, so they do not consume more energy creating the reduced topology than the energy that they are supposed to save. In addition, it is desirable to reduce or completely eliminate the need of localization mechanisms since they introduce additional costs and energy consumption. To this end, we present the family of A3 distributed topology construction algorithms, four simple algorithms that build reduced topologies with very low computational and message complexity without the need of localization information: A3, A3Cov, A3Lite and A3CovLite. The algorithms are compared in sparse and dense networks versus optimal theoretical bounds for connected-coverage topologies and two distributed heuristics found in the literature using the number of active nodes and the ratio of coverage as the main performance metrics. The results demonstrate that there is no clear winner, and rather, trade offs exist. If coverage is not as critical as energy (network lifetime), it would be better to use A3Lite, as it needs fewer number of nodes and messages. If coverage is very important for the application, then the A3CovLite is the best option mostly because of the lower message complexity. 相似文献

19.

FIR: An efficient routing strategy for tori and meshes

《Journal of Parallel and Distributed Computing》2006,66(7):907-921

Recent massively parallel computers are based on clusters of PCs. These machines use one of the recently proposed standard interconnects. These interconnects either use source routing or distributed routing based on forwarding tables. While source routers are simpler, distributed routers provides more flexibility allowing the network to achieve a higher performance. Distributed routing can be implemented by a fixed hardware specific to a routing function on a given topology or by using forwarding tables. The main problem of this approach is the lack of scalability of forwarding tables. In this paper, we propose a distributed routing strategy for commercial switches, flexible interval routing, that is scalable, both in memory and routing time because it is not based on tables. At the same time, the strategy is easy to reconfigure, being able to implement the most commonly used routing algorithms in the most widely used regular topologies. 相似文献

20.

可伸缩分布式动态区间映射算法

刘仲周兴铭《计算机学报》2006,29(10):1757-1763

提出一种支持权重分布数据的可伸缩分布式动态区间映射算法.该算法能够在存储节点发生变化时,根据可用的资源情况立即重新均衡数据对象分布,从所有存储节点中并行迁移数据对象,且迁移的数据对象数目是最少的.在此基础上提出分布式节点地址计算算法,支持计算节点通过视图校正算法自主学习,自动适应新的系统规模,消除了现有的集中式访问性能瓶颈,使系统具有高可伸缩性. 相似文献