期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallel routing algorithms for nonblocking electronic and photonic switching networks

Lu E. Zheng S.Q. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(8):702-713

We study the connection capacity of a class of rearrangeable nonblocking (RNB) and strictly nonblocking (SNB) networks with/without crosstalk-free constraint, model their routing problems as weak or strong edge-colorings of bipartite graphs, and propose efficient routing algorithms for these networks using parallel processing techniques. This class of networks includes networks constructed from banyan networks by horizontal concatenation of extra stages and/or vertical stacking of multiple planes. We present a parallel algorithm that runs in O(lg/sup 2/ N) time for the RNB networks of complexities ranging from O(N lg N) to O(N/sup 1.5/ lg N) crosspoints and parallel algorithms that run in O(min{d* lg N, /spl radic/N}) time for the SNB networks of O(N/sup 1.5/ lg N) crosspoints, using a completely connected multiprocessor system of N processing elements. Our algorithms can be translated into algorithms with an O(lg N lg lg N) slowdown factor for the class of N-processor hypercubic networks, whose structures are no more complex than a single plane in the RNB and SNB networks considered. 相似文献

2.

Work-efficient routing algorithms for rearrangeable symmetricalnetworks

Cam H. Fortes J.A.B. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(7):733-741

The work performed by a parallel algorithm is the product of its running time and the number of processors it requires. This paper presents work-efficient (or cost-optimal) routing algorithms to determine the switch settings for realizing permutations on rearrangeable symmetrical networks such as Benes and the reduced Ω _NΩ_N^-1. These networks have 2n-1 stages with N=2ⁿ inputs/outputs, each stage consisting of N/2 crossbar switches of size (2×2). Previously known parallel routing algorithms for a rearrangeable network with N inputs determine the states of all switches recursively in O(n) iterations using N processors. Each iteration determines the switch settings of at most two stages of the network and requires at least O(n) time on a computer of N processors, regardless of the type of its interconnection network. Hence, the work of any previously known parallel routing algorithm equals at least O(Nn²) for setting up all the switches of a rearrangeable network. The new routing algorithms run on a computer of p processors, 1⩽p⩽N/n, and perform work O(Nn). Moreover, because the range of p is large, the new routing algorithms do not have to be changed in case some processors become faulty 相似文献

3.

Parallel constant-time connectivity algorithms on a reconfigurablenetwork of processors

Alnuweiri H.M. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(1):105-110

This short note presents constant-time algorithms for labeling the connected components of an image on a network of processors with a wide reconfigurable bus. The algorithms are based on a processor indexing scheme which employs constant-weight codes. The use of such codes enables identifying a single representative processor for each component in a constant number of steps. The proposed algorithms can label an N×N image in O(1) time using N² processors, which is optimal. Furthermore, the proposed techniques lead to an O(logN/loglogN)-time image labeling algorithm on a network of N² processors with a reconfigurable bus of width log N bits. It is shown that these techniques on be applied to labeling an undirected N-vertex graph represented by an adjacency matrix 相似文献

4.

Hypercube sandwich approach to conferencing

Joanne F. Houlahan Lenore J. Cowen Gerald M. Masson 《The Journal of supercomputing》1996,10(3):271-283

This paper presents a novel cascaded conference network that provides distributed processing and signal transmission among members of disjoint sets of generic send/receive devices called conferees. It assumes an online request model in which idle groups of conferees may request the formation of a conference interconnection. Once a conference is established, all conferees remain connected until the entire conference is dissolved. The Hypercube Sandwich Network (HSN) consists of two components. A bidirectional permutation network is used for routing purposes to and from a hypercube of special processing elements for the purpose of conference formation. The HSN achieves strictly nonblocking performance for N conferees using O(Nlog N) processing elements, and this is shown to be tight to within a log ^1/4 N factor. Previous constructions required a quadratic number of processing elements for strictly nonblocking performance or could only provide wide-sense nonblocking conferencing. If the stronger requirement is made that the communication delay is logarithmic in the conference size, a simple algorithm is presented for wide-sense nonblocking conferencing in an HSN with O(N log N) processing elements.An earlier version of this paper was presented at the 1995 International Conference on Parallel Processing Techniques and Applications. 相似文献

5.

Optimal algorithms for the channel-assignment problem on a reconfigurable array of processors with wider bus networks

Shi-Jinn Horng Horng-Ren Tsai Yi Pan Seitzer J. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(11):1124-1138

The computation model on which the algorithms are developed is the reconfigurable array of processors with wider bus networks (abbreviated to RAPWBN). The main difference between the RAPWBN model and other existing reconfigurable parallel processing systems is that the bus width of each network is bounded within the range [2,[/spl radic/(N)]]. Such a strategy not only saves the silicon area of the chip as well as increases the computational power enormously, but the strategy also allows the execution speed of the proposed algorithms to be tuned by the bus bandwidth. To demonstrate the computational power of the RAPWBN, the channel-assignment problem is derived in this paper. For the channel-assignment problem with N pairs of components, we first design an O(T + [N//spl omega/]) time parallel algorithm using 2N processors with a 2N-row by 2N-column bus network, where the bus width of each bus network is /spl omega/-bit for 2 /spl les/ /spl omega/ /spl les/ [/spl radic/N] and T = [log/sub /spl omega//N] + 1. By tuning the bus bandwidth to the natural log N-bit and the extended N/sup 1/c/-bit (N/sup 1/c/ > log N) for any constant c and c /spl ges/ 1, two more results which run in O(log N/log log N) and O(1) time, respectively, are also derived. When compared to the algorithms proposed by Olariu et al. [17] and Lin [14], it is shown that our algorithm runs in the equivalent time complexity while significantly reducing the number of processors to O(N). 相似文献

6.

基于二进制寻路法和多Omega网络的自路由无阻塞多级网

张联顾乃杰刘刚《计算机应用》2005,25(12):2923-2924

提出了一种可以无阻塞地传输其输入与输出间任意多播信号的新型自路由无阻塞多级网。该网络采用了循环重建法,以二进制扩散概念为基础。它由一个二进制扩散网络和两个二分之一大小的多播路由网络循环构建而成。多播信号由第一个Omega网复制并二分扩散到输出端口,进入N×N的Omega×Omega-1网络,再进入紧随其后的N/2×N/2的Omega×Omega-1网络……。每个Omega×Omega-1网络负责依照目的地址的有效标志位将输入置换到输出的上半部分和下半部分,再分别进入上下两个子Omega×Omega-1网络中做同样的处理,如此类推,直到全部地址有效位处理完毕,从而完成自路由无阻塞的多播传输。由于各大小不等的Omega×Omega-1网络皆可并行设置和并行路由,故此种新型多Omega网络的设置时间为O(NlogN),路由时间为O(log2N),硬件代价则为O(Nlog2N)。它比现行已知的多播网络设计具有较优的代价。 相似文献

7.

2-Dilated flattened butterfly: A nonblocking switching topology for high-radix networks

Ajithkumar Thamarakuzhi John A. Chandy 《Computer Communications》2011,34(15):1822-1835

High-performance computing is highly dependent on the communication network connecting the nodes. In this paper, we propose a 2-Dilated flattened butterfly (2DFB) network which provides non-blocking performance for relatively low cost overhead. We study the topological properties of the proposed 2DFB network and compare it with different nonblocking switching topologies. We observe that a dilation factor of two is sufficient to obtain nonblocking property for a flattened butterfly structure irrespective of its size or dimension. Dilating each link in a flattened butterfly causes an increase in cost. Therefore, we modeled the implementation cost of a 2DFB network and compared it with other popular nonblocking networks. We observe that the cost of a 2DFB is less than other nonblocking networks, while at the same time providing reduced latency because of its reduced diameter and hop count. We also propose a procedure to develop a conflict-free static routing schedule as well as an adaptive load balanced routing scheme (ALDFB) for 2DFB networks. Finally, we also describe the hardware implementation of a 2DFB network using the NetFPGA as the switching element and verify the nonblocking behavior of a 2DFB. We also show that the 2DFB topology can be used to build high speed switching systems with reduced cost. 相似文献

8.

Massively parallel algorithms for trace-driven cache simulations

Nicol D.M. Greenberg A.G. Lubachevsky B.D. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(8):849-859

Considers the use of massively parallel architectures to execute a trace-driven simulation of a single cache set. A method is presented for the least-recently-used (LRU) policy, which, regardless of the set size C, runs in time O(log N) using N processors on the EREW (exclusive read, exclusive write) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. We present timings of this algorithm's implementation on the MasPar MP-1, a machine with 16384 processors. A broad class of reference-based line replacement policies are considered, which includes LRU as well as the least-frequently-used (LFU) and random replacement policies. A simulation method is presented for any such policy that, on any trace of length N directed to a C line set, runs in O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are well suited for SIMD implementation 相似文献

9.

Fully dynamic maintenance of k-connectivity in parallel

Weifa Liang Brent R.P. Hong Shen 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(8):846-864

Given a graph G=(V, E) with n vertices and m edges, the k-connectivity of G denotes either the k-edge connectivity or the k-vertex connectivity of G. In this paper, we deal with the fully dynamic maintenance of k-connectivity of G in the parallel setting for k=2, 3. We study the problem of maintaining k-edge/vertex connected components of a graph undergoing repeatedly dynamic updates, such as edge insertions and deletions, and answering the query of whether two vertices are included in the same k-edge/vertex connected component. Our major results are the following: (1) An NC algorithm for the 2-edge connectivity problem is proposed, which runs in O(log n log(m/n)) time using O(n^3/4) processors per update and query. (2) It is shown that the biconnectivity problem can be solved in O(log^{2 n}) time using O(nα(2n, n)/logn) processors per update and O(1) time with a single processor per query or in O(log n log_n/^m) time using O(nα(2n, n)/log n) processors per update and O(logn) time using O(nα(2n, n)/logn) processors per query, where α(.,.) is the inverse of Ackermann's function. (3) An NC algorithm for the triconnectivity problem is also derived, which takes O(log n log_n/^m+logn log log n/α(3n, n)) time using O(nα(3n, n)/log n) processors per update and O(1) time with a single processor per query. (4) An NC algorithm for the 3-edge connectivity problem is obtained, which has the same time and processor complexities as the algorithm for the triconnectivity problem. To the best of our knowledge, the proposed algorithms are the first NC algorithms for the problems using O(n) processors in contrast to Ω(m) processors for solving them from scratch. In particular, the proposed NC algorithm for the 2-edge connectivity problem uses only O(n^3/4) processors. All the proposed algorithms run on a CRCW PRAM 相似文献

10.

A star network approach in heterogeneous multiprocessors system on chip

Chao Wang Xi Li Junneng Zhang Xuehai Zhou Aili Wang 《The Journal of supercomputing》2012,62(3):1404-1424

Multiprocessor System on Chip (MPSoC) platform plays a vital role in parallel processor architecture design. However, with the growing number of processors, interconnect on chip is becoming one of the major bottlenecks of MPSoC architecture. In this paper, we propose a star network based on peer to peer links on FPGA. The star network utilizes fast simplex links (FSL) as basic structure to connect the scheduler with heterogeneous processing elements, including processors and hardware IP cores. Blocking and nonblocking application interfaces are provided for high level programming. We built a prototype system on FPGA to evaluate the transfer time and hardware cost of the proposed star network architecture. Experiment results demonstrated that the average transfer time for each word could be reduced to 7 cycles, which achieves 14× speedup against state-of-the-art shared memory literatures. Moreover, the star network cost only 1.2?% Flip Flops and 2.45?% LUTs of a single FPGA. 相似文献

11.

Efficient parallel implementation of Bose Hubbard model: Exact numerical ground states and dynamics of gaseous Bose-Einstein condensates

Mary Ann E. Leung William P. Reinhardt 《Computer Physics Communications》2007,177(4):348-356

We present a parallel implementation of the Bose Hubbard model, using imaginary time propagation to find the lowest quantum eigenstate and real time propagation for simulation of quantum dynamics. Scaling issues, performance of sparse matrix-vector multiplication, and a parallel algorithm for determining nonzero matrix elements are described. Implementation of imaginary time propagation yields an O(N) linear convergence on a single processor and slightly better than ideal performance on up to 160 processors for a particular problem size. The determination of the nonzero matrix elements is intractable using sequential non-optimized techniques for large problem sizes. Thus, we discuss a parallel algorithm that takes advantage of the intrinsic structural characteristics of the Fock-space matrix representation of the Bose Hubbard Hamiltonian and utilizes a parallel implementation of a Fock state look up table to make this task solvable within reasonable timeframes. Our parallel algorithm demonstrates near ideal scaling on thousand of processors. We include results for a matrix 22.6 million square, with 202 million nonzero elements, utilizing 2048 processors. 相似文献

12.

Multiway merging in parallel

Zhaofang Wen 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(1):11-17

The problem of merging k (k⩾2) sorted lists is considered. We give an optimal parallel algorithm which takes O((n log k/p)+log n) time using p processors on a parallel random access machine that allows concurrent reads and exclusive writes, where n is the total size of the input lists. This algorithm achieves O(log n) time using p=n log k/log n processors. Most of the previous log n research for this problem has been focused on the case when k=2. Very recently, parallel solutions for the case when k=2 have been reported. Our solution is the first logarithmic time optimal parallel algorithm for the problem when k⩾2. It can also be seen as a unified optimal parallel algorithm for sorting and merging. In order to support the algorithm, a new processor assignment strategy is also presented 相似文献

13.

基于流水光总线阵列上的快速可扩展并行排序算法

陈宏建陈崚秦玲徐晓华屠莉《计算机工程》2004,30(24):17-18,191

在Y．Pan提出的基于流水光总线阵列模型(LARPBS)上使用N个处理器对N个元素进行排序在最好情况下以O(logN)时间，最坏情况下以O(N)时间完成的并行排序算法的基础上，提出了一种LARPBS模型上的可扩展的快速并行排序算法，对N个元素进行排序，使用p(1≤P≤N)个处理器在最好情况下以O(NlogN／p)时间，最坏情况下以O(N^2／p)时间完成排序。另外还提出了一种LARPBS模型上改进的快速高效并行排序算法，该算法对N个元素进行排序使用N个处理器在最好情况下以O(log√N)时间、最坏情况下以O(√N)时间完成排序。相似文献

14.

A new algorithm based on Givens rotations for solving linearequations on fault-tolerant mesh-connected processors

Murthy K.N.B. Bhuvaneswari K. Ram Murthy C.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(8):825-832

In this paper, we propose a new I/O overhead free Givens rotations based parallel algorithm for solving a system of linear equations. The algorithm uses a new technique called two-sided elimination and requires an N×(N+1) mesh-connected processor array to solve N linear equations in (5N-log N-4) time steps. The array is well suited for VLSI implementation as identical processors with simple and regular interconnection pattern are required. We also describe a fault-tolerant scheme based on an algorithm based fault tolerance (ABFT) approach. This scheme has small hardware and time overhead and can tolerate up to N processor failures 相似文献

15.

A fast algorithm for computing a histogram on reconfigurable mesh 总被引：1，自引：0，他引：1

Ju-Wook Jang Heonchul Park Prasanna V.K. 《IEEE transactions on pattern analysis and machine intelligence》1995,17(2):97-106

The reconfigurable mesh captures salient features from a variety of sources, including the content addressable array parallel processor, the CHiP, the polymorphic-torus network and the bus automaton. It consists of an array of processors interconnected by a reconfigurable bus system. The bus system can be used to dynamically obtain various interconnection patterns between the processors. In this paper, we present a fast algorithm for computing the histogram of an N×N image with h grey levels in O(min{√h+log*(N/h),N}) time on an N×N reconfigurable mesh assuming each PE has a constant amount of local memory. This algorithm runs on the PARBUS and MRN/LRN models. In addition, histogram modification can be performed in O(√h) time on the same model. A variant of out algorithm runs in O(min{√h+log log(N/h),N}) time on an N×N RMESH in which each PE has constant storage. This result improves the known time and memory bounds for histogramming on the RMESH model 相似文献

16.

A parallel implementation of Strassen’s matrix multiplication algorithm for wormhole-routed all-port 2D torus networks

Cesur Baransel Kayhan M. İmre 《The Journal of supercomputing》2012,62(1):486-509

A new parallel implementation of Strassen’s matrix multiplication algorithm is proposed for massively parallel supercomputers with 2D, all-port torus interconnection networks. The proposed algorithm employs a special conflict-free routing pattern for better scalability and is able to yield a performance rate very close to the theoretical bound for many practical network and matrix sizes. It effectively scales up to very large networks typically containing hundreds-of-thousands processors where petaflop or exaflop processing rates are sought. 相似文献

17.

An analysis of cube-connected cycles and circular shuffle networks for parallel computation

《Journal of Parallel and Distributed Computing》1988,5(6):741-754

This paper is concerned with demonstrating isomorphism between two classes of computer architectures that support parallel computation, viz., the cube-connected cycles (CCC) network and the homogeneous circular shuffle network (HCSN). This is done by developing a suitable and common notation for addressing processing elements and specifying interconnections in the two networks. The implications of such an equivalence are discussed. Properties and algorithms concerning HCSN networks, with respect to routing and fault tolerance, thereby, immediately become applicable to CCC networks. As for HCSN networks, their VLSI layout is now apparent. The networks are shown to be totally symmetric with respect to each processor, and in some cases may be recursively defined in terms of modules. Further, any algorithm that runs on an HCSN network also runs on a CCC network without any modification. It is also shown that a large class of algorithms that run on a CCC network can be implemented, with slight modification, on an HCSN network. In particular, an implementation of the DESCENT algorithm on an HCSN network is proposed. 相似文献

18.

An O(log n) pyramid hough transform

Jean-Michel Jolion Azriel Rosenfeld 《Pattern recognition letters》1989,9(5):343-349

This paper describes a divide-and-conquer Hough transform technique for detecting a given number of straight edges or lines in an image. This technique is designed for implementation on a pyramid of processors, and requires only O(log n) computational steps for an image of size n × n. 相似文献

19.

Modeling and distributed simulation of a broadband-ISDN network

Chai A. Ghosh S. 《Computer》1993,26(9):37-51

A distributed approach to communication network simulation using a network of workstations configured as a loosely coupled parallel processor to model and simulate the broadband integrated services digital network (B-ISDN) is proposed. In a loosely coupled parallel processor system, a number of concurrently executable processors communicate asynchronously using explicit messages over high-speed links. Since this architecture is similar to that of B-ISDN networks, it constitutes a realistic testbed for their modeling and simulation. The authors describe an implementation of this approach on 50 Sun workstations at Brown University. Performance results, based on representative B-ISDN networks and realistic traffic models, indicate that the distributed approach is efficient and accurate 相似文献

20.

An optimal fault-tolerant routing algorithm for weighted bidirectional double-loop networks

Dharmasena H.P. Yan X. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(9):841-852

Double-loop networks are widely used in computer networks. In this paper, we present an optimal message routing algorithm and an optimal fault-tolerant message routing algorithm for weighted bidirectional double-loop networks. The algorithms presented are novel, and they do not use routing tables. After a precalculation of O(log N) steps to determine network parameters, the algorithms can route messages using constant time at each node along the route. The algorithm presented can route messages in the presence of up to three faulty nodes or links. The fault-tolerant routing algorithm guarantees an optimal route in the presence of one node failure. 相似文献