期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A processor-time-minimal systolic array for cubical mesh algorithms

Cappello P. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(1):4-13

Using a directed acyclic graph (DAG) model of algorithms, the paper focuses on time-minimal multiprocessor schedules that use as few processors as possible. Such a processor-time-minimal scheduling of an algorithm's DAG first is illustrated using a triangular shaped 2-D directed mesh (representing, for example, an algorithm for solving a triangular system of linear equations). Then, algorithms represented by an n×n×n directed mesh are investigated. This cubical directed mesh is fundamental; it represents the standard algorithm for computing matrix product as well as many other algorithms. Completion of the cubical mesh required 3n-2 steps. It is shown that the number of processing elements needed to achieve this time bound is at least [3n^2/4]. A systolic array for the cubical directed mesh is then presented. It completes the mesh using the minimum number of steps and exactly [3n ^2/4] processing elements it is processor-time-minimal. The systolic array's topology is that of a hexagonally shaped, cylindrically connected, 2-D directed mesh 相似文献

2.

A processor-time-minimal systolic array for transitive closure

Scheiman C.J. Cappello P.R. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(3):257-269

Using a directed acyclic graph (DAG) model of algorithms, the authors focus on processor-time-minimal multiprocessor schedules: time-minimal multiprocessor schedules that use as few processors as possible. The Kung, Lo, and Lewis (KLL) algorithm for computing the transitive closure of a relation over a set of n elements requires at least 5n-4 parallel steps. As originally reported. their systolic array comprises n² processing elements. It is shown that any time-minimal multiprocessor schedule of the KLL algorithm's dag needs at least n²/3 processing elements. Then a processor-time-minimal systolic array realizing the KLL dag is constructed. Its processing elements are organized as a cylindrically connected 2-D mesh, when n=0 mod 3. When n≠0 mod 3, the 2-D mesh is connected as a torus 相似文献

3.

Square meshes are not optimal for convex hull computation

Bhagavathi D. Gurla H. Olariu S. Schwing J.L. Jingyuan Zhang 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(6):545-554

Recently it has been noticed that for semigroup computations and for selection, rectangular meshes with multiple broadcasting yield faster algorithms than their square counterparts. The contribution of the paper is to provide yet another example of a fundamental problem for which this phenomenon occurs. Specifically, we show that the problem of computing the convex hull of a set of n sorted points in the plane can be solved in O(n^1/8 log ^3/4) time on a rectangular mesh with multiple broadcasting of size n^3/8 log^1/4 n×n^5/8/log^1/4n. The fastest previously known algorithms on a square mesh of size √n×√n run in O(n^1/6) time in case the n points are pixels in a binary image, and in O(n^1/6log^3/2 n) time for sorted points in the plane 相似文献

4.

Work-efficient routing algorithms for rearrangeable symmetricalnetworks

Cam H. Fortes J.A.B. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(7):733-741

The work performed by a parallel algorithm is the product of its running time and the number of processors it requires. This paper presents work-efficient (or cost-optimal) routing algorithms to determine the switch settings for realizing permutations on rearrangeable symmetrical networks such as Benes and the reduced Ω _NΩ_N^-1. These networks have 2n-1 stages with N=2ⁿ inputs/outputs, each stage consisting of N/2 crossbar switches of size (2×2). Previously known parallel routing algorithms for a rearrangeable network with N inputs determine the states of all switches recursively in O(n) iterations using N processors. Each iteration determines the switch settings of at most two stages of the network and requires at least O(n) time on a computer of N processors, regardless of the type of its interconnection network. Hence, the work of any previously known parallel routing algorithm equals at least O(Nn²) for setting up all the switches of a rearrangeable network. The new routing algorithms run on a computer of p processors, 1⩽p⩽N/n, and perform work O(Nn). Moreover, because the range of p is large, the new routing algorithms do not have to be changed in case some processors become faulty 相似文献

5.

An optimal algorithm for the angle-restricted all nearest neighborproblem on the reconfigurable mesh, with applications

Nakano K. Olariu S. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(9):983-990

Given a set S of n points in the plane and two directions r₁ and r₂, the Angle-Restricted All Nearest Neighbor problem (ARANN, for short) asks to compute, for every point p in S, the nearest point in S lying in the planar region bounded by two rays in the directions r₁ and r₂ emanating from p. The ARANN problem generalizes the well-known ANN problem and finds applications to pattern recognition, image processing, and computational morphology. Our main contribution is to present an algorithm that solves an instance of size n of the ARANN problem in O(1) time on a reconfigurable mesh of size n×n. Our algorithm is optimal in the sense that Ω(n²) processors are necessary to solve the ARANN problem in O(1) time. By using our ARANN algorithm, we can provide O(1) time solutions to the tasks of constructing the Geographic Neighborhood Graph and the Relative Neighborhood Graph of n points in the plane on a reconfigurable mesh of size n×n. We also show that, on a somewhat stronger reconfigurable mesh of size n×n², the Euclidean Minimum Spanning Tree of n points can be computed in O(1) time 相似文献

6.

Constructing Euclidean minimum spanning trees and all nearestneighbors on reconfigurable meshes

Lai T.H. Ming-Jye Sheng 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(8):806-817

A reconfigurable mesh, R-mesh for short, is a two-dimensional array of processors connected by a grid-shaped reconfigurable bus system. Each processor has four I/O ports that can be locally connected during execution of algorithms. This paper considers the d-dimensional Euclidean minimum spanning tree (EMST) and the all nearest neighbors (ANN) problem. Two results are reported. First, we show that a minimum spanning tree of n points in a fixed d-dimensional space can be constructed in O(1) time on a √(n³)×√(n³) R-mesh. Second, all nearest neighbors of n points in a fixed d-dimensional space can be constructed in O(1) time on an n×n R-mesh. There is no previous O(1) time algorithm for the EMST problem; ours is the first such algorithm. A previous R-mesh algorithm exists for the two-dimensional ANN problem; we extend it to any d-dimensional space. Both of the proposed algorithms have a time complexity independent of n but growing with d. The time complexity is O(1) if d is a constant 相似文献

7.

Hot-potato algorithms for permutation routing

Newman I. Schuster A. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(11):1168-1176

We develop a methodology for the design of hot-potato algorithms for routing permutations. The basic idea is to convert existing store-and-forward routing algorithms to hot-potato algorithms. Using it, we obtain the following complexity bounds for permutation routing: n×n Mesh: 7n+o(n) steps; 2ⁿ hypercube: O(n²) steps; n×n Torus: 4n+o(n) steps. The algorithm for the two-dimensional grid is the first to be both deterministic and asymptotically optimal. The algorithm for the 2ⁿ-nodes Boolean cube is the first deterministic algorithm that achieves a complexity of o(2ⁿ) steps 相似文献

8.

Parallel marching Poisson solvers

Marian Vajter&#x;ic 《Parallel Computing》1984,1(3-4):325-330

The paper presents parallel algorithms for solving Poisson equation at N² mesh points. The methods based on marching techniques are structured for efficient parallel realization. Using orthogonal decomposition properties of arising matrices, the algorithms can be formulated in terms of transformed vectors. On a MIMD computer with not more than N processors, the computations can be performed in horizontal slices with minimal synchronization requirements. Considering an SIMD machine with N² processors, the complexity bound O(log N) has been achieved, whereby the single marching requires 10 log N steps only. 相似文献

9.

An O((log log n)²) time algorithm to compute the convexhull of sorted points on reconfigurable meshes

Hayashi T. Nakano K. Olarlu S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(12):1167-1179

The problem of computing the convex hull of a set of n sorted points in the plane is one of the fundamental tasks in image processing, pattern recognition, cellular network design, and robotics, among many others. Somewhat surprisingly, in spite of a great deal of effort, the best previously known algorithm to solve this problem on a reconfigurable mesh of size √n×√n was running in O(log2 n) time. It was open for more than ten years to obtain an algorithm for this important problem running in sublogarithmic time. Our main contribution is to provide the first breakthrough: we propose an almost optimal convex hull algorithm running in O((log log n)²) time on a reconfigurable mesh of size √n×√n. With slight modifications, this algorithm can be implemented to run in O((log log n)²) time on a reconfigurable mesh of size √n/loglogn×√n/loglogn. Clearly, the latter algorithm is work-optimal. We also show that any algorithm that computes the convex hull of a set of n sorted points on an n-processor reconfigurable mesh must take Ω(log log n) time. Our result opens the door to an entire slew of efficient convex-hull-based algorithms on reconfigurable meshes 相似文献

10.

Compaction of Schedules and a Two-Stage Approach for Duplication-Based DAG Scheduling 总被引：1，自引：0，他引：1

Bozdag D. Ozguner F. Catalyurek U.V. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(6):857-871

Many DAG scheduling algorithms generate schedules that require prohibitively large number of processors. To address this problem, we propose a generic algorithm, SC, to minimize the processor requirement of any given valid schedule. SC preserves the schedule length of the original schedule and reduces processor count by merging processor schedules and removing redundant duplicate tasks. To the best of our knowledge, this is the first algorithm to address this highly unexplored aspect of DAG scheduling. On average, SC reduced the processor requirement 91, 82, and 72 percent for schedules generated by PLW, TCSD, and CPFD algorithms, respectively. SC algorithm has a low complexity (O{N}³) compared to most duplication-based algorithms. Moreover, it decouples processor economization from schedule length minimization problem. To take advantage of these features of SC, we also propose a scheduling algorithm SDS, having the same time complexity as SC. Our experiments demonstrate that schedules generated by SDS are only 3 percent longer than CPFD (O{N}⁴), one of the best algorithms in that respect. SDS and SC together form a two-stage scheduling algorithm that produces schedules with high quality and low processor requirement, and has lower complexity than the comparable algorithms that produce similar high-quality results. 相似文献

11.

PROCESSOR-TIME-OPTIMAL SYSTOLIC ARRAYS

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(3-4):167-199

Abstract

Minimizing the amount of time and number of processors needed to perform an application reduces the application's fabrication cost and operation costs. A directed acyclic graph (dag) model of algorithms is used to define a time-minimal schedule and a processor-time-minimal schedule, We present a technique for finding a lower bound on the number of processors needed to achieve a given schedule of an algorithm. The application of this technique is illustrated with a tensor product computation. We then apply the technique to the free schedule of algorithms for matrix product, Gaussian elimination, and transitive closure. For each, we provide a time-minimal processor schedule that meets these processor lower bounds, including the one for tensor product. 相似文献

12.

L₂ vector median filters on arrays with reconfigurableoptical buses

Chin-Hsiung Wu Shi-Jinn Horng 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(12):1281-1292

In spite of their good filtering characteristics for vector-valued image processing, the usability of vector median filters is limited by their high computational complexity. Given an N × N image and a W × W window, the computational complexity of vector median filter is O(W⁴N²). In this paper, we design three fast and efficient parallel algorithms for vector median filtering based on the 2-norm (L₂) on the arrays with reconfigurable optical buses (AROB). For 1 ⩽ p ⩽ W ⩽ q ⩽ N, our algorithms run in O(W⁴ log W/p⁴), O(W²N²/p ⁴q² log W) and O(1) times using p⁴N² / log W, p⁴q² / log W, and W⁴N² log N processors, respectively. In the sense of the product of time and the number of processors used, the first two results are cost optimal and the last one is time optimal 相似文献

13.

Task clustering and scheduling for distributed memory parallelarchitectures

Palis M.A. Jing-Chiou Liou Wei D.S.L. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(1):46-55

This paper addresses the problem of scheduling parallel programs represented as directed acyclic task graphs for execution on distributed memory parallel architectures. Because of the high communication overhead in existing parallel machines, a crucial step in scheduling is task clustering, the process of coalescing fine grain tasks into single coarser ones so that the overall execution time is minimized. The task clustering problem is NP-hard, even when the number of processors is unbounded and task duplication is allowed. A simple greedy algorithm is presented for this problem which, for a task graph with arbitrary granularity, produces a schedule whose makespan is at most twice optimal. Indeed, the quality of the schedule improves as the granularity of the task graph becomes larger. For example, if the granularity is at least 1/2, the makespan of the schedule is at most 5/3 times optimal. For a task graph with n tasks and e inter-task communication constraints, the algorithm runs in O(n(n lg n+e)) time, which is n times faster than the currently best known algorithm for this problem. Similar algorithms are developed that produce: (1) optimal schedules for coarse grain graphs; (2) 2-optimal schedules for trees with no task duplication; and (3) optimal schedules for coarse grain trees with no task duplication 相似文献

14.

Dynamic task scheduling using online optimization

Hamidzadeh B. Lau Ying Kit Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(11):1151-1163

Algorithms for scheduling independent tasks on to the processors of a multiprocessor system must trade-off processor load balance, memory locality, and scheduling overhead. Most existing algorithms, however, do not adequately balance these conflicting factors. This paper introduces the self-adjusting dynamic scheduling (SADS) class of algorithms that use a unified cost model to explicitly account for these factors at runtime. A dedicated processor performs scheduling in phases by maintaining a tree of partial schedules and incrementally assigning tasks to the least-cost schedule. A scheduling phase terminates whenever any processor becomes idle, at which time partial schedules are distributed to the processors. An extension of the basic SADS algorithm, called DBSADS, controls the scheduling overhead by giving higher priority to partial schedules with more task-to-processor assignments. These algorithms are compared to two distributed scheduling algorithms within a database application on an Intel Paragon distributed memory multiprocessor system. 相似文献

15.

An improved constant-time algorithm for computing the Radon andHough transforms on a reconfigurable mesh

Yi Pan Keqin Li Hamdi M. 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》1999,29(4):417-421

The Hough transform is an important problem in image processing and computer vision. An efficient algorithm for computing the Hough transform has been proposed on a reconfigurable array by Kao et al. (1995). For a problem with an √N×√N image and an n×n parameter space, the algorithm runs in a constant time on a three-dimensional (3-D) n×n×N reconfigurable mesh where the data bus is N¹c/-bit wide. To our best knowledge, this is the most efficient constant-time algorithm for computing the Hough transform on a reconfigurable mesh. In this paper, an improved Hough transform algorithm on a reconfigurable mesh is proposed. For the same problem, our algorithm runs in constant time on a 3-D n*n×n×√n√n reconfigurable mesh, where the data bus is only log N-bit wide. In most practical situations, n=O(√N). Hence, our algorithm requires much less VLSI area to accomplish the same task. In addition, our algorithm can compute the Radon transform (a generalized Hough transform) in O(1) time on the same model, whereas the algorithm in the above paper cannot be adapted to computing Radon transform easily 相似文献

16.

Sorting n² numbers on n×n meshes

Nigam M. Sahni S. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(12):1221-1225

We show that by folding data from an n×n mesh onto an n×(n/k) submesh, sorting on the submesh, and finally unfolding back onto the entire n×n mesh it is possible to sort on bidirectional and strict unidirectional meshes using a number of routing steps that is very close to the distance lower bound for these architectures 相似文献

17.

Embedding of complete binary trees into meshes with row-columnrouting

Sang-Kyu Lee Hyeong-Ah Choi 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(5):493-497

This paper considers the problem of embedding complete binary trees into meshes using the row-column routing and obtained the following results: a complete binary tree with 2^p-1 nodes can be embedded (1) with link congestion one into a ⁹/₈√(2^p)×⁹/ ₈√(2^p) mesh when p is even and a √( ⁹/₈2^p)×√(⁹/ ₈2^p) mesh when p is odd, and (2) with link congestion two into a √(2^p)×√(2^p) mesh when p is even, and a √(2^p-1)×√(2^p-1) mesh when p is odd 相似文献

18.

An efficient parallel recognition algorithm forbipartite-permutation graphs

Chang-Wu Yu Gen-Huey Chen 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(1):3-10

We present a parallel recognition algorithm for bipartite-permutation graphs. The algorithm can be executed in O(log n) time on the CRCW PRAM if O(n³/log n) processors are used, or O(log² n) time on the CREW PRAM if O(n³/log²n) processors are used. Chen and Yesha (1993) have presented another CRCW PRAM algorithm that takes O(log²n) time if O(n ³) processors are used. Compared with Chen and Yesha's algorithm, our algorithm requires either less time and fewer processors on the same machine model, or fewer processors on a weaker machine model. Our algorithm can also be applied to determine if two bipartite-permutation graphs are isomorphic 相似文献

19.

Constant time dynamic programming on directed reconfigurablenetworks

Bertossi A.A. Mei A. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(6):529-536

Several dynamic programming algorithms are considered which can be efficiently implemented using parallel networks with reconfigurable buses. The bit model of general reconfigurable meshes with directed links, common write, and unit-time delay for broadcasting is assumed. Given two sequences of length m and n, respectively, their longest common subsequence can be found in constant time by an O(mh)×O(nh) directed reconfigurable mesh, where h=min{m, n}+1. Moreover, given an n-node directed graph G=(V, E) with (possibly negative) integer weights on its arcs, the shortest distances from a source node ν ϵ V to all other nodes can be found in constant time by an O(n²w) x O(n²w) directed reconfigurable mesh, where w is the maximum are weight 相似文献

20.

Fully dynamic maintenance of k-connectivity in parallel

Weifa Liang Brent R.P. Hong Shen 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(8):846-864

Given a graph G=(V, E) with n vertices and m edges, the k-connectivity of G denotes either the k-edge connectivity or the k-vertex connectivity of G. In this paper, we deal with the fully dynamic maintenance of k-connectivity of G in the parallel setting for k=2, 3. We study the problem of maintaining k-edge/vertex connected components of a graph undergoing repeatedly dynamic updates, such as edge insertions and deletions, and answering the query of whether two vertices are included in the same k-edge/vertex connected component. Our major results are the following: (1) An NC algorithm for the 2-edge connectivity problem is proposed, which runs in O(log n log(m/n)) time using O(n^3/4) processors per update and query. (2) It is shown that the biconnectivity problem can be solved in O(log^{2 n}) time using O(nα(2n, n)/logn) processors per update and O(1) time with a single processor per query or in O(log n log_n/^m) time using O(nα(2n, n)/log n) processors per update and O(logn) time using O(nα(2n, n)/logn) processors per query, where α(.,.) is the inverse of Ackermann's function. (3) An NC algorithm for the triconnectivity problem is also derived, which takes O(log n log_n/^m+logn log log n/α(3n, n)) time using O(nα(3n, n)/log n) processors per update and O(1) time with a single processor per query. (4) An NC algorithm for the 3-edge connectivity problem is obtained, which has the same time and processor complexities as the algorithm for the triconnectivity problem. To the best of our knowledge, the proposed algorithms are the first NC algorithms for the problems using O(n) processors in contrast to Ω(m) processors for solving them from scratch. In particular, the proposed NC algorithm for the 2-edge connectivity problem uses only O(n^3/4) processors. All the proposed algorithms run on a CRCW PRAM 相似文献