首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Using a directed acyclic graph (DAG) model of algorithms, the paper focuses on time-minimal multiprocessor schedules that use as few processors as possible. Such a processor-time-minimal scheduling of an algorithm's DAG first is illustrated using a triangular shaped 2-D directed mesh (representing, for example, an algorithm for solving a triangular system of linear equations). Then, algorithms represented by an n×n×n directed mesh are investigated. This cubical directed mesh is fundamental; it represents the standard algorithm for computing matrix product as well as many other algorithms. Completion of the cubical mesh required 3n-2 steps. It is shown that the number of processing elements needed to achieve this time bound is at least [3n2/4]. A systolic array for the cubical directed mesh is then presented. It completes the mesh using the minimum number of steps and exactly [3n 2/4] processing elements it is processor-time-minimal. The systolic array's topology is that of a hexagonally shaped, cylindrically connected, 2-D directed mesh  相似文献   

2.
Using a directed acyclic graph (DAG) model of algorithms, the authors focus on processor-time-minimal multiprocessor schedules: time-minimal multiprocessor schedules that use as few processors as possible. The Kung, Lo, and Lewis (KLL) algorithm for computing the transitive closure of a relation over a set of n elements requires at least 5n-4 parallel steps. As originally reported. their systolic array comprises n2 processing elements. It is shown that any time-minimal multiprocessor schedule of the KLL algorithm's dag needs at least n2/3 processing elements. Then a processor-time-minimal systolic array realizing the KLL dag is constructed. Its processing elements are organized as a cylindrically connected 2-D mesh, when n=0 mod 3. When n≠0 mod 3, the 2-D mesh is connected as a torus  相似文献   

3.
Recently it has been noticed that for semigroup computations and for selection, rectangular meshes with multiple broadcasting yield faster algorithms than their square counterparts. The contribution of the paper is to provide yet another example of a fundamental problem for which this phenomenon occurs. Specifically, we show that the problem of computing the convex hull of a set of n sorted points in the plane can be solved in O(n1/8 log 3/4) time on a rectangular mesh with multiple broadcasting of size n3/8 log1/4 n×n5/8/log1/4n. The fastest previously known algorithms on a square mesh of size √n×√n run in O(n1/6) time in case the n points are pixels in a binary image, and in O(n1/6log3/2 n) time for sorted points in the plane  相似文献   

4.
The work performed by a parallel algorithm is the product of its running time and the number of processors it requires. This paper presents work-efficient (or cost-optimal) routing algorithms to determine the switch settings for realizing permutations on rearrangeable symmetrical networks such as Benes and the reduced Ω NΩN-1. These networks have 2n-1 stages with N=2n inputs/outputs, each stage consisting of N/2 crossbar switches of size (2×2). Previously known parallel routing algorithms for a rearrangeable network with N inputs determine the states of all switches recursively in O(n) iterations using N processors. Each iteration determines the switch settings of at most two stages of the network and requires at least O(n) time on a computer of N processors, regardless of the type of its interconnection network. Hence, the work of any previously known parallel routing algorithm equals at least O(Nn2) for setting up all the switches of a rearrangeable network. The new routing algorithms run on a computer of p processors, 1⩽p⩽N/n, and perform work O(Nn). Moreover, because the range of p is large, the new routing algorithms do not have to be changed in case some processors become faulty  相似文献   

5.
Given a set S of n points in the plane and two directions r1 and r2, the Angle-Restricted All Nearest Neighbor problem (ARANN, for short) asks to compute, for every point p in S, the nearest point in S lying in the planar region bounded by two rays in the directions r1 and r2 emanating from p. The ARANN problem generalizes the well-known ANN problem and finds applications to pattern recognition, image processing, and computational morphology. Our main contribution is to present an algorithm that solves an instance of size n of the ARANN problem in O(1) time on a reconfigurable mesh of size n×n. Our algorithm is optimal in the sense that Ω(n2) processors are necessary to solve the ARANN problem in O(1) time. By using our ARANN algorithm, we can provide O(1) time solutions to the tasks of constructing the Geographic Neighborhood Graph and the Relative Neighborhood Graph of n points in the plane on a reconfigurable mesh of size n×n. We also show that, on a somewhat stronger reconfigurable mesh of size n×n2, the Euclidean Minimum Spanning Tree of n points can be computed in O(1) time  相似文献   

6.
A reconfigurable mesh, R-mesh for short, is a two-dimensional array of processors connected by a grid-shaped reconfigurable bus system. Each processor has four I/O ports that can be locally connected during execution of algorithms. This paper considers the d-dimensional Euclidean minimum spanning tree (EMST) and the all nearest neighbors (ANN) problem. Two results are reported. First, we show that a minimum spanning tree of n points in a fixed d-dimensional space can be constructed in O(1) time on a √(n3)×√(n3) R-mesh. Second, all nearest neighbors of n points in a fixed d-dimensional space can be constructed in O(1) time on an n×n R-mesh. There is no previous O(1) time algorithm for the EMST problem; ours is the first such algorithm. A previous R-mesh algorithm exists for the two-dimensional ANN problem; we extend it to any d-dimensional space. Both of the proposed algorithms have a time complexity independent of n but growing with d. The time complexity is O(1) if d is a constant  相似文献   

7.
We develop a methodology for the design of hot-potato algorithms for routing permutations. The basic idea is to convert existing store-and-forward routing algorithms to hot-potato algorithms. Using it, we obtain the following complexity bounds for permutation routing: n×n Mesh: 7n+o(n) steps; 2n hypercube: O(n2) steps; n×n Torus: 4n+o(n) steps. The algorithm for the two-dimensional grid is the first to be both deterministic and asymptotically optimal. The algorithm for the 2n-nodes Boolean cube is the first deterministic algorithm that achieves a complexity of o(2n) steps  相似文献   

8.
The paper presents parallel algorithms for solving Poisson equation at N2 mesh points. The methods based on marching techniques are structured for efficient parallel realization. Using orthogonal decomposition properties of arising matrices, the algorithms can be formulated in terms of transformed vectors. On a MIMD computer with not more than N processors, the computations can be performed in horizontal slices with minimal synchronization requirements. Considering an SIMD machine with N2 processors, the complexity bound O(log N) has been achieved, whereby the single marching requires 10 log N steps only.  相似文献   

9.
The problem of computing the convex hull of a set of n sorted points in the plane is one of the fundamental tasks in image processing, pattern recognition, cellular network design, and robotics, among many others. Somewhat surprisingly, in spite of a great deal of effort, the best previously known algorithm to solve this problem on a reconfigurable mesh of size √n×√n was running in O(log2 n) time. It was open for more than ten years to obtain an algorithm for this important problem running in sublogarithmic time. Our main contribution is to provide the first breakthrough: we propose an almost optimal convex hull algorithm running in O((log log n)2) time on a reconfigurable mesh of size √n×√n. With slight modifications, this algorithm can be implemented to run in O((log log n)2) time on a reconfigurable mesh of size √n/loglogn×√n/loglogn. Clearly, the latter algorithm is work-optimal. We also show that any algorithm that computes the convex hull of a set of n sorted points on an n-processor reconfigurable mesh must take Ω(log log n) time. Our result opens the door to an entire slew of efficient convex-hull-based algorithms on reconfigurable meshes  相似文献   

10.
Many DAG scheduling algorithms generate schedules that require prohibitively large number of processors. To address this problem, we propose a generic algorithm, SC, to minimize the processor requirement of any given valid schedule. SC preserves the schedule length of the original schedule and reduces processor count by merging processor schedules and removing redundant duplicate tasks. To the best of our knowledge, this is the first algorithm to address this highly unexplored aspect of DAG scheduling. On average, SC reduced the processor requirement 91, 82, and 72 percent for schedules generated by PLW, TCSD, and CPFD algorithms, respectively. SC algorithm has a low complexity (O{N}3) compared to most duplication-based algorithms. Moreover, it decouples processor economization from schedule length minimization problem. To take advantage of these features of SC, we also propose a scheduling algorithm SDS, having the same time complexity as SC. Our experiments demonstrate that schedules generated by SDS are only 3 percent longer than CPFD (O{N}4), one of the best algorithms in that respect. SDS and SC together form a two-stage scheduling algorithm that produces schedules with high quality and low processor requirement, and has lower complexity than the comparable algorithms that produce similar high-quality results.  相似文献   

11.
Abstract

Minimizing the amount of time and number of processors needed to perform an application reduces the application's fabrication cost and operation costs. A directed acyclic graph (dag) model of algorithms is used to define a time-minimal schedule and a processor-time-minimal schedule, We present a technique for finding a lower bound on the number of processors needed to achieve a given schedule of an algorithm. The application of this technique is illustrated with a tensor product computation. We then apply the technique to the free schedule of algorithms for matrix product, Gaussian elimination, and transitive closure. For each, we provide a time-minimal processor schedule that meets these processor lower bounds, including the one for tensor product.  相似文献   

12.
In spite of their good filtering characteristics for vector-valued image processing, the usability of vector median filters is limited by their high computational complexity. Given an N × N image and a W × W window, the computational complexity of vector median filter is O(W4N2). In this paper, we design three fast and efficient parallel algorithms for vector median filtering based on the 2-norm (L2) on the arrays with reconfigurable optical buses (AROB). For 1 ⩽ p ⩽ W ⩽ q ⩽ N, our algorithms run in O(W4 log W/p4), O(W2N2/p 4q2 log W) and O(1) times using p4N2 / log W, p4q2 / log W, and W4N2 log N processors, respectively. In the sense of the product of time and the number of processors used, the first two results are cost optimal and the last one is time optimal  相似文献   

13.
This paper addresses the problem of scheduling parallel programs represented as directed acyclic task graphs for execution on distributed memory parallel architectures. Because of the high communication overhead in existing parallel machines, a crucial step in scheduling is task clustering, the process of coalescing fine grain tasks into single coarser ones so that the overall execution time is minimized. The task clustering problem is NP-hard, even when the number of processors is unbounded and task duplication is allowed. A simple greedy algorithm is presented for this problem which, for a task graph with arbitrary granularity, produces a schedule whose makespan is at most twice optimal. Indeed, the quality of the schedule improves as the granularity of the task graph becomes larger. For example, if the granularity is at least 1/2, the makespan of the schedule is at most 5/3 times optimal. For a task graph with n tasks and e inter-task communication constraints, the algorithm runs in O(n(n lg n+e)) time, which is n times faster than the currently best known algorithm for this problem. Similar algorithms are developed that produce: (1) optimal schedules for coarse grain graphs; (2) 2-optimal schedules for trees with no task duplication; and (3) optimal schedules for coarse grain trees with no task duplication  相似文献   

14.
Algorithms for scheduling independent tasks on to the processors of a multiprocessor system must trade-off processor load balance, memory locality, and scheduling overhead. Most existing algorithms, however, do not adequately balance these conflicting factors. This paper introduces the self-adjusting dynamic scheduling (SADS) class of algorithms that use a unified cost model to explicitly account for these factors at runtime. A dedicated processor performs scheduling in phases by maintaining a tree of partial schedules and incrementally assigning tasks to the least-cost schedule. A scheduling phase terminates whenever any processor becomes idle, at which time partial schedules are distributed to the processors. An extension of the basic SADS algorithm, called DBSADS, controls the scheduling overhead by giving higher priority to partial schedules with more task-to-processor assignments. These algorithms are compared to two distributed scheduling algorithms within a database application on an Intel Paragon distributed memory multiprocessor system.  相似文献   

15.
The Hough transform is an important problem in image processing and computer vision. An efficient algorithm for computing the Hough transform has been proposed on a reconfigurable array by Kao et al. (1995). For a problem with an √N×√N image and an n×n parameter space, the algorithm runs in a constant time on a three-dimensional (3-D) n×n×N reconfigurable mesh where the data bus is N1c/-bit wide. To our best knowledge, this is the most efficient constant-time algorithm for computing the Hough transform on a reconfigurable mesh. In this paper, an improved Hough transform algorithm on a reconfigurable mesh is proposed. For the same problem, our algorithm runs in constant time on a 3-D n*n×n×√n√n reconfigurable mesh, where the data bus is only log N-bit wide. In most practical situations, n=O(√N). Hence, our algorithm requires much less VLSI area to accomplish the same task. In addition, our algorithm can compute the Radon transform (a generalized Hough transform) in O(1) time on the same model, whereas the algorithm in the above paper cannot be adapted to computing Radon transform easily  相似文献   

16.
We show that by folding data from an n×n mesh onto an n×(n/k) submesh, sorting on the submesh, and finally unfolding back onto the entire n×n mesh it is possible to sort on bidirectional and strict unidirectional meshes using a number of routing steps that is very close to the distance lower bound for these architectures  相似文献   

17.
This paper considers the problem of embedding complete binary trees into meshes using the row-column routing and obtained the following results: a complete binary tree with 2p-1 nodes can be embedded (1) with link congestion one into a 9/8√(2p9/ 8√(2p) mesh when p is even and a √( 9/82p)×√(9/ 82p) mesh when p is odd, and (2) with link congestion two into a √(2p)×√(2p) mesh when p is even, and a √(2p-1)×√(2p-1) mesh when p is odd  相似文献   

18.
We present a parallel recognition algorithm for bipartite-permutation graphs. The algorithm can be executed in O(log n) time on the CRCW PRAM if O(n3/log n) processors are used, or O(log2 n) time on the CREW PRAM if O(n3/log2 n) processors are used. Chen and Yesha (1993) have presented another CRCW PRAM algorithm that takes O(log2n) time if O(n 3) processors are used. Compared with Chen and Yesha's algorithm, our algorithm requires either less time and fewer processors on the same machine model, or fewer processors on a weaker machine model. Our algorithm can also be applied to determine if two bipartite-permutation graphs are isomorphic  相似文献   

19.
Several dynamic programming algorithms are considered which can be efficiently implemented using parallel networks with reconfigurable buses. The bit model of general reconfigurable meshes with directed links, common write, and unit-time delay for broadcasting is assumed. Given two sequences of length m and n, respectively, their longest common subsequence can be found in constant time by an O(mh)×O(nh) directed reconfigurable mesh, where h=min{m, n}+1. Moreover, given an n-node directed graph G=(V, E) with (possibly negative) integer weights on its arcs, the shortest distances from a source node ν ϵ V to all other nodes can be found in constant time by an O(n2w) x O(n2w) directed reconfigurable mesh, where w is the maximum are weight  相似文献   

20.
Given a graph G=(V, E) with n vertices and m edges, the k-connectivity of G denotes either the k-edge connectivity or the k-vertex connectivity of G. In this paper, we deal with the fully dynamic maintenance of k-connectivity of G in the parallel setting for k=2, 3. We study the problem of maintaining k-edge/vertex connected components of a graph undergoing repeatedly dynamic updates, such as edge insertions and deletions, and answering the query of whether two vertices are included in the same k-edge/vertex connected component. Our major results are the following: (1) An NC algorithm for the 2-edge connectivity problem is proposed, which runs in O(log n log(m/n)) time using O(n3/4) processors per update and query. (2) It is shown that the biconnectivity problem can be solved in O(log2 n ) time using O(nα(2n, n)/logn) processors per update and O(1) time with a single processor per query or in O(log n logn/m) time using O(nα(2n, n)/log n) processors per update and O(logn) time using O(nα(2n, n)/logn) processors per query, where α(.,.) is the inverse of Ackermann's function. (3) An NC algorithm for the triconnectivity problem is also derived, which takes O(log n logn/m+logn log log n/α(3n, n)) time using O(nα(3n, n)/log n) processors per update and O(1) time with a single processor per query. (4) An NC algorithm for the 3-edge connectivity problem is obtained, which has the same time and processor complexities as the algorithm for the triconnectivity problem. To the best of our knowledge, the proposed algorithms are the first NC algorithms for the problems using O(n) processors in contrast to Ω(m) processors for solving them from scratch. In particular, the proposed NC algorithm for the 2-edge connectivity problem uses only O(n3/4) processors. All the proposed algorithms run on a CRCW PRAM  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号