期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers 总被引：1，自引：0，他引：1

Keqin Li 《Journal of Parallel and Distributed Computing》2001,61(12):1709

Consider any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity O(N^α), where 2<α3. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in O(log N) time by using N^α/log N processors. Such a parallel computation is cost optimal and matches the performance of PRAM. Furthermore, our parallelization on a DMPC can be made fully scalable, that is, for all 1pN^α/log N, multiplying two N×N matrices can be performed by a DMPC with p processors in O(N^α/p) time, i.e., linear speedup and cost optimality can be achieved in the range [1..N^α/log N]. This unifies all known algorithms for matrix multiplication on DMPC, standard or non- standard, sequential or parallel. Extensions of our methods and results to other parallel systems are also presented. For instance, for all 1p N^α /log N, multiplying two N×N matrices can be performed by p processors connected by a hypercubic network in O(N^α/p+(N²/p^2/α)(log p)^2(α−1)/α) time, which implies that if p=O(N^α/(log N)^{2(α−1)/(α−2)}), linear speedup can be achieved. Such a parallelization is highly scalable. The above claims result in significant progress in scalable parallel matrix multiplication (as well as solving many other important problems) on distributed memory systems, both theoretically and practically. 相似文献

2.

Parallel Construction of (a, b)-Trees

Deo N. Jain A. Medidi M. 《Journal of Parallel and Distributed Computing》1994,23(3)

We present an optimal parallel algorithm for the construction of (a, b)-trees-a generalization of 2-3 trees, 2-3-4 trees, and B-trees. We show the existence of a canonical form for (a, b)-trees, with a very regular structure, which allows us to obtain a scalable parallel algorithm for the construction of a minimum-height (a, b)-tree with N keys in O(N/p + log log N) time using p ≤ N/log log N processors on the EREW-PRAM model, and in O(N/p) time using p ≤ N processors on the CREW model. We show that the average memory utilization for the canonical form is at least 50% better than that for the worst-case and is also better than that for a random (a, b)-tree. A significant feature of the proposed parallel algorithm is that its time-complexity depends neither on a nor on b, and hence our general algorithm is superior to earlier algorithms for parallel construction of B-trees. 相似文献

3.

Scalable Load Balancing Strategies for Parallel A* Algorithms

Dutt S. Mahapatra N. R. 《Journal of Parallel and Distributed Computing》1994,22(3)

In this paper, we develop load balancing strategies for scalable high-performance parallel A* algorithms suitable for distributed-memory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algorithm) grow with the number of processors P used, thus restricting its scalability. To alleviate this effect, we propose a novel parallel startup phase and an efficient dynamic load balancing strategy called the quality equalizing (QE) strategy. Our new parallel startup scheme executes optimally in Θ(log P) time and, in addition, achieves good initial load balance. The QE strategy prossess certain unique quantitative and qualitative load balancing properties that enable it to significantly reduce starvation and nonessential work. Consequently, we obtain a highly scalable parallel A* algorithm with an almost-linear speedup. The startup and load balancing schemes were employed in parallel A* algorithms to solve the Traveling Salesman Problem on an nCUBE2 hypercube multicomputer. The QE strategy yields average speedup improvements of about 20-185% and 15-120% at low and intermediate work densities (the ratio of the problem size to P), respectively, over three well-known load balancing methods-the round-robin (RR), the random communication (RC), and the neighborhood averaging (NA) strategies. The average speedup observed on 1024 processors is about 985, representing a very high efficiency of 0.96. Finally, we analyze and empirically evaluate the scalability of parallel A* algorithms in terms of the isoefficiency metric. Our analysis gives (1) a Θ(P log P) lower bound on the isoefficiency function of any parallel A* algorithm, and (2) a general expression for the upper bound on the isoefficiency function of our parallel A* algorithm using the QE strategy on any topology-for the hypercube and 2-D mesh architectures the upper bounds on the isoefficiency function are found to be Θ(P log²P) and Θ(P[formula]), respectively. Experimental results validate our analysis, and also show that parallel A* search has better scalability using the QE load balancing strategy than using the RR, RC, or NA strategies. 相似文献

4.

Approximate algorithms for the knapsack problem on parallel computers

P. S. Gopalakrishnan I. V. Ramakrishnan L. N. Kanal 《Information and Computation》1991,91(2)

Computing an optimal solution to the knapsack problem is known to be NP-hard. Consequently, fast parallel algorithms for finding such a solution without using an exponential number of processors appear unlikely. An attractive alternative is to compute an approximate solution to this problem rapidly using a polynomial number of processors. In this paper, we present an efficient parallel algorithm for finding approximate solutions to the 0–1 knapsack problem. Our algorithm takes an , 0 < < 1, as a parameter and computes a solution such that the ratio of its deviation from the optimal solution is at most a fraction of the optimal solution. For a problem instance having n items, this computation uses O(n^5/2/^3/2) processors and requires O(log³n + log²nlog(1/)) time. The upper bound on the processor requirement of our algorithm is established by reducing it to a problem on weighted bipartite graphs. This processor complexity is a significant improvement over that of other known parallel algorithms for this problem. 相似文献

5.

Algorithms for handling skew in parallel task scheduling

Arun Swami Honesty C. Young Ashish Gupta 《Journal of Parallel and Distributed Computing》1992,16(4)

In this paper we consider the problem of scheduling a collection of independent tasks on multiple processors (denote the number of processors by p) so that the maximum completion time is minimized. We present two new algorithms, the LPT-MinHeight (LPTMH) algorithm and the Split-LPT(SLPT) algorithm. Both algorithms are based on the LPT(Largest Processing Time first) algorithm. The worst case imbalance for the LPTMH algorithm never exceeds 1/(e − 1) ≤ 0.582, while the worst case imbalance for the SLPT algorithm is (p − 1)/(p + 1) < 1. The SLPT bound is equal to the bound for a previously published algorithm while the LPTMH bound is the best known so far. Both LPTMH and SLPT take much less running time than competing algorithms. Results of experiments show that the SLPT algorithm performs better on the average than the LPTMH algorithm and as well as other known algorithms. 相似文献

6.

Comments on Samal and Henderson: “Parallel consistent labeling algorithms”

Michael J. Swain 《International journal of parallel programming》1988,17(6):523-528

Samal and Henderson claim that any parallel algorithm for enforcing arc consistency in the worst case must have (na) sequential steps, wheren is the number of nodes, anda is the number of labels per node. We argue that Samal and Henderson's argument makes assumptions about how processors are used and give a counterexample that enforces arc consistency in a constant number of steps usingO(n[su2a²2^na) processors. It is possible that the lower bound holds for a polynomial number of processors; if such a lower bound were to be proven it would answer an important open question in theoretical computer science concerning the relation between the complexity classesP andNC. The strongest existing lower bound for the arc consistency problem states that it cannot be solved in polynomial log time unlessP=NC. 相似文献

7.

Approximate algorithms for partitioning problems

Mohammad Ashraf Iqbal 《International journal of parallel programming》1991,20(5):341-361

We consider the problem of optimally assigning the modules of a parallel/pipelined program over the processors of a multiple processor system under certain restrictions on the interconnection structure of the program as well as the multiple computer system. We show that for a variety of such problems, it is possible to find if a partition of the modular program exists in which the load on any processor is whithin a certain bound. This method when combined with a binary search over a fixed range, provides an optimal solution to the partitioning problem.The specific problems we consider are partitioning of (1) a chain structured parallel program over a chain-like computer system, (2) multiple chain-like programs over a host-satellite system, and (3) a tree structured parallel program over a host-satellite system.For a problem withN modules andM processors, the complexity of our algorithm is no worse thanO(Mlog(N)log(W _T/)), whereW _T is the cost of assigning all modules to one processors, and the desired accuracy. This algorithm provides an improvement over the recently developed best known algorithm that runs inO(MNlog(N)) time.This Research was supported by a grant from the Division of Research Extension and Advisory Services, University of Engineering and Technology Lahore, Pakistan. Further support was provided by NASA Contracts NAS1-17070 and NAS1-18107 while the author was resident at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, Virginia, USA. 相似文献

8.

Parallel Merge Sort with Load Balancing

Jeon Minsoo Kim Dongseung 《International journal of parallel programming》2003,31(1):21-33

Parallel merge sort is useful for sorting a large quantity of data progressively. The merge sort should be parallelized carefully since the conventional algorithm has poor performance due to the successive reduction of the number of participating processors by half, and down to one in the last merging stage. The proposed load-balanced merge sort utilizes all processors throughout the computation. It evenly distributes data to all processors in each stage. Thus every processor is forced to work in all phases. Significant performance enhancement has been achieved up to a speedup of (P–1)/log P where P is the number of processors. Experimental results demonstrate a speedup of 9.6 (upper bound of 10.7) on 32-processor Cray T3E when sorting 4M 32-bit integers, and a speed up of 2.3 (upper bound of 2.8) on an 8-node PC cluster. 相似文献

9.

Fast Parallel Algorithms for Solving Triangular Systems of Linear Equations on the Hypercube

Ibarra O. H. Kim M. H. 《Journal of Parallel and Distributed Computing》1994,20(3)

This paper presents efficient hypercube algorithms for solving triangular systems of linear equations by using various matrix partitioning and mapping schemes. Recently, several parallel algorithms have been developed for this problem. In these algorithms, the triangular solver is treated as the second stage of Gauss elimination. Thus, the triangular matrix is distributed by columns (or rows) in a wrap fashion since it is likely that the matrix is distributed this way after an LU decomposition has been done on the matrix. However, the efficiency of the algorithms is low. Our motivation is to develop various data partitioning and mapping schemes for hypercube algorithms by treating the triangular solver as an independent problem. Theoretically, the computation time of our best algorithm is ((12p + 1)n² + 36p³ − 28p²)/(24p²), and an upper bound on the communication time is 2αp log p (log n − log p) + 2α(log n − log p − 1) log p + (cn/p − 2c)(2 log p − 1) + log p(cn − c − α), where α is the (communication startup time)/(one entry scanning time), c is a constant, n is the order of the triangular system and p is the number of nodes in the hypercube. Experimental results show that the algorithm is efficient. The efficiency of the algorithm is 0.945 when p = 2, n = 513, and 0.93 when p = 8, n = 1025. 相似文献

10.

An optimal and processor efficient parallel sorting algorithm on a linear array with a reconfigurable pipelined bus system

Min He Xiaolong Wu Si Qing Zheng 《Computers & Electrical Engineering》2009,35(6):951-965

Optical interconnections attract many engineers and scientists’ attention due to their potential for gigahertz transfer rates and concurrent access to the bus in a pipelined fashion. These unique characteristics of optical interconnections give us the opportunity to reconsider traditional algorithms designed for ideal parallel computing models, such as PRAMs. Since the PRAM model is far from practice, not all algorithms designed on this model can be implemented on a realistic parallel computing system. From this point of view, we study Cole’s pipelined merge sort [Cole R. Parallel merge sort. SIAM J Comput 1988;14:770–85] on the CREW PRAM and extend it in an innovative way to an optical interconnection model, the LARPBS (Linear Array with Reconfigurable Pipelined Bus System) model [Pan Y, Li K. Linear array with a reconfigurable pipelined bus system—concepts and applications. J Inform Sci 1998;106;237–58]. Although Cole’s algorithm is optimal, communication details have not been provided due to the fact that it is designed for a PRAM. We close this gap in our sorting algorithm on the LARPBS model and obtain an O(log N)-time optimal sorting algorithm using O(N) processors. This is a substantial improvement over the previous best sorting algorithm on the LARPBS model that runs in O(log N log log N) worst-case time using N processors [Datta A, Soundaralakshmi S, Owens R. Fast sorting algorithms on a linear array with a reconfigurable pipelined bus system. IEEE Trans Parallel Distribut Syst 2002;13(3):212–22]. Our solution allows efficiently assign and reuse processors. We also discover two new properties of Cole’s sorting algorithm that are presented as lemmas in this paper. 相似文献

11.

More Efficient Topological Sort Using Reconfigurable Optical Buses

Li Jie Pan Yi Shen Hong 《The Journal of supercomputing》2003,24(3):251-258

Topological sort of an acyclic graph has many applications such as job scheduling and network analysis. Due to its importance, it has been tackled on many models. Dekel et al. [3], proposed an algorithm for solving the problem in O(log² N) time on the hypercube or shuffle-exchange networks with O(N ³) processors. Chaudhuri [2], gave an O(log N) algorithm using O(N ³) processors on a CRCW PRAM model. On the LARPBS (Linear Arrays with a Reconfigurable Pipelined Bus System) model, Li et al. [5] showed that the problem for a weighted directed graph with N vertices can be solved in O(log N) time by using N ³ processors. In this paper, a more efficient topological sort algorithm is proposed on the same LARPBS model. We show that the problem can be solved in O(log N) time by using N ³/log N processors. We show that the algorithm has better time and processor complexities than the best algorithm on the hypercube, and has the same time complexity but better processor complexity than the best algorithm on the CRCW PRAM model. 相似文献

12.

On a Nearest-Neighbor Problem Under Minkowski and Power Metrics for Large Data Sets

Gavrilova Marina L. 《The Journal of supercomputing》2002,22(1):87-98

The paper presents the sequential and the parallel algorithm for solving the nearest-neighbor problem in the plane, based on the generalized Voronoi diagram construction. The applications of the problem are found in the areas of networking, communications, distributed systems, computer modeling and information retrieval. The input for the problem is the set of circular sites S with varying radii, the query point p and the metric (Minkowski or power) according to which the site, neighboring the query point, is to be reported. The sequential algorithm takes O(n) time to build the data structure and O(log n) time for each query. The parallel algorithm requires O(log n log) preprocessing time and O(log) query time on EREW PRAM architecture with n/log n processors. The IDG/NNM software was developed for an experimental study of the problem. The experimental results demonstrate that the Voronoi diagram method outperforms the k – d tree method for all tested input configurations. The tests were conducted on large data sets, comprising thousands of generators. 相似文献

13.

Optimal Parallel Randomized Algorithms for the Voronoi Diagram of Line Segments in the Plane

Rajasekaran Ramaswami 《Algorithmica》2002,33(4):436-460

Abstract. We present an optimal parallel randomized algorithm for the Voronoi diagram of a set of n nonintersecting (except possibly at endpoints) line segments in the plane. Our algorithm runs in O(log n) time with high probability using O(n) processors on a CRCW PRAM. This algorithm is optimal in terms of work done since the sequential time bound for this problem is Ω(n log n) . Our algorithm improves by an O(log n) factor the previously best known deterministic parallel algorithm, given by Goodrich, ó'Dúnlaing, and Yap, which runs in O( log ² n) time using O(n) processors. We obtain this result by using a new ``two-stage' random sampling technique. By choosing large samples in the first stage of the algorithm, we avoid the hurdle of problem-size ``blow-up' that is typical in recursive parallel geometric algorithms. We combine the two-stage sampling technique with efficient search and merge procedures to obtain an optimal algorithm. This technique gives an alternative optimal algorithm for the Voronoi diagram of points as well (all other optimal parallel algorithms for this problem use the transformation to three-dimensional half-space intersection). 相似文献

14.

A Unified Approach to the Parallel Construction of Search Trees

《Journal of Parallel and Distributed Computing》1995,27(1):71-78

We present a unified parallel algorithm for constructing various search trees. The tree construction is based on a unified scheme, called bottom-level balancing, which constructs a perfectly balanced search tree having a uniform distribution of keys. The algorithm takes O(log log N) time using N/log log N processors on the EREW PRAM model, and O(1) time with N processors on the CREW PRAM model, where N is the number of keys in the tree. 相似文献

15.

Optimal Computing the Chessboard Distance Transform on Parallel Processing Systems

Yu-Hua Lee Shi-Jinn Horng 《Computer Vision and Image Understanding》1999,73(3):272

Thedistance transform(DT) is an image computation tool which can be used to extract the information about the shape and the position of the foreground pixels relative to each other. It converts a binary image into a grey-level image, where each pixel has a value corresponding to the distance to the nearest foreground pixel. The time complexity for computing the distance transform is fully dependent on the different distance metrics. Especially, the more exact the distance transform is, the worse execution time reached will be. Nowadays, quite often thousands of images are processed in a limited time. It seems quite impossible for a sequential computer to do such a computation for the distance transform in real time. In order to provide efficient distance transform computation, it is considerably desirable to develop a parallel algorithm for this operation. In this paper, based on the diagonal propagation approach, we first provide anO(N²) time sequential algorithm to compute thechessboard distance transform(CDT) of anN×Nimage, which is a DT using the chessboard distance metrics. Based on the proposed sequential algorithm, the CDT of a 2D binary image array of sizeN×Ncan be computed inO(logN) time on the EREW PRAM model usingO(N²/logN) processors,O(log logN) time on the CRCW PRAM model usingO(N²/log logN) processors, andO(logN) time on the hypercube computer usingO(N²/logN) processors. Following the mapping as proposed by Lee and Horng, the algorithm for the medial axis transform is also efficiently derived. The medial axis transform of a 2D binary image array of sizeN×Ncan be computed inO(logN) time on the EREW PRAM model usingO(N²/logN) processors,O(log logN) time on the CRCW PRAM model usingO(N²/log logN) processors, andO(logN) time on the hypercube computer usingO(N²/logN) processors. The proposed parallel algorithms are composed of a set of prefix operations. In each prefix operation phase, only increase (add-one) operation and minimum operation are employed. So, the algorithms are especially efficient in practical applications. 相似文献

16.

A Cost-Optimal Pipeline Algorithm for Permutation Generation in Lexicographic Order

Clémentin Tayou Djamegni Maurice Tchuenté 《Journal of Parallel and Distributed Computing》1997,44(2):323

In this paper we solve the open problem of designing a cost-optimal parallel algorithm for generating permutations ofMelements out of the set {0, 1, …,N− 1}, in lexicographic order. Our algorithm runs on the simplest model of parallel computation, i.e., a linear array of sizeM, where each processor PE(i) needs only a constant number of words of length logN, and is responsible for producing with constant delay, theith components in the successive permutations. This algorithm runs in timeO(N!/(N−M)!) on an array of sizeMand is therefore cost-optimal when the time needed to output the permutations is taken into account. Moreover, by doubling the number of cells, it is possible to implement this algorithm on a unidirectional linear array. 相似文献

17.

An optimal parallel algorithm for computing cut vertices and blocks on interval graphs

《国际计算机数学杂志》2012,89(1):59-70

In this paper, a parallel algorithm is presented to find all cut-vertices and blocks of an interval graph. If the list of sorted end points of the intervals of an interval graph is given then the proposed algorithm takes O(log n) time and O(n/log n) processors on an EREW PRAM, if the sorted list is not given then the time and processors complexities are respectively O(log n) and O(n). 相似文献

18.

The parallel complexity of scheduling with precedence constraints

Danny Dolev Eli Upfal Manfred K. Warmuth 《Journal of Parallel and Distributed Computing》1986,3(4)

We study the problem of parallel computation of a schedule for a system of n unit-length tasks on m identical machines, when the tasks are related by a set of precedence constraints. We present NC algorithms for computing an optimal schedule in the case where m, the number of available machines, does not vary with time and the precedence constraints are represented by a collection of outtrees. The algorithms run on an exclusive read, exclusive write (EREW) PRAM. Their complexities are O(log n) and O((log n)²) parallel time using O(n²) and O(n) processors, respectively. The schedule computed by our algorithms is a height-priority schedule. As a complementary result we show that it is very unlikely that computing such a schedule is in NC when any of the above conditions is significantly relaxed. We prove that the problem is P-complete under logspace reductions when the precedence constraints are a collection of intrees and outtrees, or for a collection of outtrees when the number of available machines is allowed to increase with time. The time span of a height-priority schedule for an arbitrary precedence constraints graph is at most 2 − 1/(m − 1) times longer than the optimal (N. E Chen and C. L. Liu, Proc. 1974 Sagamore Computer Conference on Parallel Processing, T. Fend (Ed.), Springer-Verlag, Berlin, 1975, pp. 1–16). Whereas it is P-complete to produce the classical height-priority schedules even for very restricted precedence constraints graphs, we present a simple NC parallel algorithm which produces a different schedule that is only 2 − 1/m times the optimal. 相似文献

19.

Efficient parallel term matching and anti-unification

Arthur L. Delcher Simon Kasif 《Journal of Automated Reasoning》1992,9(3):391-406

In this paper we present optimal processor x time parallel algorithms for term matching and anti-unification of terms represented as trees. Term matching is the special case of unification in which one of the terms is restricted to contain no variables. It has wide applicability to logic programming, term rewriting systems and symbolic pattern matching. Anti-unification is the dual problem of unification in which one computes the most specific generalization of two terms. It has application to inductive inference and theorem proving. Our algorithms run in O(log² N) time using N/log² N processors on a shared-memory model of computation that prohibits simultaneous reads or writes (EREW PRAM). These algorithms are the first polylogarithmic-time EREW algorithms with a processor x time product of the same order as that of their sequential counterparts, thereby permitting optimal speed-ups using any number of processors up to N/log² N. We also use the techniques developed in the paper to provide an N/log N-processor, O(log N)-time algorithm for a shared-memory model that allows both simultaneous reads and simultaneous writes (CRCW PRAM).Supported by NSF Grant IRI-88-09324 and NSF/DARPA Grant CCR-8908092. 相似文献

20.

Multiple Addition and Prefix Sum on a Linear Array with a Reconfigurable Pipelined Bus System

Amitava Datta 《The Journal of supercomputing》2004,29(3):303-317

We present several fast algorithms for multiple addition and prefix sum on the Linear Array with a Reconfigurable Pipelined Bus System (LARPBS), a recently proposed architecture based on optical buses. Our algorithm for adding N integers runs on an N log M-processor LARPBS in O(log* N) time, where log* N is the number of times logarithm has to be taken to reduce N below 1 and M is the largest integer in the input. Our addition algorithm improves the time complexity of several matrix multiplication algorithms proposed by Li, Pan and Zheng (IEEE Trans. Parallel and Distributed Systems, 9(8):705–720, 1998). We also present several fast algorithms for computing prefix sums of N integers on the LARPBS. For integers with bounded magnitude, our first algorithm for prefix sum computation runs in O(log log N) time using N processors and in O(1) time using N ¹⁺ processors, for < 1. For integers with unbounded magnitude, the first algorithm for multiple addition runs in O(log log N log* N) time using N log M processors, when M is the largest integer in the input. Our second algorithm for multiple addition runs in O(log* N) time using N ¹⁺ log M processors, for < 1. We also show suitable extensions of our algorithm for real numbers. 相似文献