首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Dehne  Dittrich  Hutchinson 《Algorithmica》2003,36(2):97-122
External memory (EM) algorithms are designed for large-scale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACM Working Group on Storage I/ O for Large-Scale Computing. In this paper we provide a simulation technique which produces efficient parallel EM algorithms from efficient BSP-like parallel algorithms. The techniques obtained can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. When applied to existing BSP-like algorithms, our simulation technique produces improved parallel EM algorithms for a large number of problems.  相似文献   

2.
Abstract. Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/ O. In this paper we present a simple, deterministic simulation technique which transforms certain Bulk Synchronous Parallel (BSP) algorithms into efficient parallel EM algorithms. It optimizes blockwise data access and parallel disk I/ O and, at the same time, utilizes multiple processors connected via a communication network or shared memory. We obtain new improved parallel EM algorithms for a large number of problems including sorting, permutation, matrix transpose, several geometric and GIS problems including three-dimensional convex hulls (two-dimensional Voronoi diagrams), and various graph problems. We show that certain parallel algorithms known for the BSP model can be used to obtain EM algorithms that meet well known I /O complexity lower bounds for various problems, including sorting.  相似文献   

3.
Abstract. We present an optimal parallel randomized algorithm for the Voronoi diagram of a set of n nonintersecting (except possibly at endpoints) line segments in the plane. Our algorithm runs in O(log n) time with high probability using O(n) processors on a CRCW PRAM. This algorithm is optimal in terms of work done since the sequential time bound for this problem is Ω(n log n) . Our algorithm improves by an O(log n) factor the previously best known deterministic parallel algorithm, given by Goodrich, ó'Dúnlaing, and Yap, which runs in O( log 2 n) time using O(n) processors. We obtain this result by using a new ``two-stage' random sampling technique. By choosing large samples in the first stage of the algorithm, we avoid the hurdle of problem-size ``blow-up' that is typical in recursive parallel geometric algorithms. We combine the two-stage sampling technique with efficient search and merge procedures to obtain an optimal algorithm. This technique gives an alternative optimal algorithm for the Voronoi diagram of points as well (all other optimal parallel algorithms for this problem use the transformation to three-dimensional half-space intersection).  相似文献   

4.
5.
Sanders  Egner  Korst 《Algorithmica》2003,35(1):21-55
   Abstract. High performance applications involving large data sets require the efficient and flexible use of multiple disks. In an external memory machine with D parallel, independent disks, only one block can be accessed on each disk in one I/ O step. This restriction leads to a load balancing problem that is perhaps the main inhibitor for the efficient adaptation of single-disk external memory algorithms to multiple disks. We solve this problem for arbitrary access patterns by randomly mapping blocks of a logical address space to the disks. We show that a shared buffer of O (D) blocks suffices to support efficient writing. The analysis uses the properties of negative association to handle dependencies between the random variables involved. This approach might be of independent interest for probabilistic analysis in general. If two randomly allocated copies of each block exist, N arbitrary blocks can be read within
I/ O steps with high probability. The redundancy can be further reduced from 2 to 1+1/r for any integer r without a big impact on reading efficiency. From the point of view of external memory models, these results rehabilitate Aggarwal and Vitter's ``single-disk multi-head' model [1] that allows access to D arbitrary blocks in each I/ O step. This powerful model can be emulated on the physically more realistic independent disk model [2] with small constant overhead factors. Parallel disk external memory algorithms can therefore be developed in the multi-head model first. The emulation result can then be applied directly or further refinements can be added.  相似文献   

6.
The block-cyclic data distribution is commonly used to organize array elements over the processors of a coarse-grained distributed memory parallel computer. In many scientific applications, the data layout must be reorganized at run-time in order to enhance locality and reduce remote memory access overheads. In this paper we present a general framework for developing array redistribution algorithms. Using this framework, we have developed efficient algorithms that redistribute an array from one block-cyclic layout to another. Block-cyclic redistribution consists of index set computation , wherein the destination locations for individual data blocks are calculated, and data communication , wherein these blocks are exchanged between processors. The framework treats both these operations in a uniform and integrated way. We have developed efficient and distributed algorithms for index set computation that do not require any interprocessor communication. To perform data communication in a conflict-free manner, we have developed direct indirect and hybrid algorithms. In the direct algorithm, a data block is transferred directly to its destination processor. In an indirect algorithm, data blocks are moved from source to destination processors through intermediate relay processors. The hybrid algorithm is a combination of the direct and indirect algorithms. Our framework is based on a generalized circulant matrix formalism of the redistribution problem and a general purpose distributed memory model of the parallel machine. Our algorithms sustain excellent performance over a wide range of problem and machine parameters. We have implemented our algorithms using MPI, to allow for easy portability across different HPC platforms. Experimental results on the IBM SP-2 and the Cray T3D show superior performance over previous approaches. When the block size of the cyclic data layout changes by a factor of K , the redistribution can be performed in O( log K) communication steps. This is true even when K is a prime number. In contrast, previous approaches take O(K) communication steps for redistribution. Our framework can be used for developing scalable redistribution libraries, for efficiently implementing parallelizing compiler directives, and for developing parallel algorithms for various applications. Redistribution algorithms are especially useful in signal processing applications, where the data access patterns change significantly between computational phases. They are also necessary in linear algebra programs, to perform matrix transpose operations. Received June 1, 1997; revised March 10, 1998.  相似文献   

7.
In this paper a parallel algorithm is given that, given a graph G=(V,E) , decides whether G is a series parallel graph, and, if so, builds a decomposition tree for G of series and parallel composition rules. The algorithm uses O(log \kern -1pt |E|log ^\ast \kern -1pt |E|) time and O(|E|) operations on an EREW PRAM, and O(log \kern -1pt |E|) time and O(|E|) operations on a CRCW PRAM. The results hold for undirected as well as for directed graphs. Algorithms with the same resource bounds are described for the recognition of graphs of treewidth two, and for constructing tree decompositions of treewidth two. Hence efficient parallel algorithms can be found for a large number of graph problems on series parallel graphs and graphs with treewidth two. These include many well-known problems like all problems that can be stated in monadic second-order logic. Received July 15, 1997; revised January 29, 1999, and June 23, 1999.  相似文献   

8.
We consider the problem of dynamically allocating and deallocating local memory resources among multiple users in a parallel or distributed system. Given a group of independent users and a collection of interconnected local memory devices, we want to render the fragmentation of the memory resources irrelevant by allowing any user to allocate space for his or her purposes as long as there is space available anywhere in the system. In effect, we would like it to appear to the users as though they are allocating memory from a single central pool of memory, even though the space is distributed throughout the system. Our goal is to devise an on-line allocation algorithm that minimizes two cost measures: first, the fraction of unused space , which arises due to fragmentation of the memory; second, the slowdown needed by the system to service user requests, which arises due to the contention for access to the memory devices. We solve this distributed dynamic allocation problem in near-optimal fashion by devising an algorithm that allows the memory to be used to 100% of capacity despite the fragmentation and guarantees that service delays will always be within a constant factor of optimal. The algorithm is completely on-line (no foreknowledge of user activity is assumed) and can accommodate any sequence of allocations and deallocations by the users that does not violate global memory bounds. We also consider the distributed dynamic allocation problem in the more restrictive setting where the local memory devices are connected by a low-degree fixed-connection network, rather than being fully interconnected. In this case, communication costs must be more explicitly considered in our allocation algorithms. We give allocation algorithms for butterfly and hypercube networks, and prove necessary and sufficient conditions on the total amount of memory space needed for near-optimal algorithms to exist. Received November 5, 1996; revised December 10, 1997.  相似文献   

9.
In this paper we present deterministic parallel algorithms for the coarse-grained multicomputer (CGM) and bulk synchronous parallel (BSP) models for solving the following well-known graph problems: (1) list ranking, (2) Euler tour construction in a tree, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition or open ear decomposition, and (7) 2-edge connectivity and biconnectivity (testing and component computation). The algorithms require O(log p) communication rounds with linear sequential work per round (p = no. processors, N = total input size). Each processor creates, during the entire algorithm, messages of total size O(log (p) (N/p)) . The algorithms assume that the local memory per processor (i.e., N/p ) is larger than p ε , for some fixed ε > 0 . Our results imply BSP algorithms with O(log p) supersteps, O(g log (p) (N/p)) communication time, and O(log (p) (N/p)) local computation time. It is important to observe that the number of communication rounds/ supersteps obtained in this paper is independent of the problem size, and grows only logarithmically with respect to p . With growing problem size, only the sizes of the messages grow but the total number of messages remains unchanged. Due to the considerable protocol overhead associated with each message transmission, this is an important property. The result for Problem (1) is a considerable improvement over those previously reported. The algorithms for Problems (2)—(7) are the first practically relevant parallel algorithms for these standard graph problems. Received July 5, 2000; revised April 16, 2001.  相似文献   

10.
Practical parallel algorithms, based on classical sequential Union-Find algorithms for computing transitive closures of binary relations, are described and implemented for both shared memory and distributed memory parallel computers. By practical algorithms, we mean algorithms that are efficient for parallel systems with bounded numbers of processors as opposed to algorithms where the number of processors grows with the problem size. Transitive closures are useful for decomposing many applications problems into independent subproblems. The implementations were on an ENCORE Multimax shared memory machine and an NCUBE hypercube. Our implementations indicate that transitive closure computations are intrinsically difficult for distributed memory parallel machines because of the need for global information. By contrast, our results for shared memory machines exhibited excellent speedups.Supported in part by NSF Grant DCR-8619103, ONR contract N000-86-G-0202 and DOE Grant DE-FG02-85ER25001.Supported in part by RADC contract F30602-85-C-0303.Supported in part by RADC contract F30602-85-C-0303.  相似文献   

11.
The range tree is a fundamental data structure for multidimensional point sets, and, as such, is central in a wide range of geometric and database applications. In this paper we describe the first nontrivial adaptation of range trees to the parallel distributed memory setting (BSP-like models). Given a set of n points in d -dimensional Cartesian space, we show how to construct on a coarse-grained multicomputer a distributed range tree T in time O( s / p + T c (s,p)) , where s = n log d-1 n is the size of the sequential data structure and T c (s,p) is the time to perform an h -relation with h=Θ (s/p) . We then show how T can be used to answer a given set Q of m=O(n) range queries in time O((s log m)/p + T c (s,p)) and O((s log m)/p + T c (s,p) + k/p) , where k is the number of results to be reported. These parallel construction and search algorithms are both highly efficient, in that their running times are the sequential time divided by the number of processors, plus a constant number of parallel communication rounds. Received June 1, 1997; revised March 10, 1998.  相似文献   

12.
This paper considers a variety of geometric pattern recognition problems on input sets of size n using a coarse grained multicomputer model consisting of p processors with Ω(n/p) local memory each (i.e., Ω(n/p) memory cells of Θ(log n) bits apiece), where the processors are connected to an arbitrary interconnection network. It introduces efficient scalable parallel algorithms for a number of geometric problems including the rectangle finding problem, the maximal equally spaced collinear points problem, and the point set pattern matching problem. All of the algorithms presented are scalable in that they are applicable and efficient over a very wide range of ratios of problem size to number of processors. In addition to the practicality imparted by scalability, these algorithms are easy to implement in that all required communications can be achieved by a small number of calls to standard global routing operations.  相似文献   

13.
In this paper we describe a technique for finding efficient parallel algorithms for problems on directed graphs that involve checking the existence of certain kinds of paths in the graph. This technique provides efficient algorithms for finding dominators in flow graphs, performing interval and loop analysis on reducible flow graphs, and finding the feedback vertices of a digraph. Each of these algorithms takesO(log2 n) time using the same number of processors needed for fast matrix multiplication. All of these bounds are for an EREW PRAM.  相似文献   

14.
This paper considers four parallel Cholesky factorization algorithms, including SPOTRF from the February 1992 release of LAPACK, each of which call parallel Level 2 or 3 BLAS, or both. A fifth parallel Cholesky algorithm that calls serial Level 3 BLAS is also described. The efficiency of these five algorithms on the CRAY-2, CRAY Y-MP/832, Hitachi Data Systems EX 80, and IBM 3090-600J is evaluated and compared with a vendor-optimized parallel Cholesky factorization algorithm. The fifth parallel Cholesky algorithm that calls serial Level 3 BLAS provided the best performance of all algorithms that called BLAS routines. In fact, this algorithm outperformed the Cray-optimized libsci routine (SPOTRF) by 13–44%;, depending on the problem size and the number of processors used.This work was supported by grants from IMSL, Inc., and Hitachi Data Systems. The first version of this paper was presented as a poster session at Supercomputing '90, New York City, November 1990.  相似文献   

15.
We consider the following partition problem: Given a set S of n elements that is organized as k sorted subsets of size n/k each and given a parameter h with 1/k ≤ h ≤ n/k , partition S into g = O(n/(hk)) subsets D 1 , D 2 , . . . , D g of size Θ(hk) each, such that, for any two indices i and j with 1 ≤ i < j ≤ g , no element in D i is bigger than any element in D j . Note that with various combinations of the values of parameters h and k , several fundamental problems, such as merging, sorting, and finding an approximate median, can be formulated as or be reduced to this partition problem. The partition problem also finds many applications in solving problems of parallel computing and computational geometry. In this paper we present efficient parallel algorithms for solving the partition problem and a number of its applications. Our parallel partition algorithm runs in O( log n) time using processors in the EREW PRAM model. The complexity bounds of our parallel partition algorithm on the respective special cases match those of the optimal EREW PRAM algorithms for merging, sorting, and finding an approximate median. Using our parallel partition algorithm, we are also able to obtain better complexity bounds (even possibly on a weaker parallel model) than the previously best known parallel algorithms for several important problems, including parallel multiselection, parallel multiranking, and parallel sorting of k sorted subsets. Received May 5, 1996; revised July 30, 1998.  相似文献   

16.
Andrews  Bender  Zhang 《Algorithmica》2008,32(2):277-301
Abstract. Processor speed and memory capacity are increasing several times faster than disk speed. This disparity suggests that disk I/ O performance could become an important bottleneck. Methods are needed for using disks more efficiently. Past analysis of disk scheduling algorithms has largely been experimental and little attempt has been made to develop algorithms with provable performance guarantees. We consider the following disk scheduling problem. Given a set of requests on a computer disk and a convex reachability function that determines how fast the disk head travels between tracks, our goal is to schedule the disk head so that it services all the requests in the shortest time possible. We present a 3/2 -approximation algorithm (with a constant additive term). For the special case in which the reachability function is linear we present an optimal polynomial-time solution. The disk scheduling problem is related to the special case of the Asymmetric Traveling Salesman Problem with the triangle inequality (ATSP-Δ ) in which all distances are either 0 or some constant α . We show how to find the optimal tour in polynomial time and describe how this gives another approximation algorithm for the disk scheduling problem. Finally we consider the on-line version of the problem in which uniformly distributed requests arrive over time. We present an algorithm related to the above ATSP-Δ .  相似文献   

17.
Parallel updates of minimum spanning trees (MSTs) have been studied in the past. These updates allowed a single change in the underlying graph, such as a change in the cost of an edge or an insertion of a new vertex. Multiple update problems for MSTs are concerned with handling more than one such change. In the sequential case multiple update problems may be solved using repeated applications of an efficient algorithm for a single update. However, for efficiency reasons, parallel algorithms for multiple update problems must consider all changes to the underlying graph simultaneously. In this paper we describe parallel algorithms for updating an MST whenk new vertices are inserted or deleted in the underlying graph, when the costs ofk edges are changed, or whenk edge insertions and deletions are performed. For multiple vertex insertion update, our algorithm achieves time and processor bounds ofO(log n·logk) and nk/(logn·logk), respectively, on a CREW parallel random access machine. These bounds are optimal for dense graphs. A novel feature of this algorithm is a transformation of the previous MST andk new vertices to a bipartite graph which enables us to obtain the above-mentioned bounds.  相似文献   

18.
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Blockwise access to data is a central theme in the design of efficient EM algorithms. A similar requirement arises in the transmission of data between processors in high performance parallel computation systems, for which blockwise communication is a crucial issue. {We consider multisearch problems, where a large number of queries are to be simultaneously processed and satisfied by navigating through large data structures on parallel computers. Our examples originate as algorithms for parallel machines, and we adapt them to the EM situation where the queries and data structure are considered to be much larger than the size of the available internal memory. } This paper presents techniques to achieve blocking for I/ O as well as for communication in multisearch on the BSP and EM-BSP models. We describe improvements to the 1-optimal BSP* multisearch algorithm of [8] which permit larger search trees to be handled. In the area of EM algorithms new algorithms for multisearch in balanced trees are described. For search trees of size O(nlog n) where n is the number of queries, we obtain a work-optimal, parallel, randomized EM multisearch algorithm whose I/ O and communication time are smaller, asymptotically, than the computation time. We obtain a deterministic version via a similar technique. These algorithms are obtained via the simulation techniques of [12], [17], [13], [14], and [24]. For larger trees we describe a parallel, EM algorithm which is 1-optimal considering computation, communication, and I/ O (for suitable parameter constellations) plus I/ O-optimal. We give a lower bound to the number of I/ O operations required for filtering n queries through a binary or multiway search tree of size m when m\geq n 2+ε , for a constant ε>0 . Online publication February 20, 2001.  相似文献   

19.
Abstract

In this paper we present two parallel versions of bisection method to compute the spectrum of symmetric Toeplitz matrices. Both parallel algorithms have been implemented and analysed on a virtual shared memory multiprocessor using a portable message-passing environment. The algorithms very efficiently parallelize the sequential method, and the application of a dynamic strategy to distribute the computations produces better results than the use of a static method. We also improve the performance of the original sequential algorithm by applying Newton's method for the final approximation of the eigenvalues. However, the bad results of the sequential algorithm produce low speedups when we compare the parallel methods with the best available sequential algorithm.  相似文献   

20.
In this paper a systematic method for the design of efficient parallel algorithms for the dynamic evaluation of computation trees and/or expressions is presented. This method involves the use of uniform closure properties of certain classes of unary functions. Using this method, optimal parallel algorithms are given for many computation tree problems which are important in parallel algebraic and numerical computation, and parallel code generation on exclusive read and exclusive write parallel random access machines. Our algorithmic result is complemented by a P-complete tree problem. Received February 13, 1995; revised March 25, 1996.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号