首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 343 毫秒
1.
Algorithms for parallel memory,I: Two-level memories   总被引:1,自引:0,他引:1  
We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our two-level memory model is new and gives a realistic treatmentof parallel block transfer, in which during a single I/O each of theP secondary storage devices can simultaneously transfer a contiguous block ofB records. The model pertains to a large-scale uniprocessor system or parallel multiprocessor system withP disks. In addition, the sorting, FFT, permutation network, and standard matrix multiplication algorithms are typically optimal in terms of the amount of internal processing time. The difficulty in developing optimal algorithms is to cope with the partitioning of memory intoP separate physical devices. Our algorithms' performances can be significantly better than those obtained by the wellknown but nonoptimal technique of disk striping. Our optimal sorting algorithm is randomized, but practical; the probability of using more than times the optimal number of I/Os is exponentially small inl(logl) log(M/B), whereM is the internal memory size.A summarized version of this research was presented at the 22nd Annual ACM Symposium on Theory of Computing, Baltimore, MD, May 1990. This work was done while the first author was at Brown University. Support was provided in part by a National Science Foundation Presidential Young Investigator Award with matching funds from IBM, by NSF Research Grants DCR-8403613 and CCR-9007851, by Army Research Office Grant DAAL03-91-G-0035, and by the Office of Naval Research and the Defense Advanced Research Projects Agency under Contract N00014-91-J-4052 ARPA Order 8225. This work was done in part while the second author was at Brown University supported by a Bellcore graduate fellowship and at Bellcore.  相似文献   

2.
LetU andV be two sets of points in the plane, where ¦U¦=k,¦V¦=, andn=k+. These two sets of points induce a directed complete bipartite graph in which the points represent nodes and an edge is directed from each node inU to each node in K Each edge is given a cost equal to the distance between the corresponding nodes measured by some metricd on the plane. We consider thetransportation problem on such a graph. We present an 0(n2,5 logn logN) algorithm, whereN is the magnitude of the largest supply or demand. The algorithm uses some fundamental results of computational geometry and scaling of supplies and demands. The algorithm is valid for the 1 metric, the 2 metric, and the metric. The running time for the 1 and metrics can be improved to 0(n2(logn)3 logN).D. S. Atkinson was supported by the National Science Foundation under Grant CCR90-57481PYI. P. M. Vaidya was supported by the National Science Foundation under Grants CCR-9057481 and CCR-9007195.  相似文献   

3.
This paper presents the analysis of a parallel formulation of depth-first search. At the heart of this parallel formulation is a dynamic work-distribution scheme that divides the work between different processors. The effectiveness of the parallel formulation is strongly influenced by the work-distribution scheme and the target architecture. We introduce the concept of isoefficiency function to characterize the effectiveness of different architectures and work-distribution schemes. Many researchers considered the ring architecture to be quite suitable for parallel depth-first search. Our analytical and experimental results show that hypercube and shared-memory architectures are significantly better. The analysis of previously known work-distribution schemes motivated the design of substantially improved schemes for ring and shared-memory architectures. In particular, we present a work-distribution algorithm that guarantees close to optimal performance on a shared-memory/-network-with-message-combining architecture (e.g. RP3). Much of the analysis presented in this paper is applicable to other parallel algorithms in which work is dynamically shared between different processors (e.g., parallel divide-and-conquer algorithms). The concept of isoefficiency is useful in characterizing the scalability of a variety of parallel algorithms.This work was supported by Army Research Office Grant No. DAAG29-84-K-0060 to the Artificial Intelligence Laboratory, and Office of Naval Research Grant N00014-86-K-0763 to the Computer Science Department at the University of Texas at Austin.  相似文献   

4.
We present algorithms for the randomized simulation of a shared memory machine (PRAM) on a Distributed Memory Machine (DMM). In a PRAM, memory conflicts occur only through concurrent access to the same cell, whereas the memory of a DMM is divided into modules, one for each processor, and concurrent accesses to the same module create a conflict. Thedelay of a simulation is the time needed to simulate a parallel memory access of the PRAM. Any general simulation of anm processor PRAM on ann processor DMM will necessarily have delay at leastm/n. A randomized simulation is calledtime-processor optimal if the delay isO(m/n) with high probability. Using a novel simulation scheme based on hashing we obtain a time-processor optimal simulation with delayO(log log(n) log*(n)). The best previous simulations use a simpler scheme based on hashing and have much larger delay: (log(n)/log log(n)) for the simulation of an n processor PRAM on ann processor DMM, and (log(n)) in the case where the simulation is time-processor optimal.Our simulations use several (two or three) hash functions to distribute the shared memory among the memory modules of the PRAM. The stochastic processes modeling the behavior of our algorithms and their analyses based on powerful classes of universal hash functions may be of independent interest.Research partially supported by NSF/DARPA Grant CCR-9005448. Work was done while at the University of California at Berkeley and the International Computer Science Institute, Berkeley, CA.Research partially supported by National Science Foundation Operating Grant CCR-9016468, National Science Foundation Operating Grant CCR-9304722, United States-Israel Binational Science Foundation Grant No. 89-00312, United States-Israel Binational Science Foundation Grant No. 92-00226, and ESPRIT BR Grant EC-US 030.Part of work was done during a visit at the International Computer Science Institute at Berkeley; supported in part by DFG-Forschergruppe Effiziente Nutzung massiv paralleler Systeme, Teilprojekt 4, and by the Esprit Basic Research Action Nr. 7141 (ALCOM II).  相似文献   

5.
Massively parallel processors have begun using commodity operating systems that support demand-paged virtual memory. To evaluate the utility of virtual memory, we measured the behavior of seven shared-memory parallel application programs on a simulated distributed-shared-memory machine. Our results (1) confirm the importance of gang CPU scheduling, (2) show that a page-faulting processor should spin rather than invoke a parallel context switch, (3) show that our parallel programs frequently touch most of their data, and (4) indicate that memory, not just CPUs, must be gang scheduled. Overall, our experiments demonstrate that demand paging has limited value on current parallel machines because of the applications' synchronization and memory reference patterns and the machines' high page-fault and parallel context-switch overheads.An earlier version of this paper was presented at Supercomputing '94.This work is supported in part by NSF Presidential Young Investigator Award CCR-9157366; NSF Grants MIP-9225097, CCR-9100968, and CDA-9024618; Office of Naval Research Grant N00014-89-J-1222; Department of Energy Grant DE-FG02-93ER25176; and donations from Thinking Machines Corporation, Xerox Corporation, and Digital Equipment Corporation.  相似文献   

6.
This paper introduces a model for parallel computation, called thedistributed randomaccess machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages through a communication network. A DRAM explicitly models the congestion of messages across cuts of the network.We introduce the notion of aconservative algorithm as one whose communication requirements at each step can be bounded by the congestion of pointers of the input data structure across cuts of a DRAM. We give a simple lemma that shows how to shortcut pointers in a data structure so that remote processors can communicate without causing undue congestion. We giveO(lgn)-step, linear-processor, linear-space, conservative algorithms for a variety of problems onn-node trees, such as computing treewalk numberings, finding the separator of a tree, and evaluating all subexpressions in an expression tree. We giveO(lg2 n)-step, linear-processor, linear-space, conservative algorithms for problems on graphs of sizen, including finding a minimum-cost spanning forest, computing biconnected components, and constructing an Eulerian cycle. Most of these algorithms use as a subroutine a generalization of the prefix computation to trees. We show that any suchtreefix computation can be performed inO(lgn) steps using a conservative variant of Miller and Reif's tree-contraction technique.This research was supported in part by the Defense Advanced Research Projects Agency under Contract N00014-80-C-0622 and by the Office of Naval Research under Contract N00014-86-K-0593. Charles Leiserson is supported in part by an NSF Presidential Young Investigator Award with matching funds provided by AT&T Bell Laboratories and Xerox Corporation. Bruce Maggs is supported in part by an NSF Fellowship.  相似文献   

7.
Lisp and its descendants are among the most important and widely used of programming languages. At the same time, parallelism in the architecture of computer systems is becoming commonplace. There is a pressing need to extend the technology of automatic parallelization that has become available to Fortran programmers of parallel machines, to the realm of Lisp programs and symbolic computing. In this paper we present a comprehensive approach to the compilation of Scheme programs for shared-memory multiprocessors. Our strategy has two principal components:interprocedural analysis andprogram restructuring. We introduceprocedure strings andstack configurations as a framework in which to reason about interprocedural side-effects and object lifetimes, and develop a system of interprocedural analysis, using abstract interpretation, that is used in the dependence analysis and memory management of Scheme programs. We introduce the transformations ofexit-loop translation andrecursion splitting to treat the control structures of iteration and recursion that arise commonly in Scheme programs. We propose an alternative representation for s-expressions that facilitates the parallel creation and access of lists. We have implemented these ideas in a parallelizing Scheme compiler and run-time system, and we complement the theory of our work with snapshots of programs during the restructuring process, and some preliminary performance results of the execution of object codes produced by the compiler.This work was supported in part by the National Science Foundation under Grant No. NSF MIP-8410110, the U.S. Department of Energy under Grant No. DE-FG02-85ER25001, the Office of Naval Research under Grant No. ONR N00014-88-K-0686, the U.S. Air Force Office of Scientific Research under Grant No. AFOSR-F49620-86-C-0136, and by a donation from the IBM Corportation.  相似文献   

8.
We study thegrouping by swapping problem, which occurs in memory compaction and in computing the exponential of a matrix. In this problem we are given a sequence ofn numbers drawn from {0,1, 2,...,m–1} with repetitions allowed; we are to rearrange them, using as few swaps of adjacent elements as possible, into an order such that all the like numbers are grouped together. It is known that this problem is NP-hard. We present a probabilistic analysis of a grouping algorithm calledMEDIAN that works by sorting the numbers in the sequence according to their median positions. Our results show that the expected behavior ofMEDIAN is within 10% of optimal and is asymptotically optimal asn/m or asn/m0.The work of this author was supported in part by the Texas Advanced Research Program under Grant 4096.  相似文献   

9.
Assume we are given ann ×n binary image containing horizontally convex features; i.e., for each feature, each of its row's pixels form an interval on that row. In this paper we consider the problem of assigning topological numbers to such features, i.e., assign a number to every featuref so that all features to the left off in the image have a smaller number assigned to them. This problem arises in solutions to the stereo matching problem. We present a parallel algorithm to solve the topological numbering problem inO(n) time on ann ×n mesh of processors. The key idea of our solution is to create a tree from which the topological numbers can be obtained even though the tree does not uniquely represent the to the left of relationship of the features.The work of M. J. Atallah was supported by the Office of Naval Research under Grants N00014-84-K-0502 and N00014-86-K-0689, and the National Science Foundation under Grant DCR-8451393, with matching funds from AT&T. Part of this work was done while he was a Visiting Scientist at the Center for Advanced Architectures project of the Research Institute for Advanced Computer Science, NASA Ames Research Center, Moffett Field, CA 94035, USA. S. E. Hambrusch's work was supported by the Office of Naval Research under Contracts N00014-84-K-0502 and N00014-86K-0689, and by the National Science Foundation under Grant MIP-87-15652. Part of this work was done while she was visiting the International Computer Science Institute, Berkeley, CA 94704, USA. The work of L. E. TeWinkel was supported by the Office of Naval Research under Contract N00014-86K-0689.  相似文献   

10.
Fractional cascading is a technique designed to allow efficient sequential search in a graph with catalogs of total sizen. The search consists of locating a key in the catalogs along a path. In this paper we show how to preprocess a variety of fractional cascaded data structures whose underlying graph is a tree so that searching can be done efficiently in parallel. The preprocessing takesO(logn) time withn/logn processors on an EREW PRAM. For a balanced binary tree, cooperative search along root-to-leaf paths can be done inO((logn)/logp) time usingp processors on a CREW PRAM. Both of these time/processor constraints are optimal. The searching in the fractional cascaded data structure can be either explicit, in which the search path is specified before the search starts, or implicit, in which the branching is determined at each node. We apply this technique to a variety of geometric problems, including point location, range search, and segment intersection search.An earlier version of this work appears inProceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, July 1990, pp. 307–316. The first author's support was provided in part by National Science Foundation Grant CCR-9007851, by the U.S. Army Research Office under Grants DAAL03-91-G-0035 and DAAH04-93-0134, and by the Office of Naval Research and the Advanced Research Projects Agency under Contract N00014-91-J-4052, ARPA Order 8225. This research was performed while the second author was at Brown University. Support was provided in part by an NSF Presidential Young Investigator Award CCR-9047466, with matching funds from IBM, by National Science Foundation Grant CCR-9007851, by the U.S. Army Research Office under Grant DAAL03-91-G-0035, and by the Office of Naval Research and the Advanced Research Projects Agency under Contract N00014-91-J-4052, ARPA Order 8225.  相似文献   

11.
We consider the problem of sorting n integers when the elements are drawn from the restricted domain [1...n]. A new deterministic parallel algorithm for sorting n integers is obtained. Its running time is O(lognlog(n/logn)) using n/logn processors on EREW (exclusive read exclusive write) PRAM (parallel random access machine). Also, our algorithm was modified to become optimal when we use processors. This algorithm belongs to class EP (Efficient, Polynomial fast).  相似文献   

12.
We answer questions about the distribution of the maximum size of queues and data structures as a function of time. The concept of maximum occurs in many issues of resource allocation. We consider several models of growth, including general birth-and-death processes, the M/G/ model, and a non-Markovian process (data structure) for processing plane-sweep information in computational geometry, called hashing with lazy deletion (HwLD). It has been shown that HwLD is optimal in terms of expected time and dynamic space; our results show that it is also optimal in terms of expectedpreallocated space, up to a constant factor.We take two independent and complementary approaches: first, in Section 2, we use a variety of algebraic and analytical techniques to derive exact formulas for the distribution of the maximum queue size in stationary birth-and-death processes and in a nonstationary model related to file histories. The formulas allow numerical evaluation and some asymptotics. In our second approach, in Section 3, we consider the M/G/ model (which includes M/M/ as a special case) and use techniques from the analysis of algorithms to get optimal big-oh bounds on the expected maximum queue size and on the expected maximum amount of storage used by HwLD in excess of the optimal amount. The techniques appear extendible to other models, such as M/M/1.Research was also done while the author was at Princeton University, supported in part by a Procter Fellowship.Research was also done while the author was on sabbatical at INRIA in Rocquencourt, France, and at Ecole Normale Supérieure in Paris, France. Support was provided in part by National Science Foundation Research Grant DCR-84-03613, by an NSF Presidential Young Investigator Award with matching funds from an IBM Faculty Development Award and an AT&T research grant, by a Guggenheim Fellowship, and by the Office of Naval Research and the Defense Advanced Research Projects Agency under Contract N00014-83-K-0146 and ARPA Order 6320, Amendment 1.  相似文献   

13.
We present two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyze their scalability using the isoefficiency metric. We show that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers, while QSP1 is fairly close to optimal. Langet al. (1) and Schnorret al. (2) have developed parallel sorting algorithms for the mesh architecture that have either optimal (Schnorr) or close to optimal (Lang) run-time complexity for the one-element-perprocessor case. Both QSP1 and QSP2 have better scalability than the scaled-down variants of these algorithms (for the case in which there are more elements than processors). We also analyze a different variant of Lang's sort which is as scalable as QSP2. We briefly discuss another metric called resource consumption. According to this metric, both QSP1 and QSP2 are superior to variants of Lang's sort.  相似文献   

14.
Summary The efficient parallel algorithms proposed for many fundamental problems, such as list ranking, integer sorting and computing preorder numberings on trees, are very sensitive to processor failures. The requirement of efficiency (commonly formalized usingParallel-timexProcessors as a cost measure) has led to the design of highly tuned PRAM algorithms which, given the additional constraint of simple processor failures, unfortunately become inefficient or even incorrect. We propose a new notion ofrobustness, that combines efficiency with fault tolerance. For the common case of fail-stop errors, we develop a general and easy to implement technique to make robust many efficient parallel algorithms, e.g., algorithms for all the problems listed above. More specifically,for any dynamic pattern of fail-stop errors on a CRCW PRAMwith at least one surviving processor, our method increases the original algorithm cost by at most a log2 multiplicative factor. Our technique is based on a robust solution of the problem ofWrite-All, i.e., usingP processors, write 1's in all locations of anN-sized array. In addition we show that at least a log/log log multiplicative overhead will be incurred for certain patterns of failures by any algorithm that implements robust solutions toWrite-All withP=N. However, by exploiting parallel slackness, we obtain an optimal cost algorithm when Paris C. Kanellakis is a professor of computer science at Brown University. His primary research interests are in the application of logic to computer science, such as high-level query languages for database systems, parallel evaluation of logic programs, and type inference for programming languages. He has published many articles on these subjects and is the author of the chapter Elements of Relational Database Theory in theHandbook of Theoretical Computer Science (Elsevier, 1990). He has also contributed to the theory of distributed computing and to combinatorial optimization. Alex Allister Shvartsman is an engineer at Digital Equipment Corporation. His professional interests include design and development of efficient distributed systems, distributed resource management, and theoretical foundations of fault-tolerant parallel computation. At Digital he architected and managed the development of distributed control systems that automated several of Digital's manufacturing processes. He is currently on an academic leave at Brown University.An extended abstract of a part of this work appears as: Kanellakis and Shvartsman (1989) in the Proceedings of the 8th ACM Symposium on Principles of Distributed Computing, Edmonton 1989This author was supported by NSF grant IRI-8617344, an Alfred P. Sloan fellowship and ONR grant N00014-83-K-0146 ARPA Order No. 4786This author was supported by DEC GEEP Doctoral program and ONR grant N00014-91-J-1613  相似文献   

15.
Many applications of digital image processing now deal with three- and higher-dimensional images. One way to represent n-dimensional digital images is to use the specialization graphs of subspaces of the Alexandroff topological space n (where denotes the integers with the Khalimsky line topology). In this paper the dimension of any such graph is defined in three ways, and the equivalence of the three definitions is established. Two of the definitions have a geometric basis and are closely related to the topological definition of inductive dimension; the third extends the Alexandroff dimension to graphs. Diagrams are given of graphs that are dimensionally correct discrete models of Euclidean spaces, n-dimensional spheres, a projective plane and a torus. New characterizations of n-dimensional (digital) surfaces are presented. Finally, the local structure of the space n is analyzed, and it is shown that n is an n-dimensional surface for all n1.  相似文献   

16.
We prove an (lg2 n/lg lgn) lower bound for the depth ofn-input sorting networks based on the shuffle permutation. The best previously known lower bound was the trivial (lgn) bound, while the best upper bound is given by Batcher's (lg2 n)-depth bitonic sorting network. The proof technique employed in the lower bound argument may be of independent interest.C. G. Plaxton was supported by NSF Research Initiation Award CCR-9111591, and the Texas Advanced Research Program under Grant No. 003658-480. T. Suel was supported by the Texas Advanced Research Program under Grant No. 003658-480, and by an MCD Fellowship of the University of Texas at Austin.  相似文献   

17.
On parallel integer sorting   总被引:1,自引:0,他引:1  
We present an optimal algorithm for sortingn integers in the range [1,n c ] (for any constantc) for the EREW PRAM model where the word length isn , for any >0. Using this algorithm, the best known upper bound for integer sorting on the (O(logn) word length) EREW PRAM model is improved. In addition, a novel parallel range reduction algorithm which results in a near optimal randomized integer sorting algorthm is presented. For the case when the keys are uniformly distributed integers in an arbitrary range, we give an algorithm whose expected running time is optimal.Supported by NSF-DCR-85-03251 and ONR contract N00014-87-K-0310  相似文献   

18.
Summary The problem of fault-tolerant agreement is fundamental to distributed computing. When agreement is to be reached in spite of arbitrary behavior by faulty processors, this problem is calledDistributed Consensus. By requiring that the number of faulty processors be , wheren is the number of processors in the system, we are able to derive two new protocols forDistributed Consensus. Both are simple and use messages that are only one bit in length, and both provide forearly stopping: the fewer failures there are, the fewer rounds of communication are required. One protocol is optimal with respect to the number of rounds of communication required, and the other is asymptotically optimal with respect to the total number of message bits exchanged. James E. Burns received the B.S. degree in mathematics from the California Institute of Technology, the M.B.I.S. degree from Georgia State University, and the M.S. and Ph.D. degrees in information and computer science from the Georgia Institute of Technology. He served on the faculty of Computer Science at Indiana University and the College of Computing at the Georgia Institute of Technology before joining Bellcore in 1993. He is currently a Member of Technical Staff in the Network Control Research Department, where he is studying the telephone control network with special interest in behavior when faults occur. He also has research interests in theoretical issues of distributed and parallel computing, especially relating to problems of synchronization and fault tolerance. Gil Neiger was born on February 19, 1957 in New York, New York. In June 1979, he received an A.B. in Mathematics and Psycholinguistics from Brown University in Proidence, Rhode Island. In February 1985, he spent two weeks picking cotton in Nicaragua in a brigade of international volunteers. In January 1986, he received an M.S. in Computer Science from Cornell University in Ithaca, New York and, in August 1988, he received a Ph.D. in Computer Science, also from Cornell University. On August 20, 1988, Dr. Neiger married Hilary Lombard in Lansing, New York. Since August 1988, he has been an Assistant Professor in the College of Computing (formely School of Information and Computer Science) at the Georgia Institute of Technology in Atlanta, Georgia. Dr. Neiger is a member of the editorial board of theChicago Journal of Theoretical Computer Science and theJournal of Parallel and Distributed Computing.This author was supported in part by the National Science Foundation under grants CCR-8909663, CCR-9106627, and CCR-9301454.  相似文献   

19.
Let P andQ be two convex,n-vertex polygons. We consider the problem of computing, in parallel, some functions ofP andQ whenP andQ are disjoint. The model of parallel computation we consider is the CREW-PRAM, i.e., it is the synchronous shared-memory model where concurrent reads are allowed but no two processors can simultaneously attempt to write in the same memory location (even if they are trying to write the same thing). We show that a CREW-PRAM havingn 1/k processors can compute the following functions in O(k1+) time: (i) the common tangents betweenP andQ, and (ii) the distance betweenP andQ (and hence a straight line separating them). The positive constant can be made arbitrarily close to zero. Even with a linear number of processors, it was not previously known how to achieve constant time performance for computing these functions. The algorithm for problem (ii) is easily modified to detect the case of zero distance as well.This research was supported by the Office of Naval Research under Grants N00014-84-K-0502 and N00014-86-K-0689, and the National Science Foundation under Grant DCR-8451393, with matching funds from AT&T.  相似文献   

20.
In this paper,1 we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However, many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on the PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper, we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Next, we show analytically that this idea can reduce the number of parallel I/O’s required to sort the input close to the lower bound of . We verify experimentally our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on the PDM significantly.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号