首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents a parallel formulation of depth-first search which retains the storage efficiency of sequential depth-first search and can be mapped on to anyMIMD architecture. To study its effectiveness it has been implemented to solve the 15-puzzle problem on three commercially available multiprocessors—Sequent Balance 21000, the Intel Hypercube and BBN Butterfly. We have been able to achieve fairly linear speedup on Sequent up to 30 processors (the maximum configuration available) and on the Intel Hypercube andBBN Butterfly up to 128 processors (the maximum configurations available). Many researchers considered the ring architecture to be quite suitable for parallel depth-first search. Our experimental results show that hypercube and sharedmemory architectures are significantly better.At the heart of our parallel formulation is a dynamic work distribution scheme that divides the work between different processors. The effectiveness of the parallel formulation is strongly influenced by the work distribution scheme and architectural features such as presence/absence of shared memory, the diameter of the network, relative speed of the communication network, etc. In a companion paper,(1) we analyze the effectiveness of different load-balancing schemes and architectures, and also present new improved work distribution schemes.This work was supported by Army Research Office Grant No. DAAG29-84-K-0060 to the Artificial Intelligence Laboratory, and Office of Naval Research Grant N00014-86-K-0763 to the Computer Science Department at the University of Texas ay Austin  相似文献   

2.
The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture  相似文献   

3.
Assume we are given ann ×n binary image containing horizontally convex features; i.e., for each feature, each of its row's pixels form an interval on that row. In this paper we consider the problem of assigning topological numbers to such features, i.e., assign a number to every featuref so that all features to the left off in the image have a smaller number assigned to them. This problem arises in solutions to the stereo matching problem. We present a parallel algorithm to solve the topological numbering problem inO(n) time on ann ×n mesh of processors. The key idea of our solution is to create a tree from which the topological numbers can be obtained even though the tree does not uniquely represent the to the left of relationship of the features.The work of M. J. Atallah was supported by the Office of Naval Research under Grants N00014-84-K-0502 and N00014-86-K-0689, and the National Science Foundation under Grant DCR-8451393, with matching funds from AT&T. Part of this work was done while he was a Visiting Scientist at the Center for Advanced Architectures project of the Research Institute for Advanced Computer Science, NASA Ames Research Center, Moffett Field, CA 94035, USA. S. E. Hambrusch's work was supported by the Office of Naval Research under Contracts N00014-84-K-0502 and N00014-86K-0689, and by the National Science Foundation under Grant MIP-87-15652. Part of this work was done while she was visiting the International Computer Science Institute, Berkeley, CA 94704, USA. The work of L. E. TeWinkel was supported by the Office of Naval Research under Contract N00014-86K-0689.  相似文献   

4.
Lisp and its descendants are among the most important and widely used of programming languages. At the same time, parallelism in the architecture of computer systems is becoming commonplace. There is a pressing need to extend the technology of automatic parallelization that has become available to Fortran programmers of parallel machines, to the realm of Lisp programs and symbolic computing. In this paper we present a comprehensive approach to the compilation of Scheme programs for shared-memory multiprocessors. Our strategy has two principal components:interprocedural analysis andprogram restructuring. We introduceprocedure strings andstack configurations as a framework in which to reason about interprocedural side-effects and object lifetimes, and develop a system of interprocedural analysis, using abstract interpretation, that is used in the dependence analysis and memory management of Scheme programs. We introduce the transformations ofexit-loop translation andrecursion splitting to treat the control structures of iteration and recursion that arise commonly in Scheme programs. We propose an alternative representation for s-expressions that facilitates the parallel creation and access of lists. We have implemented these ideas in a parallelizing Scheme compiler and run-time system, and we complement the theory of our work with snapshots of programs during the restructuring process, and some preliminary performance results of the execution of object codes produced by the compiler.This work was supported in part by the National Science Foundation under Grant No. NSF MIP-8410110, the U.S. Department of Energy under Grant No. DE-FG02-85ER25001, the Office of Naval Research under Grant No. ONR N00014-88-K-0686, the U.S. Air Force Office of Scientific Research under Grant No. AFOSR-F49620-86-C-0136, and by a donation from the IBM Corportation.  相似文献   

5.
In this paper, we develop load balancing strategies for scalable high-performance parallel A* algorithms suitable for distributed-memory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algorithm) grow with the number of processors P used, thus restricting its scalability. To alleviate this effect, we propose a novel parallel startup phase and an efficient dynamic load balancing strategy called the quality equalizing (QE) strategy. Our new parallel startup scheme executes optimally in Θ(log P) time and, in addition, achieves good initial load balance. The QE strategy prossess certain unique quantitative and qualitative load balancing properties that enable it to significantly reduce starvation and nonessential work. Consequently, we obtain a highly scalable parallel A* algorithm with an almost-linear speedup. The startup and load balancing schemes were employed in parallel A* algorithms to solve the Traveling Salesman Problem on an nCUBE2 hypercube multicomputer. The QE strategy yields average speedup improvements of about 20-185% and 15-120% at low and intermediate work densities (the ratio of the problem size to P), respectively, over three well-known load balancing methods-the round-robin (RR), the random communication (RC), and the neighborhood averaging (NA) strategies. The average speedup observed on 1024 processors is about 985, representing a very high efficiency of 0.96. Finally, we analyze and empirically evaluate the scalability of parallel A* algorithms in terms of the isoefficiency metric. Our analysis gives (1) a Θ(P log P) lower bound on the isoefficiency function of any parallel A* algorithm, and (2) a general expression for the upper bound on the isoefficiency function of our parallel A* algorithm using the QE strategy on any topology-for the hypercube and 2-D mesh architectures the upper bounds on the isoefficiency function are found to be Θ(P log2P) and Θ(P[formula]), respectively. Experimental results validate our analysis, and also show that parallel A* search has better scalability using the QE load balancing strategy than using the RR, RC, or NA strategies.  相似文献   

6.
Let P andQ be two convex,n-vertex polygons. We consider the problem of computing, in parallel, some functions ofP andQ whenP andQ are disjoint. The model of parallel computation we consider is the CREW-PRAM, i.e., it is the synchronous shared-memory model where concurrent reads are allowed but no two processors can simultaneously attempt to write in the same memory location (even if they are trying to write the same thing). We show that a CREW-PRAM havingn 1/k processors can compute the following functions in O(k1+) time: (i) the common tangents betweenP andQ, and (ii) the distance betweenP andQ (and hence a straight line separating them). The positive constant can be made arbitrarily close to zero. Even with a linear number of processors, it was not previously known how to achieve constant time performance for computing these functions. The algorithm for problem (ii) is easily modified to detect the case of zero distance as well.This research was supported by the Office of Naval Research under Grants N00014-84-K-0502 and N00014-86-K-0689, and the National Science Foundation under Grant DCR-8451393, with matching funds from AT&T.  相似文献   

7.
A task scheduling algorithm for parallel execution of logic programs on NUMA multiprocessors is proposed. The algorithm endorces a so-calledfair polling policy. We show analytically that the proposed algorithm has a good isoefficiency function. Results from simulation on switch based NUMA architecture multiprocessors, the BBN Butterfly GP1000 and TC2000, corroborate the analysis. The proposed algorithm exhibits performance characteristics very similar to that of its counterpart that uses a shared memory. It achieves reasonable speed-up on benchmarks, using a nonconstant time task migration protocol. In addition, fair polling algorithms (with or without a shared memory) are shown to beconsistently superior than several other known polling schemes that do not maintain fairness, for a variety of benchmark programs.Supported by Airforce Office of Scientific Research Grant AFOSR-88-0152, AFOSR-91-0350.  相似文献   

8.
We present an optimal parallel algorithm for computing a cycle separator of ann-vertex embedded planar undirected graph inO(logn) time onn/logn processors. As a consequence, we also obtain an improved parallel algorithm for constructing a depth-first search tree rooted at any given vertex in a connected planar undirected graph in O(log2 n) time on n/logn processors. The best previous algorithms for computing depth-first search trees and cycle separators achieved the same time complexities, but withn processors. Our algorithms run on a parallel random access machine that permits concurrent reads and concurrent writes in its shared memory and allows an arbitrary processor to succeed in case of a write conflict.A preliminary version of this paper appeared as Improved Parallel Depth-First Search in Undirected Planar Graphs in theProceedings of the Third Workshop on Algorithms and Data Structures, 1993, pp. 407–420.Supported in part by NSF Grant CCR-9101385.  相似文献   

9.
Latency measures the delay caused by communication between processors and memory modules over the network in a parallel system. Using intensive measurements and simulation, we show that network latency forms a major obstacle to improving parallel computing performance and scalability. We present an experimental metric, using network latency to measure and evaluate the scalability of parallel programs and architectures. This latency metric is an extension to the isoefficiency function [Grama et al., IEEE Parallel Distrib. Technology 1, 3 (1993), 12-21] and iso-speed metric [Sun and Rover, IEEE Trans. Parallel Distrib. Systems 5, 6 (1994), 599-613]. We give a measurement method for using this latency metric, and report the experimental results of evaluating the scalabilities of several scientific computing algorithms on the KSR-1 shared-memory architecture. Our analysis and experiments show that the latency metric is a practical method to effectively predict and evaluate scalability based on measured latencies inherent in the program and the architecture.  相似文献   

10.
Distributed match-making   总被引:1,自引:0,他引:1  
In many distributed computing environments, processes are concurrently executed by nodes in a store- and-forward communication network. Distributed control issues as diverse as name server, mutual exclusion, and replicated data management involve making matches between such processes. We propose a formal problem called distributed match-making as the generic paradigm. Algorithms for distributed match-making are developed and the complexity is investigated in terms of messages and in terms of storage needed. Lower bounds on the complexity of distributed match-making are established. Optimal algorithms, or nearly optimal algorithms, are given for particular network topologies.The work of the second author was supported in part by the Office of Naval Research under Contract N00014-85-K-0168, by the Office of Army Research under Contract DAAG29-84-K-0058, by the National Science Foundation under Grant DCR-83-02391, and by the Defence Advanced Research Projects Agency (DARPA) under Contract N00014-83-K-0125. Current address of both authors: CWI, Kruislaan 413, 1098 SJ Amsterdam, The Netherlands.  相似文献   

11.
Massively parallel processors have begun using commodity operating systems that support demand-paged virtual memory. To evaluate the utility of virtual memory, we measured the behavior of seven shared-memory parallel application programs on a simulated distributed-shared-memory machine. Our results (1) confirm the importance of gang CPU scheduling, (2) show that a page-faulting processor should spin rather than invoke a parallel context switch, (3) show that our parallel programs frequently touch most of their data, and (4) indicate that memory, not just CPUs, must be gang scheduled. Overall, our experiments demonstrate that demand paging has limited value on current parallel machines because of the applications' synchronization and memory reference patterns and the machines' high page-fault and parallel context-switch overheads.An earlier version of this paper was presented at Supercomputing '94.This work is supported in part by NSF Presidential Young Investigator Award CCR-9157366; NSF Grants MIP-9225097, CCR-9100968, and CDA-9024618; Office of Naval Research Grant N00014-89-J-1222; Department of Energy Grant DE-FG02-93ER25176; and donations from Thinking Machines Corporation, Xerox Corporation, and Digital Equipment Corporation.  相似文献   

12.
There is a large and growing body of literature concerning the solutions of geometric problems on mesh-connected arrays of processors. Most of these algorithms are optimal (i.e., run in timeO(n 1/d ) on ad-dimensionaln-processor array), and they all assume that the parallel machine is trying to solve a problem of sizen on ann-processor array. Here we investigate the situation where we have a mesh of sizep and we are interested in using it to solve a problem of sizen >p. The goal we seek is to achieve, when solving a problem of sizen >p, the same speed up as when solving a problem of sizep. We show that for many geometric problems, the same speedup can be achieved when solving a problem of sizen >p as when solving a problem of sizep.The research of M. J. Atallah was supported by the Office of Naval Research under Contracts N00014-84-K-0502 and N00014-86-K-0689, the Air Force Office of Scientific Research under Grant AFOSR-90-0107, the National Science Foundation under Grant DCR-8451393, and the National Library of Medicine under Grant R01-LM05118. Jyh-Jong Tsay's research was partially supported by the Office of Naval Research under Contract N00014-84-K-0502, the Air Force Office of Scientific Research under Grant AFOSR-90-0107, and the National Science Foundation under Grant DCR-8451393.  相似文献   

13.
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithm-architecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objectives of this paper are to critically assess the state of the art in the theory of scalability analysis, and to motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying scalability issues, and discuss their interrelationships. For example, we derive an important relationship between time-constrained scaling and the isoefficiency function. We point out some of the weaknesses of the existing schemes for measuring scalability, and discuss possible ways of extending them.  相似文献   

14.
We present two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyze their scalability using the isoefficiency metric. We show that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers, while QSP1 is fairly close to optimal. Langet al. (1) and Schnorret al. (2) have developed parallel sorting algorithms for the mesh architecture that have either optimal (Schnorr) or close to optimal (Lang) run-time complexity for the one-element-perprocessor case. Both QSP1 and QSP2 have better scalability than the scaled-down variants of these algorithms (for the case in which there are more elements than processors). We also analyze a different variant of Lang's sort which is as scalable as QSP2. We briefly discuss another metric called resource consumption. According to this metric, both QSP1 and QSP2 are superior to variants of Lang's sort.  相似文献   

15.
Weighted heuristic search (best-first or depth-first) refers to search with a heuristic function multiplied by a constant w [31]. The paper shows, for the first time, that for optimization queries in graphical models the weighted heuristic best-first and weighted heuristic depth-first branch and bound search schemes are competitive energy-minimization anytime optimization algorithms. Weighted heuristic best-first schemes were investigated for path-finding tasks. However, their potential for graphical models was ignored, possibly because of their memory costs and because the alternative depth-first branch and bound seemed very appropriate for bounded depth. The weighted heuristic depth-first search has not been studied for graphical models. We report on a significant empirical evaluation, demonstrating the potential of both weighted heuristic best-first search and weighted heuristic depth-first branch and bound algorithms as approximation anytime schemes (that have sub-optimality bounds) and compare against one of the best depth-first branch and bound solvers to date.  相似文献   

16.
Multilisp is a parallel programming language derived from the Scheme dialect of Lisp by addition of thefuture construct. It has been implemented on Concert, a 32-processor shared-memory multiprocessor. A statistics-gathering feature of Concert Multilisp producesparallelism profiles showing the number of processors busy with computing or overhead, as a function of time. Experience gained using parallelism profiles and other measurement tools on several application programs has revealed three basic ways in whichfuture generates concurrency. These ways are illustrated on two example programs: the Lisp mapping functionmapcar and the partitioning routine from Quicksort. Experience with Multilisp programming exposes issues relating to side effects, error and exception handling, low-level operations for explicit manipulation of futures and tasks, and speculative computing, which are also discussed. The basic outlines of Multilisp are now fairly clear and have stood the test of being used for several applications, but further language design work is especially needed in the areas of speculative computing and exception handling.This research was supported in part by the Defense Advanced Research Projects Agency and was monitored by the Office of Naval Research under contract numbers N00014-83-K-0125 and N00014-84-K-0099.  相似文献   

17.
M. Sun  R. Glowinski 《Calcolo》1993,30(3):219-239
Pathwise approximate solutions to the Zakai nonlinear filtering equation are analyzed and computed through operator splitting techniques. Several examples are presented to test the performances of the operator splitting algorithms. The work of R. Glowinski was supported by the U.S. Army Research Office under Contract DAAL03-86-K-0138 and by NSF via Grant INT-8612680.  相似文献   

18.
Summary A scheme for the compilation of imperative or functional programs into systolic programs is demonstrated on matrix composition/decomposition and Gauss-Jordan elimination. Using this scheme, programs for the processor network Warp and for several transputer networks have been generated. Christian Lengauer holds a Dipl. Math. (1976) from the Free University of Berlin, and an M.Sc (1978) and Ph.D. (1982) in Computer Science from the University of Toronto. He was an Assistant Professor of Computer Sciences at The University of Texas at Austin from 1982 to 1989 and is presently a Lecturer in Computer Science at the University of Edinburgh. His past research has been in the areas of systolic design, formal semantics and program construction, and automated theorem proving. Michael Barnett received a B.A. in Computer Science from Brooklyn College/City University of New York in 1985, and is currently a Ph.d. candidate at the University of Texas at Austin, where he has been since 1986. From 1985 to 1986 he worked at IBM's T.J. Watson Research Center. His current research interests include formal methods, programming methodology, and functional programming. Duncan G. Hudson III received the B.A. degree in computer sciences from The University of Texas at Austin in 1987 and the M.S.C.S. degree in computer sciences from The University of Texas at Austin in 1989. He has worked as a Graduate Research Assistant at The University of Texas at Austin in the areas of graphical parallel programming environments, parallel numerical algorithms, and objectoriented programming languages for parallel architectures and as a Software Design Engineer at Texas Instruments in the areas of objectoriented databases and parallel image understanding. He is currently a Ph.D. candidate in the Department of Computer Sciences at The University of Texas at Austin. His current research interests include parallel architectures and algorithms and parallelizing compilers.This research was supported in part by the following funding agencies: through Carnegie-Mellon University by the Defense Advanced Research Projects Agency monitored by the Space and Naval Warfare Systems Command under Contract N00039-87-C-0251 and by the Office of Naval Research under Contracts N00014-87-K-0385 and N00014-87-K-0533; through Oxford University by the Science and Engineering Research Council under Contract GR/E 63902; through the University of Texas at Austin by the Office of Naval Research under Contract N00014-86-K-0763 and by the National Science Foundation under Contract DCR-8610427  相似文献   

19.
Hot spot contention on a network-based shared-memory architecture occurs when a large number of processors try to access a globally shared variable across the network. While multistage interconnection network (MIN) and hierarchical ring (HR) structures are two important bases on which to build large scale shared-memory multiprocessors, the different interconnection networks and cache/memory systems of the two architectures respond very differently to network bottleneck situations. In this paper, we present a comparative performance evaluation of hot spot effects on the MIN-based and HR-based shared-memory architectures. Both nonblocking MIN-based and HR-based architectures are classified, and analytical models are described for understanding network differences and for evaluating hot spot performance on both architectures. The analytical comparisons indicate that HR-based architectures have the potential to handle various contentions caused by hot spots more efficiently than MIN-based architectures. Intensive performance measurements on hot spots have been conducted on the BBN TC2000 (MIN-based) and the KSR1 (HR-based) machines. Performance experiments were also conducted on the practical experience of hot spots with respect to synchronization lock algorithms. The experimental results are consistent with the analytical models, and present practical observations and an evaluation of hot spots on the two types of architectures  相似文献   

20.
The paper presents several parallel DSP (digital signal processing) algorithms and their performance analysis, targetting a hybrid message-passing and shared-memory architecture that has been built at New Jersey Institute of Technology. The current version of our system contains eight powerful TMS320C40 processors. The algorithms are implemented on our system using message-passing only, shared-memory only, and, if possible, a combination of both of these parallel processing paradigms. Comparisons show that TurboNet's robust, hybrid architecture results in significant performance gains because of the flexibility it introduces.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号