首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
We present algorithms for the randomized simulation of a shared memory machine (PRAM) on a Distributed Memory Machine (DMM). In a PRAM, memory conflicts occur only through concurrent access to the same cell, whereas the memory of a DMM is divided into modules, one for each processor, and concurrent accesses to the same module create a conflict. Thedelay of a simulation is the time needed to simulate a parallel memory access of the PRAM. Any general simulation of anm processor PRAM on ann processor DMM will necessarily have delay at leastm/n. A randomized simulation is calledtime-processor optimal if the delay isO(m/n) with high probability. Using a novel simulation scheme based on hashing we obtain a time-processor optimal simulation with delayO(log log(n) log*(n)). The best previous simulations use a simpler scheme based on hashing and have much larger delay: (log(n)/log log(n)) for the simulation of an n processor PRAM on ann processor DMM, and (log(n)) in the case where the simulation is time-processor optimal.Our simulations use several (two or three) hash functions to distribute the shared memory among the memory modules of the PRAM. The stochastic processes modeling the behavior of our algorithms and their analyses based on powerful classes of universal hash functions may be of independent interest.Research partially supported by NSF/DARPA Grant CCR-9005448. Work was done while at the University of California at Berkeley and the International Computer Science Institute, Berkeley, CA.Research partially supported by National Science Foundation Operating Grant CCR-9016468, National Science Foundation Operating Grant CCR-9304722, United States-Israel Binational Science Foundation Grant No. 89-00312, United States-Israel Binational Science Foundation Grant No. 92-00226, and ESPRIT BR Grant EC-US 030.Part of work was done during a visit at the International Computer Science Institute at Berkeley; supported in part by DFG-Forschergruppe Effiziente Nutzung massiv paralleler Systeme, Teilprojekt 4, and by the Esprit Basic Research Action Nr. 7141 (ALCOM II).  相似文献   

2.
That the influence of the PRAM model is ubiquitous in parallel algorithm design is as clear as the fact that it is technologically infeasible for the forseeable future. The current generation of parallel hardware prominently features distributed memory and high‐performance interconnection networks—very much the antithesis of the shared memory required for the PRAM model. It has been shown that, in spite of communication costs, for some problems very fast parallel algorithms are available for distributed‐memory machines—from embarassingly parallel problems to sorting and numerical analysis. In contrast it is known that for other classes of problem PRAM‐style shared‐memory simulation on a distributed‐memory machine can, in theory, produce solutions of comparable performance to the best possible for such architectures. The Bulk Synchronous Parallel (BSP) model accurately represents most parallel machines—theoretical and actual—in an execution and cost model. We introduce a scalable portable PRAM realization appropriate for BSP computers and a methodology for usage. Our system is fast and built upon the familiar sequential C++ coupled with the new standard BSP library of parallel computation and communication primitives. It is portable to and predictable on a vast number of parallel computers including workstation clusters, a 256‐processor Cray T3D, an 8‐node IBM SP/2 and a 4‐node shared‐memory SGI Power Challenge machine. Our approach achieves simplicity of programming over direct‐mode BSP programming for reasonable overhead cost. We objectively compare optimized BSP and PRAM algorithms implemented with our C++ PRAM library and provide encouraging experimental results for our new style of programming. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

3.
The parallel language FORK [1], based on a scalable shared memory model, is a PASCAL-like language with some additional parallel constructs. A PRAM (Parallel Random Access Machine) algorithm can be expressed on a high level of abstraction as a FORK program which is translated into efficient PRAM code guaranteeing theoretically predicted runtimes.

In this paper, we concentrate on those features of the language FORK related to parallelism, such as the group concept, a shared memory access and synchronous or asynchronous execution. We present a trace-based denotational interleaving semantics where processes describe synchronous computations. Processes are created or deleted dynamically and run asynchronously. Interleaving rules reflect the underlying CRCW (concurrent-read-concurrent-write) PRAM model.  相似文献   


4.
The SB-PRAM is a shared-memory parallel computer that has been designed according to the PRAM model from theoretical computer science. The SB-PRAM realizes a concurrent-read, concurrent-write PRAM where each processor can access the global memory in unit time. This article describes the programming environment of the SB-PRAM that enables a programmer to develop efficient and portable programs without dealing with architectural details of the machine. In particular, we discuss compiler and operating system issues and show that the runtime functions of the P4 environment and several parallel data structures can be implemented very efficiently by using special features of the SB-PRAM. In contrast to other parallel machines, the synchronization of processors and the management of concurrent accesses to the global memory only require a few machine instructions independent of the number of processors participating in the operation. This efficient implementation of the runtime system is the basis for good performance of many challenging applications.  相似文献   

5.
In light of recent shift towards shared-memory systems in parallel explicit model checking, we explore relative advantages and disadvantages of shared versus private hash tables. Since usage of shared state storage allows for techniques unavailable in distributed memory, these are evaluated, both theoretically and practically, in a prototype implementation. Experimental data is presented to assess practical utility of those techniques, compared to static partitioning of state space, more traditional in distributed memory algorithms.  相似文献   

6.
An important midlevel task for computer vision is addressed. The problem consists of labeling connected components in N1/2 ×N2/2 binary images. This task can be solved with parallel computers by using a simple and novel algorithm. The parallel computing model used is a synchronous fine-grained shared-memory model where only one processor can read from or write to the same memory location at a given time. This model is known as the exclusive-read exclusive-write parallel RAM (EREW PRAM). Using this model, the algorithm presented has O(log N) complexity. The algorithm can run on parallel machines other than the EREW PRAM. In particular, it offers an optimal image component labeling algorithm for mesh-connected computers  相似文献   

7.
We present the design and implementation of a parallel algorithm for computing Gröbner bases on distributed memory multiprocessors. The parallel algorithm is irregular both in space and time: the data structures are dynamic pointer-based structures and the computations on the structures have unpredictable duration. The algorithm is presented as a series of refinements on atransition rule program, in which computation proceeds by nondeterministic invocations of guarded commands. Two key data structures, a set and a priority queue, are distributed across processors in the parallel algorithm. The data structures are designed for high throughput and latency tolerance, as appropriate for distributed memory machines. The programming style represents a compromise between shared-memory and message-passing models. The distributed nature of the data structures shows through their interface in that the semantics are weaker than with shared atomic objects, but they still provide a shared abstraction that can be used for reasoning about program correctness. In the data structure design there is a classic trade-off between locality and load balance. We argue that this is best solved by designing scheduling structures in tandem with the state data structures, since the decision to replicate or partition state affects the overhead of dynamically moving tasks.This work was supported in part by the Advanced Research Projects Agency of the Department of Defense monitored by the Office of Naval Research under contract DABT63-92-C-0026, by AT&T, and by the National Science Foundation through an Infrastructure Grant (number CDA-8722788) and a Research Initiation Award (number CCR-9210260). The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred.  相似文献   

8.
An algorithm for selecting the concatenated hash code for partial-match or multiple-attribute retrieval in a hashing scheme is presented. The optimal code length for each attribute is determined with respect to a merit function. Two adjustment algorithms are then presented to find the optimal code length under the integer and nonnegative lower bound constraints. Finally, an algorithm is given for incremental expansion of the concatenated hash code in an extendible hashing scheme.This research was supported by the Defense Advanced Research Projects Agency under Contract MDA903-78-C-0293, and the National Science Foundation under Grant NCS78-05978.  相似文献   

9.
In this paper we give parallel algorithms for a number of problems defined on point sets and polygons. All our algorithms have optimalT(n) * P(n) products, whereT(n) is the time complexity andP(n) is the number of processors used, and are for the EREW PRAM or CREW PRAM models. Our algorithms provide parallel analogues to well-known phenomena from sequential computational geometry, such as the fact that problems for polygons can oftentimes be solved more efficiently than point-set problems, and that nearest-neighbor problems can be solved without explicitly constructing a Voronoi diagram.The research of R. Cole was supported in part by NSF Grants CCR-8702271, CCR-8902221, and CCR-8906949, by ONR Grant N00014-85-K-0046, and by a John Simon Guggenheim Memorial Foundation fellowship. M. T. Goodrich's research was supported by the National Science Foundation under Grant CCR-8810568 and by the National Science Foundation and DARPA under Grant CCR-8908092.  相似文献   

10.
Linda meets Unix   总被引:1,自引:0,他引:1  
Leler  W. 《Computer》1990,23(2):43-54
The limitations of the shared-memory and distributed-memory models for explicit parallel programming are discussed and a new model, the Linda parallel communication paradigm which was designed specifically for parallel programming, is examined. Processes communicate in Linda by way of a shared data space called tuple space which acts something like an associative memory, since tuples are identified by matching on a key rather than using a specific address. This model is adapted for use as the basis of a new class of operating systems and a specific instance, QIX, is presented. Like Linda, this operating system model can support both the shared-memory and the distributed-memory styles of programming. Thus, it provides the benefits of both, while avoiding hardware dependencies. QIX also incorporates a novel scheme for name resolution that is easier to use than other methods and provides significant benefits in the operating system and it directly supports communication between programs written in different languages  相似文献   

11.
New parallel objective function determination methods for the job shop scheduling problem are proposed in this paper, considering makespan and the sum of jobs execution times criteria, however, the methods proposed can be applied also to another popular objective functions such as jobs tardiness or flow time. Parallel Random Access Machine (PRAM) model is applied for the theoretical analysis of algorithm efficiency. The methods need a fine-grained parallelization, therefore the approach proposed is especially devoted to parallel computing systems with fast shared memory (e.g. GPGPU, General-Purpose computing on Graphics Processing Units).  相似文献   

12.
We present the first optimal parallel algorithms for the verification and sensitivity analysis of minimum spanning trees. Our algorithms are deterministic and run inO(logn) time and require linear-work in the CREW PRAM model. These algorithms are used as a subroutine in the linear-work randomized algorithm for finding minimum spanning trees of Cole, Klein, and Tarjan. Research partially supported by a National Science Foundation Graduate Fellowship and by DIMACS (Center for Discrete Mathematics and Theoretical Computer Science), a National Science Foundation Science and Technology Center, Grant No. NSF-STC88-09648. Research at Princeton University was partially supported by the National Science Foundation, Grant No. CCR-8920505, the Office of Naval Research, Contract No. N00014-91-J-1463, and by DIMACS (Center for Discrete Mathematics and Theoretical Computer Science), a National Science Foundation Science and Technology Center, Grant No. NSF-STC88-09648.  相似文献   

13.
The block distributed memory model   总被引:1,自引:0,他引:1  
We introduce a computation model for developing and analyzing parallel algorithms on distributed memory machines. The model allows the design of algorithms using a single address space and does not assume any particular interconnection topology. We capture performance by incorporating a cost measure for interprocessor communication induced by remote memory accesses. The cost measure includes parameters reflecting memory latency, communication bandwidth, and spatial locality. Our model allows the initial placement of the input data and pipelined prefetching. We use our model to develop parallel algorithms for various data rearrangement problems, load balancing, sorting, FFT, and matrix multiplication. We show that most of these algorithms achieve optimal or near optimal communication complexity while simultaneously guaranteeing an optimal speed-up in computational complexity. Ongoing experimental work in testing and evaluating these algorithms has thus far shown very promising results  相似文献   

14.
并行图论算法研究进展   总被引:10,自引:1,他引:9  
在这篇综述文章中,我们将重点介绍并行图论处近年来的发展概况及主要成果,并给出一些可能的发展方向。具体内容包括:基于共享存储模型上的图搜索技术、连发支及最小生成树算法、增值并行图论算法、最短路径算法、极大独立集算法、极大匹配与最大匹配算法,图着色算法、求欧拉回路及哈密尔顿回路算法,图同构算法、图K连通算法以及最大流最小割算法等。  相似文献   

15.
In sequential systems, hash tables yield almost constant time performance for single element accesses. However, in massively parallel systems, we need to consider a large number of parallel accesses. Consequently, the potential queueing delay as well as the communication overhead can alter the relative performance of hash algorithms. Thus, it is necessary to reevaluate the performance of conventional hash algorithms and investigate new algorithms that exploit the parallelism without suffering from excessive communication overheads or queueing delays. In this paper, we first study the performance of data parallel hash algorithms with conventional collision resolution strategies. For SIMD/SPMD hypercube systems, neither linear probing nor double hashing yields satisfactory performance. Thus, we develop a new collision resolution strategy, namely, hypercube hashing. Hypercube hashing combines the randomness provided in double hashing with the low communication cost inherited from linear probing to yield better performance. We also investigate efficient implementation of the chaining algorithm in data parallel systems and its performance. From the simulation results, hypercube hashing significantly outperforms the other open addressing strategies in all cases (under the assumption of random input key space). For high load factors, chaining performs better than hypercube hashing. However, with a low load factor, hypercube hashing significantly outperforms the chaining algorithm.  相似文献   

16.
Traditional implementation of sequential consistency in shared-memory systems requires memory accesses to be globally performed in program order.Based on an event ordering model for correct executions in shared-memory systems,this paper proposes and proves that out-of-order execution does not influence the correctness of an execution providing certain condition is met.Simulation results show that out-of-order execution proposed in this paper is an effective way to improve the performance of a sequentially consistent shared-memory system.  相似文献   

17.
Fractional cascading is a technique designed to allow efficient sequential search in a graph with catalogs of total sizen. The search consists of locating a key in the catalogs along a path. In this paper we show how to preprocess a variety of fractional cascaded data structures whose underlying graph is a tree so that searching can be done efficiently in parallel. The preprocessing takesO(logn) time withn/logn processors on an EREW PRAM. For a balanced binary tree, cooperative search along root-to-leaf paths can be done inO((logn)/logp) time usingp processors on a CREW PRAM. Both of these time/processor constraints are optimal. The searching in the fractional cascaded data structure can be either explicit, in which the search path is specified before the search starts, or implicit, in which the branching is determined at each node. We apply this technique to a variety of geometric problems, including point location, range search, and segment intersection search.An earlier version of this work appears inProceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, July 1990, pp. 307–316. The first author's support was provided in part by National Science Foundation Grant CCR-9007851, by the U.S. Army Research Office under Grants DAAL03-91-G-0035 and DAAH04-93-0134, and by the Office of Naval Research and the Advanced Research Projects Agency under Contract N00014-91-J-4052, ARPA Order 8225. This research was performed while the second author was at Brown University. Support was provided in part by an NSF Presidential Young Investigator Award CCR-9047466, with matching funds from IBM, by National Science Foundation Grant CCR-9007851, by the U.S. Army Research Office under Grant DAAL03-91-G-0035, and by the Office of Naval Research and the Advanced Research Projects Agency under Contract N00014-91-J-4052, ARPA Order 8225.  相似文献   

18.
在Xen虚拟化中已存在基于共享内存的域间通信机制,为了简化Xen虚拟化驱动的编写工作和提高稳定性,提出一种基于消息队列的域间通信机制.申请一块内存来存放消息队列,并通过授权表在所有的域间共享此段内存.在上述共享内存中创建消息队列,并将消息队列的起始地址存放到Xenstore中,从而使得所有的域都可以使用该消息队列收发信息.结果CPU的吞吐量降低38.3%,传输速率降低38.2%,同时响应时间增加64.3%.该模型有效简化编写分离设备驱动程序的复杂性,并使得驱动程序的可靠性得到了显著地提高.  相似文献   

19.
The Global Arrays (GA) toolkit provides a shared-memory programming model in which data locality is explicitly managed by the programmer. It inter-operates with MPI and supports a variety of language bindings. The Disk Resident Arrays (DRA) model extends the GA programming model to secondary storage. GA and DRA together provide a convenient programming model that encourages locality-aware programming by the user, while presenting a high-level abstraction. High performance depends on the appropriate distribution of the data in the disk-resident arrays. In this paper, we discuss the addition of layout transformation support to DRA. The implementation of an efficient parallel layout transformation algorithm is done on top of existing GA/DRA functions; thus GA/DRA is itself used in implementing the enhanced DRA functionality. Experimental performance data is provided that demonstrates the effectiveness of the new layout transformation functionality. This work was supported in part through funding from the U.S. Department of Energy and the National Science Foundation (award 0121676).  相似文献   

20.
Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two parallel programming systems in the same application. We introduce an MPI-integrated shared-memory programming model that is incorporated into MPI through a small extension to the one-sided communication interface. We discuss the integration of this interface with the MPI 3.0 one-sided semantics and describe solutions for providing portable and efficient data sharing, atomic operations, and memory consistency. We describe an implementation of the new interface in the MPICH2 and Open MPI implementations and demonstrate an average performance improvement of 40 % to the communication component of a five-point stencil solver.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号