首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Current application and technology trends are causing researchers and developers to revisit shared memory multiprocessing. The authors look at what is needed to maintain growth, particularly for commercial applications. The first step is to carefully examine the current use of shared memory multiprocessing and arrive at intelligent projections of future use based on application and technology trends. The second step is to begin filling gaps in programming models and architectures for shared memory multiprocessing. Finally-and possibly concurrently with the second step-researchers must look at ways to make the development of parallel software more feasible, including the development of compilers and tools. We look at the issues in these three steps. We also show how architectural issues are tightly interwoven with application trends  相似文献   

2.
Two paradigms for distributed shared memory on loosely-coupled computing systems are compared: the shared data-object model as used in Orca, a programming language specially designed for loosely-coupled computing systems, and the shared virtual memory model. For both paradigms two systems are described, one using only point-to-point messages, the other using broadcasting as well. The two paradigms and their implementations are described briefly. Their performances are compared on four applications: the travelling-salesman problem, alpha-beta search, matrix multiplication and the all-pairs shortest-paths problem. Measurements were obtained on a system consisting of 10 MC68020 processors connected by an Ethernet. For comparison purposes, the applications have also been run on a system with physical shared memory. In addition, the paper gives measurements for the first two applications above when remote procedure call is used as the communication mechanism. The measurements show that both paradigms can be used efficiently for programming large-grain parallel applications, with significant speed-ups. The structured shared data-object model achieves the highest speed-ups and is easiest to program and to debug.  相似文献   

3.
We describe a single-copy mechanism which enables an efficient message passing among UNIX processes on shared memory multiprocessors. A special version of PVMe, IBM's AIX implementation of the PVM message passing programming model, has been built based on this approach. Some preliminary results here reported show the clear advantage of the single-copy with respect to more conventional schemes.  相似文献   

4.
Summary We study the relation between knowledge and space. That is, we analyze how much shared memory space is needed in order to learn certain kinds of facts. Such results are useful tools for reasoning about shared memory systems. In addition we generalize a known impossibility result, and show that results about how knowledge can be gained and lost in message passing systems also hold for shared memory systems. Michael Merritt received a B.S. degree in Philosophy and in Computer Science from Yale College in 1978, the M.S. and Ph.D. degrees in Information and Computer Science in 1980 and 1983, respectively, from the Georgia Institute of Technology. Since 1983 he has been a member of technical staff at AT & T Bell Laboratories, and has taught as an adjunct or visiting lecturer at Stevens Institute of Technology, Massachusetts Institute of Technology, and Columbia University. In 1989 he was program chair for the ACM Symposium on Principles of Distributed Computing. His research interests include distributed and concurrent computation, both algorithms and formal methods for verifying their correctness, cryptography, and security. He is an editor for Distributed Computing and for Information and Computation, recently co-authored a book on database concurrency control algorithms, and is a member of the ACM and of Computer Professionals for Social Responsibility. Gadi Taubenfeld received the B.A., M.Sc. and Ph.D. degrees in Computer Science from the Technion (Israel Institute of Technology), in 1982, 1984 and 1988, respectively. From 1988 to 1990 he was a research scientist at Yale University. Since 1991 he has been a member of technical staff at AT & T Bell Laboratories. His primary research interests are in concurrent and distributed computing.A preliminary version of this work appeared in the Proceedings of the Tenth Annual ACM Symposium on Principles of Distributed Computing, pages 189–200, Montreal, Canada, August 1991  相似文献   

5.
While recognition of the advantages of heterogeneous computing is steadily growing, the issues of programmability and portability hinder its exploitation. The introduction of the OpenCL standard was a major step forward in that it provides code portability, but its interface is even more complex than that of other approaches. In this paper, we present the Heterogeneous Programming Library (HPL), which permits the development of heterogeneous applications addressing both portability and programmability while not sacrificing high performance. This is achieved by means of an embedded language and data types provided by the library with which generic computations to be run in heterogeneous devices can be expressed. A comparison in terms of programmability and performance with OpenCL shows that both approaches offer very similar performance, while outlining the programmability advantages of HPL.  相似文献   

6.
We study the power of reliable anonymous distributed systems, where processes do not fail, do not have identifiers, and run identical programmes. We are interested specifically in the relative powers of systems with different communication mechanisms: anonymous broadcast, read-write registers, or read-write registers plus additional shared-memory objects. We show that a system with anonymous broadcast can simulate a system of shared-memory objects if and only if the objects satisfy a property we call idemdicence this result holds regardless of whether either system is synchronous or asynchronous. Conversely, the key to simulating anonymous broadcast in anonymous shared memory is the ability to count: broadcast can be simulated by an asynchronous shared-memory system that uses only counters, but read-write registers by themselves are not enough. We further examine the relative power of different types and sizes of bounded counters and conclude with a non-robustness result. James Aspnes is a Professor of Computer Science at Yale University. He received his Ph.D. from Carnegie-Mellon University in 1992. Faith Ellen Fich is a Professor of Computer Science at the University of Toronto. She received her Ph.D. from the University of California, Berkeley in 1982. Eric Ruppert was educated at the University of Toronto, where he completed his doctorate in 1999. He spent a year as a postdoctoral fellow at Brown University and is an Associate Professor at York University.  相似文献   

7.
We provide efficient constructions and tight bounds for shared memory systems accessed by n processes, up to t of which may exhibit Byzantine failures, in a model previously explored by Malkhi et al. [21]. We show that sticky bits are universal in the Byzantine failure model for n ≥ 3t + 1, an improvement over the previous result requiring n ≥ (2t + 1)(t + 1). Our result follows from a new strong consensus construction that uses sticky bits and tolerates t Byzantine failures among n processes for any n ≥ 3t + 1, the best possible bound on n for strong consensus. We also present tight bounds on the efficiency of implementations of strong consensus objects from sticky bits and similar primitive objects. Research supported in part by a grant from the Israel Science Foundation, and by the Hermann Minkowski Minerva Center for Geometry at Tel Aviv University. This work was partially completed while at AT&T Labs and while visiting the Institute for Advanced Study, Princeton, NJ. Research supported in part by US-Israel Binational Science Foundation Grant 2002246. This work was partially completed while visiting AT&T Labs. This work was partially completed while at AT&T Labs. Research supported in part by the National Science Foundation under Grant No. CCR-0331584. A preliminary version of the results presented in this paper appeared in [23].  相似文献   

8.
We consider the interaction of weakly atomic Software Transactional Memory (STM) providing single global lock atomicity with the x86 memory consistency model. We show that a practical design for such an STM requires that some program behaviour be disallowed, due to the strictness of the x86 memory consistency model in comparison to the language level memory models hitherto considered in weakly atomic STM designs. We present the design and construction of such an STM that disallows races between a transactional read and a non-transactional write. We also report on a practical application of this STM to elide legacy locks in x86 binaries. This allows software transactional memory to be applied without requiring software to be a priori written with awareness of transactional memory and without any restriction on source language or compiler. As an example, we show how a mainstream multiplayer game can use transactional memory with zero changes and 11% overhead over language level transactional memory, which requires over 700 annotations and severely restricts software development.  相似文献   

9.
The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory models. Their findings have influenced the introduction of task dependency in the OpenMP 4.0 standard.This article presents DaSH – the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH also includes sequential and shared-memory implementations based on OpenMP and Intel TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics.  相似文献   

10.
In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors.  相似文献   

11.
In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages.  相似文献   

12.
In this paper, we compare the relative performance of two iterative schemes, based on projection techniques, on a shared memory multiprocessor - VAX 6240. We consider the CG accelerated Block-SSOR method and the CG accelerated Symmetric-Kaczmarz method for the solution of large non-symmetric systems of linear equations. We show that the regular structure of many matrices can be exploited by the CG-accelerated Block-SSOR method to provide good speedup in a multiprocessing environment. However, the CG accelerated Symmetric-Kaczmarz method, while being a viable alternative on a scalar machine, is unable to benefit from multiprocessing.  相似文献   

13.
一个分布共享存储模拟系统THSVM   总被引:1,自引:0,他引:1  
DSM系统又称SVM系统,是在基于分布存储器的多计算机上,实现物理上分布但逻辑上共享的存储系统,它集共享存储器易于编程和分布存储器可扩充性好于一体。木文主要介绍我们在清华大学计算机系研制的并行图归约智能工作站上实现的一个分布共享存储模拟系统THSVM,以及对该系统进行的性能分析,最后对下一步工作提出了建议。  相似文献   

14.
Summary The abstraction of a shared memory is of growing importance in distributed computing systems. Traditional memory consistency ensures that all processes agree on a common order of all operations on memory. Unfortunately, providing these guarantees entails access latencies that prevent scaling to large systems. This paper weakens such guarantees by definingcausal memory, an abstraction that ensures that processes in a system agree on the relative ordering of operations that arecausally related. Because causal memory isweakly consistent, it admits more executions, and hence more concurrency, than either atomic or sequentially consistent memories. This paper provides a formal definition of causal memory and gives an implementation for message-passing systems. In addition, it describes a practical class of programs that, if developed for a strongly consistent memory, run correctly with causal memory. Mustaque Ahamad is an Associate Professor in the College of Computing at the Georgia Institute of Technology. He received his M.S. and Ph.D. degrees in Computer Science from the State University of New York at Stony Brook in 1983 and 1985 respectively. His research interests include distributed operating systems, consistency of shared information in large scale distributed systems, and replicated data systems. James E. Burns received the B.S. degree in mathematics from the California Institute of Technology, the M.B.I.S. degree from Georgia State University, and the M.S. and Ph.D. degrees in information and computer science from the Georgia Institute of Technology. He served on the faculty of Computer Science at Indiana University and the College of Computing at the Georgia Institute of Technology before joining Bellcore in 1993. He is currently a Member of Technical Staff in the Network Control Research Department, where he is studying the telephone control network with special interest in behavior when faults occur. He also has research interests in theoretical issues of distributed and parallel computing especially relating to problems of synchronization and fault tolerance.This work was supported in part by the National Science Foundation under grants CCR-8619886, CCR-8909663, CCR-9106627, and CCR-9301454. Parts of this paper appeared in S. Toueg, P.G. Spirakis, and L. Kirousis, editors,Proceedings of the Fifth International Workshop on Distributed Algorithms, volume 579 ofLecture Notes on Computer Science, pages 9–30, Springer-Verlag, October 1991The photograph of Professor J.E. Burns was published in Volume 8, No. 2, 1994 on page 59This author's contributions were made while he was a graduate student at the Georgia Institute of Technology. No photograph and biographical information is available for P.W. Hutto Gil Neiger was born on February 19, 1957 in New York, New York. In June 1979, he received an A.B. in Mathematics and Psycholinguistics from Brown University in Providence, Rhode Island. In February 1985, he spent two weeks picking cotton in Nicaragua in a brigade of international volunteers. In January 1986, he received an M.S. in Computer Science from Cornell University in Ithaca, New York and, in August 1988, he received a Ph.D. in Computer Science, also from Cornell University. On August 20, 1988, Dr. Neiger married Hilary Lombard in Lansing, New York. He is currently a Staff Software Engineer at Intel's Software Technology Lab in Hillsboro, Oregon. Dr. Neiger is a member of the editorial boards of theChicago Journal of Theoretical Computer Science and theJournal of Parallel and Distributed Computing.  相似文献   

15.
The goal of this paper is to return attention to two problems that arise in the context of supporting the monitor as a mechanism for concurrent programming. This paper will re-examine the monitor concept in its original context—a multiprocessing environment implemented on a single processor sharing memory with and being interrupted by asynchronous peripheral devices—and will address the two previously unresolved problems. The first is the conflict between the immediate resumption requirement in explicit signalling and the policies and priorities of the process scheduler. The second is the possibility of deadlock inherent in nested monitors and in its most important instance, the dynamic resource allocation problem. After briefly describing the historical context of these two problems, the paper proposes a language structure called a signalling region that together with the notion of encapsulation by modules solves the immediate resumption problem and avoids the nested monitor problem. The former is done by a combination of the signal-and-return semantics of Concurrent Pascal and the signal-and-continue semantics of Mesa and StarMod. The latter is done by suggesting that mutual exclusion and data encapsulation are distinct concepts that, if separated, make nested encapsulation possible while avoiding the problems of nested mutual exclusion. Classical examples of the use of signalling regions in an extended Modula-2 are given as well as an implementation by translation to unextended Modula-2 together with a Kernel module.  相似文献   

16.
We examine the use of atomic instructions in implementing barriers that do not require previously initialized memory. We show hown identical processes can use uninitialized shared memory to elect a leader that then initialized the shared memory. The processes first use the uninitialized memory to obtain unique identifiers in the range 0 ton–1 and then meet at a barrier. After passing the barrier, the leader initializes the shared memory. Whenn is not a power of 2 this barrier implementation, a tournament algorithm, avoids extra work by taking advantage of information implicit in the algorithm for obtaining the unique identifiers. The only atomic instruction that we require is one that complements a bit.  相似文献   

17.
Shared memory is a simple yet powerful paradigm for structuring systems. Recently, there has been an interest in extending this paradigm to non-shared memory architectures as well. For example, the virtual address spaces for all objects in a distributed object-based system could be viewed as constituting a global distributed shared memory. We propose a set of primitives for managing distributed shared memory. We present an implementation of these primitives in the context of an object-based operating system as well as on top of Unix.  相似文献   

18.
The availability of low cost, high performance microprocessors has led to various designs of shared memory multiprocessor systems. As a result, commercial products which are based on shared memory have been proliferated. Such a multiprocessor system is heavily influenced by the structure of memory system and it is not difficult to find that most configurations include local cache memories. The more processors a system carries, the larger local cache memory is needed to maintain the traffic to and from the shared memory at reasonable level. The implementation of local cache memories, however, is not a simple task because of environmental limitations. In particular, the general lack of board space availability presents a formidable problem. A cache memory system usually needs space mostly to support its complex control logic circuits for the cache itself and network interfaces like snooping logic circuits for shared bus. Although packaging can be made denser to reduce system size, there are still multiple processors per board. It requires a more area-efficient cache memory architecture. This paper presents a design of shared cache for dual processor board of bus-based symmetric multiprocessors. The design and implementation issues are described first and then the evaluation and measurement results are discussed. The shared cache proposed in this paper has been determined to be quite area-efficient without the significant loss of throughput and scalability. It has been implemented as a plug-in unit for TICOM, a prevalent commercial multiprocessor system.  相似文献   

19.
Database query processing can benefit significantly from parallelism. Parallel database algorithms combine substantial CPU and I/O activity, memory requirements, and massive data exchange between processes, all of which must be considered to obtain optimal performance. Since parallel external sorting is a very typical example, we have focused on sorting to tune Volcano, a new query processing system. The purpose of the Volcano project is to provide efficient, extensible tools for query and request processing in novel application domains, particularly in object-oriented and scientific database systems, and for experimental database performance research. It includes all query processing algorithms conventionally used in relational database systems as well as several new ones, and can execute all of them in parallel. In this article, we present Volcano's parallel external sorting algorithm and a sequence of enhancements to improve its performance. We obtained very good absolute performance, 84 seconds for 100 MB of data, as well as near-linear speedup with sixteen CPUs and disks. Furthermore, these results were achieved on a shared-memory machine despite the common belief that parallel query processing is best implemented on distributed-memory systems. We detail our tuning measures and report on their effectiveness.  相似文献   

20.
There are two distinct types of MIMD (Multiple Instruction, Multiple Data) computers: the shared memory machine, e.g. Butterfly, and the distributed memory machine, e.g. Hypercubes, Transputer arrays. Typically these utilize different programming models: the shared memory machine has monitors, semaphores and fetch-and-add; whereas the distributed memory machine uses message passing. Moreover there are two popular types of operating systems: a multi-tasking, asynchronous operating system and a crystalline, loosely synchronous operating system.

In this paper I firstly describe the Butterfly, Hypercube and Transputer array MIMD computers, and review monitors, semaphores, fetch-and-add and message passing; then I explain the two types of operating systems and give examples of how they are implemented on these MIMD computers. Next I discuss the advantages and disadvantages of shared memory machines with monitors, semaphores and fetch-and-add, compared to distributed memory machines using message passing, answering questions such as “is one model ‘easier’ to program than the other?” and “which is ‘more efficient‘?”. One may think that a shared memory machine with monitors, semaphores and fetch-and-add is simpler to program and runs faster than a distributed memory machine using message passing but we shall see that this is not necessarily the case. Finally I briefly discuss which type of operating system to use and on which type of computer. This of course depends on the algorithm one wishes to compute.  相似文献   


设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号