期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An implementation of distributed shared memory

Umakishore Ramachandran M. Yousef A. Khalidi 《Software》1991,21(5):443-464

Shared memory is a simple yet powerful paradigm for structuring systems. Recently, there has been an interest in extending this paradigm to non-shared memory architectures as well. For example, the virtual address spaces for all objects in a distributed object-based system could be viewed as constituting a global distributed shared memory. We propose a set of primitives for managing distributed shared memory. We present an implementation of these primitives in the context of an object-based operating system as well as on top of Unix. 相似文献

2.

Comparing shared and distributed memory computers

Clive F. Baillie 《Parallel Computing》1988,8(1-3):101-110

There are two distinct types of MIMD (Multiple Instruction, Multiple Data) computers: the shared memory machine, e.g. Butterfly, and the distributed memory machine, e.g. Hypercubes, Transputer arrays. Typically these utilize different programming models: the shared memory machine has monitors, semaphores and fetch-and-add; whereas the distributed memory machine uses message passing. Moreover there are two popular types of operating systems: a multi-tasking, asynchronous operating system and a crystalline, loosely synchronous operating system.

In this paper I firstly describe the Butterfly, Hypercube and Transputer array MIMD computers, and review monitors, semaphores, fetch-and-add and message passing; then I explain the two types of operating systems and give examples of how they are implemented on these MIMD computers. Next I discuss the advantages and disadvantages of shared memory machines with monitors, semaphores and fetch-and-add, compared to distributed memory machines using message passing, answering questions such as “is one model ‘easier’ to program than the other?” and “which is ‘more efficient‘?”. One may think that a shared memory machine with monitors, semaphores and fetch-and-add is simpler to program and runs faster than a distributed memory machine using message passing but we shall see that this is not necessarily the case. Finally I briefly discuss which type of operating system to use and on which type of computer. This of course depends on the algorithm one wishes to compute. 相似文献

3.

Towards implementation of a novel scheme for data prefetching on distributed shared memory systems

Hsiao-Hsi Wang Kuan-Ching Li Ssu-Hsuan Lu Chun-Chieh Yang 《The Journal of supercomputing》2009,47(2):111-126

High speed networks and rapidly improving microprocessor performance make the network of workstations an extremely important tool for parallel computing in order to speedup the execution of scientific applications. Shared memory is an attractive programming model for designing parallel and distributed applications, where the programmer can focus on algorithmic development rather than data partition and communication. Based on this important characteristic, the design of systems to provide the shared memory abstraction on physically distributed memory machines has been developed, known as Distributed Shared Memory (DSM). DSM is built using specific software to combine a number of computer hardware resources into one computing environment. Such an environment not only provides an easy way to execute parallel applications, but also combines available computational resources with the purpose of speeding up execution of these applications. DSM systems need to maintain data consistency in memory, which usually leads to communication overhead. Therefore, there exists a number of strategies that can be used to overcome this overhead issue and improve overall performance. Strategies as prefetching have been proven to show great performance in DSM systems, since they can reduce data access communication latencies from remote nodes. On the other hand, these strategies also transfer unnecessary prefetching pages to remote nodes. In this research paper, we focus on the access pattern during execution of a parallel application, and then analyze the data type and behavior of parallel applications. We propose an adaptive data classification scheme to improve prefetching strategy with the goal to improve overall performance. Adaptive data classification scheme classifies data according to the accessing sequence of pages, so that the home node uses past history access patterns of remote nodes to decide whether it needs to transfer related pages to remote nodes. From experimental results, we can observe that our proposed method can increase the accuracy of data access in effective prefetch strategy by reducing the number of page faults and misprefetching. Experimental results using our proposed classification scheme show a performance improvement of about 9–25% over the same benchmark applications running on top of an original JIAJIA DSM system.

Kuan-Ching Li (Corresponding author)Email:

相似文献

4.

A comparison of two paradigms for distributed shared memory

Willem G. Levelt M. Frans Kaashoek Henri E. Bal Andrew S. Tanenbaum 《Software》1992,22(11):985-1010

Two paradigms for distributed shared memory on loosely-coupled computing systems are compared: the shared data-object model as used in Orca, a programming language specially designed for loosely-coupled computing systems, and the shared virtual memory model. For both paradigms two systems are described, one using only point-to-point messages, the other using broadcasting as well. The two paradigms and their implementations are described briefly. Their performances are compared on four applications: the travelling-salesman problem, alpha-beta search, matrix multiplication and the all-pairs shortest-paths problem. Measurements were obtained on a system consisting of 10 MC68020 processors connected by an Ethernet. For comparison purposes, the applications have also been run on a system with physical shared memory. In addition, the paper gives measurements for the first two applications above when remote procedure call is used as the communication mechanism. The measurements show that both paradigms can be used efficiently for programming large-grain parallel applications, with significant speed-ups. The structured shared data-object model achieves the highest speed-ups and is easiest to program and to debug. 相似文献

5.

Performance evaluation of directory protocols on an optical broadcast-based distributed shared memory multiprocessor

?pek Abas?kele? 《Computers & Electrical Engineering》2010,36(1):114-131

Recent advances in the development of optical technologies suggest the possible emergence of broadcast-based optical interconnects within cache-coherent distributed shared memory (DSM) multiprocessor architectures. It is well known that the cache-coherence protocol is a critical issue in designing such architectures because it directly affects memory latencies. In this paper, we evaluate via simulation the performance of three directory-based cache-coherence protocols; strict request-response, intervention forwarding and reply forwarding on the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus), which is a low-latency and high-bandwidth broadcast-based fiber-optic interconnection network supporting DSM. The simulated system contains 64 nodes, each of which has a processor, a cache controller, a directory controller and an output channel. Simulations have been conducted for each protocol to measure average processor utilization, average network latency and average number of packets transferred over the network for varying values of the important DSM parameters such as the ratio of the mean channel service time to mean thread run time (T/R), probability of a cache block being in modified state {P(M)}, the fraction of write misses {P(W)} and home node contention rate. The results reveal that for all cases, except for low values of P(M), intervention forwarding gives the worst performance (lowest processor utilization and highest latency). The performance of strict request-response and reply forwarding is comparable for several values of the DSM parameters and contention rate. For a contention rate of 0%, the increase of P(M) makes reply forwarding perform better than strict request-response. The performance of all protocols decreases with the increase of P(W) and contention rate. However, the performance of strict request-response is the least affected among other protocols due to the negative impact of the increase of P(W) and contention rate. Therefore, for the full contention case (i.e. contention rate of 100%); for low values of P(M), or for mid values of P(M) and high values of P(W), strict request-response performs better than reply forwarding. These results are significant in the sense that they provide an insight to multiprocessor architecture designers for comparing the performance of different directory-based cache-coherence protocols on a broadcast-based interconnection network for different values of the DSM parameters and varying rates of contention. 相似文献

6.

Delay tolerant lazy release consistency for distributed shared memory in opportunistic networks

《Pervasive and Mobile Computing》2016

Opportunistic networks (ONs) allow mobile wireless devices to interact with one another through a series of opportunistic contacts. While ONs exploit mobility of devices to route messages and distribute information, the intermittent connections among devices make many traditional computer collaboration paradigms, such as distributed shared memory (DSM), very difficult to realize. DSM systems, developed for traditional networks, rely on relatively stable, consistent connections among participating nodes to function properly.We propose a novel delay tolerant lazy release consistency (DTLRC) mechanism for implementing distributed shared memory in opportunistic networks. DTLRC permits mobile devices to remain independently productive while separated, and provides a mechanism for nodes to regain coherence of shared memory if and when they meet again. DTLRC allows applications to utilize the most coherent data available, even in the challenged environments typical to opportunistic networks. Simulations demonstrate that DTLRC is a viable concept for enhancing cooperation among mobile wireless devices in opportunistic networking environment. 相似文献

7.

Causal memory: definitions,implementation, and programming

Mustaque Ahamad Gil Neiger James E. Burns Prince Kohli Phillip W. Hutto 《Distributed Computing》1995,9(1):37-49

Summary The abstraction of a shared memory is of growing importance in distributed computing systems. Traditional memory consistency ensures that all processes agree on a common order of all operations on memory. Unfortunately, providing these guarantees entails access latencies that prevent scaling to large systems. This paper weakens such guarantees by definingcausal memory, an abstraction that ensures that processes in a system agree on the relative ordering of operations that arecausally related. Because causal memory isweakly consistent, it admits more executions, and hence more concurrency, than either atomic or sequentially consistent memories. This paper provides a formal definition of causal memory and gives an implementation for message-passing systems. In addition, it describes a practical class of programs that, if developed for a strongly consistent memory, run correctly with causal memory. Mustaque Ahamad is an Associate Professor in the College of Computing at the Georgia Institute of Technology. He received his M.S. and Ph.D. degrees in Computer Science from the State University of New York at Stony Brook in 1983 and 1985 respectively. His research interests include distributed operating systems, consistency of shared information in large scale distributed systems, and replicated data systems. James E. Burns received the B.S. degree in mathematics from the California Institute of Technology, the M.B.I.S. degree from Georgia State University, and the M.S. and Ph.D. degrees in information and computer science from the Georgia Institute of Technology. He served on the faculty of Computer Science at Indiana University and the College of Computing at the Georgia Institute of Technology before joining Bellcore in 1993. He is currently a Member of Technical Staff in the Network Control Research Department, where he is studying the telephone control network with special interest in behavior when faults occur. He also has research interests in theoretical issues of distributed and parallel computing especially relating to problems of synchronization and fault tolerance.This work was supported in part by the National Science Foundation under grants CCR-8619886, CCR-8909663, CCR-9106627, and CCR-9301454. Parts of this paper appeared in S. Toueg, P.G. Spirakis, and L. Kirousis, editors,Proceedings of the Fifth International Workshop on Distributed Algorithms, volume 579 ofLecture Notes on Computer Science, pages 9–30, Springer-Verlag, October 1991The photograph of Professor J.E. Burns was published in Volume 8, No. 2, 1994 on page 59This author's contributions were made while he was a graduate student at the Georgia Institute of Technology. No photograph and biographical information is available for P.W. Hutto Gil Neiger was born on February 19, 1957 in New York, New York. In June 1979, he received an A.B. in Mathematics and Psycholinguistics from Brown University in Providence, Rhode Island. In February 1985, he spent two weeks picking cotton in Nicaragua in a brigade of international volunteers. In January 1986, he received an M.S. in Computer Science from Cornell University in Ithaca, New York and, in August 1988, he received a Ph.D. in Computer Science, also from Cornell University. On August 20, 1988, Dr. Neiger married Hilary Lombard in Lansing, New York. He is currently a Staff Software Engineer at Intel's Software Technology Lab in Hillsboro, Oregon. Dr. Neiger is a member of the editorial boards of theChicago Journal of Theoretical Computer Science and theJournal of Parallel and Distributed Computing. 相似文献

8.

The new mexico state university ring-star system: A distributed unix environment

Arthur I. Karshmer Dirk J. Depree James Phelan 《Software》1983,13(12):1157-1168

A distributed version of the UNIX operations system is currently under development through a joint effort of New Mexico State University and the Hebrew University of Jerusalem. A microprocessor version of the UNIX kernel has been developed which will run on any PDP-11 or LSI-11 based processing element and allows processes to run in a UNIX ‘look-alike’ environment. Each process is fully transportable among all processors in the system. Although the preliminary version of the system was built in a star configuration, the system is currently being upgraded by the addition of a communication ring with 8-bit microprocessors as ring interface units. The current paper describes the software structure, the hardware structure and the communication protocol of the system. 相似文献

9.

UNIX系统中进程利用共享内存进行通讯的方式 总被引：2，自引：0，他引：2

刘唱军《电脑与信息技术》2003,11(3):57-59

文章阐述了在UNIX系统进行程序设计时利用共享内存进行进程间通讯的方法并给出了相关函数的调用，该方法可以提高应用程序的开发速度和运行效率。相似文献

10.

Data race avoidance and replay scheme for developing and debugging parallel programs on distributed shared memory systems

Yung-Chang Chiu Tyng-Yeu Liang 《Parallel Computing》2011,37(1):11-25

Distributed shared memory (DSM) allows parallel programs to run on distributed computers by simulating a global virtual shared memory, but data racing bugs may easily occur when the threads of a multi-threaded process concurrently access the physically distributed memory. Earlier tools to help programmers locate data racing bugs in non-DSM parallel programs are not easily applied to DSM systems. This study presents the data race avoidance and replay scheme (DRARS) to assist debugging parallel programs on DSM or multi-core systems. DRARS is a novel tool which controls the consistency protocol of the target program, automatically preventing a large class of data racing bugs when the parallel program is subsequently run, obviating much of the need for manual debugging. For data racing bugs that cannot be avoided automatically, DRARS performs a deterministic replay-type function on DSM systems, faithfully reproducing the behavior of the parallel program during run time. Because one class of data racing bugs has already been eliminated, the remaining manual debugging task is greatly simplified. Unlike previous debugging methods, DRARS does not require that the parallel program be written in a specific style or programming language. Moreover, DRARS can be implemented in most consistency protocols. In this paper, DRARS is realized and verified in real experiments using the eager release consistency protocol on a DSM system with various applications. 相似文献

11.

The parallel ‘Deutschland-Modell’ - A message-passing version for distributed memory computers

Ulrich Schättler Elisabeth Krenzien 《Parallel Computing》1997,23(14):13-2226

The parallel ‘Deutschland-Modell’ and its implementation on distributed memory parallel computers using the message-passing library PARMACS 6.0 is described. Performance results on a Cray T3D are given and the problem of dynamical load imbalances is addressed. 相似文献

12.

Parallelizing with BDSC,a resource-constrained scheduling algorithm for shared and distributed memory systems

《Parallel Computing》2015

We introduce a new parallelization framework for scientific computing based on BDSC, an efficient automatic scheduling algorithm for parallel programs in the presence of resource constraints on the number of processors and their local memory size. BDSC extends Yang and Gerasoulis’s Dominant Sequence Clustering (DSC) algorithm; it uses sophisticated cost models and addresses both shared and distributed parallel memory architectures. We describe BDSC, its integration within the PIPS compiler infrastructure and its application to the parallelization of four well-known scientific applications: Harris, ABF, equake and IS. Our experiments suggest that BDSC’s focus on efficient resource management leads to significant parallelization speedups on both shared and distributed memory systems, improving upon DSC results, as shown by the comparison of the sequential and parallelized versions of these four applications running on both OpenMP and MPI frameworks. 相似文献

13.

基于多空间内存共享的高速网络链路数据包捕获方法*

张敦行张广兴张大方谢高岗于真《计算机应用研究》2008,25(3):807-810

结合共享内存和NAPI技术提出了一种基于通用硬件平台和开源软件实现的高速网络链路数据包捕获方案,能够将数据包的捕获能力和捕获效率提升到一个新水平.通过实验表明,在通用PC服务器上实现该方案完全能够满足千兆链路的监测需求,数据包处理能力可达到线速148.8万pps. 相似文献

14.

Language issues in the implementation of a kernel

N. Natarajan Mukul K. Sinha 《Software》1979,9(9):771-778

In this paper we present some issues encountered in the implementation of a kernel for a multiprocessor system using a high level language called CCNPASCAL. We present the nature of the problem concerning each issue, the solutions we adopted and possible better alternatives which we could not adopt for various reasons. 相似文献

15.

A communication system for distributed computations on a network of personal computers

《Advances in Engineering Software》1999,30(4):255-260

This paper deals with a software system PCCOM which makes it possible to simulate a distributed memory environment for parallel computations by using a local area network of personal computers. This system consists of FORTRAN subroutines that can be used from application programs. Parallel computations may be performed on a network of personal computers under DOS or any other operating system. This software package simulates, partially, the widespread PVM (Parallel Virtual Machine) package which runs under UNIX. A simple example of an application is given here. 相似文献

16.

The computational complexity of the quadrant interlocking (QI) iterative methods on shared memory parallel computers

《国际计算机数学杂志》2012,89(3-4):391-410

In this paper the Quadrant Interlocking (QI) matrix splitting is shown to yield parallel iterative methods for the solution of linear equations with improved convergence rates for both synchronous and asynchronous versions of the algorithms. 相似文献

17.

A parametrized algorithm that implements sequential, causal, and cache memory consistencies

Ernesto Jiménez Vicent Cholvi 《Journal of Systems and Software》2008,81(1):120-131

In this paper, we present an algorithm that can be used to implement sequential, causal, or cache consistency in distributed shared memory (DSM) systems. For this purpose it includes a parameter that allows us to choose the consistency model to be implemented. If all processes run the algorithm with the same value in this parameter, the corresponding consistency is achieved. (Additionally, the algorithm tolerates that processes use certain combination of parameter values.) This characteristic allows a concrete consistency model to be chosen, but implements it with the more efficient algorithm in each case (depending of the requirements of the applications). Additionally, as far as we know, this is the first algorithm proposed that implements cache coherence.In our algorithm, all the read and write operations are executed locally when implementing causal and cache consistency (i.e., they are fast). It is known that no sequential algorithm has only fast memory operations. In our algorithm, however, all the write operations and some read operations are fast when implementing sequential consistency. The algorithm uses propagation and full replication, where the values written by a process are propagated to the rest of the processes. It works in a cyclic turn fashion, with each process of the DSM system, broadcasting one message in turn. The values written by the process are sent in the message (instead of sending one message for each write operation): However, unnecessary values are excluded. All this permits the amount of message traffic owing to the algorithm to be controlled. 相似文献

18.

Parallel operation of CartaBlanca on shared and distributed memory computers

N. T. Padial‐Collins W. B. VanderHeyden D. Z. Zhang E. D. Dendy D. Livescu 《Concurrency and Computation》2004,16(1):61-77

We describe the parallel performance of the pure Java CartaBlanca code on heat transfer and multiphase fluid flow problems. CartaBlanca is designed for parallel computations on partitioned unstructured meshes. It uses Java's thread facility to manage computations on each of the mesh partitions. Inter‐partition communications are handled by two compact objects for node‐by‐node communication along partition boundaries and for global reduction calculations across the entire mesh. For distributed calculations, the JavaParty package from the University of Karlsruhe is demonstrated to work with CartaBlanca. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

19.

Workload decomposition strategies for hierarchical distributed‐shared memory parallel systems and their implementation with integration of high‐level parallel languages

Sergio Briguglio Beniamino Di Martino Gregorio Vlad 《Concurrency and Computation》2002,14(11):933-956

In this paper we address the issue of workload decomposition in programming hierarchical distributed‐shared memory parallel systems. The workload decomposition we have devised consists of a two‐stage procedure: a higher‐level decomposition among the computational nodes; and a lower‐level one among the processors of each computational node. By focusing on porting of a case study particle‐in‐cell application, we have implemented the described work decomposition without large programming effort by using and integrating the high‐level language extensions High‐Performance Fortran and OpenMP. Copyright © 2002 John Wiley & Sons, Ltd. 相似文献

20.

Implementation of a boundary element method on distributed memory computers 总被引：1，自引：0，他引：1

El Mostafa Daoudi Jacques Lobry 《Parallel Computing》1992,18(12):1317-1324

In this paper, we analyse and compare different parallel implementations of the Boundary Element Method on distributed memory computers. We deal with the computation of two-dimensional magnetostatic problems. The resulting linear system will be solved using Householder transformation and Gaussian elimination. Experimental results are obtained on a Meiko Computing Surface with 32 T800 transputers. 相似文献