期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An implementation of distributed shared memory

Umakishore Ramachandran M. Yousef A. Khalidi 《Software》1991,21(5):443-464

Shared memory is a simple yet powerful paradigm for structuring systems. Recently, there has been an interest in extending this paradigm to non-shared memory architectures as well. For example, the virtual address spaces for all objects in a distributed object-based system could be viewed as constituting a global distributed shared memory. We propose a set of primitives for managing distributed shared memory. We present an implementation of these primitives in the context of an object-based operating system as well as on top of Unix. 相似文献

2.

Performance improvement of parallel programs on a broadcast-based distributed shared memory multiprocessor by simulation

《Simulation Modelling Practice and Theory》2008,16(3):338-352

Due to advances in fiber optics and VLSI technology, interconnection networks that allow simultaneous broadcasts are becoming feasible. Distributed shared memory (DSM) implementations on such networks promise high performance even for small applications with small granularity. This paper, after summarizing the architecture of one such implementation called the Simultaneous Multiprocessor Optical Exchange Bus (SOME-Bus), presents simple algorithms for improving the performance of parallel programs running on the SOME-Bus multiprocessor implementing cache-coherent DSM. The algorithms are based on run-time data redistribution via dynamic page migration protocol. They use memory access references together with the information of average channel utilization, average channel waiting time, number of messages in the channel queue or short-term average channel waiting time reported by each node and gathered by hardware monitors to make correct decisions related to the placement of shared data. Simulations with four parallel codes on a 64-processor SOME-Bus show that the algorithms yield significant performance improvements such as reduction in the execution times, number of remote memory accesses, average channel waiting times, average network latencies and increase in average channel utilizations. 相似文献

3.

A new congestion control algorithm for improving the performance of a broadcast-based multiprocessor architecture

Cigdem Inan Aci Mehmet Fatih Akay 《Journal of Parallel and Distributed Computing》2010

Congestion occurring in the input queues of broadcast-based multiprocessor architectures can severely limit their overall performance. The existing congestion control algorithms estimate congestion based on a node’s output channel parameters such as the number of free virtual channels or the number of packets waiting at the channel queue. In this paper, we have proposed a new congestion control algorithm to prevent congestion on broadcast-based multiprocessor architectures with multiple input queues. Our algorithm performs congestion control at the packet level and takes into account the next input queue number which will be accessed by the processor, which form the fundamental differences between our algorithm and the algorithms based on the idea of virtual channel congestion control. The performance of the algorithm is evaluated by OPNET Modeler with various synthetic traffic patterns on a 64-node Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) architecture employing the message passing protocol. Performance measures such as average input waiting time, average network response time and average processor utilization have been collected before and after applying the algorithm. The results show that the proposed algorithm is able to decrease the average input waiting time by 13.99% to 20.39%, average network response time by 8.76% to 20.36% and increase average processor utilization by 1.92% to 6.63%. The performance of the algorithm is compared with that of the other congestion control algorithms and it is observed that our algorithm performs better under all traffic patterns. Also, theoretical analysis of the proposed method is carried out by using queuing networks. 相似文献

4.

A comparison of two paradigms for distributed shared memory

Willem G. Levelt M. Frans Kaashoek Henri E. Bal Andrew S. Tanenbaum 《Software》1992,22(11):985-1010

Two paradigms for distributed shared memory on loosely-coupled computing systems are compared: the shared data-object model as used in Orca, a programming language specially designed for loosely-coupled computing systems, and the shared virtual memory model. For both paradigms two systems are described, one using only point-to-point messages, the other using broadcasting as well. The two paradigms and their implementations are described briefly. Their performances are compared on four applications: the travelling-salesman problem, alpha-beta search, matrix multiplication and the all-pairs shortest-paths problem. Measurements were obtained on a system consisting of 10 MC68020 processors connected by an Ethernet. For comparison purposes, the applications have also been run on a system with physical shared memory. In addition, the paper gives measurements for the first two applications above when remote procedure call is used as the communication mechanism. The measurements show that both paradigms can be used efficiently for programming large-grain parallel applications, with significant speed-ups. The structured shared data-object model achieves the highest speed-ups and is easiest to program and to debug. 相似文献

5.

Towards implementation of a novel scheme for data prefetching on distributed shared memory systems

Hsiao-Hsi Wang Kuan-Ching Li Ssu-Hsuan Lu Chun-Chieh Yang 《The Journal of supercomputing》2009,47(2):111-126

High speed networks and rapidly improving microprocessor performance make the network of workstations an extremely important tool for parallel computing in order to speedup the execution of scientific applications. Shared memory is an attractive programming model for designing parallel and distributed applications, where the programmer can focus on algorithmic development rather than data partition and communication. Based on this important characteristic, the design of systems to provide the shared memory abstraction on physically distributed memory machines has been developed, known as Distributed Shared Memory (DSM). DSM is built using specific software to combine a number of computer hardware resources into one computing environment. Such an environment not only provides an easy way to execute parallel applications, but also combines available computational resources with the purpose of speeding up execution of these applications. DSM systems need to maintain data consistency in memory, which usually leads to communication overhead. Therefore, there exists a number of strategies that can be used to overcome this overhead issue and improve overall performance. Strategies as prefetching have been proven to show great performance in DSM systems, since they can reduce data access communication latencies from remote nodes. On the other hand, these strategies also transfer unnecessary prefetching pages to remote nodes. In this research paper, we focus on the access pattern during execution of a parallel application, and then analyze the data type and behavior of parallel applications. We propose an adaptive data classification scheme to improve prefetching strategy with the goal to improve overall performance. Adaptive data classification scheme classifies data according to the accessing sequence of pages, so that the home node uses past history access patterns of remote nodes to decide whether it needs to transfer related pages to remote nodes. From experimental results, we can observe that our proposed method can increase the accuracy of data access in effective prefetch strategy by reducing the number of page faults and misprefetching. Experimental results using our proposed classification scheme show a performance improvement of about 9–25% over the same benchmark applications running on top of an original JIAJIA DSM system.

Kuan-Ching Li (Corresponding author)Email:

相似文献

6.

Performance evaluation of an open distributed platform for realistic traffic generation

Stefano Donato Antonio Giorgio 《Performance Evaluation》2005,60(1-4):359-392

Network researchers have dedicated a notable part of their efforts to the area of modeling traffic and to the implementation of efficient traffic generators. We feel that there is a strong demand for traffic generators capable to reproduce realistic traffic patterns according to theoretical models and at the same time with high performance. This work presents an open distributed platform for traffic generation that we called distributed internet traffic generator (D-ITG), capable of producing traffic (network, transport and application layer) at packet level and of accurately replicating appropriate stochastic processes for both inter departure time (IDT) and packet size (PS) random variables. We implemented two different versions of our distributed generator. In the first one, a log server is in charge of recording the information transmitted by senders and receivers and these communications are based either on TCP or UDP. In the other one, senders and receivers make use of the MPI library. In this work a complete performance comparison among the centralized version and the two distributed versions of D-ITG is presented. 相似文献

7.

Performance evaluation of bag of gangs scheduling in a heterogeneous distributed system

Zafeirios C. Papazachos^{Author Vitae} Helen D. Karatza Author Vitae 《Journal of Systems and Software》2010,83(8):1346-1354

Distributed systems deliver a cost-effective and scalable solution to the increasing performance intensive applications by utilizing several shared resources. Gang scheduling is considered to be an efficient time-space sharing scheduling algorithm for parallel and distributed systems. In this paper we examine the performance of scheduling strategies of jobs which are bags of independent gangs in a heterogeneous system. A simulation model is used to evaluate the performance of bag of gangs scheduling in the presence of high priority jobs implementing migrations. The simulation results reveal the significant role of the implemented migration scheme as a load balancing factor in a heterogeneous environment. Another significant aspect of implementing migrations presented in this paper is the reduction of the fragmentation caused in the schedule by gang scheduled jobs and the alleviation of the performance impact of the high priority jobs. 相似文献

8.

Performance evaluation of compromised synchronization control mechanism for distributed virtual environment (DVE)

Olarn Wongwirat Shigeyuki Ohara 《Virtual Reality》2005,9(1):1-16

Synchronization in a distributed virtual environment (DVE) involves mechanisms to ensure a consistent view of a virtual world for all participants. Most applications in the DVE are related to collaborative activities that include non-contention and contention cases. Using transmission of update messages is suitable enough to support synchronization for only non-contention activity. The contention activity requires an additional mechanism to control accessing a common object for synchronization. In this paper, we present the compromised synchronization control mechanism to support both non-contention and contention activities. The mechanism employs frequent update event and multiple-lock checking to control the synchronization. Frequent update event is used to support a dynamic virtual world for non-contention activity. Multiple-lock checking is embedded to ensure consistency when accessing the common object is required simultaneously for the contention event. Performance measurement of the compromised synchronization is provided by simulation in terms of locking time, sampling event, number of logical processes, and traffic tolerance. Prototype application is also implemented to compare the result in a small scale level. Based on the simulation and experimental results, the compromised sychronization control mechanism is capable to support up to 100 participants for the non-contention activity. It provides a good performance of supporting the contention activity in a small scale. The mechanism is considered suitable for collaborative application where contention is considered a critical event. 相似文献

9.

Data race avoidance and replay scheme for developing and debugging parallel programs on distributed shared memory systems

Yung-Chang Chiu Tyng-Yeu Liang 《Parallel Computing》2011,37(1):11-25

Distributed shared memory (DSM) allows parallel programs to run on distributed computers by simulating a global virtual shared memory, but data racing bugs may easily occur when the threads of a multi-threaded process concurrently access the physically distributed memory. Earlier tools to help programmers locate data racing bugs in non-DSM parallel programs are not easily applied to DSM systems. This study presents the data race avoidance and replay scheme (DRARS) to assist debugging parallel programs on DSM or multi-core systems. DRARS is a novel tool which controls the consistency protocol of the target program, automatically preventing a large class of data racing bugs when the parallel program is subsequently run, obviating much of the need for manual debugging. For data racing bugs that cannot be avoided automatically, DRARS performs a deterministic replay-type function on DSM systems, faithfully reproducing the behavior of the parallel program during run time. Because one class of data racing bugs has already been eliminated, the remaining manual debugging task is greatly simplified. Unlike previous debugging methods, DRARS does not require that the parallel program be written in a specific style or programming language. Moreover, DRARS can be implemented in most consistency protocols. In this paper, DRARS is realized and verified in real experiments using the eager release consistency protocol on a DSM system with various applications. 相似文献

10.

Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols

Eduardo H.M. Cruz Matthias Diener Marco A.Z. Alves Philippe O.A. Navaux 《Journal of Parallel and Distributed Computing》2014

In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages. 相似文献

11.

Efficient PRAM simulation on a distributed memory machine

R. M. Karp M. Luby F. Meyer auf der Heide 《Algorithmica》1996,16(4-5):517-542

We present algorithms for the randomized simulation of a shared memory machine (PRAM) on a Distributed Memory Machine (DMM). In a PRAM, memory conflicts occur only through concurrent access to the same cell, whereas the memory of a DMM is divided into modules, one for each processor, and concurrent accesses to the same module create a conflict. Thedelay of a simulation is the time needed to simulate a parallel memory access of the PRAM. Any general simulation of anm processor PRAM on ann processor DMM will necessarily have delay at leastm/n. A randomized simulation is calledtime-processor optimal if the delay isO(m/n) with high probability. Using a novel simulation scheme based on hashing we obtain a time-processor optimal simulation with delayO(log log(n) log*(n)). The best previous simulations use a simpler scheme based on hashing and have much larger delay: (log(n)/log log(n)) for the simulation of an n processor PRAM on ann processor DMM, and (log(n)) in the case where the simulation is time-processor optimal.Our simulations use several (two or three) hash functions to distribute the shared memory among the memory modules of the PRAM. The stochastic processes modeling the behavior of our algorithms and their analyses based on powerful classes of universal hash functions may be of independent interest.Research partially supported by NSF/DARPA Grant CCR-9005448. Work was done while at the University of California at Berkeley and the International Computer Science Institute, Berkeley, CA.Research partially supported by National Science Foundation Operating Grant CCR-9016468, National Science Foundation Operating Grant CCR-9304722, United States-Israel Binational Science Foundation Grant No. 89-00312, United States-Israel Binational Science Foundation Grant No. 92-00226, and ESPRIT BR Grant EC-US 030.Part of work was done during a visit at the International Computer Science Institute at Berkeley; supported in part by DFG-Forschergruppe Effiziente Nutzung massiv paralleler Systeme, Teilprojekt 4, and by the Esprit Basic Research Action Nr. 7141 (ALCOM II). 相似文献

12.

An evaluation of hardware-based and compiler-controlled optimizations of snooping cache protocols

Fredrik Dahlgren Jonas Skeppstedt Per Stenström 《Future Generation Computer Systems》1998,13(6):469-487

Coherence misses and invalidation traffic limit the performance of bus-based multiprocessors using write-invalidate snooping caches. This paper considers optimizations of a write-invalidate protocol that remove such overhead. While coherence misses are attacked by a hybrid update/invalidate protocol and another technique where update instructions are selectively inserted by a compiler, invalidation traffic is reduced by three optimizations that coalesce ownership acquisition with miss handling: migrate-on-dirty, an adaptive hardware-based scheme, and compiler-controlled insertion of load-exclusive instructions.

The relative effectiveness of these optimizations are evaluated using detailed architectural simulations and a set of four parallel programs. We find that while both of the update-based schemes effectively remove most coherence misses, the hybrid update/invalidate scheme causes lower traffic. By contrast, the compiler-based approach to cut invalidation traffic is slightly more efficient than the adaptive hardware-based scheme. Moreover, the migrate-on-dirty heuristic is found to have devastating effects on the miss rate. 相似文献

13.

Reliable performance prediction for multigrid software on distributed memory systems

Giuseppe Romanazzi Peter K. JimackChristopher E. Goodyer 《Advances in Engineering Software》2011,42(5):247-258

We propose a model for describing and predicting the parallel performance of a broad class of parallel numerical software on distributed memory architectures. The purpose of this model is to allow reliable predictions to be made for the performance of the software on large numbers of processors of a given parallel system, by only benchmarking the code on small numbers of processors. Having described the methods used, and emphasized the simplicity of their implementation, the approach is tested on a range of engineering software applications that are built upon the use of multigrid algorithms. Despite their simplicity, the models are demonstrated to provide both accurate and robust predictions across a range of different parallel architectures, partitioning strategies and multigrid codes. In particular, the effectiveness of the predictive methodology is shown for a practical engineering software implementation of an elastohydrodynamic lubrication solver. 相似文献

14.

Analytic evaluation of contention protocols used in distributed real-time systems

Kang Shin Chao-Ju Hou 《Real-Time Systems》1995,9(1):69-107

The probability of a station failing to deliver packets before their deadlines, called theprobability of dynamic failure, P _dyn, is an important measure for the communication subsystem of a distributed real-time system. Another closely-related performance measure is the -bounded delivery time,T , which is defined as the least time needed to deliver a packet with probability greater than 1–. UsingP _dyn andT , we comparatively evaluate four contention protocols often used in distributed real-time systems: (i) the token passing protocol and its priority-based variation (called thetoken scheduling protocol), and (ii) theP _i-persistent protocol and a priority-based variation thereof. The communication subsystem equipped with different contention protocols is modeled first as embedded Markov chains. Then, we derive the probability distributions of access delay, from whichP _dyn andT can be calculated. The blocking probability,Q _i, can also be derived from the access delay distribution. These measures are derived first under the assumption of a single buffer at each station. The single-buffer model is then extended to the multiple-buffer case. The effects of buffer size onP _dyn,T , andQ _i, and the performance improvement with multiple buffers are analyzed over a wide range of network traffic.The work reported in this paper was supported in part by the ONR under Grant N00014-92-J—1080. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the ONR. 相似文献

15.

The local memory access sequence of multiple induction variables on distributed memory machines

Tsung-Chuan Huang^{Author Vitae} Liang-Cheng Shiu Author VitaeAuthor Vitae 《Computers & Electrical Engineering》2004,30(3):231-244

相似文献

16.

Addressing a workload characterization study to the design of consistency protocols

Salvador Petit Julio Sahuquillo Ana Pont David Kaeli 《The Journal of supercomputing》2006,38(1):49-72

Shared Virtual Memory (SVM) provides a low-cost and effective way to implement the shared-memory programming paradigm. SVMs utilize a number of concepts that include consistency models/protocols, sharing patterns, false sharing, and fragmentation issues. The range of issues encountered in an SVM introduces a level of complexity and presents a challenge to many SVM researchers. This paper presents a careful study of SVM systems focusing on how the workload characteristics can affect the performace of consistency protocols. This knowledge is used to propose a novel consistency protocol that improves the system performance. This paper pursues two main goals: (i) to illustrate how different SVM workload characteristics are interrelated, and (ii) to motivate the design of a new multiple-writer memory consistency protocol. To achieve the first goal, we provide a detailed workload characterization analysis and discussion on how consistency models and protocols work. To achieve the second goal, we describe a software-based SVM protocol that achieves better performance than a hardware protocol proposed in the literature. In some workloads, the speedup obtained over the baseline protocol is more than 20%. 相似文献

17.

Performance evaluation of an automated material handling system for a wafer fab 总被引：1，自引：1，他引：1

F. K. Wang J. T. Lin 《Robotics and Computer》2004,20(2):91-100

Discrete-event simulation model was developed to evaluate the performance of an automated material handling system (AMHS) for a wafer fab with a zone control scheme avoiding all vehicle collision. The layout of this AMHS is a custom configuration. The track option contains turntables, turnouts and high-speed express lanes. The behavior of the interarrival for all stockers from the real data set was analyzed to verify the assumption of the simulation model. The results show that the underlying distributions of most stockers for interarrival times belong to the exponential or Weibull distribution. The simulation results show that the number of vehicles significantly affects the average delivery time and the average throughput. A simple one-factor response surface model is used to determine the appropriate vehicle numbers. This study was also investigated to determine the vehicle numbers in an automated guided vehicle-based intrabay material handling system. 相似文献

18.

Implementation of two projection methods on a shared memory multiprocessor: DEC VAX 6240

C. Kamath S. Weeratunga 《Parallel Computing》1990,16(2-3):375-382

In this paper, we compare the relative performance of two iterative schemes, based on projection techniques, on a shared memory multiprocessor - VAX 6240. We consider the CG accelerated Block-SSOR method and the CG accelerated Symmetric-Kaczmarz method for the solution of large non-symmetric systems of linear equations. We show that the regular structure of many matrices can be exploited by the CG-accelerated Block-SSOR method to provide good speedup in a multiprocessing environment. However, the CG accelerated Symmetric-Kaczmarz method, while being a viable alternative on a scalar machine, is unable to benefit from multiprocessing. 相似文献

19.

A general purpose subroutine for fast fourier transform on a distributed memory parallel machine

A. Dubey M. Zubair C. E. Grosch 《Parallel Computing》1994,20(12)

One issue which is central in developing a general purpose FFT subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. In this paper we present an FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications. We have also addressed the problem of rearranging the data after computing the FFT. We have evaluated the performance of our implementation on a distributed memory parallel machine, the Intel iPSC/860. 相似文献

20.

Performance evaluation of a multithreaded RTS using a synchronous reactive model

A. Valderruten V. M. Gulías J. Mosquera J. S. Jorge 《Control Engineering Practice》1999,7(12):771-1539

Synchronous reactive modelling provides an optimal framework for the modular decomposition of programs that engage in complex patterns of deterministic interaction, such as many real-time and communication entities. This paper presents an approach which includes performance modelling techniques in the synchronous reactive modelling method supported by ESTEREL. It defines a methodology based on timing and probabilistic quantitative constructs that complete the synchronous reactive models. A monitoring mechanism allows the computation of performance results during the simulation. This methodology is applied to study a multithreaded runtime system for a distributed functional programming language. Performance metrics are computed and validated with experimental results. 相似文献