首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 3 毫秒
1.
Shared memory is a simple yet powerful paradigm for structuring systems. Recently, there has been an interest in extending this paradigm to non-shared memory architectures as well. For example, the virtual address spaces for all objects in a distributed object-based system could be viewed as constituting a global distributed shared memory. We propose a set of primitives for managing distributed shared memory. We present an implementation of these primitives in the context of an object-based operating system as well as on top of Unix.  相似文献   

2.
Due to advances in fiber optics and VLSI technology, interconnection networks that allow simultaneous broadcasts are becoming feasible. Distributed shared memory (DSM) implementations on such networks promise high performance even for small applications with small granularity. This paper, after summarizing the architecture of one such implementation called the Simultaneous Multiprocessor Optical Exchange Bus (SOME-Bus), presents simple algorithms for improving the performance of parallel programs running on the SOME-Bus multiprocessor implementing cache-coherent DSM. The algorithms are based on run-time data redistribution via dynamic page migration protocol. They use memory access references together with the information of average channel utilization, average channel waiting time, number of messages in the channel queue or short-term average channel waiting time reported by each node and gathered by hardware monitors to make correct decisions related to the placement of shared data. Simulations with four parallel codes on a 64-processor SOME-Bus show that the algorithms yield significant performance improvements such as reduction in the execution times, number of remote memory accesses, average channel waiting times, average network latencies and increase in average channel utilizations.  相似文献   

3.
We describe the evolution of a distributed shared memory (DSM) system, Mirage, and the difficulties encountered when moving the system from a Unix-based* kernel on the VAX to a Unix-based kernel on personal computers. Mirage provides a network transparent form of shared memory for a loosely coupled environment. The system hides network boundaries for processes that are accessing shared memory and is upward compatible with the Unix System V Interface Definition. This paper addresses the architectural dependencies in the design of the system and evaluates performance of the implementation. The new version, MIRAGE +, performs well compared to Mirage even though eight times the amount of data is sent on each page fault because of the larger page size used in the implementation. We show that performance of systems with a large page size to network packet size can be dramatically improved on conventional hardware by applying three well-known techniques: packet blasting, compression, and running at interrupt level. The measured time for a page fault in MIRAGE + has been reduced 37 per cent by sending a page using packet blasting instead of using a handshake for each portion of the page. When compression was added to MIRAGE +, the time to fault a page across the network was further improved by 47 per cent when the page was compressed into one network packet. Our measured performance compares favorably with the amount of time it takes to fault a page from disk. Lastly, running at interrupt level may improve performance 16 per cent when faulting pages without compression.  相似文献   

4.
Congestion occurring in the input queues of broadcast-based multiprocessor architectures can severely limit their overall performance. The existing congestion control algorithms estimate congestion based on a node’s output channel parameters such as the number of free virtual channels or the number of packets waiting at the channel queue. In this paper, we have proposed a new congestion control algorithm to prevent congestion on broadcast-based multiprocessor architectures with multiple input queues. Our algorithm performs congestion control at the packet level and takes into account the next input queue number which will be accessed by the processor, which form the fundamental differences between our algorithm and the algorithms based on the idea of virtual channel congestion control. The performance of the algorithm is evaluated by OPNET Modeler with various synthetic traffic patterns on a 64-node Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) architecture employing the message passing protocol. Performance measures such as average input waiting time, average network response time and average processor utilization have been collected before and after applying the algorithm. The results show that the proposed algorithm is able to decrease the average input waiting time by 13.99% to 20.39%, average network response time by 8.76% to 20.36% and increase average processor utilization by 1.92% to 6.63%. The performance of the algorithm is compared with that of the other congestion control algorithms and it is observed that our algorithm performs better under all traffic patterns. Also, theoretical analysis of the proposed method is carried out by using queuing networks.  相似文献   

5.
Two paradigms for distributed shared memory on loosely-coupled computing systems are compared: the shared data-object model as used in Orca, a programming language specially designed for loosely-coupled computing systems, and the shared virtual memory model. For both paradigms two systems are described, one using only point-to-point messages, the other using broadcasting as well. The two paradigms and their implementations are described briefly. Their performances are compared on four applications: the travelling-salesman problem, alpha-beta search, matrix multiplication and the all-pairs shortest-paths problem. Measurements were obtained on a system consisting of 10 MC68020 processors connected by an Ethernet. For comparison purposes, the applications have also been run on a system with physical shared memory. In addition, the paper gives measurements for the first two applications above when remote procedure call is used as the communication mechanism. The measurements show that both paradigms can be used efficiently for programming large-grain parallel applications, with significant speed-ups. The structured shared data-object model achieves the highest speed-ups and is easiest to program and to debug.  相似文献   

6.
High speed networks and rapidly improving microprocessor performance make the network of workstations an extremely important tool for parallel computing in order to speedup the execution of scientific applications. Shared memory is an attractive programming model for designing parallel and distributed applications, where the programmer can focus on algorithmic development rather than data partition and communication. Based on this important characteristic, the design of systems to provide the shared memory abstraction on physically distributed memory machines has been developed, known as Distributed Shared Memory (DSM). DSM is built using specific software to combine a number of computer hardware resources into one computing environment. Such an environment not only provides an easy way to execute parallel applications, but also combines available computational resources with the purpose of speeding up execution of these applications. DSM systems need to maintain data consistency in memory, which usually leads to communication overhead. Therefore, there exists a number of strategies that can be used to overcome this overhead issue and improve overall performance. Strategies as prefetching have been proven to show great performance in DSM systems, since they can reduce data access communication latencies from remote nodes. On the other hand, these strategies also transfer unnecessary prefetching pages to remote nodes. In this research paper, we focus on the access pattern during execution of a parallel application, and then analyze the data type and behavior of parallel applications. We propose an adaptive data classification scheme to improve prefetching strategy with the goal to improve overall performance. Adaptive data classification scheme classifies data according to the accessing sequence of pages, so that the home node uses past history access patterns of remote nodes to decide whether it needs to transfer related pages to remote nodes. From experimental results, we can observe that our proposed method can increase the accuracy of data access in effective prefetch strategy by reducing the number of page faults and misprefetching. Experimental results using our proposed classification scheme show a performance improvement of about 9–25% over the same benchmark applications running on top of an original JIAJIA DSM system.
Kuan-Ching Li (Corresponding author)Email:
  相似文献   

7.
When using a shared memory multiprocessor, the programmer faces the issue of selecting the portable programming model which will provide the best performance. Even if they restricts their choice to the standard programming environments (MPI and OpenMP), they have to select a programming approach among MPI and the variety of OpenMP programming styles. To help the programmer in their decision, we compare MPI with three OpenMP programming styles (loop level, loop level with large parallel sections, SPMD) using a subset of the NAS benchmark (CG, MG, FT, LU), two dataset sizes (A and B), and two shared memory multiprocessors (IBM SP3 NightHawk II, SGI Origin 3800). We have developed the first SPMD OpenMP version of the NAS benchmark and gathered other OpenMP versions from independent sources (PBN, SDSC and RWCP). Experimental results demonstrate that OpenMP provides competitive performance compared with MPI for a large set of experimental conditions. Not surprisingly, the two best OpenMP versions are those requiring the strongest programming effort. MPI still provides the best performance under some conditions. We present breakdowns of the execution times and measurements of hardware performance counters to explain the performance differences. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

8.
Opportunistic networks (ONs) allow mobile wireless devices to interact with one another through a series of opportunistic contacts. While ONs exploit mobility of devices to route messages and distribute information, the intermittent connections among devices make many traditional computer collaboration paradigms, such as distributed shared memory (DSM), very difficult to realize. DSM systems, developed for traditional networks, rely on relatively stable, consistent connections among participating nodes to function properly.We propose a novel delay tolerant lazy release consistency (DTLRC) mechanism for implementing distributed shared memory in opportunistic networks. DTLRC permits mobile devices to remain independently productive while separated, and provides a mechanism for nodes to regain coherence of shared memory if and when they meet again. DTLRC allows applications to utilize the most coherent data available, even in the challenged environments typical to opportunistic networks. Simulations demonstrate that DTLRC is a viable concept for enhancing cooperation among mobile wireless devices in opportunistic networking environment.  相似文献   

9.
Network researchers have dedicated a notable part of their efforts to the area of modeling traffic and to the implementation of efficient traffic generators. We feel that there is a strong demand for traffic generators capable to reproduce realistic traffic patterns according to theoretical models and at the same time with high performance. This work presents an open distributed platform for traffic generation that we called distributed internet traffic generator (D-ITG), capable of producing traffic (network, transport and application layer) at packet level and of accurately replicating appropriate stochastic processes for both inter departure time (IDT) and packet size (PS) random variables. We implemented two different versions of our distributed generator. In the first one, a log server is in charge of recording the information transmitted by senders and receivers and these communications are based either on TCP or UDP. In the other one, senders and receivers make use of the MPI library. In this work a complete performance comparison among the centralized version and the two distributed versions of D-ITG is presented.  相似文献   

10.
Distributed systems deliver a cost-effective and scalable solution to the increasing performance intensive applications by utilizing several shared resources. Gang scheduling is considered to be an efficient time-space sharing scheduling algorithm for parallel and distributed systems. In this paper we examine the performance of scheduling strategies of jobs which are bags of independent gangs in a heterogeneous system. A simulation model is used to evaluate the performance of bag of gangs scheduling in the presence of high priority jobs implementing migrations. The simulation results reveal the significant role of the implemented migration scheme as a load balancing factor in a heterogeneous environment. Another significant aspect of implementing migrations presented in this paper is the reduction of the fragmentation caused in the schedule by gang scheduled jobs and the alleviation of the performance impact of the high priority jobs.  相似文献   

11.
Synchronization in a distributed virtual environment (DVE) involves mechanisms to ensure a consistent view of a virtual world for all participants. Most applications in the DVE are related to collaborative activities that include non-contention and contention cases. Using transmission of update messages is suitable enough to support synchronization for only non-contention activity. The contention activity requires an additional mechanism to control accessing a common object for synchronization. In this paper, we present the compromised synchronization control mechanism to support both non-contention and contention activities. The mechanism employs frequent update event and multiple-lock checking to control the synchronization. Frequent update event is used to support a dynamic virtual world for non-contention activity. Multiple-lock checking is embedded to ensure consistency when accessing the common object is required simultaneously for the contention event. Performance measurement of the compromised synchronization is provided by simulation in terms of locking time, sampling event, number of logical processes, and traffic tolerance. Prototype application is also implemented to compare the result in a small scale level. Based on the simulation and experimental results, the compromised sychronization control mechanism is capable to support up to 100 participants for the non-contention activity. It provides a good performance of supporting the contention activity in a small scale. The mechanism is considered suitable for collaborative application where contention is considered a critical event.  相似文献   

12.
Distributed shared memory (DSM) allows parallel programs to run on distributed computers by simulating a global virtual shared memory, but data racing bugs may easily occur when the threads of a multi-threaded process concurrently access the physically distributed memory. Earlier tools to help programmers locate data racing bugs in non-DSM parallel programs are not easily applied to DSM systems. This study presents the data race avoidance and replay scheme (DRARS) to assist debugging parallel programs on DSM or multi-core systems. DRARS is a novel tool which controls the consistency protocol of the target program, automatically preventing a large class of data racing bugs when the parallel program is subsequently run, obviating much of the need for manual debugging. For data racing bugs that cannot be avoided automatically, DRARS performs a deterministic replay-type function on DSM systems, faithfully reproducing the behavior of the parallel program during run time. Because one class of data racing bugs has already been eliminated, the remaining manual debugging task is greatly simplified. Unlike previous debugging methods, DRARS does not require that the parallel program be written in a specific style or programming language. Moreover, DRARS can be implemented in most consistency protocols. In this paper, DRARS is realized and verified in real experiments using the eager release consistency protocol on a DSM system with various applications.  相似文献   

13.
In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages.  相似文献   

14.
We present algorithms for the randomized simulation of a shared memory machine (PRAM) on a Distributed Memory Machine (DMM). In a PRAM, memory conflicts occur only through concurrent access to the same cell, whereas the memory of a DMM is divided into modules, one for each processor, and concurrent accesses to the same module create a conflict. Thedelay of a simulation is the time needed to simulate a parallel memory access of the PRAM. Any general simulation of anm processor PRAM on ann processor DMM will necessarily have delay at leastm/n. A randomized simulation is calledtime-processor optimal if the delay isO(m/n) with high probability. Using a novel simulation scheme based on hashing we obtain a time-processor optimal simulation with delayO(log log(n) log*(n)). The best previous simulations use a simpler scheme based on hashing and have much larger delay: (log(n)/log log(n)) for the simulation of an n processor PRAM on ann processor DMM, and (log(n)) in the case where the simulation is time-processor optimal.Our simulations use several (two or three) hash functions to distribute the shared memory among the memory modules of the PRAM. The stochastic processes modeling the behavior of our algorithms and their analyses based on powerful classes of universal hash functions may be of independent interest.Research partially supported by NSF/DARPA Grant CCR-9005448. Work was done while at the University of California at Berkeley and the International Computer Science Institute, Berkeley, CA.Research partially supported by National Science Foundation Operating Grant CCR-9016468, National Science Foundation Operating Grant CCR-9304722, United States-Israel Binational Science Foundation Grant No. 89-00312, United States-Israel Binational Science Foundation Grant No. 92-00226, and ESPRIT BR Grant EC-US 030.Part of work was done during a visit at the International Computer Science Institute at Berkeley; supported in part by DFG-Forschergruppe Effiziente Nutzung massiv paralleler Systeme, Teilprojekt 4, and by the Esprit Basic Research Action Nr. 7141 (ALCOM II).  相似文献   

15.
Coherence misses and invalidation traffic limit the performance of bus-based multiprocessors using write-invalidate snooping caches. This paper considers optimizations of a write-invalidate protocol that remove such overhead. While coherence misses are attacked by a hybrid update/invalidate protocol and another technique where update instructions are selectively inserted by a compiler, invalidation traffic is reduced by three optimizations that coalesce ownership acquisition with miss handling: migrate-on-dirty, an adaptive hardware-based scheme, and compiler-controlled insertion of load-exclusive instructions.

The relative effectiveness of these optimizations are evaluated using detailed architectural simulations and a set of four parallel programs. We find that while both of the update-based schemes effectively remove most coherence misses, the hybrid update/invalidate scheme causes lower traffic. By contrast, the compiler-based approach to cut invalidation traffic is slightly more efficient than the adaptive hardware-based scheme. Moreover, the migrate-on-dirty heuristic is found to have devastating effects on the miss rate.  相似文献   


16.
The probability of a station failing to deliver packets before their deadlines, called theprobability of dynamic failure, P dyn, is an important measure for the communication subsystem of a distributed real-time system. Another closely-related performance measure is the -bounded delivery time,T , which is defined as the least time needed to deliver a packet with probability greater than 1–. UsingP dyn andT , we comparatively evaluate four contention protocols often used in distributed real-time systems: (i) the token passing protocol and its priority-based variation (called thetoken scheduling protocol), and (ii) theP i-persistent protocol and a priority-based variation thereof. The communication subsystem equipped with different contention protocols is modeled first as embedded Markov chains. Then, we derive the probability distributions of access delay, from whichP dyn andT can be calculated. The blocking probability,Q i, can also be derived from the access delay distribution. These measures are derived first under the assumption of a single buffer at each station. The single-buffer model is then extended to the multiple-buffer case. The effects of buffer size onP dyn,T , andQ i, and the performance improvement with multiple buffers are analyzed over a wide range of network traffic.The work reported in this paper was supported in part by the ONR under Grant N00014-92-J—1080. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the ONR.  相似文献   

17.
Shared Virtual Memory (SVM) provides a low-cost and effective way to implement the shared-memory programming paradigm. SVMs utilize a number of concepts that include consistency models/protocols, sharing patterns, false sharing, and fragmentation issues. The range of issues encountered in an SVM introduces a level of complexity and presents a challenge to many SVM researchers. This paper presents a careful study of SVM systems focusing on how the workload characteristics can affect the performace of consistency protocols. This knowledge is used to propose a novel consistency protocol that improves the system performance. This paper pursues two main goals: (i) to illustrate how different SVM workload characteristics are interrelated, and (ii) to motivate the design of a new multiple-writer memory consistency protocol. To achieve the first goal, we provide a detailed workload characterization analysis and discussion on how consistency models and protocols work. To achieve the second goal, we describe a software-based SVM protocol that achieves better performance than a hardware protocol proposed in the literature. In some workloads, the speedup obtained over the baseline protocol is more than 20%.  相似文献   

18.
We propose a model for describing and predicting the parallel performance of a broad class of parallel numerical software on distributed memory architectures. The purpose of this model is to allow reliable predictions to be made for the performance of the software on large numbers of processors of a given parallel system, by only benchmarking the code on small numbers of processors. Having described the methods used, and emphasized the simplicity of their implementation, the approach is tested on a range of engineering software applications that are built upon the use of multigrid algorithms. Despite their simplicity, the models are demonstrated to provide both accurate and robust predictions across a range of different parallel architectures, partitioning strategies and multigrid codes. In particular, the effectiveness of the predictive methodology is shown for a practical engineering software implementation of an elastohydrodynamic lubrication solver.  相似文献   

19.
20.
Discrete-event simulation model was developed to evaluate the performance of an automated material handling system (AMHS) for a wafer fab with a zone control scheme avoiding all vehicle collision. The layout of this AMHS is a custom configuration. The track option contains turntables, turnouts and high-speed express lanes. The behavior of the interarrival for all stockers from the real data set was analyzed to verify the assumption of the simulation model. The results show that the underlying distributions of most stockers for interarrival times belong to the exponential or Weibull distribution. The simulation results show that the number of vehicles significantly affects the average delivery time and the average throughput. A simple one-factor response surface model is used to determine the appropriate vehicle numbers. This study was also investigated to determine the vehicle numbers in an automated guided vehicle-based intrabay material handling system.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号