首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 453 毫秒
1.
We study the performance benefits of speculation in a release consistent software distributed shared memory system. We propose a new protocol, speculative home-based release consistency (SHRC) that speculatively updates data at remote nodes to reduce the latency of remote memory accesses. Our protocol employs a predictor that uses patterns in past accesses to shared memory to predict future accesses. We have implemented our protocol in a release consistent software distributed shared memory system that runs on commodity hardware. We evaluate our protocol implementation using eight software distributed shared memory benchmarks and show that it can result in significant performance improvements.  相似文献   

2.
We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. The analysis, calledcycle detection, is based on work by Shasha and Snir and checks for cycles among interfering accesses. We improve the accuracy of their analysis by using additional information fromsynchronization analysis, which handles post–wait synchronization, barriers, and locks. We also make the analysis efficient by exploiting the common code image property of SPMD programs. We make no assumptions on the use of synchronization constructs: our transformations preserve program meaning even in the presence of race conditions, user-defined spin locks, or other synchronization mechanisms built from shared memory. However, programs that use linguistic synchronization constructs rather than their user-defined shared memory counterparts will benefit from more accurate analysis and therefore better optimization. We demonstrate the use of this analysis for communication optimizations on distributed memory machines by automatically transforming programs written in a conventional shared memory style into a Split-C program, which has primitives for nonblocking memory operations and one-way communication. The optimizations includemessage pipelining, to allow multiple outstanding remote memory operations, conversion of two-way to one-way communication, and elimination of communication through data reuse. The performance improvements are as high as 20–35% for programs running on a CM-5 multiprocessor using the Split-C language as a global address layer. Even larger benefits can be expected on machines with higher communication latency relative to processor speed.  相似文献   

3.
基于弱一致性模型软件数据预取策略   总被引:1,自引:0,他引:1  
窦勇  周兴铭 《软件学报》1997,8(2):81-86
本文针对分布共享存储器中存在的远地访问大延迟问题,提出了基于弱序一致性模型的存储访问优化策略,主要是利用并行程序中同步操作提供的信息,在同步点成块预取将要被使用的数据.该方法能够有效地掩盖远地存储访问的大延迟.  相似文献   

4.
Cache-only memory access (COMA) multiprocessors support scalable coherent shared memory with a uniform memory access programming model. The local portion of shared memory associated with a processor is organized as a cache. This cache-based organization of memory results in long remote memory access latencies. Latency-hiding mechanisms can reduce effective remote memory access latency by making data present in a processor's local memory by the time the data are needed. In this paper we study the effectiveness of latency-hiding mechanisms on the KSR2 multiprocessor in improving the performance of three programs. The communication patterns of each program are analyzed and the mechanisms for latency hiding are applied. Results from a 52-processor system indicate that these mechanisms hide a significant portion of the latency of remote memory accesses. The results also quantify benefits in overall application performance.An earlier version of this paper was presented at the 1995 International Conference on Parallel Processing Techniques and Applications.  相似文献   

5.
Many researchers approach the problem of programming distributed memory machines by assuming a global shared name space. Thus the user views the distributed memory of the machine as though it were shared. A major issue that arises at this point is how to manage the memory. When a processor accesses data stored on another processor's memory, data must be moved between the two processors. Once these data are retrieved from another processor's memory, several interesting issues are raised. Where should these data be stored locally? What transformations must be performed to the code to guarantee that the nonlocal accesses reference the correct memory location? What optimizations can be performed to reduce the time spent in accessing the nonlocal data? In this paper we examine various data migration mechanisms that allow an explicit and controlled mapping of data to memory. We describe, experimentally evaluate, and model a set of schemes for storing and retrieving off-processor array elements. The schemes are all based on using hash tables for efficient access of nonlocal data. The three different techniques evaluated are the basic hashed cache, partial enumeration, and full enumeration, the details of which are described in the paper. In all three schemes, nonlocal data are stored in hash tables—the difference is in the amount of memory used by the schemes and the retrieval mechanisms for nonlocal data.  相似文献   

6.
Much research has focused on reducing and/or tolerating remote memory access latencies on distributed-memory parallel computers. Caching remote data is intended to reduce average access latency by handling as many remote memory accesses as possible using local copies of the data in the cache. Data-flow and multithreaded approaches help programs tolerate the latency of remote memory accesses by allowing processors to do other work while remote operations take place. The thread migration technique described here is a multithreaded architecture where threads migrate to remote processors that contain data they need. By exploiting access locality, the threads often use several data items from that processor before migrating to other processors for more data. Because the threads migrate in search of data, the approach is called Nomadic Threads. A prototype runtime system has been implemented on the CM5 and is portable to other distributed memory parallel computers.  相似文献   

7.
In distributed shared memory multiprocessors, remote memory references generate processor-to-memory traffic, which may result in a bottleneck. It is therefore important to design algorithms that minimize the number of remote memory references. We establish a lower bound of three on remote reference time complexity for mutual exclusion algorithms in a model where processes communicate by means of a general read-modify-write primitive that accesses at most one shared variable in one instruction. Since the general read-modify-write primitive is a generalization of a variety of atomic primitives that have been implemented in multiprocessor systems, our lower bound holds for all mutual exclusion algorithms that use such primitives. Furthermore, this lower bound is shown to be tight by presenting an algorithm with the matching upper bound.  相似文献   

8.
One of the critical goals in code optimization for multi-processor-system-on-a-chip (MPSoC) architectures is to minimize the number of off-chip memory accesses. This is because such accesses can be extremely costly from both performance and power angles. While conventional data locality optimization techniques can be used for improving data access pattern of each processor independently, such techniques usually do not consider locality for shared data. This paper proposes a strategy that reduces the number of off-chip references due to shared data. It achieves this goal by restructuring a parallelized application code in such a fashion that a given data block is accessed by parallel processors within the same time frame, so that its reuse is maximized while it is in the on-chip memory space. This tends to minimize the number of off-chip references since the accesses to a given data block are clustered within a short period of time during execution. Our approach employs a polyhedral tool that helps us isolate computations that manipulate a given data block. In order to test the effectiveness of our approach, we implemented it using a publicly-available compiler infrastructure and conducted experiments with twelve data-intensive embedded applications. Our results show that optimizing data locality for shared data elements is very useful in practice.  相似文献   

9.
Performance needs of many database applicationsdictate that the entire database be stored in main memory.The dali system is a main memory storage manager designed toprovide the persistence, availability and safety guarantees one typically expects from a disk-resident database, while at the same time providing very high performance by virtue of being tuned to support in-memory data.User processes map the entire database into their address space andaccess data directly, thus avoiding expensive remote procedure calls andbuffer manager interactionstypical of accesses in disk-resident commercial systems available today.dali recovers the database to a consistent state in the case of system as well as process failures. It alsoprovides unique concurrency control and memory protection features, aswell as ordered and unordered index structures. Both object-oriented and relational database management systems have beenimplemented on top of dali. dali provides access to multiple layers ofapplication programming interface, including its low-level recovery,concurrency control and indexing components as well as its high-levelrelational component. Finally, various features of dali can be tailored tothe needs of an application to achieve high performance–for example,concurrency control and logging can be turned off if not desired, enablingdali to efficiently support applications that requirenon-persistent memory-resident data to be shared by multiple processes.  相似文献   

10.
High-level parallel programming models supporting dynamic fine-grained threads in a global object space are becoming increasingly popular for expressing irregular applications based on sophisticated adaptive algorithms and pointer-based data structures. However, implementing these multithreaded computations on scalable parallel machines poses significant challenges, particularly with respect to object caching. Object caching techniques must be able to tolerate unresponsive processors and protocol handler occupancy delays. This paper examines whether these challenges can be offset by leveraging responsive general-purpose communication architectural features (such as remote memory access and atomic operations), possibly compensating for the lack of more sophisticated hardware primitives by relying upon increased involvement of the run-time system and the compiler. A detailed performance analysis of four irregular applications, using the Illinois Concert System on the Cray T3D and the SGI Origin 2000, finds that existing software distributed shared memory (DSM) systems are capable of delivering good performance only in the presence of a high level of responsive communication architecture support (specifically, support for remote atomic operations). Recognizing that this situation stems from the synchronous request–reply nature of DSM protocols, we present a composable object caching framework, called view caching, which exploits knowledge of application data access semantics to construct custom protocols that require reduced processor synchronization. View caching protocols are more tolerant to responsiveness and occupancy delays and are able to exploit even lower level responsive communication primitives (such as nonatomic remote memory accesses) for a performance benefit.  相似文献   

11.
In a Multi-Processor System-on-a-Chip (MPSoC) based on Network-on-Chip (NoC), which processes massive data in a distributed fashion, communication is concentrated on shared memory. This paper proposes an assignment algorithm that can minimize the total power consumption for data communication in executing application programs and a switch structure that can reduce communication congestion resulting from simultaneous accesses to the shared memory. The proposed assignment algorithm gives higher priority to the tasks transferring a larger amount of data to shared memory, so that these tasks can be assigned to the PEs close to shared memory. The proposed switch structure was designed to support multi-port memory, which is often used for shared memory. The ports of the proposed switch are dedicated to be connected with in/out ports of shared memory in order to increase communication bandwidth between PEs and shared memories. By adopting the proposed scheme, the congestion caused by the concentrated requests to the memory can be reduced. Experimental results show that power consumption for transferring data in High-Definition (HD) H.264 decoder, Motion-JPEG decoder, MP3 decoder and 2D Wavelet transform codes has been reduced by 23.9% on the average, when compared with the cases of applying the well-known FC, BN and SA algorithms. The area has been slightly increased by 1.7% compared to conventional NoC structures.  相似文献   

12.
Distributed Shared-Memory (DSM) systems are shared-memory multiprocessor architectures in which each processor node contains a partition of the shared memory. In hybrid DSM systems coherence among caches is maintained by a software-implemented coherence protocol relying on some hardware support. Hardware support is provided to satisfy every node hit (the common case) and software is invoked only for accesses to remote nodes.In this paper we compare the design and performance of four hybrid distributed shared memory (DSM) organizations by detailed simulation of the same hardware platform. We have implemented the software protocol handlers for the four architectures. The handlers are written in C and assembly code. Coherence transactions are executed in trap and interrupt handlers. Together with the application, the handlers are executed in full detail in execution-driven simulations of six complete benchmarks with coarse-grain and fine-grain sharing. We relate our experience implementing and simulating the software protocols for the four architectures.Because the overhead of remote accesses is very high in hybrid systems, the system of choice is different than for purely hardware systems.  相似文献   

13.
In recent years, cluster computing has been widely investigated and there is no doubt that it can provide a cost-effective computing infrastructure by aggregating computational power, communication, and storage resources. Moreover, it is also considered to be a very attractive platform for low-cost supercomputing. Distributed shared memory (DSM) systems utilize the physical memory of each computing node interconnected in a private network to form a global virtual shared memory. Since this global shared memory is distributed among the computing nodes, accessing the data located in remote computing nodes is an absolute necessity. However, this action will result in significant remote memory access latencies which are major sources of overhead in DSM systems. For these reasons, in order to increase overall system performance and decrease this overhead, a number of strategies have been devised. Prefetching is one such approach which can reduce latencies, although it always increases the workload in the home nodes. In this paper, we propose a scheme named Agent Home Scheme. Its most noticeable feature, when compared to other schemes, is that the agent home distributes the workloads of each computing nodes when sending data. By doing this, we can reduce not only the workload of the home nodes by balancing the workload for each node, but also the waiting time. Experimental results show that the proposed method can obtain about 20% higher performance than the original JIAJIA, about 18% more than History Prefetching Strategy (HPS), and about 10% higher than Effective Prefetch Strategy (EPS).  相似文献   

14.
In a distributed system, data servers (file systems and databases) can easily become bottlenecks. We propose an approach to offloading data access requests from overloaded data servers to nodes that are Idle or less busy. This approach is referred to as remote caching, and the idle or less busy nodes are called mutual servers as they help out the busy server nodes on data accesses. In addition to server and client local caches, frequently accessed data are cached in the main memory of mutual servers, thus improving the data access time in the system. We evaluate several data propagation strategics among data servers and mutual servers. These include policies in which senders are active/passive and receivers are active/passive in initiating data propagation. For example, an active sender takes the initiative to offload data onto a passive receiver. Simulation results show that the active-sender/passive-receiver policy is the method of choice In most cases. Active-Sender policies are best able to exploit the main memory of other Idle nodes in the expected normal condition where some nodes are overloaded and others are less loaded. AH active policies perform far better than the policy without remote caching even in the degenerated case where each node is equally loaded.  相似文献   

15.
The performance and energy efficiency of current systems is influenced by accesses to the memory hierarchy. One important aspect of memory hierarchies is the introduction of different memory access times, depending on the core that requested the transaction, and which cache or main memory bank responded to it. In this context, the locality of the memory accesses plays a key role for the performance and energy efficiency of parallel applications. Accesses to remote caches and NUMA nodes are more expensive than accesses to local ones. With information about the memory access pattern, pages can be migrated to the NUMA nodes that access them (data mapping), and threads that communicate can be migrated to the same node (thread mapping).In this paper, we present LAPT, a hardware-based mechanism to store the memory access pattern of parallel applications in the page table. The operating system uses the detected memory access pattern to perform an optimized thread and data mapping during the execution of the parallel application. Experiments with a wide range of parallel applications (from the NAS and PARSEC Benchmark Suites) on a NUMA machine showed significant performance and energy efficiency improvements of up to 19.2% and 15.7%, respectively, (6.7% and 5.3% on average).  相似文献   

16.
Parallel applications can be executed using the idle computing capacity of workstation clusters. However, it remains unclear how to schedule the processors among different applications most effectively. Processor scheduling algorithms that were successful for shared-memory machines have proven to be inadequate for distributed memory environments due to the high costs of remote memory accesses and redistributing data. We investigate how knowledge of system load and application characteristics can be used in scheduling decisions. We propose a new algorithm based on adaptive equipartitioning, which, by properly exploiting both the information types above, performs better than other nonpreemptive scheduling rules, and nearly as well as idealized versions of preemptive rules (with free preemption). We conclude that the new algorithm is suitable for use in scheduling parallel applications on networks of workstations.  相似文献   

17.
Due to advances in fiber optics and VLSI technology, interconnection networks that allow simultaneous broadcasts are becoming feasible. Distributed shared memory (DSM) implementations on such networks promise high performance even for small applications with small granularity. This paper, after summarizing the architecture of one such implementation called the Simultaneous Multiprocessor Optical Exchange Bus (SOME-Bus), presents simple algorithms for improving the performance of parallel programs running on the SOME-Bus multiprocessor implementing cache-coherent DSM. The algorithms are based on run-time data redistribution via dynamic page migration protocol. They use memory access references together with the information of average channel utilization, average channel waiting time, number of messages in the channel queue or short-term average channel waiting time reported by each node and gathered by hardware monitors to make correct decisions related to the placement of shared data. Simulations with four parallel codes on a 64-processor SOME-Bus show that the algorithms yield significant performance improvements such as reduction in the execution times, number of remote memory accesses, average channel waiting times, average network latencies and increase in average channel utilizations.  相似文献   

18.
We present a novel and portable threads-based system for concurrent applications on shared- and distributed-memory environments. The Ariadne system provides stateful user-space threads that can be very effective in medium to coarse grained applications. The interface is the same for uniprocessors and multiprocessors. Sequential programs are readily converted into parallel programs for shared or distributed memory, with low development effort. Ariadne supports the development of customized schedulers, and offers a thread migration capability in distributed environments. Scheduling of computations at the threads level enables both task- and data-driven executions. Thread migration is a useful feature which turns remote memory accesses into local accesses, enables load-balancing and simplifies program development. Ariadne employs a unique runtime stack rewriting mechanism to migrate threads between homogeneous processors. Ariadne currently runs on the SPARC (SunOS 4.x, SunOS 5.x), Sequent Symmetry, Intel Paragon, Silicon Graphics IRIX and IBM RS/6000 environments. We present some examples of Ariadne programs, along with performance measurements. © 1998 John Wiley & Sons, Ltd.  相似文献   

19.
This paper considers the problem of electing an eventual leader in an asynchronous shared memory system. While this problem has received a lot of attention in message-passing systems, very few solutions have been proposed for shared memory systems. As an eventual leader cannot be elected in a pure asynchronous system prone to process crashes, the paper first proposes to enrich the asynchronous system model with an additional assumption. That assumption (denoted AWB) is particularly weak. It is made up of two complementary parts. More precisely, it requires that, after some time, (1) there is a process whose write accesses to some shared variables be timely, and (2) the timers of (tf) other processes be asymptotically well-behaved (t denotes the maximal number of processes that may crash, and f the actual number of process crashes in a run). The asymptotically well-behaved timer notion is a new notion that generalizes and weakens the traditional notion of timers whose durations are required to monotonically increase when the values they are set to increase (a timer works incorrectly when it expires at arbitrary times, i.e., independently of the value it has been set to). The paper then focuses on the design of t-resilient AWB-based eventual leader protocols. “t-resilient” means that each protocol can cope with up to t process crashes (taking t=n−1 provides wait-free protocols, i.e., protocols that can cope with any number of process failures). Two protocols are presented. The first enjoys the following noteworthy properties: after some time only the elected leader has to write the shared memory, and all but one shared variables have a bounded domain, be the execution finite or infinite. This protocol is consequently optimal with respect to the number of processes that have to write the shared memory. The second protocol guarantees that all the shared variables have a bounded domain. This is obtained at the following additional price: t+1 processes are required to forever write the shared memory. A theorem is proved which states that this price has to be paid by any protocol that elects an eventual leader in a bounded shared memory model. This second protocol is consequently optimal with respect to the number of processes that have to write in such a constrained memory model. In a very interesting way, these protocols show an inherent tradeoff relating the number of processes that have to write the shared memory and the bounded/unbounded attribute of that memory.  相似文献   

20.
Many computational-intensive problems from science and engineering are irregular in nature. This makes it difficult to develop an efficient parallel implementation, even for shared-memory machines. As a typical example, we investigate a parallel implementation of an irregular particle simulation algorithm. We concentrate on the issue which programming and system support is needed to yield an efficient implementation for a large number of processors. As an execution platform we use the SB-PRAM, a shared memory machine with up to 2048 processors. The processors of the SB-PRAM can access the global memory in unit time which is the basis for an exact performance prediction. Common approaches for parallel implementations like lock protection for concurrent accesses and sequential or distributed task queues are replaced by more efficient access mechanisms and data structures which can be realized by the powerful multiprefix operations of the SB-PRAM. Their use simplifies the implementation and yields large speedup values.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号