首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 531 毫秒
1.
在大规模并分布式共享主存多处理机系统中,尽可能减少系统中远程访问时延,是提高系统整体性能的关键。该文提出了一种路由Cache结构,并详细介绍了基于路由器Cache的一致性协议,该协议在减少系统远程访问延时,提高系统有效带宽方面有较好的效果。  相似文献   

2.
A large scale, cache-based multiprocessor that is interconnected by a hierarchical network such as hierarchical buses or a multistage interconnection network (MIN) is considered. An adaptive cache coherence scheme for the system is proposed based on a hardware approach that handles multiple shared reads efficiently. The new protocol allows multiple copies of a shared data block in the hierarchical network, but minimizes the cache coherence overhead by dynamically partitioning the network into sharing and nonsharing regions based on program behavior. The new cache coherence scheme effectively utilizes the bandwidth of the hierarchical networks and exploits the locality properties of parallel algorithms. Simulation experiments have been carried out to analyze the performance of the new protocol. The simulation results show that the new protocol gives 15% to 30% performance improvement over some existing cache coherence schemes on similar systems for a wide range of workload parameters  相似文献   

3.
Several variations of cache-based checkpointing for rollback error recovery from transient errors in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated in this paper. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, we evaluate the performance effect of integrating the recovery schemes in the cache coherence protocol. Our results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead, but with uncontrollable high variability in the checkpoint interval  相似文献   

4.
Cache-only memory access (COMA) multiprocessors support scalable coherent shared memory with a uniform memory access programming model. The local portion of shared memory associated with a processor is organized as a cache. This cache-based organization of memory results in long remote memory access latencies. Latency-hiding mechanisms can reduce effective remote memory access latency by making data present in a processor's local memory by the time the data are needed. In this paper we study the effectiveness of latency-hiding mechanisms on the KSR2 multiprocessor in improving the performance of three programs. The communication patterns of each program are analyzed and the mechanisms for latency hiding are applied. Results from a 52-processor system indicate that these mechanisms hide a significant portion of the latency of remote memory accesses. The results also quantify benefits in overall application performance.An earlier version of this paper was presented at the 1995 International Conference on Parallel Processing Techniques and Applications.  相似文献   

5.
Virtual-memory-based cache coherence is a mechanism that relies only on hardware that already exists on the microprocessors of a shared-memory multiprocessor system, yet dynamically detects and resolves potential cache inconsistencies using virtual-memory techniques. The key feature of the approach is that the virtual-memory translation hardware on each processor is used to detect shared accesses that could lead to memory incoherencies, and VM page fault handlers execute the appropriate actions to maintain cache coherence. VM-based cache coherence basically trades off design simplicity for increased software overheads. The work presented in this paper evaluates this trade-off. We show that VM-based cache coherence performs well for scientific applications that require significant aggregate memory bandwidth.  相似文献   

6.
We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. The analysis, calledcycle detection, is based on work by Shasha and Snir and checks for cycles among interfering accesses. We improve the accuracy of their analysis by using additional information fromsynchronization analysis, which handles post–wait synchronization, barriers, and locks. We also make the analysis efficient by exploiting the common code image property of SPMD programs. We make no assumptions on the use of synchronization constructs: our transformations preserve program meaning even in the presence of race conditions, user-defined spin locks, or other synchronization mechanisms built from shared memory. However, programs that use linguistic synchronization constructs rather than their user-defined shared memory counterparts will benefit from more accurate analysis and therefore better optimization. We demonstrate the use of this analysis for communication optimizations on distributed memory machines by automatically transforming programs written in a conventional shared memory style into a Split-C program, which has primitives for nonblocking memory operations and one-way communication. The optimizations includemessage pipelining, to allow multiple outstanding remote memory operations, conversion of two-way to one-way communication, and elimination of communication through data reuse. The performance improvements are as high as 20–35% for programs running on a CM-5 multiprocessor using the Split-C language as a global address layer. Even larger benefits can be expected on machines with higher communication latency relative to processor speed.  相似文献   

7.
Aquarius-II is a cache coherent multiprocessor system designed for the parallel execution of Prolog programs. It contains two tiers of memory: synchronization memory and high bandwidth (HB) memory. The synchronization memory consists of snooping caches connected to a bus and is used to store rendezvous points, synchronization bits, synchronization variables such as locks and semaphores and most of the write shared data. The HB memory is used to store the bulk of the application program code and data. It contains caches and an inexpensive VLSI chip based crossbar interconnection network to memory. The caches connected to the crossbar do not have full snooping capability. The architecture is evaluated by a full simulation of parallel execution of Prolog programs on Aquarius-II. The design details of the components of the architecture and simulation results are presented. Simulation results indicate that the two tier memory system significantly reduces memory interference and speeds up synchronization when compared to a single bus multi. This shared memory multiprocesor architecture has the potential to support other parallel programming paradigms.  相似文献   

8.
Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture, as well as it adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node, which enhances performance by saving accesses to the slower main memory. Scalability is guaranteed by having the second and third-level directories out of the processor chip and using compressed data structures. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them, is also presented. Using execution-driven simulations, we show that significant latency reductions can be obtained by using the proposed node architecture, which translates into reductions of more than 30 percent in several cases in the application execution time.  相似文献   

9.
Concurrent C/C++ is a superset of C and C++ that provides parallel programming facilities based on message passing. Upon porting Concurrent C/C++to a shared memory multiprocessor, the authors believed it would be appropriate to supplement Concurrent C/C++ with explicit facilities for synchronizing accesses to shared data structures. The capsule, which is a shared memory access mechanism designed especially for Concurrent C/C++ to match the C++data abstraction facility called the class, is discussed. Capsules are like monitors but they have significant advantages. Capsules satisfy T. Bloom's (1979) criteria for expressiveness of synchronization conditions, support inheritance, allow operations to execute in parallel, and permit them to time out. The design of capsules is reviewed. The author evaluates existing shared memory mechanisms, describes capsules, gives examples of capsules, compares capsules with monitors, and discusses how capsules are implemented by the Concurrent C compiler  相似文献   

10.
基于弱一致性模型软件数据预取策略   总被引:1,自引:0,他引:1  
窦勇  周兴铭 《软件学报》1997,8(2):81-86
本文针对分布共享存储器中存在的远地访问大延迟问题,提出了基于弱序一致性模型的存储访问优化策略,主要是利用并行程序中同步操作提供的信息,在同步点成块预取将要被使用的数据.该方法能够有效地掩盖远地存储访问的大延迟.  相似文献   

11.
This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronization, and is applied to communication-related accesses between successive parallel loops. Prefetching and forwarding are each shown to be more effective for certain types of architectural and application characteristics. Given this result, a new hybrid prefetching and forwarding approach is proposed and evaluated that allows the relative amounts of prefetching and forwarding used to be adapted to these characteristics. When compared to prefetching or forwarding alone, the new hybrid scheme is shown to increase performance stability over varying application characteristics, to reduce processor instruction overheads, cache miss ratios, and memory system bandwidth requirements, and to reduce performance sensitivity to architectural parameters such as cache size. Algorithms for data prefetching, data forwarding, and hybrid prefetching and forwarding are described. These algorithms are applied by using a parallelizing compiler and are evaluated via execution-driven simulations of large, optimized, numerical application codes with loop-level and vector parallelism.  相似文献   

12.
A Framework of Memory Consistency Models   总被引:2,自引:1,他引:2       下载免费PDF全文
  相似文献   

13.
Multiprocessors in which a shared bus is used by the processor to communicate with common memory are an emerging class of machines where there is a need to support parallel programming languages. A language construct that is found in a number of parallel programming languages to support synchronization and communication in the interprocess rendezvous. Shared-bus multiprocessor require a protocol to keep the date in their caches coherent. There are two major categories of these protocols: invalidation and write-boadcast. This paper examines the requirements for cache coherence protocols to support efficient interprocessor rendezvous. The approach taken is to examine the memory referencing patterns to the run-time data structures during rendezvous execution. The appropriate coherence protocol is shown to be a function of the processor scheduling strategy used by the run-time system at synchronzation points during the rendezvous. When processes migrate freely as a result of the scheduling strategy, invalidation protocols are found to be more efficient. When migration is restricted by the scheduler, write-broadcast protocols are more efficient.  相似文献   

14.
The success of large-scale, hierarchical and distributed shared memory systems hinges on our ability to reduce delays resulting from remote accesses to shared data. To facilitate this, we present a compile-time algorithm for analyzing programs with doall-style parallelism to determine when read and write accesses to shared data areredundant (unnecessary). One identified, redundant remote accesses can be replaced by local accesses or eliminated entirely. This optimization improves program performance in two ways. First, slow memory accesses are replaced by faster ones. Second, the time to perform other remote memory accesses may be reduced as a result of the decreased traffic level. We also show how the information obtained through redundancy analysis can be used for other compiler optimizations such as prefetching and cache management.  相似文献   

15.
The two most common approaches to managing shared-access memory-free lists and buddy systems-have significant drawbacks. Free list algorithms have poor memory access characteristics, and buddy systems utilize their space inefficiently. In this paper, we present an alternative approach to parallel-access memory management based on the fast-fits algorithm. A fast-fits memory manager stores free blocks in a tree structure, providing fast access and efficient space use. Since the fast-fits algorithm accesses fewer blocks than a free list algorithm, it reduces the amount of cache invalidation overhead due to the memory manager. Our performance experiments show that the parallel-access fast-fits memory manager allows significantly greater access rates than a serial-access fast-fits memory manager does. We note that shared-memory multiprocessor systems need efficient dynamic storage allocators, both for system purposes and to support parallel programs.  相似文献   

16.
As microprocessors become faster and demand more bandwidth, the already limited scalability of a shared bus decreases even further. DICE, a shared-bus multiprocessor, utilizes cache only memory architecture (COMA) to effectively decrease the speed gap between modern high-performance microprocessors and the bus. DICE tries to optimize COMA for a shared-bus medium, in particular to reduce the detrimental effects of cache coherence and the “last memory block” problem on replacement. In this paper, we present the coherence and replacement protocol of the DICE multiprocessor and its design trade-offs. We describe a four-state write-invalidate coherence protocol in detail. Replacement, which poses a unique overhead problem of COMA, requires that a victim block with ownership be relocated to a remote node in order not to discard the last cached memory block. We show that the relocation process can be efficiently implemented by using a temporary storage called relocation buffer and a priority-based selection algorithm. We present performance results that show a drastic reduction in global bus traffic compared to a traditional shared-bus multiprocessor architecture.  相似文献   

17.
Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attempted to achieve the minimum completion time by distributing the workload as evenly as possible while minimizing the number of synchronization operations required. The authors consider a third dimension to the problem of loop scheduling on shared-memory multiprocessors: communication overhead caused by accesses to nonlocal data. They show that traditional algorithms for loop scheduling, which ignore the location of data when assigning iterations to processors, incur a significant performance penalty on modern shared-memory multiprocessors. They propose a new loop scheduling algorithm that attempts to simultaneously balance the workload, minimize synchronization, and co-locate loop iterations with the necessary data. They compare the performance of this new algorithm to other known algorithms by using five representative kernel programs on a Silicon Graphics multiprocessor workstation, a BBN Butterfly, a Sequent Symmetry, and a KSR-1, and show that the new algorithm offers substantial performance improvements, up to a factor of 4 in some cases. The authors conclude that loop scheduling algorithms for shared-memory multiprocessors cannot afford to ignore the location of data, particularly in light of the increasing disparity between processor and memory speeds  相似文献   

18.
With rapid development of multi/many-core processors, contention in shared cache becomes more and more serious that restricts performance improvement of parallel programs. Recent researches have employed page coloring mechanism to realize cache partitioning on real system and to reduce contentions in shared cache. However, page coloring-based cache partitioning has some side effects, one is page coloring restricts memory space that an application can allocate, from which may lead to memory pressure, another is changing cache partition dynamically needs massive page copying which will incur large overhead. To make page coloring-based cache partition more practical, this paper proposes a malloc allocator-based dynamic cache partitioning mechanism with page coloring. Memory allocated by our malloc allocator can be dynamically partitioned among different applications according to partitioning policy. Only coloring the dynamically allocated pages can remit memory pressure and reduce page copying overhead led by re-coloring compared to all-page coloring. To further alleviate the overhead, we introduce minimum distance page copying strategy and lazy flush strategy. We conduct experiments on real system to evaluate these strategies and results show that they work well for reducing cache misses and re-coloring overhead.  相似文献   

19.
Write-invalidate protocols suffer from memory-access penalties due to coherence misses. While write-update or hybrid update/invalidate protocols can reduce coherence misses, the update traffic can increase memory-system contention. We show in this paper that update-based cache protocols can perform significantly better than write-invalidate protocols by incorporating a write cache in each processing node. Because it is legal to delay the propagation of modifications of a block until the next synchronization under relaxed memory consistency models, a write cache can significantly reduce traffic by exploiting locality in write accesses. By concentrating on a cache-coherent NUMA architecture, we study the implementation aspects of augmenting a write-invalidate, a write-update and two hybrid update/invalidate protocols with write caches. Through detailed architectural simulations using five benchmark programs, we find that write caches, with only a few blocks each, help write-invalidate protocols to cut the false-sharing miss rate and hybrid update/invalidate protocols to keep other copies, including the memory copy, clean at an acceptable write traffic level. Overall, the memory-access penalty associated with coherence misses is drastically reduced.  相似文献   

20.
In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号