期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Compiler analysis for cache coherence: interprocedural array data-flow analysis and its impact on cache performance

Choi L. Pen-Chung Yew 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(9):879-896

In this paper, we present compiler algorithms for detecting references to stale data in shared-memory multiprocessors. The algorithm consists of two key analysis techniques, state reference detection and locality preserving analysis. While the stale reference detection finds the memory reference patterns that may violate cache coherence, the locality preserving analysis minimizes the number of such stale references by analyzing both temporal and spatial reuses. By computing the regions referenced by arrays inside loops, we extend the previous scalar algorithms for more precise analysis. We develop a full interprocedural array data-flow algorithm, which performs both bottom-up side-effect analysis and top-down context analysis on the procedure call graph to further exploit locality across procedure boundaries. The interprocedural algorithm eliminates cache invalidations at procedure boundaries, which were assumed in the previous compiler algorithms. We have fully implemented the algorithm in the Polaris parallelizing compiler. Using execution-driven simulations on Perfect Club benchmarks, we demonstrate how unnecessary cache misses can be eliminated by the automatic stale reference detection. The algorithm can be used to implement cache coherence in the shared-memory multiprocessors that do not have hardware directories, such as Cray T3D. 相似文献

2.

Maintaining Cache Coherence through Compiler-Directed Data Prefetching

Hock-Beng Lim Pen-Chung Yew 《Journal of Parallel and Distributed Computing》1998,53(2):170

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency. 相似文献

3.

An advanced compiler framework for non-cache-coherentmultiprocessors

Yunheung Paek Navarro A. Zapata E. Hoeflinger J. Padua D. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(3):241-259

The Cray T3D and T3E are non-cache-coherent (NCC) computers with a NUMA structure. They have been shown to exhibit a very stable and scalable performance for a variety of application programs. Considerable evidence suggests that they are more stable and scalable than many other shared-memory multiprocessors. However, the principal drawback of these machines is a lack of programmability, caused by the absence of the global cache coherence that is necessary to provide a convenient shared view of memory in hardware. This forces the programmer to keep careful track of where each piece of data is stored, a complication that is unnecessary when a pure shared-memory view is presented to the user. We believe that a remedy for this problem is advanced compiler technology. In this paper, we present our experience with a compiler framework for automatic parallelization and communication generation that has the potential to reduce the time-consuming hand-tuning that would otherwise be necessary to achieve good performance with this type of machine. From our experiments, we learned that our compiler performs well for a variety of applications on the T3D and T3E and we found a few sophisticated techniques that could improve performance even more once they are fully implemented in the compiler 相似文献

4.

Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Hock-Beng Lim Pen-Chung Yew 《Journal of Parallel and Distributed Computing》2001,61(12):1775

Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems through a compiler-directed cache coherence scheme called the Cache Coherence with Data Prefetching (CCDP) scheme. The CCDP scheme enforces cache coherence by prefetching the potentially stale references in a parallel program. It also prefetches the non-stale references to hide their memory latencies. To optimize the performance of the CCDP scheme, some prefetch hardware support is provided to efficiently handle these two forms of data prefetching operations. We also developed the compiler techniques utilized by the CCDP scheme for stale reference detection, prefetch target analysis, and prefetch scheduling. We evaluated the performance of the CCDP scheme via execution-driven simulations of several numerical applications from the SPEC CFP95 and the Perfect benchmark suites. The simulation results show that the CCDP scheme provides significant performance improvements for the applications studied, comparable to that obtained with a full-map hardware cache coherence scheme. 相似文献

5.

Improving memory utilization in cache coherence directories

Lilja D.J. Yew P.-C. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(10):1130-1146

Efficiently maintaining cache coherence is a major problem in large-scale shared memory multiprocessors. Hardware directory coherence schemes have very high memory requirements, while software-directed schemes must rely on imprecise compile-time memory disambiguation. Recently proposed dynamically tagged directory schemes allocate pointers to blocks only as they are referenced, which significantly reduces their memory requirements, but they still allocate pointers to blocks that do not need them. The authors present two compiler optimizations that exploit the high-level sharing information available to the compiler to further reduce the size of a tagged directory by allocating pointers only when necessary. Trace-driven simulations are used to show that the performance of this combined hardware-software approach is comparable to other coherence schemes, but with significantly lower memory requirements. In addition, these simulations suggest that this approach is less sensitive to the quality of the memory disambiguation and interprocedural analysis performed by the compiler than software-only coherence schemes 相似文献

6.

Static Coarse Grain Task Scheduling with Cache Optimization Using OpenMP

Nakano Hirofumi Ishizaka Kazuhisa Obata Motoki Kimura Keiji Kasahara Hironori 《International journal of parallel programming》2003,31(3):211-223

Effective use of cache memory is getting more important with increasing gap between the processor speed and memory access speed. Also, use of multigrain parallelism is getting more important to improve effective performance beyond the limitation of loop iteration level parallelism. Considering these factors, this paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme schedules coarse grain tasks to threads so that shared data among coarse grain tasks can be passed via cache after task and data decomposition considering cache size at compile time. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP workstation using Swim and Tomcatv from the SPEC fp 95. As the results, the proposed scheme gives us 4.56 times speedup for Swim and 2.37 times on 4 processors for Tomcatv respectively against the Sun Forte HPC Ver. 6 update 1 loop parallelizing compiler. 相似文献

7.

An optical interconnection network and a modified snooping protocol for the design of large-scale symmetric multiprocessors (SMPs)

Louri A. Kodi A.K. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(12):1093-1104

In symmetric multiprocessors (SMPs), the cache coherence overhead and the speed of the shared buses limit the address/snoop bandwidth needed to broadcast transactions to all processors. As a solution, a scalable address subnetwork called symmetric multiprocessor network (SYMNET) is proposed in which address requests and snoop responses of SMPs are implemented optically. SYMNET not only uses passive optical interconnects that increases the speed of the proposed network, but also pipelines address requests at a much faster rate than electronics. This increases the address bandwidth for snooping, but the preservation of cache coherence can no longer be maintained with the usual snooping protocols. A modified coherence protocol, coherence in SYMNET (COSYM), is introduced to solve the coherence problem. COSYM was evaluated with a subset of Splash-2 benchmarks and compared with the electrical bus-based MOESI protocol. The simulation studies have shown a 5-66 percent improvement in execution time for COSYM as compared to MOESI for various applications. Simulations have also shown that the average latency for a transaction to complete using COSYM protocol was 5-78 percent better than the MOESI protocol. It is also seen that SYMNET can scale up to hundreds of processors while still using fast snooping-based cache coherence protocols, and additional performance gains may be attained with further improvement in optical device technology. 相似文献

8.

High Performance Software Coherence for Current and Future Architectures

《Journal of Parallel and Distributed Computing》1995,29(2):179-195

Shared memory provides an attractive and intuitive programming model for large-scale parallel computing, but requires a coherence mechanism to allow caching for performance while ensuring that processors do not use stale data in their computation. Implementation options range from distributed shared memory emulations on networks of workstations to tightly coupled fully cache-coherent distributed shared memory multiprocessors. Previous work indicates that performance varies dramatically from one end of this spectrum to the other. Hardware cache coherence is fast, but also costly and time-consuming to design and implement, while DSM systems provide acceptable performance on only a limit class of applications. We claim that an intermediate hardware option-memory-mapped network interfaces that support a global physical address space, without cache coherence-can provide most of the performance benefits of fully cache-coherent hardware, at a fraction of the cost. To support this claim we present a software coherence protocol that runs on this class of machines, and use simulation to conduct a performance study. We look at both programming and architectural issues in the context of software and hardware coherence protocols. Our results suggest that software coherence on NCC-NUMA machines in a more cost-effective approach to large-scale shared-memory multiprocessing than either pure distributed shared memory or hardware cache coherence. 相似文献

9.

The DASH prototype: Logic overhead and performance 总被引：1，自引：0，他引：1

Lenoski D. Laudon J. Joe T. Nakahira D. Stevens L. Gupta A. Hennessy J. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(1):41-61

The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. The hardware overhead of directory-based cache coherence in a 48-processor is examined. The data show that the overhead is only about 10-15%, which appears to be a small cost for the ease of programming offered by coherent caches and the potential for higher performance. The performance of the system is discussed, and the speedups obtained by a variety of parallel applications running on the prototype are shown. Using a sophisticated hardware performance monitor, the effectiveness of coherent caches and the relationship between an application's reference behavior and its speedup are characterized. The optimizations incorporated in the DASH protocol are evaluated in terms of their effectiveness on parallel applications and on atomic tests that stress the memory system 相似文献

10.

Design of an adaptive cache coherence protocol for large scalemultiprocessors

Yang Q. Thangadurai G. Bhuyan L.N. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(3):281-293

A large scale, cache-based multiprocessor that is interconnected by a hierarchical network such as hierarchical buses or a multistage interconnection network (MIN) is considered. An adaptive cache coherence scheme for the system is proposed based on a hardware approach that handles multiple shared reads efficiently. The new protocol allows multiple copies of a shared data block in the hierarchical network, but minimizes the cache coherence overhead by dynamically partitioning the network into sharing and nonsharing regions based on program behavior. The new cache coherence scheme effectively utilizes the bandwidth of the hierarchical networks and exploits the locality properties of parallel algorithms. Simulation experiments have been carried out to analyze the performance of the new protocol. The simulation results show that the new protocol gives 15% to 30% performance improvement over some existing cache coherence schemes on similar systems for a wide range of workload parameters 相似文献

11.

Energy-efficient synonym data detection and consistency for virtual cache

《Microprocessors and Microsystems》2016

The cache memory consumes a large proportion of the energy used by a processor. In the on-chip cache, the translation lookaside buffer (TLB) accounts for 20–50% of energy consumption of the on-chip cache. To reduce energy consumption caused by TLB accesses, a virtual cache can be accessed by virtual addresses which are issued by a processor directly. However, a virtual cache may result in the synonym problem. In this paper, we propose low-cost synonym detection hardware and a synonym data coherence mechanism. These reduce the energy consumption incurred by TLB lookups, and maintain synonym data consistency in the virtual cache. The proposed synonym detection hardware efficiently reduces the number of blocks that must be looked up in a virtual cache for saving energy. In addition, the proposed synonym data coherence mechanism also reduces the number of invalidated blocks in the virtual cache to prevent the destruction of cache locality. The simulation results show that our proposed energy-aware virtual cache consumes 51%, 27%, and 20% less energy than the traditional physical cache, traditional virtual cache, and synonym lookaside buffer (SLB), respectively. In addition, our proposed design shows almost the same static energy consumption as SLB, and reduces static energy consumption by about 20% compared with the traditional physical cache and virtual cache. 相似文献

12.

The potential of compile-time analysis to adapt the cache coherenceenforcement strategy to the data sharing characteristics

Mounes-Toussi F. Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(5):470-481

相似文献

13.

基于硬件性能计数器的软件性能数据采集与分析研究

程克非张聪汪林林张勤《计算机应用》2005,25(10):2431-2433

引入了基于CPU硬件性能计数器的性能数据采集和分析方法，从软件运行时刻的细粒度参数入手分析软件运行时刻的性能表现，从而更为准确地反映系统实际的动态运行状态。实验证明，这种方法对于需要详细掌握系统动态运行状态的应用能够提供非常有效的分析数据，同时也在一定程度上对编译器的性能优化给出了相关参考数据。相似文献

14.

存储模型仿真器的设计与实现 总被引：2，自引：1，他引：1

吴俊敏杨超陈国良张淼辉门珂《计算机研究与发展》2005,42(3):394-403

存储一致性问题和高速缓存一致性问题是共享存储并行计算机中两个最关键的问题,通过仿真器对它们进行了量化研究,设计并实现了一个存储模型仿真器MMS．基于MMS仿真了不同并行机结构模型下多种存储一致性模型的行为;针对不同类型的计算问题比较了不同的存储一致性模型,并对实验结果进行了分析;实现了几个不同的高速缓存一致性协议,并比较了它们的性能．相似文献

15.

Design and analysis of a scalable cache coherence scheme based onclocks and timestamps

Min S.L. Baer J.-L. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(1):25-44

A timestamp-based software-assisted cache coherence scheme that does not require any global communication to enforce the coherence of multiple private caches is proposed. It is intended for shared memory multiprocessors. The scheme is based on a compile-time marking of references and a hardware-based local incoherence detection scheme. The possible incoherence of a cache entry is detected and the associated entry is implicitly invalidated by comparing a clock (related to program flow) and a timestamp (related to the time of update in the cache). Results of a performance comparison, which is based on a trace-driven simulation using actual traces. between the proposed timestamp-based scheme and other software-assisted schemes indicate that the proposed scheme performs significantly better than previous software-assisted schemes, especially when the processors are carefully scheduled so as to maximize the reuse of cache contents. This scheme requires neither a shared resource nor global communication and is, therefore, scalable up to a large number of processors 相似文献

16.

Synergistic design of an application-oriented sparse directory on many-core embedded systems

《Journal of Systems Architecture》2017

As many-core embedded systems are evolving from single-memory based designs to systems-on-a-chip running on an on-chip network, implementing a cache coherence mechanism in large-scale many-core embedded systems turns out to be a technical challenge. However, existing coherence mechanisms are difficult to scale beyond tens of cores, which require either excessive area or energy, complex hierarchical protocols, or inexact representations of sharer sets. In this paper, we present a hardware-software synergistic design of a cache coherence mechanism by considering OS-level application allocation and hardware-level coherence operations. The proposed application-oriented sparse directory (AoSD) cooperates with a contiguous allocation algorithm to isolate cache coherence traffic and thereby reduce interferences among applications. The proposed micro-architecture of sharer set representations is area-efficient; moreover, it can also be configured dynamically to track a flexible and exact sharer set. We verify our design by analyzing memory requirements of different cache organizations and implementing our design on a popular simulator Graphite to evaluate cache coherence traffic improvement. The results show that our design is both area-efficient and efficient with improvements in memory network performance by 11.74%–28.72%. It is also indicated that our design is feasible to scale up to work well in thousands-of-cores embedded systems. 相似文献

17.

A practical approach to dynamic load balancing 总被引：1，自引：0，他引：1

Watts J. Taylor S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(3):235-248

This paper presents a cohesive, practical load balancing framework that improves upon existing strategies. These techniques are portable to a broad range of prevalent architectures, including massively parallel machines, such as the Cray T3D/E and Intel Paragon, shared memory systems, such as the Silicon Graphics PowerChallenge, and networks of workstations. As part of the work, an adaptive heat diffusion scheme is presented, as well as a task selection mechanism that can preserve or improve communication locality. Unlike many previous efforts in this arena, the techniques have been applied to two large-scale industrial applications on a variety of multicomputers. In the process, this work exposes a serious deficiency in current load balancing strategies, motivating further work in this area 相似文献

18.

Benefit based cache data placement and update for mobile peer to peer networks

Fan Ye Qing Li Enhong Chen 《World Wide Web》2011,14(3):243-259

Mobile Peer to Peer (MP2P) networks provide decentralization, self-organization, scalability characters, but suffer from high latency and link break problems. In this paper, we study the cache/replication placement and cache update problems arising in such kind of networks. While researchers have proposed various replication placement algorithms to place data across the network to address the problem, it was proven as NP-hard. As a result, many heuristic algorithms have been brought forward for solving the problem. In this article, we propose an effective and low cost cache placement strategy combined with an update scheme which can be easily implemented in a decentralized way. The contribution of this paper is the adaptive and flexible cache placement and update algorithms designed for real MP2P network usage. The combination of MP2P cache placement and update is the novelty of this article. Extensive experiments are conducted to demonstrate the efficiency of the cache placement and update scheme. 相似文献

19.

Effcient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers

下载免费PDF全文

Benjamín Sahelices Agustín de Dios Pablo Ibáez Víctor Vials-Yúfera José María Llabería 《计算机科学技术学报》2012,27(1):75-91

Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters. 相似文献

20.

Bandwidth Adaptive Cache Coherence Optimizations for Chip Multiprocessors

Abdullah Kayi Olivier Serres Tarek El-Ghazawi 《International journal of parallel programming》2014,42(3):435-455

Chip Multiprocessors (CMPs) have different technological parameters and physical constraints than earlier multi-processor systems, which should be taken into consideration when designing cache coherence protocols. Also, contemporary cache coherence protocols use invalidate schemes that are known to generate a high number of coherence misses. This is especially true under producer-consumer sharing patterns that can become a performance bottleneck as the number of cores increases. This paper presents two mechanisms to design efficient and scalable cache coherence protocols for CMPs. First, we propose an adaptive hybrid protocol to reduce coherence misses observed in write-invalidate based protocols. The proposed protocol is based on a write-invalidate scheme. However, adaptively, it can push updates to potential consumers based on observed producer-consumer sharing patterns. Secondly, we extend this adaptive protocol with an interconnection resource aware mechanism. Experimental evaluations, conducted on a tiled-CMP via full-system simulation, were used to assess the performance from our proposed dynamic hybrid protocols. Performance analysis is presented on a set of scientific applications from the SPLASH-2 and NAS parallel benchmark suites. Results showed that the proposed mechanisms reduce cache-to-cache sharing misses up to 48 % and speed up application performance up to 34 %. In addition, the proposed interconnection resource aware mechanism is proven to perform well under varying interconnection utilizations. 相似文献