期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Techniques for Improving Performance of Hybrid Snooping Cache Protocols

Fredrik Dahlgren 《Journal of Parallel and Distributed Computing》1999,59(3):329

Bus-based multiprocessors constitute a cost-effective class of shared-memory multiprocessors. Private caches are the key to an efficient utilization of the shared bus, and most such systems use a write-invalidate cache-coherence protocol to keep the caches coherent. Two important factors that limit the performance of the system are cache misses that lead to long-latency reads and bus congestion because of read misses and coherence traffic. While hybrid write-invalidate/write-update snooping protocols lead to fewer read misses than write-invalidate protocols, previous studies have shown them to be incapable of providing consistent performance improvements because of heavily increased coherence traffic. In this paper, we analyze how the deficiencies of hybrid snooping protocols can be dramatically reduced by using write caches and read snarfing (also called read-broadcast) under release consistency. Our performance evaluation is based on program-driven simulation and a set of five scientific applications with different sharing behaviors including migratory sharing as well as producer–consumer sharing. We show that one of the evaluated hybrid protocols, extended with write caches as well as read snarfing, manages to reduce the number of coherence misses by between 83 and 93% as compared to a write-invalidate protocol for all five applications in this study. In addition, the number of bus transactions is reduced substantially. However, we also show that read snarfing and hybrid snooping protocols might lead to higher cache occupancy because of increased sharing. Because of the small implementation cost of the hybrid protocol and the two extensions, we believe the combination to be an effective approach to boosting the performance of bus-based multiprocessors. 相似文献

2.

Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection

Håkan Grahn Per Stenström 《Journal of Parallel and Distributed Computing》1996,39(2):168

Although directory-based write-invalidate cache coherence protocols have a potential to improve the performance of large-scale multiprocessors, coherence misses limit the processor utilization. Therefore, so-called competitive-update protocols—hybrid protocols that on a per-block basis dynamically switch between write-invalidate and write-update—have been considered as a means to reduce the coherence miss rate and have been shown to be a better coherence policy for a wide range of applications. Unfortunately, such protocols may cause high traffic peaks for applications with extensive use of migratory objects. These traffic peaks can offset the performance gain of a reduced miss rate if the network bandwidth is not sufficient. We propose in this study to extend a competitive-update protocol with a previously published adaptive mechanism that can dynamically detect migratory objects and reduce the coherence traffic they cause. Detailed architectural simulations based on five scientific and engineering applications show that this adaptive protocol outperforms a write-invalidate protocol by reducing the miss rate and bandwidth needed by up to 71 and 26%, respectively. 相似文献

3.

Using Write Caches to Improve Performance of Cache Coherence Protocols in Shared-Memory Multiprocessors

Dahlgren F. Stenstrom P. 《Journal of Parallel and Distributed Computing》1995,26(2)

Write-invalidate protocols suffer from memory-access penalties due to coherence misses. While write-update or hybrid update/invalidate protocols can reduce coherence misses, the update traffic can increase memory-system contention. We show in this paper that update-based cache protocols can perform significantly better than write-invalidate protocols by incorporating a write cache in each processing node. Because it is legal to delay the propagation of modifications of a block until the next synchronization under relaxed memory consistency models, a write cache can significantly reduce traffic by exploiting locality in write accesses. By concentrating on a cache-coherent NUMA architecture, we study the implementation aspects of augmenting a write-invalidate, a write-update and two hybrid update/invalidate protocols with write caches. Through detailed architectural simulations using five benchmark programs, we find that write caches, with only a few blocks each, help write-invalidate protocols to cut the false-sharing miss rate and hybrid update/invalidate protocols to keep other copies, including the memory copy, clean at an acceptable write traffic level. Overall, the memory-access penalty associated with coherence misses is drastically reduced. 相似文献

4.

An evaluation of hardware-based and compiler-controlled optimizations of snooping cache protocols

Fredrik Dahlgren Jonas Skeppstedt Per Stenström 《Future Generation Computer Systems》1998,13(6):469-487

Coherence misses and invalidation traffic limit the performance of bus-based multiprocessors using write-invalidate snooping caches. This paper considers optimizations of a write-invalidate protocol that remove such overhead. While coherence misses are attacked by a hybrid update/invalidate protocol and another technique where update instructions are selectively inserted by a compiler, invalidation traffic is reduced by three optimizations that coalesce ownership acquisition with miss handling: migrate-on-dirty, an adaptive hardware-based scheme, and compiler-controlled insertion of load-exclusive instructions.

The relative effectiveness of these optimizations are evaluated using detailed architectural simulations and a set of four parallel programs. We find that while both of the update-based schemes effectively remove most coherence misses, the hybrid update/invalidate scheme causes lower traffic. By contrast, the compiler-based approach to cut invalidation traffic is slightly more efficient than the adaptive hardware-based scheme. Moreover, the migrate-on-dirty heuristic is found to have devastating effects on the miss rate. 相似文献

5.

A leakage-aware L2 cache management technique for producer-consumer sharing in low-power chip multiprocessors

Hyunhee Kim Author VitaeJihong KimAuthor Vitae 《Journal of Parallel and Distributed Computing》2011,71(12):1545-1557

This paper proposes a novel leakage management technique for applications with producer-consumer sharing patterns. Although previous research has proposed leakage management techniques by turning off inactive cache blocks, these techniques can be further improved by exploiting the various run-time characteristics of target applications in CMPs. By exploiting particular access sequences observed in producer-consumer sharing patterns and the spatial locality of shared buffers, our technique enables a more aggressive turn-off of L2 cache blocks of these buffers. Experimental results using a CMP simulator show that our proposed technique reduces the energy consumption of on-chip L2 caches, a shared bus, and off-chip memory by up to 31.3% over the existing cache leakage power management techniques with no significant performance loss. 相似文献

6.

Speculative incoherent cache protocols

Jaehyuk Huh Burger D. Jichuan Chang Sohi G.S. 《Micro, IEEE》2004,24(6):104-109

Multiprocessing and multithreading are becoming ubiquitous even on single chips. With increasing cache sizes, coherence misses in such systems will account for a larger fraction of all cache misses. As communication latencies increase, this larger fraction of coherence misses will cause significant and increased performance losses. Tuning coherence protocols for specific communication patterns and applications can reduce communication latencies. However, these optimizations increase a protocol's design complexity, making the protocol difficult to verify. A competing approach requires parallel programmers to tune applications to work well with simpler protocols. Speculative execution has successfully improved performance in various scenarios. We propose a new type of load speculation, called coherence decoupling. Coherence decoupling is a microarchitectural mechanism that implements separate protocols for speculative use and for the eventual verification of values. The technique reduces the effect of long communication latencies while mitigating the burdens on the coherence protocol designer and the parallel programmer 相似文献

7.

Cache vulnerability mitigation using an adaptive cache coherence protocol

Mohammad Maghsoudloo Hamid R. Zarandi 《The Journal of supercomputing》2014,68(3):1048-1067

This paper proposes an adaptive cache coherence protocol to improve the reliability of caches against soft errors in shared-memory multi-core processors. The proposed protocol is conducted based-on a comprehensive study and analysis intended to determine the effects of cache coherence protocols on the characteristics of cache memories. The outcomes of this analysis indicate that differences in handling dirty data items play an important role to make distinction in favor of or against a cache coherence protocol. Based on the primary results, the proposed protocol tries to enhance the reliability of caches by means of sharing management. Sharing is dynamically adjusted according to the operational mode of CPU. The experimental results show that proposed protocol leads to about 16 % improvements in MTTF, with no performance degradation and with negligible bandwidth and cache energy consumption overheads compared to previous works. 相似文献

8.

Design of an adaptive cache coherence protocol for large scalemultiprocessors

Yang Q. Thangadurai G. Bhuyan L.N. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(3):281-293

A large scale, cache-based multiprocessor that is interconnected by a hierarchical network such as hierarchical buses or a multistage interconnection network (MIN) is considered. An adaptive cache coherence scheme for the system is proposed based on a hardware approach that handles multiple shared reads efficiently. The new protocol allows multiple copies of a shared data block in the hierarchical network, but minimizes the cache coherence overhead by dynamically partitioning the network into sharing and nonsharing regions based on program behavior. The new cache coherence scheme effectively utilizes the bandwidth of the hierarchical networks and exploits the locality properties of parallel algorithms. Simulation experiments have been carried out to analyze the performance of the new protocol. The simulation results show that the new protocol gives 15% to 30% performance improvement over some existing cache coherence schemes on similar systems for a wide range of workload parameters 相似文献

9.

DP&TB: a coherence filtering protocol for many-core chip multiprocessors

Fengkai Yuan Zhenzhou Ji 《The Journal of supercomputing》2013,66(1):249-261

Future many-core chip multiprocessors (CMPs) will integrate hundreds of processor cores on chip. Two cache coherence protocols are the mainstream applied to current CMPs. The token-based protocol (Token) provides high performance, but it generates a prohibitive amount of network traffic, which translates into excessive power consumption. The directory-based protocol (Directory) reduces network traffic, yet trades off with the storage overhead of the directory as well as entails comparatively low performance caused by indirection limiting its applicability for many-core CMPs. In this work, we present DP&TB, a novel cache coherence protocol particularly suited to future many-core CMPs. In DP&TB, cache coherence is maintained at the granularity of a page, facilitating to filter out either unnecessary coherence inspections for blocks inside private pages or network traffic for blocks inside shared pages. We employ Directory to detect private and shared pages and Token to maintain the coherence of the blocks inside shared pages. DP&TB inherits the merit of Directory and Token and overcome their problems. Experimental results show that DP&TB comprehensively beyond Directory and Token with improvement by 9.1 % in performance over Token and by 13.8 % in network traffic over Directory. In addition, the storage overhead of DP&TB is less than half of that of Directory. Our proposal can fulfill the requirement of many-core CMPs to achieve high performance, power and area efficiency. 相似文献

10.

众核处理器Cache一致性研究综述

韩立敏安建峰高德远樊晓桠任向隆《计算机应用研究》2012,29(11):4011-4016

以瓦片结构众核处理器一致性协议的设计为主线,综述了国内外近年来关于众核处理器cache一致性的相关研究;介绍了不同NUCA结构对一致性协议的影响;分析和对比了几种传统目录一致性协议的特性及其存在的问题;归纳了最新几个面向众核结构一致性协议的设计思想和特性。最后为设计具备应用程序适应性和可扩展性的cache一致性协议指出了几个关键的设计方向。相似文献

11.

CCNoC: Cache-Coherent Network on Chip for Chip Multiprocessors 总被引：1，自引：1，他引：0

下载免费PDF全文

王惊雷薛一波王海霞李崇民汪东升《计算机科学技术学报》2010,25(2):257-266

As the number of cores in chip multiprocessors(CMPs) increases,cache coherence protocol has become a key issue in integration of chip multiprocessors.Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles:design complexity,performance and scalability.This paper proposes Cache Coherent Network on Chip(CCNoC),a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and ... 相似文献

12.

多核处理器Cache一致性协议关键技术研究

黄安文张民选《计算机工程与科学》2009,31(Z1)

多核处理器规模的不断扩大和核间通信机制的日益复杂,使得Cache一致性维护变得更加困难。本文从多核处理器Cache一致性问题的产生背景出发,分析监听协议、目录协议、Token协议和Hammer协议的实现机制以及在多核环境中的优缺点,分别从一致性协议与片上互连结构协同设计、面向低功耗应用的协议优化策略、Cache一致性协议验证及容错机制等角度考虑,对未来多核处理器Cache一致性协议设计的发展趋势和技术挑战进行详细分析与讨论。相似文献

13.

Filtering directory lookups in CMPs

A. Bosque V. Viñals P. Ibáñez J.M. Llaber?´aAuthor vitae 《Microprocessors and Microsystems》2011,35(8):695-707

Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with shared cache and directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed that a big fraction of directory lookups cause a miss, because the block looked up is not allocated in any local cache. To reduce the number of directory lookups and therefore the power consumption, we propose to add a filter before the directory access.We introduce two filter implementations. In the first one, filtering information is explicitly kept in the shared cache for every block. In the second one, filtering information is decoupled from the shared cache organization, so the filter size does not depend on the shared cache size.We evaluate our filters in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with write-through local caches and a shared cache. We show that, for SPLASH2 benchmarks, the proposed filters reduce the number of directory lookups performed by 60% while power consumption is reduced by ∼28%. For Specweb2005, the number of directory lookups performed is reduced by 68% (44%), while directory power consumption is reduced by 19% (9%) using the first (second) filter implementation. 相似文献

14.

片上多核Cache资源管理机制研究 总被引：2，自引：1，他引：1

贾小敏张民选齐树波赵天磊《计算机科学》2011,38(1):295-301

随着片上多核成为处理器发展的主流和片上Cache资源的持续增长,Cache资源的管理已成为片上多核的关键问题。介绍了片上多核Cache资源管理的研究进展,依据研究内容将Cache资源的管理分为Cache划分和Cache共享两类。对Cache划分,探讨了其主要组成部分和一般形式,分析和比较了典型的片上多核Cache划分机制。对Cache共享,给出了其主要研究内容,并介绍和比较了几种主流的片上多核Cache共享机制。通过分析,认为软硬件协同管理的页划分应是未来片上多核Cache划分机制的研究重点;而片上多核Cache共享机制的研究则应从目标应用的Cache行为特征着手。相似文献

15.

An adaptive cache coherence protocol specification for parallel input/output systems

Garcia-Carballeira F. Carretero J. Calderon A. Perez J.M. Garcia J.D. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(6):533-545

Caching has been intensively used in memory and traditional file systems to improve system performance. However, the use of caching in parallel file systems and I/O libraries has been limited to I/O nodes to avoid cache coherence problems. We specify an adaptive cache coherence protocol that is very suitable for parallel file systems and parallel I/O libraries. This model exploits the use of caching, both at processing and I/O nodes, providing performance improvement mechanisms such as aggressive prefetching and delayed-write techniques. The cache coherence problem is solved by using a dynamic scheme of cache coherence protocols with different sizes and shapes of granularity. The proposed model is very appropriate for parallel I/O interfaces, such as MPI-IO. Performance results, obtained on an IBM SP2, are presented to demonstrate the advantages offered by the cache management methods proposed. 相似文献

16.

Reliability improvement in private non-uniform cache architecture using two enhanced structures for coherence protocols and replacement policies

Mohammad Maghsoudloo Hamid R. Zarandi 《Microprocessors and Microsystems》2014

In this paper, a comprehensive study is first conducted to investigate the effects of cache coherence protocols and cache replacement policies on the characteristics of NUCA in current many-core processors. The main focus of this study is to analyze the effects of coherence protocols and replacement policies on the vulnerability of caches. The outcomes of this analysis indicate two facts: (i) Differences in handling write operations play an important role to make distinction in favor of or against a cache coherence protocol; (ii) Near-optimal solutions for replacement problem, aimed at enhancing the performance, can also make positive influence on reduction of cache vulnerability factor. Based on the results of first step, two schemes are introduced to enhance the reliability of caches by applying some modification on the structures of cache coherence protocols and cache replacement policies. The first scheme tries to manage sharing of the dirty data items among different same-level caches. The second helps to give priority and more opportunity to old dirty blocks than clean blocks for replacement. The proposed schemes reveal about 18% improvement in MTTF, with negligible performance, bandwidth and energy consumption overhead compared to previous cache structures. 相似文献

17.

Low-level implementation of the SISC protocol for thread-level speculation on a multi-core architecture

《Parallel Computing》2017

Chip Multiprocessors (CMP) have emerged during last decades as a very attractive solution in using the ever-increasing on-chip transistor count. However, classical parallelization techniques failed to fully exploit parallelization from existing sequential applications due to false data dependencies. This paper focuses on the Thread-level Speculation (TLS) technique, an alternative way to exploit the transistor budget in a CMP. With TLS, even possibly data dependent threads can run in parallel as long as the semantics of the sequential execution is preserved. A special hardware support monitors the actual data dependencies between threads at run time and, if they are violated, misspeculation effects are undone usually through replay. This kind of system is known as speculative CMP. However, the TLS mechanism requires complex protocols that integrate cache coherence and speculation to maintain program order among multiple versions of data. Current TLS protocol evaluations are usually inadequate because they are not done low-level enough. A realistic evaluation of speculative CMPs requires either to be performed on a real hardware or very detailed cycle-accurate simulator models.In this paper we are particularly focused on a low-level evaluation of the write-invalidate TLS protocol Speculation Integrated with Snoopy Coherence (SISC) protocol proposed in [1]. This evaluation relies on cycle-level simulation environment with detailed cycle-level cache memories, cache controller and system bus. On top of this, a speculative four core architecture is simulated and three new modules (Scheduler, Squash Arbiter and Supplier Arbiter) are provided to support low-level implementation of the SISC protocol. The overall cost of the SISC protocol is evaluated by means of CACTI tool for the three different domains: the access latency cost, the area cost, and the power cost. The evaluation goal was to keep the cache access time to remain below cycle latency as well as the area and power overheads below an acceptable budget overhead. The SISC protocol has been compared against regular MESI-based architecture in both 32-bit and 64-bit versions. We kept the cache access time below the cycle latency, and we managed to keep both data cache area and static power overheads respectively below 32% and 35%. 相似文献

18.

Coherence and Replacement Protocol of DICE—A Bus-Based COMA Multiprocessor

《Journal of Parallel and Distributed Computing》1999,57(1):14-32

As microprocessors become faster and demand more bandwidth, the already limited scalability of a shared bus decreases even further. DICE, a shared-bus multiprocessor, utilizes cache only memory architecture (COMA) to effectively decrease the speed gap between modern high-performance microprocessors and the bus. DICE tries to optimize COMA for a shared-bus medium, in particular to reduce the detrimental effects of cache coherence and the “last memory block” problem on replacement. In this paper, we present the coherence and replacement protocol of the DICE multiprocessor and its design trade-offs. We describe a four-state write-invalidate coherence protocol in detail. Replacement, which poses a unique overhead problem of COMA, requires that a victim block with ownership be relocated to a remote node in order not to discard the last cached memory block. We show that the relocation process can be efficiently implemented by using a temporary storage called relocation buffer and a priority-based selection algorithm. We present performance results that show a drastic reduction in global bus traffic compared to a traditional shared-bus multiprocessor architecture. 相似文献

19.

Efficient Execution of Multiple Queries on Deep Memory Hierarchy 总被引：1，自引：0，他引：1

下载免费PDF全文

Yan Zhang Zhi-Feng Chen and Yuan-Yuan Zhou 《计算机科学技术学报》2007,22(2):273-279

This paper proposes a complementary novel idea, called MiniTasking to further reduce the number of cache misses by improving the data temporal locality for multiple concurrent queries. Our idea is based on the observation that, in many workloads such as decision support systems （DSS）, there is usually significant amount of data sharing among different concurrent queries. MiniTasking exploits such data sharing to improve data temporal locality by scheduling query execution at three levels： query level batching, operator level grouping and mini-task level scheduling. The experimental results with various types of concurrent TPC-H query workloads show that, with the traditional N-ary Storage Model （NSM） layout, MiniTasking significantly reduces the L2 cache misses by up to 83%, and thereby achieves 24% reduction in execution time. With the Partition Attributes Across （PAX） layout, MiniTasking further reduces the cache misses by 65% and the execution time by 9%. For the TPC-H throughput test workload, MiniTasking improves the end performance up to 20%. 相似文献

20.

A Cache coherence protocol for MIN-based multiprocessors

Mazin S. Yousif Chita R. Das Matthew J. Thazhuthaveetil 《The Journal of supercomputing》1994,8(2):163-185

In this paper we present a cache coherence protocol formultistage interconnection network (MIN)-based multiprocessors with two distinct private caches:privateblocks caches (PCache) containing blocks private to a process andshared-blocks caches (SCache) containing data accessible by all processes. The architecture is extended by a coherence control bus connecting all shared-block cache controllers. Timing problems due to variable transit delays through the MIN are dealt with by introducingTransient states in the proposed cache coherence protocol. The impact of the coherence protocol on system performance is evaluated through a performance study of three phases. Assuming homogeneity of all nodes, a single-node queuing model (phase 3) is developed to analyze system performance. This model is solved for processor and coherence bus utilizations using the mean value analysis (MVA) technique with shared-blocks steady state probabilities (phase 1) and communication delays (phase 2) as input parameters. The performance of our system is compared to that of a system with an equivalent-sized unified cache and with a multiprocessor implementing a directory-based coherence protocol. System performance measures are verified through simulation. 相似文献