期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

阮深沉王海霞汪东升《计算机工程与科学》2016,38(8):1568-1573

寻找新型存储材料代替DRAM内存是当前的一个研究热点。相变存储PCM因其具有低功耗、高存储密度和非易失性的优点受到广泛的关注,然而PCM的可擦写次数有限,要用作内存必须考虑如何减少对其的写操作。针对该问题,一种有效的解决方法是优化Cache替换策略,减少Cache中脏块被替换出的数量。现有研究主要通过在插入和访问命中时给脏块设定较高的保护优先级来达到给脏块额外保护的目的,但是在降级过程中不再对脏块与干净块进行区分,这导致Cache可能在存在大量干净块的情况下仍然先替换脏块。提出一种新型的Cache替换策略MAC,它通过一个多维分级结构在脏块与干净块之间设置了不可逾越的界限,使得脏块能得到更有力的保护。模拟实验表明,相对LRU替换策略,MAC以较低的硬件开销代价平均减少约25.12%的内存写,同时对程序运行性能几乎没有影响。相似文献

2.

Cache vulnerability mitigation using an adaptive cache coherence protocol

Mohammad Maghsoudloo Hamid R. Zarandi 《The Journal of supercomputing》2014,68(3):1048-1067

This paper proposes an adaptive cache coherence protocol to improve the reliability of caches against soft errors in shared-memory multi-core processors. The proposed protocol is conducted based-on a comprehensive study and analysis intended to determine the effects of cache coherence protocols on the characteristics of cache memories. The outcomes of this analysis indicate that differences in handling dirty data items play an important role to make distinction in favor of or against a cache coherence protocol. Based on the primary results, the proposed protocol tries to enhance the reliability of caches by means of sharing management. Sharing is dynamically adjusted according to the operational mode of CPU. The experimental results show that proposed protocol leads to about 16 % improvements in MTTF, with no performance degradation and with negligible bandwidth and cache energy consumption overheads compared to previous works. 相似文献

3.

The impact of wrong-path memory references in cache-coherent multiprocessor systems 总被引：1，自引：0，他引：1

Resit Ayse Joshua J. Augustus K. 《Journal of Parallel and Distributed Computing》2007,67(12):1256

The core of current-generation high-performance multiprocessor systems is out-of-order execution processors with aggressive branch prediction. Despite their relatively high branch prediction accuracy, these processors still execute many memory instructions down mispredicted paths. Previous work that focused on uniprocessors showed that these wrong-path (WP) memory references may pollute the caches and increase the amount of cache and memory traffic. On the positive side, however, they may prefetch data into the caches for memory references on the correct-path. While computer architects have thoroughly studied the impact of WP effects in uniprocessor systems, there is no comparable work for multiprocessor systems. In this paper, we explore the effects of WP memory references on the memory system behavior of shared-memory multiprocessor (SMP) systems for both broadcast and directory-based cache coherence. Our results show that these WP memory references can increase the amount of cache-to-cache transfers by 32%, invalidations by 8% and 20% for broadcast and directory-based SMPs, respectively, and the number of writebacks by up to 67% for both systems. In addition to the extra coherence traffic, WP memory references also increase the number of cache line state transitions by 21% and 32% for broadcast and directory-based SMPs, respectively. In order to reduce the performance impact of these WP memory references, we introduce two simple mechanisms—filtering WP blocks that are not likely-to-be-used and WP aware cache replacement—that yield speedups of up to 37%. 相似文献

4.

Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

Alberto Ros Ricardo Fernández-Pascual Manuel E. Acacio José M. García 《Journal of Parallel and Distributed Computing》2008

In glueless shared-memory multiprocessors where cache coherence is usually maintained using a directory-based protocol, the fast access to the on-chip components (caches and network router, among others) contrasts with the much slower main memory. Unfortunately, directory-based protocols need to obtain the sharing status of every memory block before coherence actions can be performed. This information has traditionally been stored in main memory, and therefore these cache coherence protocols are far from being optimal. In this work, we propose two alternative designs for the last-level private cache of glueless shared-memory multiprocessors: the lightweight directory and the SGluM cache. Our proposals completely remove directory information from main memory and store it in the home node’s L2 cache, thus reducing both the number of accesses to main memory and the directory memory overhead. The main characteristics of the lightweight directory are its simplicity and the significant improvement in the execution time for most applications. Its drawback, however, is that the performance of some particular applications could be degraded. On the other hand, the SGluM cache offers more modest improvements in execution time for all the applications by adding some extra structures that cope with the cases in which the lightweight directory fails. 相似文献

5.

A Performance Study on Bounteous Transfer in Multiprocessor Sectored Caches

Liu Kuang-Chih King Chung-Ta 《The Journal of supercomputing》1997,11(4):405-420

In a sectored cache, a cache line is divided into several subblocks. Each subblock is a basic coherence unit. In this way partial block invalidation can be done on the cache lines in order to eliminate false sharing on invalidate-based multiprocessors. Sectored caches often include a facility, called bounteous transfers, to supply extra subblocks after transferring the missed subblock on a read miss. Unfortunately, previous works on sectored caches concentrated mainly on solving the false sharing problem, while overlooked the prefetching effects of bounteous transfer. In this paper, we evaluate the performance impacts of bounteous based on a MESI-based sectored cache. Three different types of bounteous transfer are evaluated; bounteous transfer wuth valid subblocks (BT-V), bounteous transfer with clean subblocks (BT-C), and bounteous disabled (No-BT). We simulated the execution of typical benchmarksFFT, LU, Radix, SOR, on the MESI-based sectored cache. Two metrics U-rate and R-rate are proposed to help observe the sharing granularities and coherence overhead. Evaluation results show that different benchmarks work better with different kinds of bounteous transfer and using bounteous transfer carelessly may result in performance degradation. 相似文献

6.

Probabilistic Prediction of Temporal Locality

Etsion Y. Feitelson D.G. 《Computer Architecture Letters》2007,6(1):17-20

The increasing gap between processor and memory speeds, as well as the introduction of multi-core CPUs, have exacerbated the dependency of CPU performance on the memory subsystem. This trend motivates the search for more efficient caching mechanisms, enabling both faster service of frequently used blocks and decreased power consumption. In this paper we describe a novel, random sampling based predictor that can distinguish transient cache insertions from non-transient ones. We show that this predictor can identify a small set of data cache resident blocks that service most of the memory references, thus serving as a building block for new cache designs and block replacement policies. Although we only discuss the L1 data cache, we have found this predictor to be efficient also when handling L1 instruction caches and shared L2 caches. 相似文献

7.

Using Write Caches to Improve Performance of Cache Coherence Protocols in Shared-Memory Multiprocessors

Dahlgren F. Stenstrom P. 《Journal of Parallel and Distributed Computing》1995,26(2)

Write-invalidate protocols suffer from memory-access penalties due to coherence misses. While write-update or hybrid update/invalidate protocols can reduce coherence misses, the update traffic can increase memory-system contention. We show in this paper that update-based cache protocols can perform significantly better than write-invalidate protocols by incorporating a write cache in each processing node. Because it is legal to delay the propagation of modifications of a block until the next synchronization under relaxed memory consistency models, a write cache can significantly reduce traffic by exploiting locality in write accesses. By concentrating on a cache-coherent NUMA architecture, we study the implementation aspects of augmenting a write-invalidate, a write-update and two hybrid update/invalidate protocols with write caches. Through detailed architectural simulations using five benchmark programs, we find that write caches, with only a few blocks each, help write-invalidate protocols to cut the false-sharing miss rate and hybrid update/invalidate protocols to keep other copies, including the memory copy, clean at an acceptable write traffic level. Overall, the memory-access penalty associated with coherence misses is drastically reduced. 相似文献

8.

紧耦合的流媒体缓存代理协作机制研究

杨传栋余镇危王行刚《计算机工程》2006,32(17):167-169

位于因特网骨干网和同一接入网之间的流媒体缓存代理服务器相互协作，可以提高缓存命中率，保持负载平衡。该文提出了一种共享缓存空间的紧耦合的多代理服务器协作机制，给出了多代理协作的缓存替换策略和负载平衡算法。通过NS2模拟验证，该机制可以使系统保持更好的性能。相似文献

9.

PAM: an efficient power-aware multilevel cache policy to reduce energy consumption of storage systems

Xiaodong MENG Chentao WU Minyi GUO Long ZHENG Jingyu ZHANG 《Frontiers of Computer Science》2019,13(4):850

Energy consumption is one of the most significant aspects of large-scale storage systems where multilevel caches are widely used. In a typical hierarchical storage structure, upper-level storage serves as a cache for the lower level, forming a distributed multilevel cache system. In the past two decades, several classic LRU-based multilevel cache policies have been proposed to improve the overall I/O performance of storage systems. However, few power-aware multilevel cache policies focus on the storage devices in the bottom level, which consume more than 27% of the energy of the whole system [1]. To address this problem, we propose a novel power-aware multilevel cache (PAM) policy that can reduce the energy consumption of high-performance and I/O bandwidth storage devices. In our PAM policy, an appropriate number of cold dirty blocks in the upper level cache are identified and selected to flush directly to the storage devices, providing high probability extension of the lifetime of disks in standby mode. To demonstrate the effectiveness of our proposed policy, we conduct several simulations with real-world traces. Compared to existing popular cache schemes such as PALRU, PB-LRU, and Demote, PAM reduces power consumption by up to 15% under different I/O workloads, and improves energy efficiency by up to 50.5%. 相似文献

10.

一种基于近似LRU算法的高缓方案 总被引：2，自引：0，他引：2

下载免费PDF全文

鲍东星李晓明《计算机工程》2007,33(9):272-274

提出了一个用于扩充高缓块管理的近似LRU算法。利用该算法，设计了一个可过滤LRU数据块的扩充高缓方案——LRU块过滤高缓(LBF高缓)。仿真结果显示，LBF高缓的性能优于类似结构的扩充高缓(如牺牲高缓和辅助高缓)，与具有2倍容量的直接映像高缓相比性能有所提高。相似文献

11.

Cache coherence requirements for interprocess rendezvous

Russell M. Clapp Trevor N. Mudge Donald C. Winsor 《International journal of parallel programming》1990,19(1):31-51

Multiprocessors in which a shared bus is used by the processor to communicate with common memory are an emerging class of machines where there is a need to support parallel programming languages. A language construct that is found in a number of parallel programming languages to support synchronization and communication in the interprocess rendezvous. Shared-bus multiprocessor require a protocol to keep the date in their caches coherent. There are two major categories of these protocols: invalidation and write-boadcast. This paper examines the requirements for cache coherence protocols to support efficient interprocessor rendezvous. The approach taken is to examine the memory referencing patterns to the run-time data structures during rendezvous execution. The appropriate coherence protocol is shown to be a function of the processor scheduling strategy used by the run-time system at synchronzation points during the rendezvous. When processes migrate freely as a result of the scheduling strategy, invalidation protocols are found to be more efficient. When migration is restricted by the scheduler, write-broadcast protocols are more efficient. 相似文献

12.

Two techniques for improving performance on bus-based multiprocessors

Craig Anderson Jean-Loup Baer 《Future Generation Computer Systems》1995,11(6):537-551

In this paper, we explore two techniques for reducing memory latency in bus-based multiprocessors. The first one, designed for sector caches, is a snoopy cache coherence protocol that uses a large transfer block to take advantage of spatial locality, while using a small coherence block (called a subblock) to avoid false sharing. The second technique is read snarfing (or read broadcasting), in which all caches can acquire data transmitted in response to a read request to update invalid blocks in their own cache.

We evaluated the two techniques by simulating 6 applications that exhibit a variety of reference patterns. We compared the performance of the new protocol against that of the Illinois protocol with both small and large block sizes and found that it was effective in reducing memory latency and providing more consistent, good results than the Illinois protocol with a given line size. Read snarfing also improved performance mostly for protocols that use large line sizes. 相似文献

13.

Performance evaluation of Web proxy cache replacement policies 总被引：10，自引：0，他引：10

Martin Rich Tai 《Performance Evaluation》2000,39(1-4):149-164

The continued growth of the World-Wide Web and the emergence of new end-user technologies such as cable modems necessitate the use of proxy caches to reduce latency, network traffic and Web server loads. In this paper we analyze the importance of different Web proxy workload characteristics in making good cache replacement decisions. We evaluate workload characteristics such as object size, recency of reference, frequency of reference, and turnover in the active set of objects. Trace-driven simulation is used to evaluate the effectiveness of various replacement policies for Web proxy caches. The extended duration of the trace (117 million requests collected over 5 months) allows long term side effects of replacement policies to be identified and quantified.

Our results indicate that higher cache hit rates are achieved using size-based replacement policies. These policies store a large number of small objects in the cache, thus increasing the probability of an object being in the cache when requested. To achieve higher byte hit rates a few larger files must be retained in the cache. We found frequency-based policies to work best for this metric, as they keep the most popular files, regardless of size, in the cache. With either approach it is important that inactive objects be removed from the cache to prevent performance degradation due to pollution. 相似文献

14.

Performance trade-offs for microprocessor cache memories

Alpert D.B. Flynn M.J. 《Micro, IEEE》1988,8(4):44-54

Design trade-offs for integrated microprocessors caches are examined. A model of cache utilization is introduced to evaluate the effects on cache performance of varying the block size. By considering the overhead cost of sorting address tags and replacement information along with data, it is found that large block sizes lead to more cost-effective cache designs than predicted by previous studies. When the overhead cost is high, caches that fetch only partial blocks on a miss perform better than similar caches that fetch entire blocks. This study indicates that lessons from mainframe and minicomputer design practice should be critically examined to benefit the design of microprocessors 相似文献

15.

共享存储多处理机系统中的多级高速缓存

熊劲李国杰《计算机学报》1994,17(12):922-929

共享存储多处理机系统中，存储子系统的性能是影响整个系统性能的关键之－。我们通过基于访存地址流的模拟，从缺失率，平均访存时间和总线占用三方面，对共享存储多处理机系统中的两种两组缓存方案做了性能比较，并将它们同没有第二级缓存的情形做了性能比较。相似文献

16.

Techniques for Improving Performance of Hybrid Snooping Cache Protocols

Fredrik Dahlgren 《Journal of Parallel and Distributed Computing》1999,59(3):329

Bus-based multiprocessors constitute a cost-effective class of shared-memory multiprocessors. Private caches are the key to an efficient utilization of the shared bus, and most such systems use a write-invalidate cache-coherence protocol to keep the caches coherent. Two important factors that limit the performance of the system are cache misses that lead to long-latency reads and bus congestion because of read misses and coherence traffic. While hybrid write-invalidate/write-update snooping protocols lead to fewer read misses than write-invalidate protocols, previous studies have shown them to be incapable of providing consistent performance improvements because of heavily increased coherence traffic. In this paper, we analyze how the deficiencies of hybrid snooping protocols can be dramatically reduced by using write caches and read snarfing (also called read-broadcast) under release consistency. Our performance evaluation is based on program-driven simulation and a set of five scientific applications with different sharing behaviors including migratory sharing as well as producer–consumer sharing. We show that one of the evaluated hybrid protocols, extended with write caches as well as read snarfing, manages to reduce the number of coherence misses by between 83 and 93% as compared to a write-invalidate protocol for all five applications in this study. In addition, the number of bus transactions is reduced substantially. However, we also show that read snarfing and hybrid snooping protocols might lead to higher cache occupancy because of increased sharing. Because of the small implementation cost of the hybrid protocol and the two extensions, we believe the combination to be an effective approach to boosting the performance of bus-based multiprocessors. 相似文献

17.

Hint-based cache design for reducing miss penalty in HBS packet classification algorithm

Yeim-Kuan Chang Fang-Chen Kuo 《Journal of Parallel and Distributed Computing》2013

In this paper, we implement some notable hierarchical or decision-tree-based packet classification algorithms such as extended grid of tries (EGT), hierarchical intelligent cuttings (HiCuts), HyperCuts, and hierarchical binary search (HBS) on an IXP2400 network processor. By using all six of the available processing microengines (MEs), we find that none of these existing packet classification algorithms achieve the line speed of OC-48 provided by IXP2400. To improve the search speed of these packet classification algorithms, we propose the use of software cache designs to take advantage of the temporal locality of the packets because IXP network processors have no built-in caches for fast path processing in MEs. Furthermore, we propose hint-based cache designs to reduce the search duration of the packet classification data structure when cache misses occur. Both the header and prefix caches are studied. Although the proposed cache schemes are designed for all the dimension-by-dimension packet classification schemes, they are, nonetheless, the most suitable for HBS. Our performance simulations show that the HBS enhanced with the proposed cache schemes performs the best in terms of classification speed and number of memory accesses when the memory requirement is in the same range as those of HiCuts and HyperCuts. Based on the experiments with all the high and low locality packet traces, five MEs are sufficient for the proposed rule cache with hints to achieve the line speed of OC-48 provided by IXP2400. 相似文献

18.

Directory-based cache coherence in large-scale multiprocessors 总被引：1，自引：0，他引：1

Chaiken D. Fields C. Kurihara K. Agarwal A. 《Computer》1990,23(6):49-58

The usefulness of shared-data caches in large-scale multiprocessors, the relative merits of different coherence schemes, and system-level methods for improving directory efficiency are addressed. The research presented is part of an effort to build a high-performance, large-scale multiprocessor. The various classes of cache directory schemes are described, and a method of measuring cache coherence is presented. The various directory schemes are analyzed, and ways of improving the performance of directories are considered. It is found that the best solutions to the cache-coherence problem result from a synergy between a multiprocessor's software and hardware components 相似文献

19.

一个由编译器控制的Cache替换策略

杜红燕田兴彦田新华《计算机工程》2006,32(8):102-104

由于Cache拓染问题,传统的仅由硬件控制的Cache替换箫略不能得到令人满意的Cache利用率。为解决该问题,EPIC引入了Cache提示以辅助控制Cache替换。文章提出了一个由编译器辅助挖制的Cache替换策略：最优Cache划分（OCP）。OCP Cache替换策略简化了Cache行为和Cache失效分析方法。实验结果表明,OCP Cache替换箫略能有效地降低Cache失效率。相似文献

20.

分布式RAID中Cache模块的设计

冯晶朱兰娟吴智铭《计算机应用》2005,25(2):475-477

针对分布式RAID的特殊架构，设计了基于总线侦听方法的Cache模块。该模块采用主存分块映射策略来解决总线侦听方法，由于共享网络总线对带宽要求太高，使用较少带宽、较少的数据操作，提高了分布式RAID的系统性能。对Cache模块设计进行了性能分析，对多处理机系统Cache一致性问题的解决方案进行了分析比较。相似文献