首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Second-level buffer cache management   总被引:2,自引:0,他引:2  
Buffer caches are commonly used in servers to reduce the number of slow disk accesses or network messages. These buffer caches form a multilevel buffer cache hierarchy. In such a hierarchy, second-level buffer caches have different access patterns from first-level buffer caches because accesses to a second-level are actually misses from a first-level. Therefore, commonly used cache management algorithms such as the least recently used (LRU) replacement algorithm that work well for single-level buffer caches may not work well for second-level. We investigate multiple approaches to effectively manage second-level buffer caches. In particular, we report our research results in 1) second-level buffer cache access pattern characterization, 2) a new local algorithm called multi-queue (MQ) that performs better than nine tested alternative algorithms for second-level buffer caches, 3) a set of global algorithms that manage a multilevel buffer cache hierarchy globally and significantly improve second-level buffer cache hit ratios over corresponding local algorithms, and 4) implementation and evaluation of these algorithms in a real storage system connected with commercial database servers (Microsoft SQL server and Oracle) running industrial-strength online transaction processing benchmarks.  相似文献   

2.
An effectiveness-based adaptive cache replacement policy   总被引:1,自引:0,他引:1  
Belady’s optimal cache replacement policy is an algorithm to work out the theoretical minimum number of cache misses, but the rationale behind it was too simple. In this work, we revisit the essential function of caches to develop an underlying analytical model. We argue that frequency and recency are the only two affordable attributes of cache history that can be leveraged to predict a good replacement. Based on those two properties, we propose a novel replacement policy, the Effectiveness-Based Replacement policy (EBR) and a refinement, Dynamic EBR (D-EBR), which combines measures of recency and frequency to form a rank sequence inside each set and evict blocks with lowest rank. To evaluate our design, we simulated all 30 applications from SPEC CPU2006 for uni-core system and a set of combinations for 4-core systems, for different cache sizes. The results show that EBR achieves an average miss rate reduction of 12.4%. With the help of D-EBR, we can tune the weight ratio between ‘frequency’ and ‘recency’ dynamically. D-EBR can nearly double the miss reduction achieved by EBR alone. In terms of hardware overhead, EBR requires half the hardware overhead of real LRU and even compared with Pseudo LRU the overhead is modest.  相似文献   

3.
It is observed that the limited memory space of direct-mapped caches is not used in balance therefore incurs extra conflict misses. We propose a novel cache organization of a balanced cache, which balances accesses to cache sets at the granularity of cache subarrays. The key technique of the balanced cache is a programmable subarray decoder through which the mapping of memory reference addresses to cache subarrays can be optimized hence conflict misses of direct-mapped caches can be resolved. The experimental results show that the miss rate of balanced cache is lower than that of the same sized two-way set-associative caches on average and can be as low as that of the same sized four-way set-associative caches for particular applications. Compared with previous techniques, the balanced cache requires only one cycle to access all cache hits and has the same access time as direct-mapped caches.  相似文献   

4.
There have been numerous techniques proposed in the literature that aim to improve the performance of cache memories by reducing cache conflicts. These techniques were proposed over the past decade and each proposal independently claimed to reduce conflict misses. However, because the published results used different benchmarks and different experimental setups, it is not easy to compare them. In this paper we report a side-by-side comparison of these techniques. We also evaluate the suitability of some of these techniques for caches with higher set associativities. In addition to evaluating techniques for their impact on cache misses and average memory access times, we also evaluate the techniques for their ability in reducing the non-uniformity of cache accesses.The conclusion of our work is that, each application may benefit from a different technique and no single scheme works universally well for all applications. We also observe that, for the majority of applications, XORing (XOR) and Odd-multiplier indexing schemes perform reasonably well. Among programmable associativity techniques, B-cache performs better than column-associative and adaptive-caches, but column-associative caches require very minimal extensions to hardware. Uniformity of cache accesses is improved most by B-cache technique while column-associative cache also improves cache access uniformities.Based on the observation that different techniques benefit different applications, we explored the use of multiple, programmable addressing mechanisms, each addressing scheme designed for a specific application. We include some preliminary data using multiple addressing schemes.  相似文献   

5.
We propose a simple solution to the problem of efficient stack evaluation of LRU multiprocessor cache memories with arbitrary set-associative mapping. It is an extension of the existing stack evaluation techniques for all set-associative LRU uniprocessor caches. Special marker entries are used in the stack to represent data blocks (or lines) deleted by an invalidation-based cache coherence protocol. A method of marker-splitting is employed when a data block below a marker in the stack is accessed. Using this technique, one-pass trace evaluation of memory access trace yields hit ratios for all cache sizes and set-associative mappings of multiprocessor caches in a single pass over a memory reference trace. Simulation experiments on some multiprocessor trace data show an order-of-magnitude speed-up in simulation time using this one-pass technique  相似文献   

6.
Given N request streams and L⩽N LRU caches, the cache assignment problem asks to which cache each stream should be assigned in order to minimize the overall miss rate. An efficient solution to this problem is provided, based on characterizing each stream using the stack reference model and characterizing the interaction of the streams using a bursty stream model. It is shown that for Bernoulli (purely random) mixing of streams, the optimal cache assignment is to have one cache per stream. In practice streams are mixed in a way that is much “burstier” than can be represented by the Bernoulli model. Therefore a method is presented for superposition of bursty streams. The performance of the methods developed for bursty stream superposition and cache assignment are tested using trace data obtained from the database system DB2. The resulting cache assignment recommendations are then applied to the DB2 system, and considerable performance improvement is found to result  相似文献   

7.
Dynamic Partitioning of Shared Cache Memory   总被引:6,自引:0,他引:6  
This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches.Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses.The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory.  相似文献   

8.
传统的缓存替换算法由于不能适应应用程序的流式访问行为而导致缓存性能不佳.设计基于周期检测的预测方法,分析程序访存重用距离的规律性和流式访问的复杂性,提出用重用距离预测能同时适应简单流和复杂流访问模式的RDP算法.RDP的基本思想是预测重用距离并动态维护重用距离计数,动态调整缓存数据的替换顺序,通过流采样缩减存储开销.实验结果表明,RDP算法能够很好地适应程序中多样化的流访问模式,其总体性能优于LRU算法和DIP算法,在32MB缓存上比传统LRU算法平均减少了27.5%的缓存缺失.  相似文献   

9.
Wide-issue and high-frequency processors require not only a low-latency but also high-bandwidth memory system to achieve high performance. Previous studies have shown that using multiple small single-ported caches instead of a monolithic large multi-ported one for L1 data cache can be a scalable and inexpensive way to provide higher bandwidth. Various schemes on how to direct the memory references have been proposed in order to achieve a close match to the performance of an ideal multi-ported cache. However, most existing designs seldom take dynamic data access patterns into consideration, thus suffer from access conflicts within one cache and unbalanced loads between the caches. It is observed in this paper that if one can group data references defined in a program into several regions (access regions) to allow parallel accesses, providing separate small caches – access region cache for these regions may prove to have better performance. A register-guided memory reference partitioning approach is proposed and it effectively identifies these semantic regions and organizes them into multiple caches adaptively to maximize concurrent accesses. The base register name, not its content, in the memory reference instruction is used as a basic guide for instruction steering. With the initial assignment to a specific access region cache per the base register name, a reassignment mechanism is applied to capture the access pattern when program is moving across its access regions. In addition, a distribution mechanism is introduced to adaptively enable access regions to extend or shrink among the physical caches to reduce potential conflicts further. The simulations of SPEC CPU2000 benchmarks have shown that the semantic-based scheme can reduce the conflicts effectively, and obtain considerable performance improvement in terms of IPC; with 8 access region caches, 25–33% higher IPC is achieved for integer benchmark programs than a comparable 8-banked cache, while the benefit is less for floating-point benchmark programs, 19% at most.  相似文献   

10.
To confer the robustness and high quality of service, modern computing architectures running real-time applications should provide high system performance and high timing predictability. Cache memory is used to improve performance by bridging the speed gap between the main memory and CPU. However, the cache introduces timing unpredictability creating serious challenges for real-time applications. Herein, we introduce a miss table (MT) based cache locking scheme at level-2 (L2) cache to further improve the timing predictability and system performance/power ratio. The MT holds information of block addresses related to the application being processed which cause most cache misses if not locked. Information in MT is used for efficient selection of the blocks to be locked and victim blocks to be replaced. This MT based approach improves timing predictability by locking important blocks with the highest number of misses inside the cache for the entire execution time. In addition, this technique decreases the average delay per task and total power consumption by reducing cache misses and avoiding unnecessary data transfers. This MT based solution is effective for both uniprocessors and multicores. We evaluate the proposed MT-based cache locking scheme by simulating an 8-core processor with 2 levels of caches using MPEG4 decoding, H.264/AVC decoding, FFT, and MI workloads. Experimental results show that in addition to improving the predictability, a reduction of 21% in mean delay per task and a reduction of 18% in total power consumption are achieved for MPEG4 (and H.264/AVC) by using MT and locking 25% of the L2. The MT results in about 5% delay and power reductions on these video applications, possibly more on applications with worse cache behavior. For the FFT and MI (and other) applications whose code fits inside the level-1 instruction (I1) cache, the mean delay per task increases only by 3% and total power consumption increases by 2% due to the addition of the MT.  相似文献   

11.
The modern chip multiprocessors are vulnerable to transient faults caused by either on-purpose attacks or system mistakes, especially for those with large and multi-level caches in cloud servers. In this paper, we propose a modified/shared replication cache to keep a redundancy for the latest accessed and modified/shared L2 cache lines. According to the experiments based on Multi2Sim, this cache with proper size can provide considerable data reliability. In addition, the cache can reduce the average latency of memory hierarchy for error correction, with only about 20.2% of L2 cache energy cost and 2% of L2 cache silicon overhead.  相似文献   

12.
The memory behavior of cache oblivious stencil computations   总被引:1,自引:0,他引:1  
We present and evaluate a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an “ideal cache” of size Z, our algorithm saves a factor of Θ(Z 1/n ) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy. We evaluate our algorithm in terms of the number of cache misses, and demonstrate that the memory behavior agrees with our theoretical predictions. Our experimental evaluation is based on a finite-difference solution of a heat diffusion problem, as well as a Gauss-Seidel iteration and a 2-dimensional LBMHD program, both reformulated as cache oblivious stencil computations. This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.  相似文献   

13.
The L1 data cache is one of the most frequently accessed structures in the processor. Because of this and its moderate size it is a major consumer of power. In order to reduce its power consumption, in this paper a small filter structure that exploits the special features of the references to the stack region is proposed. This filter, which acts as a top – non-inclusive – level of the data memory hierarchy, consists of a register set that keeps the data stored in the neighborhood of the top of the stack. Our simulation results show that using a small Stack Filter (SF) of just a few registers, 10–25% data cache power savings can be achieved on average, with a negligible performance penalty.  相似文献   

14.
In embedded systems caches are very precious for keeping low the memory bandwidth and to allow employing slow and narrow off-chip devices. Conversely, the power and die size resources consumed by the cache force the embedded system designers to use small and simple cache memories. This kind of caches can experience poor performance because of their not flexible placement policy. In this scenario, a big fraction of the misses can originate from the mismatch between the cache behavior and the memory accesses' locality features (conflict misses).In this paper we analyze the conflict miss phenomenon and define a cache utilization measure. Then we propose an object level Cache Aware allocation Technique (CAT) to transform the application to fit the cache structure, minimize the number of conflict misses and maximize cache exploitation. The solution transforms the program layout using the standard functionalities of a linker.The CAT approach allowed the considered applications to deliver the same performance on two times and sometimes four times smaller caches. Moreover the CAT improved programs on direct-mapped caches outperformed the original versions on set-associative caches. In this way, the results highlight that our approach can help embedded system designers to meet the system requirements with smaller and simpler cache memories.  相似文献   

15.
This paper proposes a novel contribution in Web caching area, especially in Web cache replacement, so-called intelligent client-side Web caching scheme (ICWCS). This approach is developed by splitting the client-side cache into two caches: short-term cache that receives the Web objects from the Internet directly, and long-term cache that receives the Web objects from the short-term cache. The objects in short-term cache are removed by least recently used (LRU) algorithm as short-term cache is full. More significantly, when the long-term cache saturates, the neuro-fuzzy system is employed efficiently in managing contents of the long-term cache. The proposed solution is validated by implementing trace-driven simulation and the results are compared with least recently used (LRU) and least frequently used (LFU) algorithms; the most common policies of evaluating Web caching performance. The simulation results have revealed that the proposed approach improves the performance of Web caching in terms of hit ratio (HR), up to 14.8% and 17.9% over LRU and LFU. In terms of byte hit ratio (BHR), the Web caching performance is improved up to 2.57% and 26.25%, and for latency saving ratio (LSR), the performance is better with 8.3% and 18.9% over LRU and LFU, respectively.  相似文献   

16.
On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which are critical for overall system performance. This paper introduces an innovative design for on-chip data caches of microprocessors, called one's complement cache. While binary complement numbers have been successfully used in designing arithmetic units, to the best of our knowledge, no one has ever considered using such complement numbers in cache memory designs. This paper will show that such complement numbers help greatly in reducing cache misses in a data cache, thereby improving data cache performance. By parallel computation of cache addresses and memory addresses, the new design does not increase the critical hit time of cache accesses. Cache misses caused by line interference are reduced by evenly distributing data items referenced by program loops across all sets in a cache. Even distribution of data in the cache is achieved by making the number of sets in the cache a prime or an odd number, so that the chance of related data being mapped to a same set is small. Trace-driven simulations are used to evaluate the performance of the new design. Performance results on benchmarks show that the new design improves cache performance significantly with negligible additional hardware cost.  相似文献   

17.
Main memory cache performance continues to play an important role in determining the overall performance of object-oriented, object-relational and XML databases. An effective method of improving main memory cache performance is to prefetch or pre-load pages in advance to their usage, in anticipation of main memory cache misses. In this paper we describe a framework for creating prefetching algorithms with the novel features of path and cache consciousness. Path consciousness refers to the use of short sequences of object references at key points in the reference trace to identify paths of navigation. Cache consciousness refers to the use of historical page access knowledge to guess which pages are likely to be main memory cache resident most of the time and then assumes these pages do not exist in the context of prefetching. We have conducted a number of experiments comparing our approach against four highly competitive prefetching algorithms. The results shows our approach outperforms existing prefetching techniques in some situations while performing worse in others. We provide guidelines as to when our algorithm should be used and when others maybe more desirable.  相似文献   

18.
The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores (e.g., L3 cache) and each core possesses private low level caches (e.g., L1 and L2 caches). Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

19.
Finding the best data layout has been an ultimate goal of memory optimization. Even with data access profile, heuristic algorithms are needed to reorganize data layout for better locality. The best layout could be found by running the given application with all possible data layouts and selecting the best performing layout. This approach, however, can incur too much overhead, particulary when the number of possible layouts are too many. In this paper, we present a composition-based cache simulation for structure reorganization. Instead of running all possible layouts, we simulate only the primary subsets of layouts and compose the cache misses for all layouts by summing up the cache misses of component subsets. Our experiment with the composition-based cache simulation shows that the differences in the cache misses are within 10% of the full cache simulation for 4-way and 8-way set associative caches. In addition to the cache miss estimation, our heuristic algorithm takes account of the extra instruction overhead incurred by structure reorganization. Our experiment with several structure intensive benchmarks shows the 37% reduction in the L1D read misses and the 28% reduction in the L2 read misses. As a result, the execution times are also reduced by 19% on average.  相似文献   

20.
Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In single-threaded cores, resizable caches have demonstrated their ability to improve processor performance by adapting to the phases of the running application. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, thus, offering even more opportunities to dynamically adjust cache resources to the workload.In this paper, we demonstrate that the preferred control methodology for data cache reconfiguring in a SMT core changes as the number of running threads increases. In workloads with one or two threads, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies are closely related mathematically; the former minimizes the arithmetic mean cache access time (which we will call AMAT), while the latter minimizes its harmonic mean. We introduce an algorithm (HAMAT) that smoothly and naturally adjusts between the two strategies with the degree of multi-threading.We extend a previously proposed Globally Asynchronous, Locally Synchronous (GALS) processor core with SMT support and dynamically resizable caches. We show that the HAMAT algorithm significantly outperforms the AMAT algorithm on four-thread workloads while matching its performance on one and two thread workloads. Moreover, HAMAT achieves overall performance improvements of 18.7%, 10.1%, and 14.2% on one, two, and four thread workloads, respectively, over the best fixed-configuration cache design.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号