首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
It is observed that the limited memory space of direct-mapped caches is not used in balance therefore incurs extra conflict misses. We propose a novel cache organization of a balanced cache, which balances accesses to cache sets at the granularity of cache subarrays. The key technique of the balanced cache is a programmable subarray decoder through which the mapping of memory reference addresses to cache subarrays can be optimized hence conflict misses of direct-mapped caches can be resolved. The experimental results show that the miss rate of balanced cache is lower than that of the same sized two-way set-associative caches on average and can be as low as that of the same sized four-way set-associative caches for particular applications. Compared with previous techniques, the balanced cache requires only one cycle to access all cache hits and has the same access time as direct-mapped caches.  相似文献   

2.
Second-level buffer cache management   总被引:2,自引:0,他引:2  
Buffer caches are commonly used in servers to reduce the number of slow disk accesses or network messages. These buffer caches form a multilevel buffer cache hierarchy. In such a hierarchy, second-level buffer caches have different access patterns from first-level buffer caches because accesses to a second-level are actually misses from a first-level. Therefore, commonly used cache management algorithms such as the least recently used (LRU) replacement algorithm that work well for single-level buffer caches may not work well for second-level. We investigate multiple approaches to effectively manage second-level buffer caches. In particular, we report our research results in 1) second-level buffer cache access pattern characterization, 2) a new local algorithm called multi-queue (MQ) that performs better than nine tested alternative algorithms for second-level buffer caches, 3) a set of global algorithms that manage a multilevel buffer cache hierarchy globally and significantly improve second-level buffer cache hit ratios over corresponding local algorithms, and 4) implementation and evaluation of these algorithms in a real storage system connected with commercial database servers (Microsoft SQL server and Oracle) running industrial-strength online transaction processing benchmarks.  相似文献   

3.
Wide-issue and high-frequency processors require not only a low-latency but also high-bandwidth memory system to achieve high performance. Previous studies have shown that using multiple small single-ported caches instead of a monolithic large multi-ported one for L1 data cache can be a scalable and inexpensive way to provide higher bandwidth. Various schemes on how to direct the memory references have been proposed in order to achieve a close match to the performance of an ideal multi-ported cache. However, most existing designs seldom take dynamic data access patterns into consideration, thus suffer from access conflicts within one cache and unbalanced loads between the caches. It is observed in this paper that if one can group data references defined in a program into several regions (access regions) to allow parallel accesses, providing separate small caches – access region cache for these regions may prove to have better performance. A register-guided memory reference partitioning approach is proposed and it effectively identifies these semantic regions and organizes them into multiple caches adaptively to maximize concurrent accesses. The base register name, not its content, in the memory reference instruction is used as a basic guide for instruction steering. With the initial assignment to a specific access region cache per the base register name, a reassignment mechanism is applied to capture the access pattern when program is moving across its access regions. In addition, a distribution mechanism is introduced to adaptively enable access regions to extend or shrink among the physical caches to reduce potential conflicts further. The simulations of SPEC CPU2000 benchmarks have shown that the semantic-based scheme can reduce the conflicts effectively, and obtain considerable performance improvement in terms of IPC; with 8 access region caches, 25–33% higher IPC is achieved for integer benchmark programs than a comparable 8-banked cache, while the benefit is less for floating-point benchmark programs, 19% at most.  相似文献   

4.
Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches. This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that an optimistic support for forwarding speeds up five applications by an average of 50% for large caches and 30% for small caches. For large caches, most sharing read misses are eliminated, while for small caches, forwarding does not increase the number of conflict misses significantly. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups  相似文献   

5.
On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which are critical for overall system performance. This paper introduces an innovative design for on-chip data caches of microprocessors, called one's complement cache. While binary complement numbers have been successfully used in designing arithmetic units, to the best of our knowledge, no one has ever considered using such complement numbers in cache memory designs. This paper will show that such complement numbers help greatly in reducing cache misses in a data cache, thereby improving data cache performance. By parallel computation of cache addresses and memory addresses, the new design does not increase the critical hit time of cache accesses. Cache misses caused by line interference are reduced by evenly distributing data items referenced by program loops across all sets in a cache. Even distribution of data in the cache is achieved by making the number of sets in the cache a prime or an odd number, so that the chance of related data being mapped to a same set is small. Trace-driven simulations are used to evaluate the performance of the new design. Performance results on benchmarks show that the new design improves cache performance significantly with negligible additional hardware cost.  相似文献   

6.
Aggressive prefetching mechanisms improve performance of some important applications, but substantially increase bus traffic and “pressure” on cache tag arrays. They may even reduce performance of applications that are not memory bounded. We introduce a “feedback” mechanism, termed Prefetcher Assessment Buffer (PAB), which filters out requests that are unlikely to be useful. With this, applications that cannot benefit from aggressive prefetching will not suffer from their side-effects. The PAB is evaluated with different configurations, e.g., “all L1 accesses trigger prefetches” and “only misses to L1 trigger prefetches”. When compared with the non-selective concurrent use of multiple prefetchers, the PAB’s application to prefetching from main memory to the L2 cache can reduce the number of loads from main memory by up to 25% without losing performance. Application of more sophisticated techniques to prefetches between the L2- and L1-cache can increase IPC by 4% while reducing the traffic between the caches 8-fold.  相似文献   

7.
The performance loss resulting from different cache misses is variable in modern systems for two reasons: 1) memory access latency is not uniform, and 2) the latency toleration ability of processor cor...  相似文献   

8.
Recently, EDRAM cells have gained much attention as a promising alternative to construct on-chip memories. However, due to inherent characteristics of DRAM cells, they need to be refreshed periodically, causing a huge refresh energy burden. Particularly, employing EDRAM cells in large-scale last-level caches will make refresh burden much higher due to their large capacity. In this paper, we propose a selective fine-grain round-robin refresh scheme for both performance improvement and refresh energy reduction. To reduce bank conflicts between normal cache accesses and refresh operations, we employ a refresh scheme which refreshes cache lines in a bank-wise round-robin fashion. We also apply a selective refresh depending on the inclusive information in cache hierarchies. For the data which reside in both LLC and upper-level cache (i.e., L2 cache), the data access will be filtered by the upper-level cache. Based on this insight, we skip the refresh to the cache block in the EDRAM-based LLC which also exists in the upper-level caches. By doing so, we can reduce unnecessary refresh operations in EDRAM-based LLCs. According to our evaluation, our proposed scheme improves performance by 7.3% and reduces energy per instruction by 13.3% compared to the baseline all-bank refresh scheme.  相似文献   

9.
The sharing of caches among proxies is an important technique to reduce Web traffic, alleviate network bottlenecks, and improve response time of document requests. Most existing work on cooperative caching has been focused on serving misses collaboratively. Very few have studied the effect of cooperation on document placement schemes and its potential enhancements on cache hit ratio and latency reduction. We propose a new document placement scheme which takes into account the contentions at individual caches in order to limit the replication of documents within a cache group and increase document hit ratio. The main idea of this new scheme is to view the aggregate disk space of the cache group as a global resource of the group and uses the concept of cache expiration age to measure the contention of individual caches. The decision of whether to cache a document at a proxy is made collectively among the caches that already have a copy of this document. We refer to this new document placement scheme as the Expiration Age-based scheme (EA scheme). The EA scheme effectively reduces the replication of documents across the cache group, while ensuring that a copy of the document always resides in a cache where it is likely to stay for the longest time. We report our study on the potentials and limits of the EA scheme using both analytic modeling and trace-based simulation. The analytical model compares and contrasts the existing (ad hoc) placement scheme of cooperative proxy caches with our new EA scheme and indicates that the EA scheme improves the effectiveness of aggregate disk usage, thereby increasing the average time duration for which documents stay in the cache. The trace-based simulations show that the EA scheme yields higher hit rates and better response times compared to the existing document placement schemes used in most of the caching proxies.  相似文献   

10.
The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time.  相似文献   

11.
Write-invalidate protocols suffer from memory-access penalties due to coherence misses. While write-update or hybrid update/invalidate protocols can reduce coherence misses, the update traffic can increase memory-system contention. We show in this paper that update-based cache protocols can perform significantly better than write-invalidate protocols by incorporating a write cache in each processing node. Because it is legal to delay the propagation of modifications of a block until the next synchronization under relaxed memory consistency models, a write cache can significantly reduce traffic by exploiting locality in write accesses. By concentrating on a cache-coherent NUMA architecture, we study the implementation aspects of augmenting a write-invalidate, a write-update and two hybrid update/invalidate protocols with write caches. Through detailed architectural simulations using five benchmark programs, we find that write caches, with only a few blocks each, help write-invalidate protocols to cut the false-sharing miss rate and hybrid update/invalidate protocols to keep other copies, including the memory copy, clean at an acceptable write traffic level. Overall, the memory-access penalty associated with coherence misses is drastically reduced.  相似文献   

12.
传统的缓存替换算法由于不能适应应用程序的流式访问行为而导致缓存性能不佳.设计基于周期检测的预测方法,分析程序访存重用距离的规律性和流式访问的复杂性,提出用重用距离预测能同时适应简单流和复杂流访问模式的RDP算法.RDP的基本思想是预测重用距离并动态维护重用距离计数,动态调整缓存数据的替换顺序,通过流采样缩减存储开销.实验结果表明,RDP算法能够很好地适应程序中多样化的流访问模式,其总体性能优于LRU算法和DIP算法,在32MB缓存上比传统LRU算法平均减少了27.5%的缓存缺失.  相似文献   

13.
In the race to improve cache performance, many researchers have proposed schemes that increase a cache's associativity. The associativity of a cache is the number of places in the cache where a block may reside. In a direct-mapped cache, which has an associativity of 1, there is only one location to search for a match for each reference. In a cache with associativity n-an n-way set-associative cache-there are n locations. Increasing associativity reduces the miss rate by decreasing the number of conflict, or interference, references. The column-associative cache and the predictive sequential associative cache seem to have achieved near-optimal performance for an associativity of two. Increasing associativity beyond two, therefore, is one of the most important ways to further improve cache performance. We propose two schemes for implementing associativity greater than two: the sequential multicolumn cache, which is an extension of the column-associative cache, and the parallel multicolumn cache. For an associativity of four, they achieve the low miss rate of a four-way set-associative cache. Our simulation results show that both schemes can effectively reduce the average access time  相似文献   

14.
In this paper, we implement some notable hierarchical or decision-tree-based packet classification algorithms such as extended grid of tries (EGT), hierarchical intelligent cuttings (HiCuts), HyperCuts, and hierarchical binary search (HBS) on an IXP2400 network processor. By using all six of the available processing microengines (MEs), we find that none of these existing packet classification algorithms achieve the line speed of OC-48 provided by IXP2400. To improve the search speed of these packet classification algorithms, we propose the use of software cache designs to take advantage of the temporal locality of the packets because IXP network processors have no built-in caches for fast path processing in MEs. Furthermore, we propose hint-based cache designs to reduce the search duration of the packet classification data structure when cache misses occur. Both the header and prefix caches are studied. Although the proposed cache schemes are designed for all the dimension-by-dimension packet classification schemes, they are, nonetheless, the most suitable for HBS. Our performance simulations show that the HBS enhanced with the proposed cache schemes performs the best in terms of classification speed and number of memory accesses when the memory requirement is in the same range as those of HiCuts and HyperCuts. Based on the experiments with all the high and low locality packet traces, five MEs are sufficient for the proposed rule cache with hints to achieve the line speed of OC-48 provided by IXP2400.  相似文献   

15.
In embedded systems caches are very precious for keeping low the memory bandwidth and to allow employing slow and narrow off-chip devices. Conversely, the power and die size resources consumed by the cache force the embedded system designers to use small and simple cache memories. This kind of caches can experience poor performance because of their not flexible placement policy. In this scenario, a big fraction of the misses can originate from the mismatch between the cache behavior and the memory accesses' locality features (conflict misses).In this paper we analyze the conflict miss phenomenon and define a cache utilization measure. Then we propose an object level Cache Aware allocation Technique (CAT) to transform the application to fit the cache structure, minimize the number of conflict misses and maximize cache exploitation. The solution transforms the program layout using the standard functionalities of a linker.The CAT approach allowed the considered applications to deliver the same performance on two times and sometimes four times smaller caches. Moreover the CAT improved programs on direct-mapped caches outperformed the original versions on set-associative caches. In this way, the results highlight that our approach can help embedded system designers to meet the system requirements with smaller and simpler cache memories.  相似文献   

16.
Cache misses in small, limited-associativityprimary caches very often replace live cache blocks, giventhe dominance of capacity and conflict misses. Towardsmotivating novel cache organizations, we study thecomparative characteristics of the virtual memoryaddress pairs involved in typical primary-cachecontention (block replacements) for the SPEC2000integer benchmarks. We focus on the cache tag bits, andresults show that (i) often just a few tag bits differbetween contending addresses, and (ii) accesses to certainsegments or page groups of the virtual address space (i.e.certain tag-bit groups) contend frequently. Cacheconsciousvirtual address space allocation can furtherreduce the number of conflicting tag bits. We mentiontwo directions for exploiting such page-level contentionpatterns to improve cache cost and performance.  相似文献   

17.
Bus-based multiprocessors constitute a cost-effective class of shared-memory multiprocessors. Private caches are the key to an efficient utilization of the shared bus, and most such systems use a write-invalidate cache-coherence protocol to keep the caches coherent. Two important factors that limit the performance of the system are cache misses that lead to long-latency reads and bus congestion because of read misses and coherence traffic. While hybrid write-invalidate/write-update snooping protocols lead to fewer read misses than write-invalidate protocols, previous studies have shown them to be incapable of providing consistent performance improvements because of heavily increased coherence traffic. In this paper, we analyze how the deficiencies of hybrid snooping protocols can be dramatically reduced by using write caches and read snarfing (also called read-broadcast) under release consistency. Our performance evaluation is based on program-driven simulation and a set of five scientific applications with different sharing behaviors including migratory sharing as well as producer–consumer sharing. We show that one of the evaluated hybrid protocols, extended with write caches as well as read snarfing, manages to reduce the number of coherence misses by between 83 and 93% as compared to a write-invalidate protocol for all five applications in this study. In addition, the number of bus transactions is reduced substantially. However, we also show that read snarfing and hybrid snooping protocols might lead to higher cache occupancy because of increased sharing. Because of the small implementation cost of the hybrid protocol and the two extensions, we believe the combination to be an effective approach to boosting the performance of bus-based multiprocessors.  相似文献   

18.
In this article, we discuss the performance modeling and optimization of Sparse Matrix-Vector Multiplication ( ) on NVIDIA GPUs using CUDA. has a very low computation-data ratio and its performance is mainly bound by the memory bandwidth. We propose optimization of based on ELLPACK from two aspects: (1) enhanced performance for the dense vector by reducing cache misses, and (2) reduce accessed matrix data by index reduction. With matrix bandwidth reduction techniques, both cache usage enhancement and index compression can be enabled. For GPU with better cache support, we propose differentiated memory access scheme to avoid contamination of caches by matrix data. Performance evaluation shows that the combined speedups of proposed optimizations for GT-200 are 16% (single-precision) and 12.6% (double-precision) for GT-200 GPU, and 19% (single-precision) and 15% (double-precision) for GF-100 GPU.  相似文献   

19.
A NUCA Substrate for Flexible CMP Cache Sharing   总被引:1,自引:0,他引:1  
We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.  相似文献   

20.
In this article, we analyze file access characteristics of smartphone applications and find out that a large portion of file data in smartphones are written only once. This specific phenomenon appears due to the behavior of SQLite, a lightweight database library used in most smartphone applications. Based on this observation, we present a new buffer cache management scheme for smartphone systems that considers non-reusability of write-only-once data that we observe. Buffer cache improves file access performances by maintaining hot data in memory thereby servicing subsequent requests without storage accesses. The proposed scheme classifies write-only-once data and aggressively evicts them from the buffer cache to improve cache space utilization. Experimental results with various real smartphone applications show that the proposed buffer cache management scheme improves the performance of smartphone buffer cache by 5%–33%. We also show that our scheme can reduce the buffer cache size to 1/4 of the original system without performance degradation, which allows the reduction of energy consumption in a smartphone memory system by 27%–92%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号