期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Mueller Frank 《Real-Time Systems》2000,18(2-3):217-247

This paper contributes a comprehensive study of a framework to bound worst-case instruction cache performance for caches with arbitrary levels of associativity. The framework is formally introduced, operationally described and its correctness is shown. Results of incorporating instruction cache predictions within pipeline simulation show that timing predictions for set-associative caches remain just as tight as predictions for direct-mapped caches. The low cache simulation overhead allows interactive use of the analysis tool and scales well with increasing associativity.The approach taken is based on a data-flow specification of the problem and provides another step toward worst-case execution time prediction of contemporary architectures and its use in schedulability analysis for hard real-time systems. 相似文献

2.

Using the First-Level Caches as Filters to Reduce the Pollution Caused by Speculative Memory References

Onur?Mutlu Email author Hyesoon?Kim David?N.?Armstrong Yale?N.?Patt 《International journal of parallel programming》2005,33(5):529-559

High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. This paper proposes the use of the first-level caches as filters that predict the usefulness of speculative memory references. With the proposed technique, speculative memory references bring data only into the first-level caches rather than all levels in the cache hierarchy. The processor monitors the use of the cache blocks in the first-level caches and decides which blocks to keep in the cache hierarchy based on the usefulness of cache blocks. It is shown that a simple implementation of this technique usually outperforms inclusive and exclusive baseline cache hierarchies commonly used by today’s processors and results in IPC performance improvements of up to 10% on the SPEC CPU2000 integer benchmarks. 相似文献

3.

On the design of on-chip instruction caches

Carl McCrosky Brian ven der Buhs 《Microprocessors and Microsystems》1988,12(10):563-572

Cache memories reduce memory latency and traffic in computing systems. Most existing caches are implemented as board-based systems. Advancing VLSI technology will soon permit significant caches to be integrated on chip with the processors they support. In designing on-chip caches, the constraints of VLSI become significant. The primary constraints are economic limitations on circuit area and off-chip communications. The paper explores the design of on-chip instruction-only caches in terms of these constraints. The primary contribution of this work is the development of a unified economic model of on-chip instruction-only cache design which integrates the points of view of the cache designer and of the floorplan architect. With suitable data, this model permits the rational allocation of constrained resources to the achievement of a desired cache performance. Specific conclusions are that random line replacement is superior to LRU replacement, due to an increased flexibility in VLSI floorplan design; that variable set associativity can be an effective tool in regulating a chip's floorplan; and that sectoring permits area efficient caches while avoiding high transfer widths. Results are reported on economic functionality, from chip area and transfer width to miss ratio. These results, or the underlying analysis, can be used by microprocessor architects to make intelligent decisions regarding appropriate cache organizations and resource allocations. 相似文献

4.

Performance of One''s Complement Caches

Qing Yang Sridar Adina T. Sun 《Journal of Parallel and Distributed Computing》1998,48(2):143

On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which are critical for overall system performance. This paper introduces an innovative design for on-chip data caches of microprocessors, called one's complement cache. While binary complement numbers have been successfully used in designing arithmetic units, to the best of our knowledge, no one has ever considered using such complement numbers in cache memory designs. This paper will show that such complement numbers help greatly in reducing cache misses in a data cache, thereby improving data cache performance. By parallel computation of cache addresses and memory addresses, the new design does not increase the critical hit time of cache accesses. Cache misses caused by line interference are reduced by evenly distributing data items referenced by program loops across all sets in a cache. Even distribution of data in the cache is achieved by making the number of sets in the cache a prime or an odd number, so that the chance of related data being mapped to a same set is small. Trace-driven simulations are used to evaluate the performance of the new design. Performance results on benchmarks show that the new design improves cache performance significantly with negligible additional hardware cost. 相似文献

5.

YAARC: yet another approach to further reducing the rate of conflict misses

Mohsen Sharifi Behrouz Zolfaghari 《The Journal of supercomputing》2008,44(1):24-40

Traditional set associative caches are seriously prone to conflict misses. We propose an adapted new skewed associative architecture as an attempt to alleviate this problem. It has already been shown that skewed associative caches can reduce the rate of conflict misses by using different hash functions to index different banks. Building on this observation, we propose yet another approach to further reduce the rate of conflict misses, nicknamed YAARC (Yet Another Approach to Reducing Conflicts) that uses different hash functions to index into a single bank. Mathematical modeling and simulation results are exploited to evaluate the impact of YAARC on the rate of conflict misses. Mathematical analysis show the superiority of YAARC caches over set and skewed associative caches from the conflict miss point of view. Simulations, using some benchmarks from SPEC CPU2000 benchmark suit that former researchers have reported them as the best candidates for cache performance evaluation, also show nearly 43% conflict miss rate improvement for the skewed associative cache over the set associative cache, and nearly 31% improvement for the YAARC cache over the skewed associative cache. This implies that YAARC caches considerably outperform set and skewed associative caches from the conflict miss point of view. Since production of YAARC caches require a dispensable amount of hardware overhead, they can be considered as a cost effective approach to minimize the rate of conflict misses.

Behrouz ZolfaghariEmail:

相似文献

6.

一种片上众核结构共享Cache动态隐式隔离机制研究 总被引：2，自引：0，他引：2

宋风龙刘志勇范东睿张军超余磊《计算机学报》2009,32(10)

访存带宽是限制众核处理器件能提升的关键,将片上最后一级Cache设计为所有处理器核共享是必要的.在共享Cache中隔离放置冲突的数据,是提高共享Cache性能的关键.文中提出了缓存块链接的硬件方法,用于隔离共享Cache中不同线程之间的数据.文中基于时钟精准的片上众核结构模拟器,使用Splash2程序组和生物信息学中的仟务,对所提机制进行了评估.实验结果表明,与传统共享Cache相比,使用缓存块链接机制时,使得共享Cache的冲突性缺失率降低约20%,而使得IPC平均提高了约10%. 相似文献

7.

Automatic detection and exploitation of branch constraints for timing analysis

Healy C.A. Whalley D.B. 《IEEE transactions on pattern analysis and machine intelligence》2002,28(8):763-781

Predicting the worst-case execution time (WCET) and best-case execution time (BCET) of a real-time program is a challenging task. Though much progress has been made in obtaining tighter timing predictions by using techniques that model the architectural features of a machine, significant overestimations of WCET and underestimations of GCET can still occur. Even with perfect architectural modeling, dependencies on data values can constrain the outcome of conditional branches and the corresponding set of paths that can be taken in a program. While branch constraint information has been used in the past by some timing analyzers, it has typically been specified manually, which is both tedious and error prone. This paper describes efficient techniques for automatically detecting branch constraints by a compiler and automatically exploiting these constraints within a timing analyzer. The result is significantly tighter timing analysis predictions without requiring additional interaction with a user. 相似文献

8.

Efficient and Precise Cache Behavior Prediction for Real-Time Systems 总被引：1，自引：1，他引：0

Ferdinand Christian Wilhelm Reinhard 《Real-Time Systems》1999,17(2-3):131-181

Abstract interpretation is a technique for the static detection of dynamic properties of programs. It is semantics based, that is, it computes approximative properties of the semantics of programs. On this basis, it supports correctness proofs of analyses. It replaces commonly used ad hoc techniques by systematic, provable ones, and it allows for the automatic generation of analyzers from specifications by existing tools. In this work, abstract interpretation is applied to the problem of predicting the cache behavior of programs. Abstract semantics of machine programs are defined which determine the contents of caches. For interprocedural analysis, existing methods are examined and a new approach that is especially tailored for the cache analysis is presented. This allows for a static classification of the cache behavior of memory references of programs. The calculated information can be used to improve worst case execution time estimations. It is possible to analyze instruction, data, and combined instruction/data caches for common (re)placement and write strategies. Experimental results are presented that demonstrate the applicability of the analyses. 相似文献

9.

一种基于自适应缓存机制的报文分类算法

张建宇韦韬邹维《计算机研究与发展》2006,43(2):196-203

提出了一种高效、适用性好、易于实现的报文分类算法CSAC（classification on self-adaptive cache）．该算法通过缓存属性子空间内报文集合的分类查询路径,将查询结果复用于同一子空间后续报文的分类．而缓存命中失效时也不必从头开始查询,减少了失效的时间开销．根据通信流量上下文变化对缓存运行状态造成的影响,算法采用自适应缓存机制,通过动态调整缓存的粒度、结构和缓存项在散列桶中的位置,有效地保证了缓存命中率．此外,算法不需要预处理过程,支持多维复杂规则（如4～7层属性、逻辑匹配操作等）和规则增量更新,比较适合于网络边界安全、用户流量审计和负载均衡等报文分类比较复杂的应用．采用CSAC算法开发的高端防火墙和入侵检测设备在实际网络环境中的性能良好．相似文献

10.

构件化WebGIS及其缓存框架 总被引：2，自引：0，他引：2

罗英伟汪小林许卓群《计算机辅助设计与图形学学报》2005,17(2):320-326

介绍并分析了一个多层次的构件化WebGIs系统——Geo-Union的体系结构、构件功能划分以及Web应用模式．为了改善Geo-Union的性能,引入了空间缓存框架,并将其分为三个层次：空间数据库缓存、网络空间缓存和空间数据代理服务器．空间数据库缓存是为了提高空间数据库的读取效率;网络空间缓存是为了提高单个用户或局域网用户的空间数据读取效率;而空间数据代理服务器则是为了提高广域网上空间数据读取的整体效率．相似文献

11.

Exploiting client caches to build large Web caches 总被引：2，自引：1，他引：1

Yingwu Zhu Yiming Hu 《The Journal of supercomputing》2007,39(2):149-175

New demands brought by the continuing growth of the Internet will be met in part by more effective and comprehensive use of caching. This paper proposes to exploit client browser caches in the context of cooperative proxy caching by constructing the client caches within each organization (e.g., corporate networks) as a peer-to-peer (P2P) client cache. Via trace-driven simulations we evaluate the potential performance benefit of cooperative proxy caching with/without exploiting client caches. We show that exploiting client caches in cooperative proxy caching can significantly improve performance, particularly when the size of individual proxy caches is limited compared to the universe of Web objects. We further devise a cooperative hierarchical greedy-dual replacement algorithm (Hier-GD), which not only provides some cache coordination but also utilizes client caches. Through Hier-GD, we explore the design issues of how to exploit client caches in cooperative proxy caching to build large Web caches. We show that Hier-GD is technically practical and can potentially improve the performance of cooperative proxy caching by utilizing client caches.

Yiming HuEmail:

相似文献

12.

嵌入式浏览器缓存策略的设计与实现

胡贯荣阳富民《计算机工程与设计》2005,26(12):3362-3364

给出了一种嵌入式浏览器缓存的实现策略,将网络数据进行分类,通过使用内存缓存技术,合理地缓冲网络数据,同时根据网页的结构和访问信息,使用一种简单可行的缓存淘汰策略,充分地利用缓存资源,使系统具有了较好的性能。相似文献

13.

A new cache architecture based on temporal and spatial locality 总被引：5，自引：0，他引：5

Jung-Hoon Jang-Soo Shin-Dug 《Journal of Systems Architecture》2000,46(15):1451-1467

A data cache system is designed as low power/high performance cache structure for embedded processors. Direct-mapped cache is a favorite choice for short cycle time, but suffers from high miss rate. Hence the proposed dual data cache is an approach to improve the miss ratio of direct-mapped cache without affecting this access time. The proposed cache system can exploit temporal and spatial locality effectively by maximizing the effective cache memory space for any given cache size. The proposed cache system consists of two caches, i.e., a direct-mapped cache with small block size and a fully associative spatial buffer with large block size. Temporal locality is utilized by caching candidate small blocks selectively into the direct-mapped cache. Also spatial locality can be utilized aggressively by fetching multiple neighboring small blocks whenever a cache miss occurs. According to the results of comparison and analysis, similar performance can be achieved by using four times smaller cache size comparing with the conventional direct-mapped cache.And it is shown that power consumption of the proposed cache can be reduced by around 4% comparing with the victim cache configuration. 相似文献

14.

一种基于重用距离预测与流检测的高速缓存替换算法

林隽民王炜乔林汤志忠《计算机研究与发展》2012,49(5):1049-1060

传统的缓存替换算法由于不能适应应用程序的流式访问行为而导致缓存性能不佳.设计基于周期检测的预测方法,分析程序访存重用距离的规律性和流式访问的复杂性,提出用重用距离预测能同时适应简单流和复杂流访问模式的RDP算法.RDP的基本思想是预测重用距离并动态维护重用距离计数,动态调整缓存数据的替换顺序,通过流采样缩减存储开销.实验结果表明,RDP算法能够很好地适应程序中多样化的流访问模式,其总体性能优于LRU算法和DIP算法,在32MB缓存上比传统LRU算法平均减少了27.5%的缓存缺失. 相似文献

15.

Reducing the Upper Bound Delay by Optimizing Bank-to-Core Mapping

下载免费PDF全文

Ji-Zan Zhang Zhi-Min Gu Ming-Quan Zhang 《计算机科学技术学报》2016,31(6):1179-1193

Nowadays, inter-task interferences are the main difficulty in analyzing the timing behavior of multicores. The timing predictable embedded multicore architecture MERASA, which allows safe worst-case execution time (WCET) estimations, has emerged as an attractive solution. In the architecture, WCET can be estimated by the upper bound delay (UBD) which can be bounded by the interference-aware bus arbiter (IABA) and the dynamic cache partitioning such as columnization or bankization. However, this architecture faces a dilemma between decreasing UBD and efficient shared cache utilization. To obtain tighter WCET estimation, we propose a novel approach that reduces UBD by optimizing bank-to-core mapping on the multicore system with IABA and the two-level partitioned cache. For this, we first present a new UBD computation model based on the analysis of inter-task interference delay, and then put forward the core-sequence optimization method of bank-to-core mapping and the optimizing algorithms with the minimum UBD. Experimental results demonstrate that our approach can reduce WCET from 4% to 37%. 相似文献

16.

A scalable Web cache sharing scheme

Yong H Shin 《Information Processing Letters》2004,91(5):227-232

A new Web cache sharing scheme is presented. Our scheme reduces the duplicated copies of the same objects in global shared Web caches. It also reduces the message overhead of existing schemes significantly. Trace-driven simulations with actual Web cache logs show that the proposed scheme performs better than the two well-known Web cache sharing schemes, the Internet Cache Protocol and the Cache Array Routing Protocol. 相似文献

17.

一种集群文件系统二级缓存协同置换算法

韩宗芬林运章金海岳建辉徐婕《计算机工程与科学》2004,26(9):54-56

集群文件系统二级缓存置换算法中，某一存储节点的二级缓存单点不命中会破坏集群文件系统的并行性，从而降低其他存储节点二级缓存的效率，进而降低系统缓存的整体命中率。本文提出了存储节点间协同置换的概念，并设计了置换算法CMQ。仿真结果表明，与LRU、LFU和MQ等传统置换算法相比，CMQ算法命中率有了显著提高。相似文献

18.

Maintaining Cache Coherence through Compiler-Directed Data Prefetching

Hock-Beng Lim Pen-Chung Yew 《Journal of Parallel and Distributed Computing》1998,53(2):170

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency. 相似文献

19.

Linked instruction caches for enhancing power efficiency of embedded systems

Chang-Jung Ku Ching-Wen Chen An Hsia Chun-Lin Chen 《Microprocessors and Microsystems》2014

The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time. 相似文献

20.

Memory referencing characteristics and caching performance of AND-Parallel Prolog on shared-memory multiprocessors

M. Hermenegildo E. Tick 《New Generation Computing》1989,7(1):37-58

This paper presents the performance analysis results for the RAP-WAM AND-Parallel Prolog architecture on shared-memory multiprocessor organizations. The goal of this parallel model is to provide inference speeds beyond those attainable in sequential systems, while supporting conventional logic programming semantics. Special emphasis is placed on sequential performance, storage efficiency, and low control overhead. First, the concepts and techniques used in the parallel execution model are described, along with the general methodology, benchmarks, and simulation tools used for its evaluation. Results are given both at the memory reference level and at the memory organization level. A two-level shared-memory architecture model is presented together with an analysis of various solutions to the cache coherency problem. Finally, RAP-WAM shared-memory simulation results are presented. It is argued that the RAP-WAM model can exploit coherent caches and attain speeds in excess of 2 MLIPS with current shared-memory multiprocessing technology for real applications that exhibit medium degrees of parallelism. MCC 相似文献