期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Two fast and high-associativity cache schemes

Chenxi Zhang Xiaodong Zhang Yong Yan 《Micro, IEEE》1997,17(5):40-49

In the race to improve cache performance, many researchers have proposed schemes that increase a cache's associativity. The associativity of a cache is the number of places in the cache where a block may reside. In a direct-mapped cache, which has an associativity of 1, there is only one location to search for a match for each reference. In a cache with associativity n-an n-way set-associative cache-there are n locations. Increasing associativity reduces the miss rate by decreasing the number of conflict, or interference, references. The column-associative cache and the predictive sequential associative cache seem to have achieved near-optimal performance for an associativity of two. Increasing associativity beyond two, therefore, is one of the most important ways to further improve cache performance. We propose two schemes for implementing associativity greater than two: the sequential multicolumn cache, which is an extension of the column-associative cache, and the parallel multicolumn cache. For an associativity of four, they achieve the low miss rate of a four-way set-associative cache. Our simulation results show that both schemes can effectively reduce the average access time 相似文献

2.

Exploring cache performance in multithreaded processors

Dimitris Lioupis Sotiris MiliosAuthor vitae 《Microprocessors and Microsystems》1997,20(10):230

Multithreading is a well known technique to hide latency in a non-blocking cache architecture. By switching execution from one thread to another, the CPU can perform useful work, while waiting for pending requests to be processed by the main memory. In this paper we examine the effects of varying the associativity and block size on cache performance in a reduced locality of reference environment, due to multithreading. We find that for associativity equal to the number of threads, the cache produces very low miss rate even for small sizes. Also by taking into account the increase in cycle time due to larger cache size or associativity we find that the optimum cache configuration for best processor performance is 16Kbytes direct mapped. Finally, with a constant main memory bandwidth, increasing the block size to more than 32 bytes, reduces the miss rate, but degrades processor performance. 相似文献

3.

Exploring cache performance in multithreaded processors

Dimitris Lioupis Sotiris Milios 《Microprocessors and Microsystems》1997,20(10):631-642

Multithreading is a well known technique to hide latency in a non-blocking cache architecture. By switching execution from one thread to another, the CPU can perform useful work, while waiting for pending requests to be processed by the main memory. In this paper we examine the effects of varying the associativity and block size on cache performance in a reduced locality of reference environment, due to multithreading. We find that for associativity equal to the number of threads, the cache produces very low miss rate even for small sizes. Also by taking into account the increase in cycle time due to larger cache size or associativity we find that the optimum cache configuration for best processor performance is 16Kbytes direct mapped. Finally, with a constant main memory bandwidth, increasing the block size to more than 32 bytes, reduces the miss rate, but degrades processor performance. 相似文献

4.

主成分线性回归模型分析应用程序性能 总被引：3，自引：0，他引：3

李胜梅程步奇高兴誉乔林汤志忠《计算机研究与发展》2009,46(11)

应用程序的性能分析能够给体系架构设计者和性能优化者提供有效的参考和指导.采用主成分线性回归模型分析了SPEC CPU2006的整型程序性能.模型选取性能监测单元采样到的事件为自变量,每条指令的时钟周期数(CPI)作为因变量.模型中采用主成分分析法消除了性能事件之间的相关性.实验结果表明,模型的拟合优度在90%以上,对性能进行预测的平均相对误差为15%.模型从量化上分析了L1,L2高速缓存缺失作为影响性能的关键因素是怎样影响程序性能的. 相似文献

5.

On the design of on-chip instruction caches

Carl McCrosky Brian ven der Buhs 《Microprocessors and Microsystems》1988,12(10):563-572

Cache memories reduce memory latency and traffic in computing systems. Most existing caches are implemented as board-based systems. Advancing VLSI technology will soon permit significant caches to be integrated on chip with the processors they support. In designing on-chip caches, the constraints of VLSI become significant. The primary constraints are economic limitations on circuit area and off-chip communications. The paper explores the design of on-chip instruction-only caches in terms of these constraints. The primary contribution of this work is the development of a unified economic model of on-chip instruction-only cache design which integrates the points of view of the cache designer and of the floorplan architect. With suitable data, this model permits the rational allocation of constrained resources to the achievement of a desired cache performance. Specific conclusions are that random line replacement is superior to LRU replacement, due to an increased flexibility in VLSI floorplan design; that variable set associativity can be an effective tool in regulating a chip's floorplan; and that sectoring permits area efficient caches while avoiding high transfer widths. Results are reported on economic functionality, from chip area and transfer width to miss ratio. These results, or the underlying analysis, can be used by microprocessor architects to make intelligent decisions regarding appropriate cache organizations and resource allocations. 相似文献

6.

Performance of One''s Complement Caches

Qing Yang Sridar Adina T. Sun 《Journal of Parallel and Distributed Computing》1998,48(2):143

On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which are critical for overall system performance. This paper introduces an innovative design for on-chip data caches of microprocessors, called one's complement cache. While binary complement numbers have been successfully used in designing arithmetic units, to the best of our knowledge, no one has ever considered using such complement numbers in cache memory designs. This paper will show that such complement numbers help greatly in reducing cache misses in a data cache, thereby improving data cache performance. By parallel computation of cache addresses and memory addresses, the new design does not increase the critical hit time of cache accesses. Cache misses caused by line interference are reduced by evenly distributing data items referenced by program loops across all sets in a cache. Even distribution of data in the cache is achieved by making the number of sets in the cache a prime or an odd number, so that the chance of related data being mapped to a same set is small. Trace-driven simulations are used to evaluate the performance of the new design. Performance results on benchmarks show that the new design improves cache performance significantly with negligible additional hardware cost. 相似文献

7.

SAGA:一种由流特性制导的微处理器高速缓存分配策略

陈彧林隽民乔林汤志忠《计算机学报》2008,31(11)

传统的缓存替换策略,如广泛使用的LRU算法,在程序工作集大于缓存容量的情况下,不能有效开发流式数据的重用性,导致缓存性能很差.文中提出一种流特性制导的缓存分配策略(SAGA).该策略利用流检测引擎来发掘程序中的流特性信息,进而动态地在发生缓存缺失时指导是否为缺失数据分配缓存块,最终提高数据缓存的性能.实验表明,对于SPEC2000FP程序集,在1MB缓存上,比较于LRU策略,使用SAGA策略时缓存的缺失平均减少了31%,程序平均CPI降低4%. 相似文献

8.

ELF:基于无用块消除和低重用块过滤的共享Cache管理策略

隋秀峰吴俊敏陈国良唐轶轩《计算机学报》2011,34(1):143-153

当代CMP处理器通常采用基于LRU替换策略或其近似算法的共享最后一级Cache设计.然而,随着LLC容量和相联度的增长,LRU和理论最优替换算法之间的性能差距日趋增大.为此已提出多种Cache管理策略来解决这一问题,但是它们多数仅针对单一的内存访问类型,且对Cache访问的频率信息关注较少,因而性能提升具有很大的局限性... 相似文献

9.

A new cache architecture based on temporal and spatial locality 总被引：5，自引：0，他引：5

Jung-Hoon Jang-Soo Shin-Dug 《Journal of Systems Architecture》2000,46(15):1451-1467

A data cache system is designed as low power/high performance cache structure for embedded processors. Direct-mapped cache is a favorite choice for short cycle time, but suffers from high miss rate. Hence the proposed dual data cache is an approach to improve the miss ratio of direct-mapped cache without affecting this access time. The proposed cache system can exploit temporal and spatial locality effectively by maximizing the effective cache memory space for any given cache size. The proposed cache system consists of two caches, i.e., a direct-mapped cache with small block size and a fully associative spatial buffer with large block size. Temporal locality is utilized by caching candidate small blocks selectively into the direct-mapped cache. Also spatial locality can be utilized aggressively by fetching multiple neighboring small blocks whenever a cache miss occurs. According to the results of comparison and analysis, similar performance can be achieved by using four times smaller cache size comparing with the conventional direct-mapped cache.And it is shown that power consumption of the proposed cache can be reduced by around 4% comparing with the victim cache configuration. 相似文献

10.

一种多线程阵列众核处理器的二级Cache划分机制

陈逸飞朱蕾李宏亮《计算机工程与科学》2019,41(3):400-408

阵列众核处理器由于其较高的计算性能和能效比已经广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器,其核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。引入硬件同时多线程技术,针对实验中单核心多线程二级Cache利用率较低的问题,提出了一种共享二级Cache划分机制。经实验模拟,通过上述优化的共享二级Cache划分机制,二级指令Cache失效率下降18.59%,数据Cache失效率下降6.60%,整体CPI性能提升达到10.1%。相似文献

11.

Miss-aware LLC buffer management strategy based on heterogeneous multi-core

Fang Juan Zhang Xibei Liu Shijian Chang Zeqing 《The Journal of supercomputing》2019,75(8):4519-4528

When multiple processor (CPU) cores and a GPU integrated together on the same chip share the last-level cache (LLC), the competition for LLC is more serious. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of LLC capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications have high number of concurrent threads and they can tolerate access latency. Taking into account the GPU program memory latency tolerance characteristics, we propose an LLC buffer management strategy (buffer-for-GPU, BFG) for heterogeneous multi-core. A buffer is added on the side of LLC to filtrate streaming requests of GPU. Cache-insensitive GPU messages directly access to buffer instead of accessing to LLC, thereby filtering the GPU request and freeing up the LLC space for the CPU application. Then, for the different characteristics of CPU and GPU applications, an improved LRU replacement taking into account the recent access time and access frequency of the cache block is adopted. The cache misses-aware algorithm dynamically selects the improved LRU or LRU algorithm to fit the current operating state by comparing the miss rate of cache in buffer so that the performance of the system will be improved significantly.

相似文献

12.

A cache-aware program transformation technique suitable for embedded systems

《Information and Software Technology》2002,44(13):783-795

In embedded systems caches are very precious for keeping low the memory bandwidth and to allow employing slow and narrow off-chip devices. Conversely, the power and die size resources consumed by the cache force the embedded system designers to use small and simple cache memories. This kind of caches can experience poor performance because of their not flexible placement policy. In this scenario, a big fraction of the misses can originate from the mismatch between the cache behavior and the memory accesses' locality features (conflict misses).In this paper we analyze the conflict miss phenomenon and define a cache utilization measure. Then we propose an object level Cache Aware allocation Technique (CAT) to transform the application to fit the cache structure, minimize the number of conflict misses and maximize cache exploitation. The solution transforms the program layout using the standard functionalities of a linker.The CAT approach allowed the considered applications to deliver the same performance on two times and sometimes four times smaller caches. Moreover the CAT improved programs on direct-mapped caches outperformed the original versions on set-associative caches. In this way, the results highlight that our approach can help embedded system designers to meet the system requirements with smaller and simpler cache memories. 相似文献

13.

面向多线程多道程序的加权共享Cache划分 总被引：5，自引：1，他引：4

所光杨学军《计算机学报》2008,31(11)

并行应用在共享Cache结构的多核处理器执行时,会因为对共享Cache的冲突访问而产生性能下降和执行时间不确定的现象.共享Cache划分技术可以把共享Cache互斥地分配给多个进程使用,是解决该问题的有效方法.由于线程间的数据共享,线程数目不同的应用对共享Cache的利用率不同,但传统的以失效率最低为目标的共享Cache划分算法(例如UCP)没有区分应用线程数目的不同.文中设计了一种面向多线程多道程序的加权共享Cache划分框架(Weighted Cache Partitioning,WCP),包括面向应用的失效率监控器和加权Cache划分算法.失效率监控器以进程为单位动态监控在不同的Cache容量下应用的失效率;而加权Cache划分算法扩展了传统的失效率最优的Cache划分算法,根据应用线程数目的不同在进行Cache划分时给应用赋予不同的权值,以使具有更多线程的应用获得更多的共享Cache,从而提高系统的整体性能.实验结果表明:加权Cache划分算法虽然失效率有所增高,但却改进了IPC吞吐率、加权加速比和公平性.在由科学和工程计算应用组成的多道程序测试用例中,WCP-1的IPC吞吐率比以失效率最低为目标函数的共享Cache划分算法最高高出10.8%,平均高出5.5%. 相似文献

14.

Performance modeling using Monte Carlo simulation

Srinivasan R. Cook J. Lubeck O. 《Computer Architecture Letters》2006,5(1):38-41

Cycle accurate simulation has long been the primary tool for micro-architecture design and evaluation. Though accurate, the slow speed often imposes constraints on the extent of design exploration. In this work, we propose a fast, accurate Monte-Carlo based model for predicting processor performance. We apply this technique to predict the CPI of in-order architectures and validate it against the Itanium-2. The Monte Carlo model uses micro-architecture independent application characteristics, and cache, branch predictor statistics to predict CPI with an average error of less than 7%. Since prediction is achieved in a few seconds, the model can be used for fast design space exploration that can efficiently cull the space for cycle-accurate simulations. Besides accurately predicting CPI, the model also breaks down CPI into various components, where each component quantifies the effect of a particular stall condition (branch misprediction, cache miss, etc.) on overall CPI. Such a CPI decomposition can help processor designers quickly identify and resolve critical performance bottlenecks. 相似文献

15.

YAARC: yet another approach to further reducing the rate of conflict misses

Mohsen Sharifi Behrouz Zolfaghari 《The Journal of supercomputing》2008,44(1):24-40

Traditional set associative caches are seriously prone to conflict misses. We propose an adapted new skewed associative architecture as an attempt to alleviate this problem. It has already been shown that skewed associative caches can reduce the rate of conflict misses by using different hash functions to index different banks. Building on this observation, we propose yet another approach to further reduce the rate of conflict misses, nicknamed YAARC (Yet Another Approach to Reducing Conflicts) that uses different hash functions to index into a single bank. Mathematical modeling and simulation results are exploited to evaluate the impact of YAARC on the rate of conflict misses. Mathematical analysis show the superiority of YAARC caches over set and skewed associative caches from the conflict miss point of view. Simulations, using some benchmarks from SPEC CPU2000 benchmark suit that former researchers have reported them as the best candidates for cache performance evaluation, also show nearly 43% conflict miss rate improvement for the skewed associative cache over the set associative cache, and nearly 31% improvement for the YAARC cache over the skewed associative cache. This implies that YAARC caches considerably outperform set and skewed associative caches from the conflict miss point of view. Since production of YAARC caches require a dispensable amount of hardware overhead, they can be considered as a cost effective approach to minimize the rate of conflict misses.

Behrouz ZolfaghariEmail:

相似文献

16.

缓存关联对数据库系统性能的影响及其优化策略

冯柯陈刚董金祥《计算机科学》2005,32(3):79-82

研究了缓存关联对数据库系统性能的影响,并提出了一种优化策略,大大降低了数据访问时对缓存的争用。该优化技术实现简单,对已有系统改动小,已经实现并应用于开放源代码数据库系统Postgres和MySQL,以及自主开发的CoreBase数据库系统中,测试结果表明：数据库的查询速度得到了极大的提高。相似文献

17.

一种处理器系统接口部件的设计与实现

李文郇丹丹高翔唐志敏《计算机工程与科学》2006,28(5):118-121

本文给出了一种处理器系统接口部件的具体设计方案.该接口部件通过使用Split读和片外Cache来提高处理器的性能.测试结果表明,Split读和片外Cache能够以比较低的代价使处理器性能得到很大提高. 相似文献

18.

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham 《International journal of parallel programming》2013,41(2):305-330

The demand for low-power embedded systems requires designers to tune processor parameters to avoid excessive energy wastage. Tuning on a per-application or per-application-phase basis allows a greater saving in energy consumption without a noticeable degradation in performance. On-chip caches often consume a significant fraction of the total energy budget and are therefore prime candidates for adaptation. Fixed-configuration caches must be designed to deliver low average memory access times across a wide range of potential applications. However, this can lead to excessive energy consumption for applications that do not require the full capacity or associativity of the cache at all times. Furthermore, in systems where the clock period is constrained by the access times of level-1 caches, the clock frequency for all applications is effectively limited by the cache requirements of the most demanding phase within the most demanding application. This results in both performance and energy efficiency that represents the lowest common denominator across the applications. In this work we present a Set and way Management cache Architecture for Run-Time reconfiguration (SMART cache), a cache architecture that allows reconfiguration in both its size and associativity. Results show the energy-delay of the Smart cache is on average 70 and 12 % better than the baseline configuration for a two-core and four-core system respectively, with just 2 % away from oracle result and also with an overall performance degradation of less than 2 % compared with a baseline statically-configured cache. 相似文献

19.

RACMan: Replication-aware cache management for manycore CMPs with private LLCs

《Microprocessors and Microsystems》2017

The last level cache (LLC) in private configurations offer lower latency and isolation but extinguishes the possibility of sharing underutilized cache resources. Cooperative Caching (CC) provides capacity sharing by spilling a line evicted from one cache to another. However, CC proposals did not pay enough attention to the natural problem of private LLC, replication. The static policies either indulging the replicated blocks (replicas) in or excluding them out of LLC invariably are deficient for the complex cache capacity situations in manycore environment. In this paper, we present replication-aware cache management (RACMan) to optimize replication for private configurations. RACMan relies on a novel coarse-grained low-overheard mechanism PBFP that monitors and predicts the replica reusability to dynamically adjust LLC insertion policies giving replicas different positions of LRU chain and chances of survival in LLC according to the prediction. Experiment results show our proposal is competent to optimize replication by performing better than two baseline systems in the respects of L2 Hit Rate, Network Traffics, IPC, and Dynamic Energy. RACMan fulfils the requirements of manycore CMPs with private LLC for increasing system performance, area efficiency, and scalability. 相似文献

20.

M<Emphasis Type="SmallCaps">osaic</Emphasis>: A Scalable Coherence Protocol

Lucia G. Menezo Valentin Puente Pablo Abad Jose-Angel Gregorio 《International journal of parallel programming》2018,46(6):1110-1138

The coherence protocol presented in this work, denoted Mosaic, introduces a new approach to face the challenges of complex multilevel cache hierarchies in future many-core systems. The essential aspect of the proposal is to eliminate the condition of inclusiveness through the different levels of the memory hierarchy while maintaining the complexity of the protocol limited. Cost reduction decisions taken to reduce this complexity may introduce artificial inefficiencies in the on-chip cache hierarchy, especially when the number of cores and private cache size is large. Our approach trades area and complexity for on-chip bandwidth, employing an integrated broadcast mechanism in a directory structure. In energy terms, the protocol scales like a conventional directory coherence protocol, but relaxes the shared information inclusiveness. This allows the performance implications of directory size and associativity reduction to be overcome. As it is even simpler than a conventional directory, the results of our evaluation show that the approach is quite insensitive, in terms of performance and energy expenditure, to the size and associativity of the directory. 相似文献