期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

On the design of on-chip instruction caches

Carl McCrosky Brian ven der Buhs 《Microprocessors and Microsystems》1988,12(10):563-572

Cache memories reduce memory latency and traffic in computing systems. Most existing caches are implemented as board-based systems. Advancing VLSI technology will soon permit significant caches to be integrated on chip with the processors they support. In designing on-chip caches, the constraints of VLSI become significant. The primary constraints are economic limitations on circuit area and off-chip communications. The paper explores the design of on-chip instruction-only caches in terms of these constraints. The primary contribution of this work is the development of a unified economic model of on-chip instruction-only cache design which integrates the points of view of the cache designer and of the floorplan architect. With suitable data, this model permits the rational allocation of constrained resources to the achievement of a desired cache performance. Specific conclusions are that random line replacement is superior to LRU replacement, due to an increased flexibility in VLSI floorplan design; that variable set associativity can be an effective tool in regulating a chip's floorplan; and that sectoring permits area efficient caches while avoiding high transfer widths. Results are reported on economic functionality, from chip area and transfer width to miss ratio. These results, or the underlying analysis, can be used by microprocessor architects to make intelligent decisions regarding appropriate cache organizations and resource allocations. 相似文献

2.

A cache-aware program transformation technique suitable for embedded systems

《Information and Software Technology》2002,44(13):783-795

In embedded systems caches are very precious for keeping low the memory bandwidth and to allow employing slow and narrow off-chip devices. Conversely, the power and die size resources consumed by the cache force the embedded system designers to use small and simple cache memories. This kind of caches can experience poor performance because of their not flexible placement policy. In this scenario, a big fraction of the misses can originate from the mismatch between the cache behavior and the memory accesses' locality features (conflict misses).In this paper we analyze the conflict miss phenomenon and define a cache utilization measure. Then we propose an object level Cache Aware allocation Technique (CAT) to transform the application to fit the cache structure, minimize the number of conflict misses and maximize cache exploitation. The solution transforms the program layout using the standard functionalities of a linker.The CAT approach allowed the considered applications to deliver the same performance on two times and sometimes four times smaller caches. Moreover the CAT improved programs on direct-mapped caches outperformed the original versions on set-associative caches. In this way, the results highlight that our approach can help embedded system designers to meet the system requirements with smaller and simpler cache memories. 相似文献

3.

片上多处理器末级Cache优化技术研究

李浩谢伦国《计算机研究与发展》2012,(Z1):172-179

片上多核技术的出现给处理器的设计和实现带来很多挑战,片上存储系统的设计就是其中最重要的方面之一.为了缓解日益严峻的存储墙问题,研究者们通常在片上放置大容量末级Cache,片上末级Cache设计和优化技术已成为当前的研究热点.介绍了片上多处理器(CMP)末级Cache设计面临的挑战,然后分别介绍了以私有设计和共享设计为基础的多种CMP末级Cache优化技术,并对它们进行了比较分析. 相似文献

4.

Balanced instruction cache: reducing conflict misses of direct-mapped caches through balanced subarray accesses 总被引：1，自引：0，他引：1

Chuanjun Zhang 《Computer Architecture Letters》2006,5(1):2-5

It is observed that the limited memory space of direct-mapped caches is not used in balance therefore incurs extra conflict misses. We propose a novel cache organization of a balanced cache, which balances accesses to cache sets at the granularity of cache subarrays. The key technique of the balanced cache is a programmable subarray decoder through which the mapping of memory reference addresses to cache subarrays can be optimized hence conflict misses of direct-mapped caches can be resolved. The experimental results show that the miss rate of balanced cache is lower than that of the same sized two-way set-associative caches on average and can be as low as that of the same sized four-way set-associative caches for particular applications. Compared with previous techniques, the balanced cache requires only one cycle to access all cache hits and has the same access time as direct-mapped caches. 相似文献

5.

改进型缓存敏感B+树的研究 总被引：1，自引：0，他引：1

王晨陈刚董金祥《计算机测量与控制》2006,14(11):1531-1534,1550

在内存数据库中,处理器缓存的失配次数对系统的性能有重要的影响;缓存敏感的索引能减少在做查询操作时产生的缓存失配次数,从而提高系统的性能;传统的设计思路将结点大小等于缓存块大小,认为这样就能使得缓存失配次数减少;但是这样的设计忽略了TLB失配对系统性能的影响;我们提出了一种缓存敏感索引——改进型缓存敏感B＋树（简称MCSB＋树）,它同时兼顾了缓存失配和TLB失配对系统性能的影响。比传统的缓存敏感索引能提供更好的操作性能。相似文献

6.

Page-Level Behavior of Cache Contention

《Computer Architecture Letters》2002,1(1):9-9

Cache misses in small, limited-associativityprimary caches very often replace live cache blocks, giventhe dominance of capacity and conflict misses. Towardsmotivating novel cache organizations, we study thecomparative characteristics of the virtual memoryaddress pairs involved in typical primary-cachecontention (block replacements) for the SPEC2000integer benchmarks. We focus on the cache tag bits, andresults show that (i) often just a few tag bits differbetween contending addresses, and (ii) accesses to certainsegments or page groups of the virtual address space (i.e.certain tag-bit groups) contend frequently. Cacheconsciousvirtual address space allocation can furtherreduce the number of conflicting tag bits. We mentiontwo directions for exploiting such page-level contentionpatterns to improve cache cost and performance. 相似文献

7.

SAGA:一种由流特性制导的微处理器高速缓存分配策略

陈彧林隽民乔林汤志忠《计算机学报》2008,31(11)

传统的缓存替换策略,如广泛使用的LRU算法,在程序工作集大于缓存容量的情况下,不能有效开发流式数据的重用性,导致缓存性能很差.文中提出一种流特性制导的缓存分配策略(SAGA).该策略利用流检测引擎来发掘程序中的流特性信息,进而动态地在发生缓存缺失时指导是否为缺失数据分配缓存块,最终提高数据缓存的性能.实验表明,对于SPEC2000FP程序集,在1MB缓存上,比较于LRU策略,使用SAGA策略时缓存的缺失平均减少了31%,程序平均CPI降低4%. 相似文献

8.

基于超窄数据的低功耗数据Cache方案 总被引：2，自引：0，他引：2

马志强季振洲胡铭曾《计算机研究与发展》2007,44(5):775-781

降低耗电量已经成为当前最重要的设计问题之一.现代微处理器多采用片上Cache来弥合主存储器与中央处理器(CPU)之间的巨大速度差异,但Cache也成为处理器功耗的主要来源,设计低功耗的Cache存储体变得越来越重要.仅需要很少的几位就可以存储的超窄数据(VNV)在Cache的存储和访问中都占有很大的比例.据此,提出了一种基于超窄数据的低功耗Cache结构(VNVC).在VNVC中,数据存储体被分为低位存储体和高位存储体两部分.在标志位控制下,用来存放超窄数据的高存储单元将被关闭,以节省其动态和静态功耗.VNVC仅通过改进存储体来获得低功耗,不需要额外的辅助硬件,并且不影响原有Cache的性能,所以适合于各种Cache组织结构.采用12个Spec2000测试程序的仿真结果表明,4位宽度的超窄数据可以获得最大的节省率,平均可节省动态功耗29.85%、静态功耗29.94%. 相似文献

9.

多核处理器片上存储系统研究 总被引：1，自引：1，他引：0

下载免费PDF全文

黄安文高军张民选《计算机工程》2010,36(4):4-6

针对多核处理器计算能力和访存速度间差异不断增大对多核系统性能提升的制约问题,分析几款典型多核处理器存储系统的设计特点,探讨多核处理器片上存储系统发展的关键技术,包括延迟造成的非一致cache访问、核与cache互连形式对访存性能的束缚以及片上cache设计的复杂化等。相似文献

10.

Data forwarding in scalable shared-memory multiprocessors

Koufaty D.A. Xiangfeng Chen Poulsen D.K. Torrellas J. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(12):1250-1264

Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches. This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that an optimistic support for forwarding speeds up five applications by an average of 50% for large caches and 30% for small caches. For large caches, most sharing read misses are eliminated, while for small caches, forwarding does not increase the number of conflict misses significantly. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups 相似文献

11.

基于记录缓冲的低功耗指令Cache方案 总被引：2，自引：1，他引：1

马志强季振洲胡铭曾《计算机研究与发展》2006,43(4):744-751

现代微处理器大多采用片上Cache来缓解主存储器与中央处理器(CPU)之间速度的巨大差异,但Cache也成为处理器功耗的主要来源,尤其是其中大部分功耗来自于指令Cache.采用缓冲器可以过滤掉大部分的指令Cache访问,从而降低功耗,但仍存在相当程度不必要的存储体访问,据此提出了一种基于记录缓冲的低功耗指令Cache结构RBC.通过记录缓冲器和对存储体的改造,RBC能够过滤大部分不必要的存储体访问,有效地降低了Cache的功耗.对10个SPEC2000标准测试程序的仿真结果表明,与传统基于缓冲器的Cache结构相比,在仅牺牲6.01%处理器性能和3.75%面积的基础上,该方案可以节省24.33%的指令Cache功耗. 相似文献

12.

基于分布式合作cache的私有cache划分方法

李浩谢伦国《计算机应用研究》2012,29(1):229-233

当片上多处理器系统上运行多个不同程序时,如何给这些不同的应用程序分配适当的cache空间成为一个难题。Cache划分就是解决这一难题的有效方法,目前大部分的划分方法都是针对最后一级共享cache设计的。私有cache划分(private cache partitioning,PCP)方法采用一个分布式一致性引擎(DCE)把多个私有cache组织在一起,最后通过硬件信息提取单元获得多个程序在不同cache路上的命中分布情况,用于指导划分算法的执行,最后由每个DCE根据划分算法运行的结果对cache空间进行划分。实验结果表明PCP方法降低了失效率,提高了程序执行性能。相似文献

13.

Improving cache locking performance of modern embedded systems via the addition of a miss table at the L2 cache level

Abu Asaduzzaman Fadi N. Sibai Manira Rani 《Journal of Systems Architecture》2010,56(4-6):151-162

To confer the robustness and high quality of service, modern computing architectures running real-time applications should provide high system performance and high timing predictability. Cache memory is used to improve performance by bridging the speed gap between the main memory and CPU. However, the cache introduces timing unpredictability creating serious challenges for real-time applications. Herein, we introduce a miss table (MT) based cache locking scheme at level-2 (L2) cache to further improve the timing predictability and system performance/power ratio. The MT holds information of block addresses related to the application being processed which cause most cache misses if not locked. Information in MT is used for efficient selection of the blocks to be locked and victim blocks to be replaced. This MT based approach improves timing predictability by locking important blocks with the highest number of misses inside the cache for the entire execution time. In addition, this technique decreases the average delay per task and total power consumption by reducing cache misses and avoiding unnecessary data transfers. This MT based solution is effective for both uniprocessors and multicores. We evaluate the proposed MT-based cache locking scheme by simulating an 8-core processor with 2 levels of caches using MPEG4 decoding, H.264/AVC decoding, FFT, and MI workloads. Experimental results show that in addition to improving the predictability, a reduction of 21% in mean delay per task and a reduction of 18% in total power consumption are achieved for MPEG4 (and H.264/AVC) by using MT and locking 25% of the L2. The MT results in about 5% delay and power reductions on these video applications, possibly more on applications with worse cache behavior. For the FFT and MI (and other) applications whose code fits inside the level-1 instruction (I1) cache, the mean delay per task increases only by 3% and total power consumption increases by 2% due to the addition of the MT. 相似文献

14.

CSA-Tree:一种改进的高维主存索引树

梁俊杰冯玉才《计算机学报》2007,30(3):415-423

主存技术的不断进步,使得主存多媒体数据库的实现成为可能.研究表明,主存多媒体数据库系统性能深受处理器缓存未命中的影响,缓存感知型主存索引是提高数据检索效率的有效手段.针对SA-Tree不适用于主存存取的缺点,提出它的变体CSA-Tree.CSA-Tree利用PCA降维技术,将树的各层节点采用不同的维度表示,这样不仅提高了缓存空间的利用率,还降低了CPU负载,从而提高了索引查询效率.大量实验证明,CSA-Tree在主存环境中具有良好的高维数据检索性能. 相似文献

15.

一种片上众核结构共享Cache动态隐式隔离机制研究 总被引：2，自引：0，他引：2

宋风龙刘志勇范东睿张军超余磊《计算机学报》2009,32(10)

访存带宽是限制众核处理器件能提升的关键,将片上最后一级Cache设计为所有处理器核共享是必要的.在共享Cache中隔离放置冲突的数据,是提高共享Cache性能的关键.文中提出了缓存块链接的硬件方法,用于隔离共享Cache中不同线程之间的数据.文中基于时钟精准的片上众核结构模拟器,使用Splash2程序组和生物信息学中的仟务,对所提机制进行了评估.实验结果表明,与传统共享Cache相比,使用缓存块链接机制时,使得共享Cache的冲突性缺失率降低约20%,而使得IPC平均提高了约10%. 相似文献

16.

利用虚存管理的思想实现基于SPM的动态能耗优化机制

张阳凌明《数字社区&智能家居》2009,(24)

当代高性能SoC通常引入对程序员透明的片上Cache作为对主存数据的缓冲。然而传统数据Cache受制于其容量与组关联度,因此常出现冲突的问题。本研究通过新引入的、与数据Cache共存的另一款片上存储器SPM来消除这部分冲突。我们提出了一种由MMU管理的、Cache与SPM共存的片上存储器架构。利用虚存管理的思想,将虚拟上连续、物理上离散的程序地址空间段通过的异构片上存储器进行缓冲,从而将容易引起数据Cache冲突的页,在程序执行的过程中重定位到SPM,最终得到了能耗和性能上的收益。相似文献

17.

DSP中指令Cache的低功耗设计

下载免费PDF全文

杨晓刚屈凌翔张树丹《计算机工程与应用》2011,47(32):82-86

设计了一种低功耗指令Cache：通过在CPU与一级指令Cache之间加入Line Buffer,来减少CPU对指令Cache的访问次数,从而降低指令Cache的功耗。此外在Line Buffer控制器中添加了重装控制单元,当指令Cache发生缺失时,能将片外存储单元中的指令直接送给CPU,从而最大限度地减少由于Cache缺失所引起CPU取指的延迟。经验证,该设计在降低功耗的同时,还提升了指令Cache的性能。相似文献

18.

可交换数据Cache结构的CMP:EDCA-CMP

陈建党郭松柳王海霞汪东升《小型微型计算机系统》2007,28(7):1331-1333

随着集成电路工艺技术的飞速发展,单芯片多处理器（Single-chip Multiprocessor,CMP）结构将是一种有效利用片上晶体管资源、提高系统性能的有效途径.CMP中各个内核通过共享同级存储装置共享数据,如共享一级Cache,共享二级Cache等.可交换数据Cache结构的CMP（Exchangeable Data Cache Architecture,EDCA-CMP）通过交换一级数据Cache的内容共享数据Cache,降低对下级存储的访问延迟,提高数据Cache的命中率,获得较高的性能. 相似文献

19.

Speculative Versioning Cache

Vijaykumar T.N. Gopal S. Smith J.E. Sohi G. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(12):1305-1317

Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively executed before the addresses of all preceding loads and stores are known. Furthermore, multiple speculative stores to a memory location create multiple speculative versions of the location. Program order among the speculative versions must be tracked to maintain sequential semantics. A previously proposed approach, the Address Resolution Buffer (ARB) uses a centralized buffer to support speculative versions. Our proposal, called the Speculative Versioning Cache (SVC), uses distributed caches to eliminate the latency and bandwidth problems of the ARB. The SVC conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches. Our evaluation for the Multiscalar architecture shows that hit latency is an important factor affecting performance and private cache solutions trade-off hit rate for hit latency 相似文献

20.

Characterization and Evaluation of Cache Hierarchies for Web Servers

Iyer Ravi 《World Wide Web》2004,7(3):259-280

As Internet usage continues to expand rapidly, careful attention needs to be paid to the design of Internet servers for achieving high performance and end-user satisfaction. Currently, the memory system continues to remain a significant performance bottleneck for Internet servers employing multi-GHz processors. In this paper, our aim is two-fold: (1) to characterize the cache/memory performance of web server workloads and (2) to propose and evaluate cache design alternatives for future web servers. We chose SPECweb99 as the representative web server workload and our entire characterization and evaluation methodology is based on our CASPER simulation framework. We begin by exploring the processor cache design space for single and dual-processor servers. Based on our observations, we then evaluate other cache hierarchy alternatives such as chipset caches, coherence filters and decompressed page stores. We show the sensitivity of these components to basic organization parameters such as cache size, line size and degree of associativity. We also present the performance implications of routing memory requests initiated by I/O devices through these caches. Based on detailed simulation data and its implications on system level performance, this paper shows that chipset caches have significant potential for improving future web server performance. 相似文献