期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

吴俊杰杨学军《计算机工程与科学》2011,33(2):51

存储墙问题使得Cache技术的研究始终非常重要。面对日益增长的片上Cache容量,线延迟逐渐成为制约Cache设计的重要因素。为了提供统一的访问延迟,传统的Cache设计方法不得不迁就离处理器最远的Cache Bank的访问时间。为此,研究人员提出了一种非一致Cache结构(NUCA),NUCA几乎成为未来处理器中大容量Cache设计的一种趋势。处理器访问NUCA时,如果在离处理器较近的Bank中发生命中,处理器的等待时间就较短;如果在离处理器较远的Bank中发生命中,处理器的等待时间就较长。本文综述了NUCA技术产生的原因、发展,以及当前最典型的NUCA系统;并且指出了对NUCA技术研究有借鉴的两种多机存储系统技术——NUMA和COMA;最后,提出了NUCA技术研究的关键问题,并给出了相应的解决思路。相似文献

2.

CMP中Cache一致性协议的验证

李崇民王海李兆麟《电子技术应用》2005,31(12):1-4

CMP是处理器体系结构发展的一个重要方向,其中Cache一致性问题的验证是CMP设计中的一项重要课题。基于MESI一致性协议,本文建立了CMP的Cache一致性协议的验证模型,总结了三种验证方法——状态列举法、模型检验法和符号状态法,并给出了每一种方法的复杂性分析。相似文献

3.

面向多核NUCA共享数据竞争问题的Bank一致性技术

下载免费PDF全文

吴俊杰潘晓辉《计算机工程与科学》2009,31(11)

非一致Cache体系结构(NUCA)几乎已经成为未来片上大容量cache的发展方向。多核处理器的NUCA结构中,多个处理器核对共享数据的竞争访问,可能导致数据经常处于中部的cache Bank,增加NUCA的访问延迟。本文提出支持数据副本的Bank一致性技术,通过有选择地在NUCA中为访问的处理器核创建不同的数据副本,Bank一致性技术能够缓解多核处理器对共享数据的竞争问题。本文详细地介绍了Bank一致性协议的设计方法。最后,使用全系统模拟器对8个NPB基准测试程序进行了详细评测。实验结果表明,Bank一致性技术能够有效缓解多核处理器中共享数据的竞争访问问题。相比不支持Bank一致性技术的CMP-DNUCA结构,本文的方法能将系统IPC性能平均提升5.95%。相似文献

4.

片上非一致Cache体系结构研究

下载免费PDF全文

贾小敏黄彩霞张民选孙彩霞齐树波《计算机工程与科学》2009,31(8)

随着集成电路制造工艺的发展,片上集成大容量Cache成为微处理器的发展趋势。然而,互连线延迟所占比例越来越大,成为大容量Cache的性能瓶颈,因此需要新的Cache体系结构来克服这些问题。非一致Cache体系结构通过在Cache内部支持多级延迟和数据块迁移来减少Cache的命中时间,提高性能,从而克服互连线延迟对大容量Cache的限制,已经成为微处理器片上存储结构的研究热点。本文回顾了非一致Cache体系结构模型的研究进展,特别是对片上多核处理器中的非一致Cache体系结构模型进行了详细介绍,比较了不同模型的贡献和不足。最后,对非一致Cache体系结构的发展进行了展望。相似文献

5.

片上多处理器末级Cache优化技术研究

李浩谢伦国《计算机研究与发展》2012,(Z1):172-179

片上多核技术的出现给处理器的设计和实现带来很多挑战,片上存储系统的设计就是其中最重要的方面之一.为了缓解日益严峻的存储墙问题,研究者们通常在片上放置大容量末级Cache,片上末级Cache设计和优化技术已成为当前的研究热点.介绍了片上多处理器(CMP)末级Cache设计面临的挑战,然后分别介绍了以私有设计和共享设计为基础的多种CMP末级Cache优化技术,并对它们进行了比较分析. 相似文献

6.

可交换数据Cache结构的CMP:EDCA-CMP

陈建党郭松柳王海霞汪东升《小型微型计算机系统》2007,28(7):1331-1333

随着集成电路工艺技术的飞速发展,单芯片多处理器（Single-chip Multiprocessor,CMP）结构将是一种有效利用片上晶体管资源、提高系统性能的有效途径.CMP中各个内核通过共享同级存储装置共享数据,如共享一级Cache,共享二级Cache等.可交换数据Cache结构的CMP（Exchangeable Data Cache Architecture,EDCA-CMP）通过交换一级数据Cache的内容共享数据Cache,降低对下级存储的访问延迟,提高数据Cache的命中率,获得较高的性能. 相似文献

7.

面向非一致Cache的智能多跳提升技术

吴俊杰潘晓辉杨学军《计算机学报》2009,32(10)

非一致Cache体系结构(Non-Uniform Cache Architecture,NUCA)几乎已经成为未来片上大容量Cache的设计趋势.非一致Cache中,数据提升技术通过将经常访问的数据放置在距离处理器较近的Cache bank中减少处理器对该数据访问的等待时间,对NUCA的性能有着重要影响.然而,目前已有的数据提升技术使用固定的提升策略,投有考虑所要提升到目标bank的实际状态,容易将目标bank中更有用的数据"挤"得远离处理器,从而产生Cache污染问题,严重制约了提升技术的性能发挥.针对这一问题,文中提出智能多跳提升技术.智能多跳提升技术能够感知候选目标bank的状态,为被提升的数据动态地选择合适的目标bank,从而提高了提升效率,减少了Cache污染.同时,智能多跳提升技术的设计巧妙地利用了处理器访问的反向路径,只是简单地扩充了处理器访问报文的格式,并没有增加对Cache bank的额外访问.最后使用全系统模拟器对来自NAS Parallel Benchmark和Livermore Benchmark的15个基准测试程序进行了详细测试,智能多跳提升技术单位提升操作节省的时钟周期数是已有提升技术的1.50倍,最多达到2.61倍;系统的IPC性能平均提高了6.24%,最高达到19.03%. 相似文献

8.

一种基于容量复用的异构CMP Cache

高翔章隆兵胡伟武《计算机研究与发展》2008,45(5):877-885

多核环境下的Cache设计技术受到线延时和应用等多方面因素影响,私有和共享方案都存在各自的不足.提出了一种异构的CMP Cache结构,采用两类具有不同Cache层次的结点组成多核芯片,设计了基于间接索引的Cache容量复用等技术,提供了容量有效且访问迅速的片上存储层次.在全系统环境下对SPEC CPU2000,SPLASH2等程序的评测结果表明,异构CMP Cache结构能够适应各类应用的需要,对单进程和多线程应用平均性能提高分别可达16%和9%.异构CMP Cache同时具有硬件设计简单的特点,具有较好的工程可实现性,其设计思想将应用在未来的龙芯多核处理器设计中. 相似文献

9.

ARP:同时多线程处理器中共享Cache自适应运行时划分机制

隋秀峰吴俊敏陈国良《计算机研究与发展》2008,45(7)

同时多线程是一种延迟容忍的体系结构,采用共享的二级Cache,在每个周期内可以执行多个线程的多条指令,这就会增加对存储层次的压力,文中主要研究了SMT处理器中多个并发执行的线程之间共享Cache的划分问题,尤其是Cache共享中的公平性问题以及它和吞吐量之间的关系,传统的LRU策略会根据线程的需要隐式地划分共享Cache,给具有较高需求的线程分配较多的Cache空间,对Cache的管理具有不公平性,从而会引起线程饿死、优先级反转等问题,实现了一种自适应、运行时划分机制(ARP)来管理共享Cache.ARP采用公平性作为划分的度量,并且使用动态划分算法来优化公平性,该算法具有易于实现,所需剖析较少的特点,硬件上使用经典的监控器来收集每个线程的栈距离信息,其存储开销不到0.25%.实验结果显示,与基于LRU的Cache划分相比,ARP可以将一个2路SMT处理器的公平性提高2.26倍,而将吞吐量平均提高14.75%. 相似文献

10.

多核处理机系统Cache管理技术研究现状 总被引：1，自引：0，他引：1

下载免费PDF全文

所光杨学军《计算机工程与科学》2010,32(7):65-68

多核处理器的Cache结构设计和管理是微处理器设计领域的重要问题。当前主流的商用微处理器均采用共享最后一级Cache的系统结构,而片上最后一级Cache的性能通常对处理器的性能影响较大,因此共享Cache的管理问题成为当前研究热点。本文首先介绍当前主流多核处理器及其设计问题,然后介绍了共享Cache管理的三项重要技术:线程调度、NUCA和Cache划分,最后给出多核处理器Cache管理技术的发展方向。相似文献

11.

Replacement techniques for dynamic NUCA cache designs on CMPs

Javier Lira Carlos Molina Ryan N. Rakvic Antonio González 《The Journal of supercomputing》2013,64(2):548-579

The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches. 相似文献

12.

Supporting faulty banks in NUCA by NoC assisted remapping mechanisms

Kuei-Chung Chang Chen-Yu Chen Chin-Sheng Yu Ching-Wen Chen 《The Journal of supercomputing》2014,67(2):305-323

The many-core SoC is a future trend technology, and the process yield will face many unpredictable challenges. Nonuniform cache architecture (NUCA) can improve the performance of many-core SoC for embedded systems. It embeds a NoC into the cache memory to enhance the data access by distributing traffic loads to several banks in parallel. Providing fault-tolerant mechanism in NUCA is very important because the chip can still work efficiently when some memory banks are unusable. In this paper, we design a specific router working with static and dynamic cache remapping policies to support faulty banks in NUCA. When a L2 cache bank in NUCA is unusable, static remapping policy (SRP) selects a suitable neighbor cache bank according to the collected remapping cost to assist with the cache access by considering cache status and traffic status of the router. We also propose a dynamic remapping policy (DRP) to select the suitable cache bank dynamically at runtime to fit the real loading status of neighbor nodes around the faulty bank. The experimental results show that the average improvement of the SRP is approximated to 26 %, and the average improvement of the DRP is approximated to 28 %. 相似文献

13.

Low-level implementation of the SISC protocol for thread-level speculation on a multi-core architecture

《Parallel Computing》2017

Chip Multiprocessors (CMP) have emerged during last decades as a very attractive solution in using the ever-increasing on-chip transistor count. However, classical parallelization techniques failed to fully exploit parallelization from existing sequential applications due to false data dependencies. This paper focuses on the Thread-level Speculation (TLS) technique, an alternative way to exploit the transistor budget in a CMP. With TLS, even possibly data dependent threads can run in parallel as long as the semantics of the sequential execution is preserved. A special hardware support monitors the actual data dependencies between threads at run time and, if they are violated, misspeculation effects are undone usually through replay. This kind of system is known as speculative CMP. However, the TLS mechanism requires complex protocols that integrate cache coherence and speculation to maintain program order among multiple versions of data. Current TLS protocol evaluations are usually inadequate because they are not done low-level enough. A realistic evaluation of speculative CMPs requires either to be performed on a real hardware or very detailed cycle-accurate simulator models.In this paper we are particularly focused on a low-level evaluation of the write-invalidate TLS protocol Speculation Integrated with Snoopy Coherence (SISC) protocol proposed in [1]. This evaluation relies on cycle-level simulation environment with detailed cycle-level cache memories, cache controller and system bus. On top of this, a speculative four core architecture is simulated and three new modules (Scheduler, Squash Arbiter and Supplier Arbiter) are provided to support low-level implementation of the SISC protocol. The overall cost of the SISC protocol is evaluated by means of CACTI tool for the three different domains: the access latency cost, the area cost, and the power cost. The evaluation goal was to keep the cache access time to remain below cycle latency as well as the area and power overheads below an acceptable budget overhead. The SISC protocol has been compared against regular MESI-based architecture in both 32-bit and 64-bit versions. We kept the cache access time below the cycle latency, and we managed to keep both data cache area and static power overheads respectively below 32% and 35%. 相似文献

14.

片上多处理器中延迟和容量权衡的cache结构 总被引：1，自引：0，他引：1

肖俊华冯子军章隆兵《计算机研究与发展》2009,46(1)

片上多处理器中二级cache的设计面临着延迟和容量不能同时满足的矛盾,私有结构有较小的命中延迟但是减少了cache的有效容量,共享结构能增加cache的有效容量但是有较长的命中延迟.提出了一种适用于CMP的cache结构--延迟和容量权衡的cache结构(TCLC).该结构是一种混合私有结构和共享结构的设计,核心思想是动态识别cache块的共享类型,根据不同共享类型分别对其进行优化,对私有cache块采用迁移的优化策略,对共享只读cache块采用复制的优化策略,对共享读写cache块采用中心放置的优化策略,以期达到访问延迟接近私有结构,有效容量接近共享结构的目的,从而缓解线延迟的影响,减少平均内存访问延迟.全系统模拟的实验结果表明,采用TCLC结构,相对于私有结构性能平均提高13.7%.相对于共享结构性能平均提高12%. 相似文献

15.

Improving performance of multi-core NUCA coherent systems using NoC-assisted mechanisms

Kuei-Chung Chang Ing-Ming Liao Chiu-Han Liao 《The Journal of supercomputing》2012,62(3):1318-1337

The significant speed-gap between processor and memory makes last-level cache performance crucial for multi-core architectures (MCA). Non-uniform cache architecture (NUCA) has been proposed to overcome the performance limitations of MCA for many embedded applications. The cache is partitioned into sub-banks, with each sub-bank being an independently accessible entity connected with a fast on-chip network (NoC). This paper presents two NoC-assisted mechanisms to improve the performance and power consumption of NUCA coherence. The first mechanism provides priority-based communication based on the wormhole routing architecture to support NUCA coherence. High-priority coherent packets are transmitted first to save time. The second mechanism offers multicasting communication based on the proposed priority-based NoC to provide efficient cache coherency for NUCA. We dispatch and collect coherence packets at the collecting nodes (CN) to further decrease the number of coherent messages flowing in the NoC. Experimental results show that the priority-based transmission can improve performance by approximately 10?%. The proposed multicasting mechanism can further improve performance and decrease power consumption of the NoC in NUCA by approximately 15?%. The two proposed mechanisms can together enhance the performance by 25?% averagely. 相似文献

16.

A workload independent energy reduction strategy for D-NUCA caches

Pierfrancesco Foglia Manuel Comparetti 《The Journal of supercomputing》2014,68(1):157-182

Wire delays and leakage energy consumption are both growing problems in the design of large on chip caches built in deep submicron technologies. D-NUCA caches (Dynamic-Nonuniform Cache Architecture) exploit an aggressive subbanking of the cache and a migration mechanism to speed up frequently accessed data access latency, to limit wire delays effects on performances. Way Adaptable D-NUCA is a leakage power reduction technique specifically suited for D-NUCA caches. It dynamically varies the portion of the powered-on cache area based on the running workload caching needs, but it relies on application dependent parameters that must be evaluated off-line. This limits the effectiveness of Way Adaptable D-NUCA in the general purpose, multiprogrammed environment. In this paper, we propose a new power reduction technique for D-NUCA caches, which still adapts the powered-on cache area to the needs of the running workload, but it does not rely on application-dependent parameters. Results show that our proposal saves around 49 % of total cache energy consumption in a single core environment and 44 % in CMP environment. By adding a timer, it performs similarly to previously proposed techniques to reduce leakage power consumptions, and outperforms them when they are applied in a workload independent manner. 相似文献

17.

Nonuniform cache architectures for wire-delay dominated on-chip caches

Changkyu Kim Burger D. Keckler S.W. 《Micro, IEEE》2003,23(6):99-107

Nonuniform cache access designs solve the on-chip wire delay problem for future large integrated caches. By embedding a network in the cache, NUCA designs let data migrate within the cache, clustering the working set nearest the processor. The authors propose several designs that treat the cache as a network of banks and facilitate nonuniform accesses to different physical regions. NUCA architectures offer low-latency access, increased scalability, and greater performance stability than conventional uniform access cache architectures. 相似文献