期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

吴俊杰潘晓辉《计算机工程与科学》2009,31(11)

非一致Cache体系结构(NUCA)几乎已经成为未来片上大容量cache的发展方向。多核处理器的NUCA结构中,多个处理器核对共享数据的竞争访问,可能导致数据经常处于中部的cache Bank,增加NUCA的访问延迟。本文提出支持数据副本的Bank一致性技术,通过有选择地在NUCA中为访问的处理器核创建不同的数据副本,Bank一致性技术能够缓解多核处理器对共享数据的竞争问题。本文详细地介绍了Bank一致性协议的设计方法。最后,使用全系统模拟器对8个NPB基准测试程序进行了详细评测。实验结果表明,Bank一致性技术能够有效缓解多核处理器中共享数据的竞争访问问题。相比不支持Bank一致性技术的CMP-DNUCA结构,本文的方法能将系统IPC性能平均提升5.95%。相似文献

2.

非一致Cache体系结构技术综述

吴俊杰杨学军《计算机工程与科学》2011,33(2):51

存储墙问题使得Cache技术的研究始终非常重要。面对日益增长的片上Cache容量,线延迟逐渐成为制约Cache设计的重要因素。为了提供统一的访问延迟,传统的Cache设计方法不得不迁就离处理器最远的Cache Bank的访问时间。为此,研究人员提出了一种非一致Cache结构(NUCA),NUCA几乎成为未来处理器中大容量Cache设计的一种趋势。处理器访问NUCA时,如果在离处理器较近的Bank中发生命中,处理器的等待时间就较短;如果在离处理器较远的Bank中发生命中,处理器的等待时间就较长。本文综述了NUCA技术产生的原因、发展,以及当前最典型的NUCA系统;并且指出了对NUCA技术研究有借鉴的两种多机存储系统技术——NUMA和COMA;最后,提出了NUCA技术研究的关键问题,并给出了相应的解决思路。相似文献

3.

多核处理器非一致Cache体系结构延迟优化技术研究综述 总被引：1，自引：0，他引：1

黄安文高军张民选《计算机研究与发展》2012,(Z1):118-124

非一致Cache体系结构(non-uniform cache architecture,NUCA)为解决多核处理器(chip multi-processor)"存储墙"难题提供了新的设计思路.重点关注面向CMP的NUCA延迟优化技术,在介绍若干典型NUCA模型的基础上,分析大容量Cache环境下共享/私有机制中的延迟-容量权衡问题,讨论映射、迁移、复制和搜索等数据管理机制在多核环境下的优缺点.最后,针对基于片上网络(network-on-chip,NoC)互连结构的可扩展CMP体系结构,从NUCA模型优化、数据管理和一致性维护机制3个方面讨论和预测未来CMP NUCA延迟优化领域的发展趋势及面临的挑战性问题. 相似文献

4.

基于分支路径跟踪的猜测访存数据Cache污染控制技术

刘松鹤宋焕生亓淑敏李文敏《计算机应用研究》2013,30(7):2064-2067

“存储墙”问题是高性能处理器设计必须跨越的障碍之一, 高效、智能的Cache系统是处理器存储体系的关键因素。具有分支预测能力的处理器在猜测执行分支路径上访存指令时取回的存储器数据所导致的Cache污染会显著影响Cache和处理器性能。分析了猜测执行和Cache数据污染对处理器性能的影响, 在此基础上结合分支预测机制的特征提出了一种基于分支路径跟踪的Cache污染控制技术——Contra, 通过构建分支路径跟踪表对猜测路径写入Cache的数据进行跟踪, 并对这些数据的存储、访问和替换过程进行控制, 有效地避免了污染数据对Cache效率的影响, 提升了处理器存储系统的性能。仿真结果表明, Contra技术相对于baseline结构来说, L1 D-Cache命中率提升幅度为0. 03%～6. 69%, 平均提升为1. 80%; IPC的提升幅度为0. 01%～6. 60%, 平均提升为2. 56%。相似文献

5.

Pentium4处理器的内存层次分析

吴金齐欢《计算机技术与发展》2004,14(7)

处理器存储系统的效率对其整体性能有着十分重要的作用.文中介绍了P4处理器内存的体系结构,它包括一级数据Cache、二级Cache、Trace Cache;各部分完成的功能以及为提高命中率和降低存取时间,从而提高效率而采取的预取处理机制;P4处理器主要采取具有层次结构的内存设计、大容量的二级Cache和在跟踪Cache中采用预取处理机制的方法来提高Cache的命中率和降低未命中的代价来缩短处理器的访问时间,最终达到提高处理器整体性能的目的. 相似文献

6.

密码嵌入式处理器中高速缓存的研究与设计

王晓燕杨先文陈海民《计算机工程与设计》2012,33(8):3000-3005

为了提高密码嵌入式处理器的运行效率,给出了一种哈佛结构的高速缓存(Cache)设计,包括指令Cache(iCache)和数据Cache(dCache)。采用双端口RAM和较低的硬件开销设计了标签存储器和指令/数据存储器,并描述了iCache和dCache控制流程。实现时配置iCache容量为4KB、dCache容量为8KB,并完成了向密码嵌入式处理器的集成。FPGA验证结果表明其满足处理器的应用要求;性能分析结果表明,采用Cache比处理器直接访问主存在速度上至少提高5.26倍。相似文献

7.

多核处理机系统Cache管理技术研究现状 总被引：1，自引：0，他引：1

下载免费PDF全文

所光杨学军《计算机工程与科学》2010,32(7):65-68

多核处理器的Cache结构设计和管理是微处理器设计领域的重要问题。当前主流的商用微处理器均采用共享最后一级Cache的系统结构,而片上最后一级Cache的性能通常对处理器的性能影响较大,因此共享Cache的管理问题成为当前研究热点。本文首先介绍当前主流多核处理器及其设计问题,然后介绍了共享Cache管理的三项重要技术:线程调度、NUCA和Cache划分,最后给出多核处理器Cache管理技术的发展方向。相似文献

8.

DSCF:一种面向共享存储多核DSP的数据流分簇前向技术

汪东陈书明《计算机研究与发展》2008,45(8)

多核数字信号处理器(DSP)的性能常常受限于共享存储的长延迟Cache一致性访问.数据前向(forwarding)技术是隐藏长延迟访问的一种有效手段.根据多核DSP应用的两类重要特征,提出了一种面向共享存储多核DSP结构的数据流分簇前向技术DSCF(data stream clustered forwarding).DSCF方法的主要特点是:兼容基本的共享存储Cache一致性协议;不污染目标Cache;数据的传输速度能够与消费速度相匹配;系统结构的可扩展性好.典型测试程序的模拟评测表明,采用DSCF方法能够将Cache一致性失效率平均降低44%,将系统总体性能提升30%～70%. 相似文献

9.

面向Cache优化的向量指令集设计与测评

曾坤《计算机工程与科学》2009,31(Z1)

为微处理器扩展向量指令集是提升现代微处理器性能的一种可行手段,然而传统向量指令对存储系统的访问表现出较差的局部性,因此难以与现代微处理器设计中广泛使用的Cache很好的结合。本文以优化Cache性能为目标,对传统向量指令集进行改造,提出了COV(Cache Optimized Vector Instruction Set)向量指令集,并以OpenRISC1200为平台,对该指令集进行了实现与测评,获得了约四倍的性能加速比。相似文献

10.

同时多线程处理器上的Cache性能分析与优化

隋秀峰吴俊敏陈国良《小型微型计算机系统》2009,30(1)

同时多线程(SMT)是一种延迟容忍的体系结构,它在每个周期内可以执行多个线程的多条指令.在SMT处理器上,对于片上共享存储这个复杂的结构资源,至今还没有很好的共享和冲突解决方案.本文着重研究了在多个并发执行的线程间划分共享Cache所存在的问题,指出基于LRU策略的传统Cache会根据需要隐式地划分共享Cache,这在某些情况下会导致全局性能的下降.针对这一问题并且考虑到SMT处理器上对Cache访问带宽的需求,本文提出采用一种多模块多体的Cache结构设计方案.并且在一个修改过的SMT模拟器上对该设计方案进行了性能评价.实验结果显示,相比于基于LRU策略的传统Cache,这一结构可以将一个4路SMT处理器的IPC提高9%. 相似文献

11.

Supporting faulty banks in NUCA by NoC assisted remapping mechanisms

Kuei-Chung Chang Chen-Yu Chen Chin-Sheng Yu Ching-Wen Chen 《The Journal of supercomputing》2014,67(2):305-323

The many-core SoC is a future trend technology, and the process yield will face many unpredictable challenges. Nonuniform cache architecture (NUCA) can improve the performance of many-core SoC for embedded systems. It embeds a NoC into the cache memory to enhance the data access by distributing traffic loads to several banks in parallel. Providing fault-tolerant mechanism in NUCA is very important because the chip can still work efficiently when some memory banks are unusable. In this paper, we design a specific router working with static and dynamic cache remapping policies to support faulty banks in NUCA. When a L2 cache bank in NUCA is unusable, static remapping policy (SRP) selects a suitable neighbor cache bank according to the collected remapping cost to assist with the cache access by considering cache status and traffic status of the router. We also propose a dynamic remapping policy (DRP) to select the suitable cache bank dynamically at runtime to fit the real loading status of neighbor nodes around the faulty bank. The experimental results show that the average improvement of the SRP is approximated to 26 %, and the average improvement of the DRP is approximated to 28 %. 相似文献

12.

Nonuniform cache architectures for wire-delay dominated on-chip caches

Changkyu Kim Burger D. Keckler S.W. 《Micro, IEEE》2003,23(6):99-107

Nonuniform cache access designs solve the on-chip wire delay problem for future large integrated caches. By embedding a network in the cache, NUCA designs let data migrate within the cache, clustering the working set nearest the processor. The authors propose several designs that treat the cache as a network of banks and facilitate nonuniform accesses to different physical regions. NUCA architectures offer low-latency access, increased scalability, and greater performance stability than conventional uniform access cache architectures. 相似文献

13.

Replacement techniques for dynamic NUCA cache designs on CMPs

Javier Lira Carlos Molina Ryan N. Rakvic Antonio González 《The Journal of supercomputing》2013,64(2):548-579

The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches. 相似文献

14.

A NUCA Substrate for Flexible CMP Cache Sharing 总被引：1，自引：0，他引：1

Jaehyuk Huh J. Changkyu Kim C. Shafi H. Lixin Zhang L. Burger D. Keckler S.W. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1028-1040

We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses. 相似文献

15.

重用感知的非一致缓存迁移策略研究

汪玲黄炎袁光辉《计算机工程》2014,(2):81-85

随着工艺的持续进步,多核处理器集成了越来越多的核以及片上缓存系统,因此利用非一致缓存架构(NUCA)应对片上多核处理器的缓存系统中逐渐增大的线延迟。高效的缓存块迁移策略对整个缓存系统至关重要。当前动态非一致缓存架构(D-NUCA)中的缓存块迁移策略未考虑缓存块的历史访问信息,导致缓存块在不同的bank之间抖动从而增加缓存块的访问延迟。为此,提出一种重用感知的缓存块迁移(RABM)策略,采用缓存块的历史迁移信息来预测将来的缓存块迁移,从而提升D-NUCA的性能以及降低整个缓存系统的功耗。基于PARSEC基准测试程序的全系统仿真结果显示,与D-NUCA相比,基于RABM的D-NUCA可以使每时钟周期指令数平均提高9.6%,片上缓存系统功耗降低14%。相似文献

16.

Improving performance of multi-core NUCA coherent systems using NoC-assisted mechanisms

Kuei-Chung Chang Ing-Ming Liao Chiu-Han Liao 《The Journal of supercomputing》2012,62(3):1318-1337

The significant speed-gap between processor and memory makes last-level cache performance crucial for multi-core architectures (MCA). Non-uniform cache architecture (NUCA) has been proposed to overcome the performance limitations of MCA for many embedded applications. The cache is partitioned into sub-banks, with each sub-bank being an independently accessible entity connected with a fast on-chip network (NoC). This paper presents two NoC-assisted mechanisms to improve the performance and power consumption of NUCA coherence. The first mechanism provides priority-based communication based on the wormhole routing architecture to support NUCA coherence. High-priority coherent packets are transmitted first to save time. The second mechanism offers multicasting communication based on the proposed priority-based NoC to provide efficient cache coherency for NUCA. We dispatch and collect coherence packets at the collecting nodes (CN) to further decrease the number of coherent messages flowing in the NoC. Experimental results show that the priority-based transmission can improve performance by approximately 10?%. The proposed multicasting mechanism can further improve performance and decrease power consumption of the NoC in NUCA by approximately 15?%. The two proposed mechanisms can together enhance the performance by 25?% averagely. 相似文献

17.

片上非一致Cache体系结构研究

下载免费PDF全文

贾小敏黄彩霞张民选孙彩霞齐树波《计算机工程与科学》2009,31(8)

随着集成电路制造工艺的发展,片上集成大容量Cache成为微处理器的发展趋势。然而,互连线延迟所占比例越来越大,成为大容量Cache的性能瓶颈,因此需要新的Cache体系结构来克服这些问题。非一致Cache体系结构通过在Cache内部支持多级延迟和数据块迁移来减少Cache的命中时间,提高性能,从而克服互连线延迟对大容量Cache的限制,已经成为微处理器片上存储结构的研究热点。本文回顾了非一致Cache体系结构模型的研究进展,特别是对片上多核处理器中的非一致Cache体系结构模型进行了详细介绍,比较了不同模型的贡献和不足。最后,对非一致Cache体系结构的发展进行了展望。相似文献