期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

张昆刘骁郑方谢向辉《计算机工程与科学》2017,39(5):834-840

众核处理器设计在芯片面积上受到了巨大挑战,如何将有限的芯片面积投入到运算能力中,是众核处理器体系结构研究的热点。聚焦众核处理器的指令缓存结构设计,研究通过在多核核心之间共享一级指令缓存,以获取指令系统及处理器流水线性能的提升。给出了共享指令缓存的结构设计,对该结构进行了节拍级精确的性能模拟,并通过RTL级代码的综合得到了面积开销和时序指标。测试结果表明,共享指令缓存可以降低11%~27%的缓存脱靶率,提升4%~7%的流水线性能。相似文献

2.

零级指令缓存研究综述

张昆郝子宇郑方谢向辉《计算机工程与科学》2017,39(3):405-412

高效能是处理器设计的重要指标。由于指令部件在处理器芯片中开始占据越来越多的芯片面积,消耗了较多的芯片功耗,研究人员提出了零级指令缓存设计。零级指令缓存容量小、访问耗能低,与流水线紧密耦合、取指命中时可以门控流水线部分逻辑。因此,零级指令缓存可以有效提高流水线指令部件的能效比。综述了现有的零级指令缓存的不同结构、各结构的发展与应用情况;展望了零级指令缓存设计的未来研究思路。相似文献

3.

一种高能效的结构不对称指令缓存

刘骁高红光陈芳园丁亚军《计算机工程与科学》2017,39(3):443-450

在现代微处理器中,指令缓存的Tag读取、比较消耗了指令缓存较大比例的能耗。提出一种基于推断的低能耗指令缓存:不对称指令缓存。根据跳转指令比例低的特点,在该结构中区别处理跳转指令和顺序指令,使用和数据不完全对应的简化标记管理位。该结构采用了命中推断和变长指令取指两种创新技术,其中基于命中推断技术解决了指令缓存命中时Tag比较过多的问题;使用变长指令取指技术提高了顺序指令块的命中率。实验结果表明,对于选取的SPEC2006测试程序,不对称指令缓存结构较常规L1指令Cache取指能耗下降了40%~60%,比无标记指令缓存结构TH IC能耗降低了9%;取指ED2P方面,较常规L1指令Cache优化约50%,比TH IC结构优化约17%。相似文献

4.

片上多核的软件指令缓存技术研究

下载免费PDF全文

过锋李宏亮谢向辉黄永勤《计算机工程与科学》2009,31(Z1)

半导体工艺的进步使片上可以集成更多的处理核心,对于消耗较多面积和功耗的存储单元,如何有效地减小面积、降低功耗是片上多核研究的一个重要方向。软件指令缓存技术是降低指令存储复杂性,以及降低功耗的有效方式,本文深入对比了硬件Cache结构和软件指令缓存结构,并且详细分析了两款典型的软件指令缓存结构,总结了其特点和需要解决的关键问题,为片上多核的指令存储设计提供了参考。相似文献

5.

众核处理器的流水线紧耦合指令循环缓存设计

张昆过锋郑方谢向辉《计算机研究与发展》2017,54(4):813-820

能效比是未来高性能计算机需要解决的重要问题.众核处理器作为高性能计算机的重要实现手段,其微结构的优化设计对能效比提升尤为关键.提出了1种面向众核处理器的流水线紧耦合的指令循环缓存设计,以较小的L0指令缓存提供更加高能效的指令取指.作为体系结构研究同硬件可实现性紧密结合的1次尝试,设计始终考虑了硬件实现代价这一关键约束.为了控制L0指令缓存对流水线性能的影响,指令缓存采用了循环出口预取技术,以此保证指令缓存提供的低功耗的指令取指能够最终转化为流水线能效比的提升.在gem5模拟器上实现了对指令循环缓存的模拟.对SPEC2006的测试结果表明,在不影响流水线性能的前提下,设计的典型配置可以减少27%的指令取指功耗以及31.5%的流水线前段部件动态功耗. 相似文献

6.

一种阵列众核处理器的多级指令缓存结构

陈逸飞李宏亮刘骁高红光《计算机工程与科学》2018,40(4):571-579

阵列众核处理器由于其较高的计算性能和能效比已经被广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器中,核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。在阵列众核处理器中,在单核心中引入硬件同时多线程技术,针对实验中一级指令缓存命中率随着线程数增加而显著降低的问题,提出了一种面向阵列众核处理器的冗余指令缓存存储结构,基于该结构,提出采用FIFO及类LRU替换策略。通过上述优化的高速缓存结构设计,经实验模拟,双线程整体指令Cache失效率降低了25.2%,整体CPI性能提升了30.2%。相似文献

7.

WWW中缓存模型的优化设计 总被引：3，自引：0，他引：3

王东《计算机工程与设计》1998,19(2):61-64,F003

ＷＷＷ提供了方便的手段来访问远程信息资源，对于Ｗｅｂ用户而言，衡量Ｗｅｂ服务质量的一个重要指标就是检索信息所花费的时间。缩短检索时间的途径很多，此文主要介绍通过设置缓存机制，降低用户对资源访问请求次数，达到缩短用户直观感觉上的检索时间的方法。并讨论一种基于Ｃ／Ｓ结构的缓存模型，提出了新的缓存替换算法，综合考虑了文档长度，网络负载等参数，实现了缓存模型的优化。经过实验验证，该算法优于目前的缓存替换算法。相似文献

8.

共享指令缓存XOR散列索引的研究与设计

刘骁唐勇郑方丁亚军《计算机学报》2019,42(11)

相似文献

9.

以太网交换控制芯片的缓存结构 总被引：1，自引：0，他引：1

下载免费PDF全文

刘宇王玉艳《计算机工程》2010,36(10):248-250

为实现交换控制,需要为以太网交换控制芯片选择合理的数据缓存结构。采用数据包缓存空间的分页管理模式、空闲缓存空间的调度方法和出口端口队列管理技术,通过数据包缓存空间描述符设计方法和对应的目的端口结构分析,提高交换控制芯片缓存空间的使用效率并增强芯片性能。相似文献

10.

基于节点热度与缓存替换率的ICN协作缓存

《计算机工程》2018,(2)

信息中心网络默认的LCE缓存策略在数据包返回路径上的每个节点缓存内容,会产生大量冗余副本,无法充分利用缓存资源。针对该问题,提出基于节点热度与缓存替换率的缓存策略。在数据包返回路径上选择特殊节点缓存内容,考虑网络流量在不同区域和不同时间段内的差异性,周期性地计算节点热度和缓存替换率,并将其作为内容是否被缓存在节点上的度量指标。实验结果表明,相对于LCE和CLFM策略,该策略能有效降低平均请求跳数和源端命中率,获得较高的缓存收益。相似文献

11.

龙芯2F上的访存优化

苏波李凯徐志广何颂颂《计算机系统应用》2010,19(1):171-175

一般的数据处理程序中,计算时间在其中往往只起次要作用,因此访存方式是否有效对程序的性能影响很大。在基于龙芯2F处理器研制的高性能计算机系统KD-50-I上安装ATLAS,经测试其性能只达到龙芯2F理论峰值的30%。通过循环展开减少函数存储访问次数,增大计算访存比;采用数据分块、部分拷贝以增强访存局部性,减少cache失效;利用非阻塞cache加快内存访问速度等访存优化技术,将ATLAS性能提高50%以上。相似文献

12.

面向指令Cache周期预取的代码排布方法

扈啸陈书明《计算机研究与发展》2009,46(5)

在含Cache的处理器中,代码排布和指令预取是减少取指延迟的常用技术.代码排布侧重研究代码执行的空间相对位置,指令预取则关注于代码执行的时间相对关系.片上Trace技术非入侵地获得程序的执行路径及时间信息,将代码执行的时空关系联系起来,因此为排布技术和预取技术的结合使用提供了基础.基于YHFT-DSP平台,利用程序运行的周期行为特性设置预取,利用VLIW结构处理器的空闲单元执行预取指令,提出以增加预取容限为目标的函数级代码排布方法.实验结果表明,该方法能有效预取并减少指令Cache失效. 相似文献

13.

基于经验搜索的多级存储层次优化

陆平静车永刚王正华《计算机工程与应用》2006,42(34):67-69

存储墙是影响单机性能优化的重要因素,其缓解依赖于对程序进行存储优化。论文提出基于经验搜索的多级存储层次优化方法,将优化多级存储层次问题转化为对优化参数的经验搜索问题,并基于遗传算法选择全局最优解。实验表明,该技术可以自适应不同应用程序,大大降低存储访问时间,降低存储因素对程序性能的影响,从而有效地缓解存储墙问题。相似文献

14.

Execution History Guided Instruction Prefetching

Zhang Yi Haga Steve Barua Rajeev 《The Journal of supercomputing》2004,27(2):129-147

The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-cache. A recent study by Rivers et al. [19] shows that this factor alone explains why most modern microprocessors do not use such hardware-based I-cache prefetch schemes. The contribution of this paper is two-fold. First, we present a method that does not require an extra port to I-cache. Second, the performance improvement for our method is greater than the best competing method BHGP [23] even disregarding the improvement from not having an extra port. The three key features of our method that prevent the above deficiencies are as follows. First, late prefetching is prevented by correlating misses to dynamically preceding instructions. For example, if the I-cache miss latency is 12 cycles, then the instruction that was fetched 12 cycles prior to the miss is used as the prefetch trigger. Second, the miss history table is kept to a reasonable size by grouping contiguous cache misses together and associated them with one preceding instruction, and therefore, one table entry. Third, the extra I-cache port is avoided through efficient prefetch filtering methods. Experiments show that for our benchmarks, chosen for their poor I-cache performance, an average improvement of 9.2% in runtime is achieved versus the BHGP methods [23], while the hardware cost is also reduced. The improvement will be greater if the runtime impact of avoiding an extra port is considered. When compared to the original machine without prefetching, our method improves performance by about 35% for our benchmarks. 相似文献

15.

The Value of a Small Microkernel for Dreamy Memory and the RAMpage Memory Hierarchy

下载免费PDF全文

Philip Machanick 《计算机科学技术学报》2005,20(5):586-595

This paper explores potential for the RAMpage memory hierarchy to use a microkernel with a small memory footprint, in a specialized cache-speed static RAM （tightly-coupled memory, TCM）. Dreamy memory is DRAM kept in low-power mode, unless referenced. Simulations show that a small microkernel suits RAMpage well, in that it achieves significantly better speed and energy gains than a standard hierarchy from adding TCM. RAMpage, in its best 128KB L2 case, gained 11% speed using TCM, and reduced energy 14%. Equivalent conventional hierarchy gains were under 1%. While 1MB L2 was significantly faster against lower-energy cases for the smaller L2, the larger SRAM＇s energy does not justify the speed gain. Using a 128KB L2 cache in a conventional architecture resulted in a best-case overall run time of 2.58s, compared with the best dreamy mode run time （RAMpage without context switches on misses） of 3.34s, a speed penalty of 29%. Energy in the fastest 128KB L2 case was 2.18J vs. 1.50J, a reduction of 31%. The same RAMpage configuration without dreamy mode took 2.83s as simulated, and used 2.393, an acceptable trade-off （penalty under 10%） for being able to switch easily to a lower-energy mode. 相似文献

16.

X处理器存储层次研究

付桂涛高军邢座程《计算机与现代化》2007,(12):22-24

随着计算机应用领域不断拓展,流媒体应用及科学计算正成为微处理器的一种重要负载.流媒体应用的特征是大量的数据并行、少量的数据重用以及每次访存带来的大量计算.因为带宽的限制,传统的微处理器结构很难满足这些特点.X处理器是一款流处理器,针对流应用特点,X处理器采用了新型的三级流式存储层次:局部寄存器文件、流寄存器文件和片外存储器,有效解决了带宽问题.本文在模拟平台采用了两种方法(RS码和测试程序)测试,验证了流存储层次解决带宽瓶颈的有效性,也证明了设计的正确性. 相似文献

17.

A Study on Modeling and Optimization of Memory Systems

下载免费PDF全文

Jason Liu Pedro Espina Xian-He Sun 《计算机科学技术学报》2021,36(1):71-89

Accesses Per Cycle(APC),Concurrent Average Memory Access Time(C-AMAT),and Layered Performance Matching(LPM)are three memory performance models that consider both data locality and memory assess concurrency.The APC model measures the throughput of a memory architecture and therefore reflects the quality of service(QoS)of a memory system.The C-AMAT model provides a recursive expression for the memory access delay and therefore can be used for identifying the potential bottlenecks in a memory hierarchy.The LPM method transforms a global memory system optimization into localized optimizations at each memory layer by matching the data access demands of the applications with the underlying memory system design.These three models have been proposed separately through prior efforts.This paper reexamines the three models under one coherent mathematical framework.More specifically,we present a new memory-centric view of data accesses.We divide the memory cycles at each memory layer into four distinct categories and use them to recursively define the memory access latency and concurrency along the memory hierarchy.This new perspective offers new insights with a clear formulation of the memory performance considering both locality and concurrency.Consequently,the performance model can be easily understood and applied in engineering practices.As such,the memory-centric approach helps establish a unified mathematical foundation for model-driven performance analysis and optimization of contemporary and future memory systems. 相似文献

18.

Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings 总被引：1，自引：0，他引：1

Mellor-Crummey John Whalley David Kennedy Ken 《International journal of parallel programming》2001,29(3):217-247

The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically under-utilize multi-level memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on space-filling curves and we introduce a new architecture-independent multi-level blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering. 相似文献

19.

一种多核处理器存储层次性能评估模型

郭建军戴葵王志英《计算机研究与发展》2009,46(Z1)

一种用于评估多核处理器存储层次性能的模型,使用排队论建模,求解速度快,可以在设计早期给出不同配置参数对处理器整体性能的影响,从而调整存储层次结构,优化设计. 相似文献