期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Supporting faulty banks in NUCA by NoC assisted remapping mechanisms

Kuei-Chung Chang Chen-Yu Chen Chin-Sheng Yu Ching-Wen Chen 《The Journal of supercomputing》2014,67(2):305-323

The many-core SoC is a future trend technology, and the process yield will face many unpredictable challenges. Nonuniform cache architecture (NUCA) can improve the performance of many-core SoC for embedded systems. It embeds a NoC into the cache memory to enhance the data access by distributing traffic loads to several banks in parallel. Providing fault-tolerant mechanism in NUCA is very important because the chip can still work efficiently when some memory banks are unusable. In this paper, we design a specific router working with static and dynamic cache remapping policies to support faulty banks in NUCA. When a L2 cache bank in NUCA is unusable, static remapping policy (SRP) selects a suitable neighbor cache bank according to the collected remapping cost to assist with the cache access by considering cache status and traffic status of the router. We also propose a dynamic remapping policy (DRP) to select the suitable cache bank dynamically at runtime to fit the real loading status of neighbor nodes around the faulty bank. The experimental results show that the average improvement of the SRP is approximated to 26 %, and the average improvement of the DRP is approximated to 28 %. 相似文献

2.

A NUCA Substrate for Flexible CMP Cache Sharing 总被引：1，自引：0，他引：1

Jaehyuk Huh J. Changkyu Kim C. Shafi H. Lixin Zhang L. Burger D. Keckler S.W. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1028-1040

We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses. 相似文献

3.

重用感知的非一致缓存迁移策略研究

汪玲黄炎袁光辉《计算机工程》2014,(2):81-85

随着工艺的持续进步,多核处理器集成了越来越多的核以及片上缓存系统,因此利用非一致缓存架构(NUCA)应对片上多核处理器的缓存系统中逐渐增大的线延迟。高效的缓存块迁移策略对整个缓存系统至关重要。当前动态非一致缓存架构(D-NUCA)中的缓存块迁移策略未考虑缓存块的历史访问信息,导致缓存块在不同的bank之间抖动从而增加缓存块的访问延迟。为此,提出一种重用感知的缓存块迁移(RABM)策略,采用缓存块的历史迁移信息来预测将来的缓存块迁移,从而提升D-NUCA的性能以及降低整个缓存系统的功耗。基于PARSEC基准测试程序的全系统仿真结果显示,与D-NUCA相比,基于RABM的D-NUCA可以使每时钟周期指令数平均提高9.6%,片上缓存系统功耗降低14%。相似文献

4.

Architecture and data migration methodology for L1 cache design with hybrid SRAM and volatile STT-RAM configuration

《Microprocessors and Microsystems》2016

Spin-Transfer Torque RAM (STT-RAM) has the advantages of circuit density and ignorable leakage power. However, it suffers from the bad write latency and poor write power consumption. Therefore, it is difficult to replace entire SRAM with STT-RAM in the L1 cache, but we can relax the retention time of STT-RAM cell to improve its write performance and replace some of the SRAM capacity to reduce leakage power. In this paper, we propose a locality-aware approach for L1 cache design with hybrid SRAM and volatile STT-RAM configuration. Based on the principle of cache locality, data block is mapped to SRAM firstly to reduce write latency and write energy, and is moved to volatile STT-RAM to reduce leakage power consumption. After a time period when there is no access of a data block in the volatile STT-RAM, we then stop its refresh operations to further reduce power consumption. Experimental results show that in comparison with the SRAM only L1 cache configuration, our hybrid cache configuration and data migration methodology reduce energy consumption by about 15–20%, with only nearly to 5% of latency overhead. Also when comparing to the STT-RAM only L1 cache configuration, we reduce memory access latency nearly to 20% with close or even better energy consumption. 相似文献

5.

Access pattern restructuring for memory energy

De La Luz V. Kadayif I. Kandemir M. Sezer U. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(4):289-303

Improving memory energy consumption of programs that manipulate arrays is an important problem as these codes spend large amounts of energy in accessing off-chip memory. We propose a data-driven strategy to optimize the memory energy consumption in a banked memory system. Our compiler-based strategy modifies the original execution order of loop iterations in array-dominated applications to increase the length of the time period(s) in which memory banks are idle (i.e., not accessed by any loop iteration). To achieve this, it first classifies loop iterations according to their bank accesses patterns and then, with the help of a polyhedral tool, tries to bring the iterations with similar bank access patterns close together. Increasing the idle periods of memory banks brings two major benefits: first, it allows us to place more memory banks into low-power operating modes and, second, it enables us to use a more aggressive (i.e., more energy saving) operating mode (hence, saving more energy) for a given bank (instead of a less aggressive mode). The proposed strategy can reduce memory energy consumption in both sequential and parallel applications. Our strategy has been implemented in an experimental compiler using a polyhedral tool and evaluated using nine array-dominated applications on both a cacheless system and a system with cache memory. Our experimental results indicate that the proposed strategy is very successful in reducing the memory system energy and improves the memory energy by as much as 36.8 percent over a strategy that uses low-power modes without optimizing data access pattern. Our results also show that optimizations that target reducing off-chip memory energy can generate very different results from those that target at improving only cache locality. 相似文献

6.

Replacement techniques for dynamic NUCA cache designs on CMPs

Javier Lira Carlos Molina Ryan N. Rakvic Antonio González 《The Journal of supercomputing》2013,64(2):548-579

The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches. 相似文献

7.

面向非一致Cache的智能多跳提升技术

吴俊杰潘晓辉杨学军《计算机学报》2009,32(10)

非一致Cache体系结构(Non-Uniform Cache Architecture,NUCA)几乎已经成为未来片上大容量Cache的设计趋势.非一致Cache中,数据提升技术通过将经常访问的数据放置在距离处理器较近的Cache bank中减少处理器对该数据访问的等待时间,对NUCA的性能有着重要影响.然而,目前已有的数据提升技术使用固定的提升策略,投有考虑所要提升到目标bank的实际状态,容易将目标bank中更有用的数据"挤"得远离处理器,从而产生Cache污染问题,严重制约了提升技术的性能发挥.针对这一问题,文中提出智能多跳提升技术.智能多跳提升技术能够感知候选目标bank的状态,为被提升的数据动态地选择合适的目标bank,从而提高了提升效率,减少了Cache污染.同时,智能多跳提升技术的设计巧妙地利用了处理器访问的反向路径,只是简单地扩充了处理器访问报文的格式,并没有增加对Cache bank的额外访问.最后使用全系统模拟器对来自NAS Parallel Benchmark和Livermore Benchmark的15个基准测试程序进行了详细测试,智能多跳提升技术单位提升操作节省的时钟周期数是已有提升技术的1.50倍,最多达到2.61倍;系统的IPC性能平均提高了6.24%,最高达到19.03%. 相似文献

8.

Power and performance analysis of multimedia applications running on low-power devices by cache modeling

Abu Asaduzzaman Govipalagodage H. Gunasekara 《Multimedia Tools and Applications》2014,72(1):207-230

High processing speed is required to support computation intensive applications. Cache memory is used to improve processing speed by reducing the speed gap between the fast processing core and slow main memory. However, the problem of adopting cache into computing systems is twofold: cache is power hungry (that challenges energy constraints) and cache introduces execution time unpredictability (that challenges supporting real-time multimedia applications). Recently published articles suggest that using cache locking improves predictability. However, increased cache activities due to aggressive cache locking make the system consume more energy and become less efficient. In this paper, we investigate the impact of cache parameters and cache locking on power consumption and performance for real-time multimedia applications running on low-power devices. In this work, we consider Intel Pentium-like single-processor and Xeon-like multicore architectures, both with two-level cache memory hierarchy, using three popular multimedia applications: MPEG-4 (the global video coding standard), H.264/AVC (the network friendly video coding standard), and recently introduced H.265/HEVC (for improved video quality and data compression ratio). Experimental results show that cache locking mechanism added to an optimized cache memory structure is very promising to increase the performance/power ratio of low-power systems running multimedia applications. According to the simulation results, performance can be improved by decreasing cache miss rate down to 36 % and the total power consumption can be saved up to 33 %. It is also observed that H.265/HEVC has significant performance advantage over H.264/AVC (and MPEG-4) for smaller caches. 相似文献

9.

ELSS:一种降低数据Cache体转换能量的替换策略

下载免费PDF全文

周宏伟孙岩张民选《计算机工程与科学》2009,31(1)

随着工艺尺寸的缩小以及频率的增加,漏流能量将成为未来微处理器能量消耗的主要来源。其中,片上Cache存储结构将是整个处理器能量消耗的重要组成部分。为了降低漏流能量,组相联数据Cache中采用了分体的结构,通过使用位线隔离技术将那些未被访问的Cache存储体的位线进行隔离,使之进入低能耗状态。本文提出一种新的数据Cache替换策略——ELSS。该策略充分考虑到访问数据Cache的地址具有较好的空间局部性,特别增加了对数据地址序列中的跨步访问模式的识别,用于指导Cache块的替换。通过将符合顺序模式与跨步模式的数据块尽量放在同一个存储体中,可以减少存储体的转换次数。实验表明,使用ELSS替换策略可以进一步减少位线隔离数据Cache使用LRU策略时9%的体转换次数,多节省8%的数据Cache能量消耗,而对性能的影响比使用LRU策略时小。相似文献

10.

多核处理器面向低功耗的共享Cache划分方案 总被引：1，自引：0，他引：1

下载免费PDF全文

熊伟殷建平所光赵志恒《计算机工程与科学》2010,32(10):26-29

随着多核处理器的发展,片上Cache的容量随之增大,其功耗占整个芯片功耗的比率也越来越大。如何减少Cache的功耗,已成为当今Cache设计的一个热点。本文研究了面向低功耗的多核处理器共享Cache的划分技术(LP-CP)。文中提出了Cache划分框架,通过在处理器中加入失效率监控器来动态地收集程序的失效率,然后使用面向低功耗的共享Cache划分算法,计算性能损耗阈值范围内的共享Cache划分策略。我们在一个共享L2 Cache的双核处理器系统中,使用多道程序测试集测试了面向低功耗的Cache划分:在性能损耗阈值为1%和3%的情况中,系统的Cache关闭率分别达到了20.8%和36.9%。相似文献

11.

PAM: an efficient power-aware multilevel cache policy to reduce energy consumption of storage systems

Xiaodong MENG Chentao WU Minyi GUO Long ZHENG Jingyu ZHANG 《Frontiers of Computer Science》2019,13(4):850

Energy consumption is one of the most significant aspects of large-scale storage systems where multilevel caches are widely used. In a typical hierarchical storage structure, upper-level storage serves as a cache for the lower level, forming a distributed multilevel cache system. In the past two decades, several classic LRU-based multilevel cache policies have been proposed to improve the overall I/O performance of storage systems. However, few power-aware multilevel cache policies focus on the storage devices in the bottom level, which consume more than 27% of the energy of the whole system [1]. To address this problem, we propose a novel power-aware multilevel cache (PAM) policy that can reduce the energy consumption of high-performance and I/O bandwidth storage devices. In our PAM policy, an appropriate number of cold dirty blocks in the upper level cache are identified and selected to flush directly to the storage devices, providing high probability extension of the lifetime of disks in standby mode. To demonstrate the effectiveness of our proposed policy, we conduct several simulations with real-world traces. Compared to existing popular cache schemes such as PALRU, PB-LRU, and Demote, PAM reduces power consumption by up to 15% under different I/O workloads, and improves energy efficiency by up to 50.5%. 相似文献

12.

SoMMA: A software-managed memory architecture for multi-issue processors

《Microprocessors and Microsystems》2020

Embedded processors rely on the efficient use of instruction-level parallelism to answer the performance and energy needs of modern applications. Though improving performance is the primary goal for processors in general, it might lead to a negative impact on energy consumption, a particularly critical constraint for current systems. In this paper, we present SoMMA, a software-managed memory architecture for embedded multi-issue processors that can reduce energy consumption and energy-delay product (EDP), while still providing an increase in memory bandwidth. We combine the use of software-managed memories (SMM) with the data cache, and leverage the lower energy access cost of SMMs to provide a processor with reduced energy consumption and EDP. SoMMA also provides a better overall performance, as memory accesses can be performed in parallel, with no cost in extra memory ports. Compiler-automated code transformations minimize the programmer's effort to benefit from the proposed architecture. The approach shows average speedups of 1.118x and 1.121x, while consuming up to 11% and 12.8% less energy when comparing two modified ρVEX processors and their baselines, at full-system level comparisons. SoMMA also shows reduction of up to 41.5% on full-system EDP, maintaining the same processor area as baseline processors. 相似文献

13.

An on-chip instruction cache design with one-bit tag for low-power embedded systems

Ji Gu^{Author Vitae} Hui Guo Author VitaePatrick LiAuthor Vitae 《Microprocessors and Microsystems》2011,35(4):382-391

On-chip instruction cache is a potential power hungry component in embedded systems due to its large chip area and high access-frequency. Aiming at reducing power consumption of the on-chip cache, we propose a Reduced One-Bit Tag Instruction Cache (ROBTIC), where the cache size is judiciously reduced and the cache tag field only contains the least significant bit of the full-tag. We develop a cache operational control scheme for ROBTIC so that with the one-bit cache tag, the program locality can still be efficiently exploited. For applications where most of the memory accesses are localized, our cache can achieve similar performance as a traditional full-tag cache; however, the power consumption of the cache can be significantly reduced due to the much smaller cache size, narrower tag array (just one bit), and tinier tag comparison circuit being used. Experiments on a set of benchmarks implemented in CMOS 180 nm process technology demonstrate that our proposed design can reduce up to 27.3% dynamic power consumption and 30.9% area of the traditional cache when the cache size is fixed at 32 instructions, which outperforms the existing partial-tag based cache design. With the cache size customization, a further 47.8% power saving can be achieved. Our experimental results also show that when implemented in the deep sub-micron technologies where the leakage power is not ignorable, our design is still efficient - a coherent power saving trend (about 22%) has been observed for technologies from 130 nm down to 65 nm. 相似文献

14.

A Predictive Shutdown Technique for GPU Shader Processors

Po-Han Wang Yen-Ming Chen Chia-Lin Yang Yu-Jung Cheng 《Computer Architecture Letters》2009,8(1):9-12

As technology continues to shrink, reducing leakage is critical to achieve energy efficiency. Previous works on low-power GPU (graphics processing unit) focus on techniques for dynamic power reduction, such as DVFS (Dynamic Voltage/Frequency Scaling) and clock gating. In this paper, we explore the potential of adopting architecture-level power gating techniques for leakage reduction on GPU. In particular, we focus on the most power-hungry components, shader processors. We observe that, due to different scene complexity, the required shader resources to satisfy the target frame rate actually vary across frames. Therefore, we propose the predictive shader shutdown technique to exploit workload variation across frames for leakage reduction on shader processors. The experimental results show that predictive shader shutdown achieves up to 46% leakage reduction on shader processors with negligible performance degradation. 相似文献

15.

H-NMRU: An Efficient Cache Replacement Policy with Low Area

Sourav Roy 《International journal of parallel programming》2010,38(3-4):277-287

In present multi-core devices, the individual processors do not need to operate at the highest possible frequencies. Instead there is a need to reduce the power, complexity and area of individual processor components like caches. In this paper we propose a low area, high performance cache replacement policy for embedded processors called Hierarchical Non-Most-Recently-Used (H-NMRU). The H-NMRU is a parameterizable policy where we can trade-off performance with area. We extended the Dinero cache simulator with the H-NMRU policy and performed architectural exploration with a set of cellular and multimedia benchmarks. On a 16 way cache, a two level H-NMRU policy where the first and second levels have 8 and 2 branches, respectively, performs as good as the Pseudo-LRU policy with storage area saving of 27%. Compared to true LRU, H-NMRU on a 16 way cache saves huge amount of area (82%) with marginal increase of cache misses (3%). Similar result was also noticed on other cache like structures like branch target buffers. Therefore, the two level H-NMRU cache replacement policy (with associativity/2 and 2 branches on the two levels) is a very attractive option for caches on embedded processors with associativities greater than 4. We present a case-study where it can be used on the L2 cache with substantial gain in performance and area for single and dual core platforms. 相似文献

16.

一种基于预比较的低功耗高速缓存设计

彭瑞华付宇卓《微计算机信息》2007,23(29):244-246

介绍了一种采用预比较方法的高速缓存结构。通过标志段的预比较来避免对无关标志段和数据段的访问以降低访问功耗。并引入反相时钟来优化其访问时序，使平均访问延时少于一个周期。实验显示，在保持命中率的基础上，对测试程序的访存优化表现出很好一致性，且功耗优势随相联度增加而增大。相比预测型结构，在8路相联度下平均有28．5％的功耗降低。相似文献

17.

基于统计信息的Cache漏流功耗估算方法

周宏伟张承义张民选《计算机研究与发展》2008,45(2):367-374

随着工艺尺寸的缩小,漏流功耗逐渐成为制约微处理器设计的主要因素之一.Sleep Cache与Drowsy Cache是两种降低Cache漏流功耗的重要技术.基于统计信息的Cache漏流功耗估算方法(SB-CLPE)用于对Sleep Cache或Drowsy Cache进行Cache漏流功耗估算,根据该方法设计的Cache体系结构能够在程序执行过程中实时估算Cache漏流功耗.通过对所有Cache块的访问间隔时间进行统计,SB_CLPE可以估算出使用不同衰退间隔时Cache的漏流功耗,从而得到使Cache漏流功耗最低的最佳衰退间隔.实验表明,SB_CLPE对Sleep Cache的漏流功耗的估算结果与HotLeakage漏流功耗模拟器通过模拟获得的结果相比,平均偏差仅为3.16%,得到的最佳衰退间隔也可以较好吻合.使用SB_CLPE的Cache体系结构可以用于在程序执行过程中对最佳衰退间隔进行实时估算,通过动态调整衰退间隔以达到最优的功耗降低效果. 相似文献

18.

结合访存失效队列状态的预取策略 总被引：1，自引：0，他引：1

郇丹丹李祖松胡伟武刘志勇《计算机学报》2007,30(7):1104-1114

随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%. 相似文献

19.

A leakage-aware L2 cache management technique for producer-consumer sharing in low-power chip multiprocessors

Hyunhee Kim Author VitaeJihong KimAuthor Vitae 《Journal of Parallel and Distributed Computing》2011,71(12):1545-1557

This paper proposes a novel leakage management technique for applications with producer-consumer sharing patterns. Although previous research has proposed leakage management techniques by turning off inactive cache blocks, these techniques can be further improved by exploiting the various run-time characteristics of target applications in CMPs. By exploiting particular access sequences observed in producer-consumer sharing patterns and the spatial locality of shared buffers, our technique enables a more aggressive turn-off of L2 cache blocks of these buffers. Experimental results using a CMP simulator show that our proposed technique reduces the energy consumption of on-chip L2 caches, a shared bus, and off-chip memory by up to 31.3% over the existing cache leakage power management techniques with no significant performance loss. 相似文献

20.

Improving cache locking performance of modern embedded systems via the addition of a miss table at the L2 cache level

Abu Asaduzzaman Fadi N. Sibai Manira Rani 《Journal of Systems Architecture》2010,56(4-6):151-162

To confer the robustness and high quality of service, modern computing architectures running real-time applications should provide high system performance and high timing predictability. Cache memory is used to improve performance by bridging the speed gap between the main memory and CPU. However, the cache introduces timing unpredictability creating serious challenges for real-time applications. Herein, we introduce a miss table (MT) based cache locking scheme at level-2 (L2) cache to further improve the timing predictability and system performance/power ratio. The MT holds information of block addresses related to the application being processed which cause most cache misses if not locked. Information in MT is used for efficient selection of the blocks to be locked and victim blocks to be replaced. This MT based approach improves timing predictability by locking important blocks with the highest number of misses inside the cache for the entire execution time. In addition, this technique decreases the average delay per task and total power consumption by reducing cache misses and avoiding unnecessary data transfers. This MT based solution is effective for both uniprocessors and multicores. We evaluate the proposed MT-based cache locking scheme by simulating an 8-core processor with 2 levels of caches using MPEG4 decoding, H.264/AVC decoding, FFT, and MI workloads. Experimental results show that in addition to improving the predictability, a reduction of 21% in mean delay per task and a reduction of 18% in total power consumption are achieved for MPEG4 (and H.264/AVC) by using MT and locking 25% of the L2. The MT results in about 5% delay and power reductions on these video applications, possibly more on applications with worse cache behavior. For the FFT and MI (and other) applications whose code fits inside the level-1 instruction (I1) cache, the mean delay per task increases only by 3% and total power consumption increases by 2% due to the addition of the MT. 相似文献