首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Advancement in semiconductor technology increases power density in recent Chip Multi-Processors (CMPs) which significantly increases the leakage energy consumptions of on-chip Last Level Caches (LLCs). Performance linked dynamic tuning in LLC size is a promising option for reducing the cache leakage.This paper reduces static power consumption by dynamically shutting down or turning on cache banks based upon system performance and cache bank usage statistics. Shutting down of a cache bank remaps its future requests to another active bank, called as target bank. The proposed method is evaluated on three different implementation policies, viz (1) The system can decide to shutdown or turn-on some cache banks periodically throughout the process execution. (2) The system allows to shutdown banks initially and once the bank restarting initiates, no more shutdown is permitted further. (3) This policy resizes cache like first policy with some predefined time slices, in which cache cannot be resized.For a 4MB 4 way set associative L2 cache, experimental analysis shows 66% reduction in static energy with 29% gain in Energy Delay Product (EDP) for first strategy; for the second policy, static power is reduced by 59% with 27% savings in EDP. Finally, last policy saves 65% in static power and 30% in EDP with minimal performance penalty.  相似文献   

2.
The last level cache (LLC) in private configurations offer lower latency and isolation but extinguishes the possibility of sharing underutilized cache resources. Cooperative Caching (CC) provides capacity sharing by spilling a line evicted from one cache to another. However, CC proposals did not pay enough attention to the natural problem of private LLC, replication. The static policies either indulging the replicated blocks (replicas) in or excluding them out of LLC invariably are deficient for the complex cache capacity situations in manycore environment. In this paper, we present replication-aware cache management (RACMan) to optimize replication for private configurations. RACMan relies on a novel coarse-grained low-overheard mechanism PBFP that monitors and predicts the replica reusability to dynamically adjust LLC insertion policies giving replicas different positions of LRU chain and chances of survival in LLC according to the prediction. Experiment results show our proposal is competent to optimize replication by performing better than two baseline systems in the respects of L2 Hit Rate, Network Traffics, IPC, and Dynamic Energy. RACMan fulfils the requirements of manycore CMPs with private LLC for increasing system performance, area efficiency, and scalability.  相似文献   

3.
非易失性存储器具有能耗低、可扩展性强和存储密度大等优势,可替代传统静态随机存取存储器作为片上缓存,但其写操作的能耗及延迟较高,在大规模应用前需优化写性能。提出一种基于缓存块重用信息的动态旁路策略,用于优化非易失性存储器的缓存性能。分析测试程序访问最后一级缓存(LLC)时的重用特征,根据缓存块的重用信息动态预测相应的写操作是否绕过非易失性缓存,利用预测表进行旁路操作完成LLC缺失时的填充,同时采用动态路径选择进行上级缓存写回操作,通过监控模块为旁路的缓存块选择合适的上级缓存,并将重用计数较高的缓存块填充其中以减少LLC写操作次数。实验结果表明,与未采用旁路策略的缓存设计相比,该策略使4核处理器中所有SPLASH-2程序的运行时间平均减少6.6%,缓存能耗平均降低22.5%,有效提高了整体缓存性能。  相似文献   

4.
The many-core SoC is a future trend technology, and the process yield will face many unpredictable challenges. Nonuniform cache architecture (NUCA) can improve the performance of many-core SoC for embedded systems. It embeds a NoC into the cache memory to enhance the data access by distributing traffic loads to several banks in parallel. Providing fault-tolerant mechanism in NUCA is very important because the chip can still work efficiently when some memory banks are unusable. In this paper, we design a specific router working with static and dynamic cache remapping policies to support faulty banks in NUCA. When a L2 cache bank in NUCA is unusable, static remapping policy (SRP) selects a suitable neighbor cache bank according to the collected remapping cost to assist with the cache access by considering cache status and traffic status of the router. We also propose a dynamic remapping policy (DRP) to select the suitable cache bank dynamically at runtime to fit the real loading status of neighbor nodes around the faulty bank. The experimental results show that the average improvement of the SRP is approximated to 26 %, and the average improvement of the DRP is approximated to 28 %.  相似文献   

5.
With the emerging of 3D-stacking technology, the dynamic random-access memory (DRAM) can be stacked on chips to architect the DRAM last level cache (LLC). Compared with static randomaccess memory (SRAM), DRAM is larger but slower. In the existing research papers, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together, ranging from SRAM structure improvement, to optimizing cache tag and data access. Instead, little attention has been paid to designing an LLC scheduling scheme for multi-programmed workloads with different memory footprints. Motivated by this, we propose a self-adaptive LLC scheduling scheme, which allows us to utilize SRAM and 3D-stacked DRAM efficiently, achieving better workload performance. This scheduling scheme employs (1) an evaluation unit, which is used to probe and evaluate the cache information during the process of programs being executed; and (2) an implementation unit, which is used to self-adaptively choose SRAM or DRAM. To make the scheduling scheme work correctly, we develop a data migration policy. We conduct extensive experiments to evaluate the performance of our proposed scheme. Experimental results show that our method can improve the multi-programmed workload performance by up to 30% compared with the state-of-the-art methods.  相似文献   

6.
现今CPU和GPU的发展已经出现新的瓶颈,将两者“结合”在同一块芯片上成为一种新的趋势。这种新的异构架构给片上共享资源的管理带来压力。而共享末级缓存(LLC)的管理对性能的影响非常关键。由于CPU程序和GPU程序的不同特性,给CPU和GPU间共享的末级缓存管理带来新的挑战。通过分析GPU程序访存特征,借鉴之前的缓存管理方案,提出对CPU-GPU融合系统的末级缓存进行等量的静态划分和最优静态划分的方案。实验结果表明:通过缓存划分可以有效避免CPU和GPU程序间的干扰。与传统LRU策略相比,等量静态划分和最优静态划分可以使系统整体性能分别提高7.68%和11.62%。  相似文献   

7.
This paper presents CMP-VR (Chip-Multiprocessor with Victim Retention), an approach to improve cache performance by reducing the number of off-chip memory accesses. The objective of this approach is to retain the chosen victim cache blocks on the chip for the longest possible time. It may be possible that some sets of the CMPs last level cache (LLC) are heavily used, while certain others are not. In CMP-VR, some number of ways from every set are used as reserved storage. It allows a victim block from a heavily used set to be stored into the reserve space of another set. In this way the load of heavily used sets are distributed among the underused sets. This logically increases the associativity of the heavily used sets without increasing the actual associativity and size of the cache. Experimental evaluation using full-system simulation shows that CMP-VR has less off-chip miss-rate as compared to baseline Tiled CMP. Results are presented for different cache sizes and associativity for CMP-VR and baseline configuration. The best improvements obtained are 45.5% and 14% in terms of miss rate and cycles per instruction (CPI) respectively for a 4 MB, 4-way set associative LLC. Reduction in CPI and miss rate together guarantees performance improvement.  相似文献   

8.
A NUCA Substrate for Flexible CMP Cache Sharing   总被引:1,自引:0,他引:1  
We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.  相似文献   

9.
Fang  Juan  Zhang  Xibei  Liu  Shijian  Chang  Zeqing 《The Journal of supercomputing》2019,75(8):4519-4528

When multiple processor (CPU) cores and a GPU integrated together on the same chip share the last-level cache (LLC), the competition for LLC is more serious. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of LLC capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications have high number of concurrent threads and they can tolerate access latency. Taking into account the GPU program memory latency tolerance characteristics, we propose an LLC buffer management strategy (buffer-for-GPU, BFG) for heterogeneous multi-core. A buffer is added on the side of LLC to filtrate streaming requests of GPU. Cache-insensitive GPU messages directly access to buffer instead of accessing to LLC, thereby filtering the GPU request and freeing up the LLC space for the CPU application. Then, for the different characteristics of CPU and GPU applications, an improved LRU replacement taking into account the recent access time and access frequency of the cache block is adopted. The cache misses-aware algorithm dynamically selects the improved LRU or LRU algorithm to fit the current operating state by comparing the miss rate of cache in buffer so that the performance of the system will be improved significantly.

  相似文献   

10.
当代CMP处理器通常采用基于LRU替换策略或其近似算法的共享最后一级Cache设计.然而,随着LLC容量和相联度的增长,LRU和理论最优替换算法之间的性能差距日趋增大.为此已提出多种Cache管理策略来解决这一问题,但是它们多数仅针对单一的内存访问类型,且对Cache访问的频率信息关注较少,因而性能提升具有很大的局限性...  相似文献   

11.
二维瓦片矢量基对二次映射分布方法   总被引:1,自引:0,他引:1       下载免费PDF全文
针对基于对象存储系统中各个存储设备的性能差异,提出了一种二维瓦片矢量基对二次映射分布方法。首先,构建一组性能相同、容量无限的虚拟设备,并根据矢量基对分布模板将二维瓦片分布到M个虚拟设备中;然后,通过伪随机数发生器将M个虚拟设备中的瓦片均匀映射到[0,1)区间的M个等分区间;最后,根据真实设备的瓦片分布百分比将[0,1)区间划分成M个区间,并将映射到各个区间的二维瓦片分布到对应的真实设备中。  相似文献   

12.
Before an application can be actually launched in a many-core system, the first thing that needs to be done is to get the application mapped to a number of tiles (cores). Such online application mapping process may unfortunately lead to a serious resource leak problem, referred as tile fragmentation, that free (uncommitted) tiles from any single contiguous region are just inadequate to accommodate the performance needs of an incoming application, although the total number of free tiles may still exceed what is required to service this application. When applications have to be mapped to noncontiguous tiles due to fragmentation, there will be obvious performance penalty due to increased communication distances. As a result, defragmentation that consolidates fragmented tiles needs to be routinely exercised, and this defragmentation process must not introduce high computation overhead that otherwise can adversely impact the system performance. In this paper, we propose a task migration-based adaptive tile defragmentation algorithm that helps consolidate running applications through online task migration. This algorithm relocates the applications’ tile regions so that a contiguous free tile region is formed and maintained. By doing so, future applications can be mapped to a region with low communication distance. Both the computation overhead and quality of defragmentation result of the proposed algorithm are adaptively set in response to the system workloads. Enabled by its low overhead, the proposed defragmentation algorithm is an effective resource management enhancement to the existing runtime task-to-tile mapping methods, with as much as 3× system throughput improvement observed in some experiments.  相似文献   

13.
For those cache hierarchy levels where program locality is not as evident as in L1, LRU replacement does not seem to be the optimal solution to determine which blocks will be requested soon. The literature is prolific on alternative reuse-distance estimations at last on-chip cache level, proving the difficulty of achieving an optimal hit rate. One of the key aspects for performance is knowing inter and intra application reuse-distance variability. Many solutions already do this, but most of them rely on a simple choice among a few alternative policies. The experiments performed to motivate the proposal confirm application variability, but also show that the behavior of applications is much more than bimodal. This means that there is a performance gap that current hybrid policies are not able to cover. In this paper we propose a mobile insertion position replacement policy (MIP), which combines well known LRU ordering and promotion policies with a completely adaptive insertion mechanism. The dynamic behavior of insertion is able to capture hit-rate variability in a more accurate way. Making use of set dueling and dynamic set sampling for prediction, our mechanism continuously estimates the insertion position that maximizes the cache hit rate. The hardware overhead compared to a LRU replacement algorithm is merely three 3-bit saturating counters per LLC bank. Our experiments show that for a wide range of applications, MIP is able to improve the hit rate of LRU by 30% on average. MIP outperforms current state-of-the-art replacement policies with a similar implementation cost by 10% on average and in single-thread or multi-thread workloads by 20%.  相似文献   

14.
The cache memory consumes a large proportion of the energy used by a processor. In the on-chip cache, the translation lookaside buffer (TLB) accounts for 20–50% of energy consumption of the on-chip cache. To reduce energy consumption caused by TLB accesses, a virtual cache can be accessed by virtual addresses which are issued by a processor directly. However, a virtual cache may result in the synonym problem. In this paper, we propose low-cost synonym detection hardware and a synonym data coherence mechanism. These reduce the energy consumption incurred by TLB lookups, and maintain synonym data consistency in the virtual cache. The proposed synonym detection hardware efficiently reduces the number of blocks that must be looked up in a virtual cache for saving energy. In addition, the proposed synonym data coherence mechanism also reduces the number of invalidated blocks in the virtual cache to prevent the destruction of cache locality. The simulation results show that our proposed energy-aware virtual cache consumes 51%, 27%, and 20% less energy than the traditional physical cache, traditional virtual cache, and synonym lookaside buffer (SLB), respectively. In addition, our proposed design shows almost the same static energy consumption as SLB, and reduces static energy consumption by about 20% compared with the traditional physical cache and virtual cache.  相似文献   

15.
Web mapping has become a popular way of distributing interactive digital maps over the internet. Instead of dynamically generating map images on the fly, those can be pre-generated and served from a server-side cache for faster retrieval. However, these caches can grow unmanageably in size when the cartography covers mid to large areas for multiple rendering scales. This forces modest organizations to use partial caches containing just a subset of the total tiles, and makes their services less attractive than other mapping services like Google Maps or Microsoft Bing Maps. This work proposes a neural-network-based intelligent system that predicts which areas are likely to be requested in the future from a catalog of geographic features and a short history of past requests. These priority regions can be used by a tile prefetching policy to achieve an optimal population of the cache. Neural networks are trained and validated using supervised learning with real data-sets from a public nation-wide web map service. Trace-driven simulations demonstrate that accurate long-term predictions, up to 90% in terms of cache-hit ratio, can be obtained with the proposed model by prefetching a low fraction, only the 20% of the total tiles, and with a short training period.  相似文献   

16.
The last level cache (LLC) memory accessing patterns influence the DRAM performance significantly due to their determination on the DRAM Row buffer hit rates (RBH), which are highly related to the DRAM access latency. However, it is too difficult to accurately quantify these effects without the LLC memory accessing traces from detailed simulations.This paper proposes an analytical model of DRAM access latency based on the RBH analysis. By exploring the relationship between the LLC spatial localities (described with the LLC stride distribution in this paper) and different RBH cases, the unknown parameters of the model can be quantified. Instead of detailed simulations, the LLC stride distributions are deduced from the stride distributions extracted from the software trace and the LLC miss rate estimated by the multi-level cache behavior model in our prior works, which significantly decreases the time overhead of our model.Fifteen benchmarks, chosen from Mobybench 2.0, Mibench 1.0 and Mediabench II, are utilized to evaluate the model's accuracy. Compared to the results of full simulations by gem5 and DRAMSim2, the average error of our model is lower than 7%, while the prediction process can be sped up by 50 times on average.  相似文献   

17.
Recent studies have shown that embedded DRAM (eDRAM) is a promising approach for 3D stacked last-level caches (LLCs) rather than SRAM due to its advantages over SRAM; (i) eDRAM occupies less area than SRAM due to its smaller bit cell size; and (ii) eDRAM has much less leakage power and access energy than SRAM, since it has much smaller number of transistors than SRAM. However, different from SRAM cells, eDRAM cells should be refreshed periodically in order to retain the data. Since refresh operations consume noticeable amount of energy, it is important to adopt appropriate refresh interval, which is highly dependent on the temperature. However, the conventional refresh method assumes the worst-case temperature for all eDRAM stacked cache banks, resulting in unnecessarily frequent refresh operations. In this paper, we propose a novel temperature-aware refresh scheme for 3D stacked eDRAM caches. Our proposed scheme dynamically changes refresh interval depending on the temperature of eDRAM stacked last-level cache (LLC). Compared to the conventional refresh method, our proposed scheme reduces the number of refresh operations of the eDRAM stacked LLC by 28.5% (on 32 MB eDRAM LLC), on average, with small area overhead. Consequently, our proposed scheme reduces the overall eDRAM LLC energy consumption by 12.5% (on 32 MB eDRAM LLC), on average.  相似文献   

18.
Recently, EDRAM cells have gained much attention as a promising alternative to construct on-chip memories. However, due to inherent characteristics of DRAM cells, they need to be refreshed periodically, causing a huge refresh energy burden. Particularly, employing EDRAM cells in large-scale last-level caches will make refresh burden much higher due to their large capacity. In this paper, we propose a selective fine-grain round-robin refresh scheme for both performance improvement and refresh energy reduction. To reduce bank conflicts between normal cache accesses and refresh operations, we employ a refresh scheme which refreshes cache lines in a bank-wise round-robin fashion. We also apply a selective refresh depending on the inclusive information in cache hierarchies. For the data which reside in both LLC and upper-level cache (i.e., L2 cache), the data access will be filtered by the upper-level cache. Based on this insight, we skip the refresh to the cache block in the EDRAM-based LLC which also exists in the upper-level caches. By doing so, we can reduce unnecessary refresh operations in EDRAM-based LLCs. According to our evaluation, our proposed scheme improves performance by 7.3% and reduces energy per instruction by 13.3% compared to the baseline all-bank refresh scheme.  相似文献   

19.
设计一种被称之为消除低重用块和预测访问间隔的Cache管理策略ELRRIP.根据多核处理器的共享最后一级高速缓存中低重用块占用资源时间较长这一特点,ELRRIP策略:1)通过感知最后一级共享高速缓存的上一级Cache中的数据历史访问信息预测出低重用块并优先将其淘汰;2)通过改进的访问间隔预测技术预测出潜在的低重用块并将其优先淘汰.同时,本文还基于ELRRIP提出了TADELRRIP.实验表明,对于4核多核处理器而言,TADELRRIP可以将加权加速比平均提高9.14%.  相似文献   

20.
多维数据以线性形式在存储系统中进行访问操作,二维及以上维度空间中的相邻节点被不同的映射算法映射到一维空间的不相邻位置。高维空间中进行相邻节点访问时,其一维存储映射位置有着不同的访问距离和访问延迟。提出了基于空间填充曲线Z-Ordering的存储映射方法及其访问距离的度量指标,并和常规优先算法进行了对比,发现能更好地将高维相邻的数据节点簇集到一维存储位置,加强了局部性。调整缓存空间中用于预取的空间大小,可以利用增强的局部性,提高了缓存命中率。实验结果表明,改善了多维数据的访问速度,优化了系统性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号