首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
朱艳娜  王党辉 《计算机科学》2018,45(Z6):513-517
多级磁自旋存储器(Multi-Level Cell Spin-Transfer Torque RAM,MLC STT-RAM)可在一个存储单元中存储多个比特位,有望取代SRAM用于构建大容量低功耗的最后一级Cache(Last Level Cache,LLC)。MLC STT-RAM的静态功耗在理论上为0,且拥有高密度和优秀的读操作特性,但它的缺陷在于低效的写操作。针对这一问题,在MLC STT-RAM Cache hard/soft逻辑分区结构 的基础上,实现了MLC STT-RAM LLC写操作密集度预测技术以及相应Cache结构的设计。通过动态预测写操作密集度较高的Cache块,帮助MLC STT-RAM LLC减少执行写操作的代价。预测的基本思想是利用访存指令地址与相应Cache块行为特征的联系,根据预测结果决定数据在LLC中的放置位置。实验结果显示,在MLC STT-RAM LLC中应用写操作密集度预测技术,使得写操作动态功耗降低6.3%的同时,系统性能有所提升。  相似文献   

2.

As embedded memory technology evolves, the traditional Static Random Access Memory (SRAM) technology has reached the end of development. For deepening the manufacturing process technology, the next generation memory technology is highly required because of the exponentially increasing leakage current of SRAM. Non-volatile memories such as STT-MRAM (Spin Torque Transfer Magnetic Random Access Memory), PCM (Phase Change Memory) are good candidates for replacing SRAM technology in embedded memory systems. They have many advanced characteristics in the perspective of power consumption, leakage power, size (density) and latency. Nonetheless, nonvolatile memories have two major problems that hinder their use it the next-generation memory. First, the lifetime of the nonvolatile memory cell is limited by the number of write operations. Next, the write operation consumes more latency and power than the same size of the read operation. This study describes a compiler optimization technique to overcome such disadvantages of a nonvolatile memory component in hybrid cache memories. A hybrid cache is proposed to overcome the disadvantages using a compiler. Specifically, to minimize the number of write operations for nonvolatile memory, we present a data replacement technique that considers the locations of the register spill data. Many portions of the memory accesses are yielded by the spill data of a register allocator in an optimizing compiler. Such spill data can be partially removed using a recalculation method. Thus, we implemented an optimization technique that rearranges the data placement with recalculation to minimize the write instructions on the nonvolatile memory. Our experimental results show that the proposed technique can reduce the average number of spill codes by 20%, and improves the energy consumption by 20.2% on average.

  相似文献   

3.
随着工艺尺寸减小,传统基于SRAM的片上Cache的漏电流功耗成指数增长,阻碍了片上Cache容量的增加。基于牺牲者Cache的原理,利用SRAM写速度快,STT-RAM的非易失性、高密度、极低漏电流功耗等特性设计了一种基于SRAM和STT-RAM的混合型指令Cache。通过实验证明,该混合型指令Cache与传统基于SRAM的指令Cache相比,在不增加指令Cache面积的情况下,增加了指令Cache容量,并显著提高了指令Cache的命中率。  相似文献   

4.
This paper focuses on energy consumption which is a major problem in the dark silicon era. As energy consumption becomes a key issue for operation and maintenance of cloud data centers, cloud computing providers are becoming significantly concerned. Here, we show how spin-transfer torque random access memory (STT-RAM) can be used as an on-chip L2 cache to obtain lower energy compared to conventional L2 caches, like SRAM. High density, fast read access and non-volatility make STT-RAM a significant technology for on-chip memories. Previous studies have mainly studied specific schemes based on common applications and do not provide a thorough analysis of emerging scale-out applications with multiple design options. Here, we discuss different outlooks consisting of performance and energy efficiency in cloud processors by running emerging scale-out workloads. Experiment results on the CloudSuite benchmarks show that the proposed method reduces energy by 51% (on average) and improves energy delay product by 37% (on average) where instruction per cycle degradation is only 22% (on average) compared to the SRAM method.  相似文献   

5.
非易失性存储器具有能耗低、可扩展性强和存储密度大等优势,可替代传统静态随机存取存储器作为片上缓存,但其写操作的能耗及延迟较高,在大规模应用前需优化写性能。提出一种基于缓存块重用信息的动态旁路策略,用于优化非易失性存储器的缓存性能。分析测试程序访问最后一级缓存(LLC)时的重用特征,根据缓存块的重用信息动态预测相应的写操作是否绕过非易失性缓存,利用预测表进行旁路操作完成LLC缺失时的填充,同时采用动态路径选择进行上级缓存写回操作,通过监控模块为旁路的缓存块选择合适的上级缓存,并将重用计数较高的缓存块填充其中以减少LLC写操作次数。实验结果表明,与未采用旁路策略的缓存设计相比,该策略使4核处理器中所有SPLASH-2程序的运行时间平均减少6.6%,缓存能耗平均降低22.5%,有效提高了整体缓存性能。  相似文献   

6.
随着硅的集成度和时钟频率的急剧提升,功耗和散热已成为体系结构设计中的关键挑战。近阈值电压技术是一种能够有效降低处理器能耗的有着广泛应用前景的技术。然而,在近阈值电压下,大量SRAM单元失效,导致一级缓存的错误率升升,给一级缓存的可靠性带来了严峻挑战。目前有很多学者通过牺牲缓存容量或者引入额外的延迟来纠正缓存的错误,但大多方法只能适应SRAM单元的低失效率环境,在高失效率的环境下表现较差。文中提出了一种基于传统6T SRAM的近阈值电压下可容错的一级缓存结构——FTFLC(Fault-Tolerant First-Level Cache),在高失效率的环境下,其表现出了更好的性能。FTFLC采用两级映射机制,利用块映射机制和位纠正机制分别对缓存行中有错的比特位和子数据块进行映射保护。此外,文中还提出了FTFLC初始化算法将两种映射机制结合,提高了可用的缓存容量。最后,使用gem5模拟器,在650 mV电压的高失效率环境下对FTFLC进行仿真实验,将其与3种已有缓存结构10T-Cache,Bit-fix,Correction Prediction进行对比。对比结果表明,FTFLC相比其他的缓存结构,在保持较低面积和能耗开销的同时,拥有至少3.86%的性能提升,且将L1 Cache的容量可用率提升了12.5%。  相似文献   

7.
Radiation-induced soft error has become an emerging reliability threat to high performance microprocessor design. As the size of on chip cache memory steadily increased for the past decades, resilient techniques against soft errors in cache are becoming increasingly important for processor reliability. However, conventional soft error resilient techniques have significantly increased the access latency and energy consumption in cache memory, thereby resulting in undesirable performance and energy efficiency degradation. The emerging 3D integration technology provides an attractive advantage, as the 3D microarchitecture exhibits heterogeneous soft error resilient characteristics due to the shielding effect of die stacking. Moreover, the 3D shielding effect can offer several inner dies that are inherently invulnerable to soft error, as they are implicitly protected by the outer dies. To exploit the invulnerability benefit, we propose a soft error resilient 3D cache architecture, in which data blocks on the soft error invulnerable dies have no protection against soft error, therefore, access to the data block on the soft error invulnerable die incurs a considerably reduced access latency and energy. Furthermore, we propose to maximize the access on the soft error invulnerable dies by dynamically moving data blocks among different dies, thereby achieving further performance and energy efficiency improvement. Simulation results show that the proposed 3D cache architecture can reduce the power consumption by up to 65% for the L1 instruction cache, 60% for the L1 data cache and 20% for the L2 cache, respectively. In general, the overall IPC performance can be improved by 5% on average.  相似文献   

8.
嵌入式处理器中Cache的应用极大地提高了处理器的性能,同时Cache,尤其是指令Cache功耗占据了处理器很大一部分功耗,关闭不必要的tag SRAM和data SRAM的访问,可以极大地降低功耗。提出了一种流水化的指令Cache访问机制,关闭不必要的data SRAM的访问;并且通过记录指令Cache行的信息和预测下一行的Cache形成一个Cache行滑动窗口,关闭不必要的tag SRAM访问。所提出的方法没有性能损失,在SMIC 90nm工艺下进行功耗分析,其指令访问的功耗降低50%。  相似文献   

9.
STT–RAM is considered as a promising alternative to SRAM due to its low static power (non-volatility) and high density. However, write operation of STT–RAM is inefficient in terms of energy and speed compared to SRAM and thus various device-/circuit-/architecture-level solutions have been proposed to tackle this inefficiency. One of the proposed solutions is redesigning STT–RAM cell for better write characteristics at the cost of shortened retention time (volatile STT–RAM). Because the retention failure of STT–RAM has a stochastic property, an extra overhead of periodic scrubbing with error correcting code (ECC) is required to tolerate the failure. The more frequent scrubbing and stronger ECC are used, the shorter retention time is allowed. With an analysis based on analytic STT–RAM model, we have conducted extensive experiments on various volatile STT–RAM cache design parameters including scrubbing period, ECC strength, and target failure rate. The experimental results show the impact of the parameter variations on last-level cache energy and performance and provide a guideline for designing a volatile STT–RAM with ECC and scrubbing.  相似文献   

10.
Wire delays and leakage energy consumption are both growing problems in the design of large on chip caches built in deep submicron technologies. D-NUCA caches (Dynamic-Nonuniform Cache Architecture) exploit an aggressive subbanking of the cache and a migration mechanism to speed up frequently accessed data access latency, to limit wire delays effects on performances. Way Adaptable D-NUCA is a leakage power reduction technique specifically suited for D-NUCA caches. It dynamically varies the portion of the powered-on cache area based on the running workload caching needs, but it relies on application dependent parameters that must be evaluated off-line. This limits the effectiveness of Way Adaptable D-NUCA in the general purpose, multiprogrammed environment. In this paper, we propose a new power reduction technique for D-NUCA caches, which still adapts the powered-on cache area to the needs of the running workload, but it does not rely on application-dependent parameters. Results show that our proposal saves around 49 % of total cache energy consumption in a single core environment and 44 % in CMP environment. By adding a timer, it performs similarly to previously proposed techniques to reduce leakage power consumptions, and outperforms them when they are applied in a workload independent manner.  相似文献   

11.
The modern semiconductor industry is evolving quite rapidly. Portable and mobile devices are becoming smaller every day and there is also a growing demand for longer battery power. With these demands it is important for researchers to focus on the leakage power in stand-by mode. The SRAM was designed to accurately communicate with CPU, DSP, processor and low-power applications, such as battery-life handheld devices. For some days now, the design engineer focuses mainly on the production of large-capacity memories, high bandwidth and low energy consuming memories. Memory is an integral part of most of these systems and is also diminished as the scale of the system reduces. Low power and processing architecture at high speed is therefore a major concern. The durability of random static access memory cells (SRAM) is another critical factor. This Paper Describes the SRAM architecture designed for the reduction of power consumption or power leakages using sleep transistor and MTCMOS (Multi-Threshold Complementary Metal Oxide Semiconductor) techniques. This helps in the reduction of the CMOS transistor leakages. This paper incorporates multiple threshold strategies to give the proposed high speed, increased reliability and low leakage current of the updated 8T SRAM cell in stand-by memory cell mode. Based on the parameters like power dissipation at a different temperature, read voltage, write voltage, read delay, write delay, compared to the previously designed SRAM architecture of 6T, 7T, 8T and 13T we get low power consumption in our designed 8T SRAM architecture. The simulations are conducted with the UMC 55 nm technology Cadence Virtuoso method.  相似文献   

12.
由于DRAM芯片超高的静态功耗,使得利用DRAM构建高性能计算机系统中的大容量主存遇到能耗过大问题,这激发了对新型大容量主存结构的研究。针对上述问题,设计了一种基于SRAM和PRAM的混合主存系统,该系统将SRAM作为PRAM的专用写缓存,并将改进后的LRFU算法应用到SRAM写缓存,从而在对主存系统性能影响不大的前提下,有效降低主存系统的能耗和延长PRAM的可用时间。仿真结果显示,所设计的混合存储结构的能耗-延时积(EDP)为纯DRAM存储结构的40%;此外,与纯PRAM存储结构相比,可使PRAM的写操作次数下降28.5%,与将SRAM作为Cache相比,PRAM写次数下降13%。  相似文献   

13.
In this paper, a data sensitive write circuit is presented to decrease the Write attempt error in STT-RAM memories. CMOS technology presents a myriad of challenges to system designers in the form of soft error reliability, volatility, power consumption, and scalability. STT-RAM (Spin Transfer Torque RAM) is one of promising non-volatile memory whose write error occurs basically due to fabrication process fluctuation. These problems which happen in different current densities made a key drawback in STT-RAM memory technologies. This paper addresses this problem and provides a solution for the drawback. A continuous and dynamic method along with the dual source approach is proposed and optimizations in physical characteristics of transistors is performed. The proposed technique can be classified into two major sections. The first section comprises of a thermal assisted circuit designed to decrease the asymmetric behaviour while writing the two states 0 and 1. A dynamic write current injector designed to hasten the write operation constitutes the subsequent section. In order to validate the design, a comparison has been made between the results of several functional simulations performed on DYSCO and the results of existing related studies. Using physical parameters optimization, we achieved WER reduction, write performance improvement and power gain. On average 27.17% improvement of write time latency is achieved with bounded 11% area overhead.  相似文献   

14.
多级自旋转移力矩磁性存储器(MLC STT-RAM)是一种新型的非易失性存储介质。不同于采用电荷方式来存储信息的SRAM,MLC STT-RAM利用自旋偏振电流通过磁隧道结(MTJ)改变自由层的磁层方向来存储信息,能够天然地避免电磁干扰。文章利用MLC STT-RAM的抗电磁辐射特性,探索在航天抗辐照环境下将其作为存储介质用于寄存器设计。在MLC STT-RAM中,每个存储单元有4种不同的阻抗状态,不同的阻抗状态之间的转换具有不同的能耗和延迟的代价。而传统的基于SRAM的寄存器分配技术并没有考虑不同的写状态转换的影响,其在没有考虑溢出优先级的情况下启发式地选择潜在溢出变量,因此该方法不适合用在MLC STT-RAM的寄存器分配中。针对该问题,提出了一种面向写状态转换的MLC STT-RAM寄存器分配的溢出优化策略。具体来说,首先,通过每个写状态转换频率的线性组合来构成溢出代价模型。然后,根据溢出代价模型针对性地选择溢出变量,选择代价低的变量保存在寄存器中,而代价高的变量倾向于被溢出,从而便实现了面向MLC STT-RAM的寄存器分配策略的优化设计。  相似文献   

15.
High-performance Web sites rely on Web server `farms', hundreds of computers serving the same content, for scalability, reliability, and low-latency access to Internet content. Deploying these scalable farms typically requires the power of distributed or clustered file systems. Building Web server farms on file systems complements hierarchical proxy caching. Proxy caching replicates Web content throughout the Internet, thereby reducing latency from network delays and off-loading traffic from the primary servers. Web server farms scale resources at a single site, reducing latency from queuing delays. Both technologies are essential when building a high-performance infrastructure for content delivery. The authors present a cache consistency model and locking protocol customized for file systems that are used as scalable infrastructure for Web server farms. The protocol takes advantage of the Web's relaxed consistency semantics to reduce latencies and network overhead. Our hybrid approach preserves strong consistency for concurrent write sharing with time-based consistency and push caching for readers (Web servers). Using simulation, we compare our approach to the Andrew file system and the sequential consistency file system protocols we propose to replace  相似文献   

16.

Low power cache memory in a system on chip is in high demand today. With the lowering of MOSFET’s channel length, low-power SRAM design has become a more challenging task. This paper presents differential 8T SRAM cell with minimum power utilization. The proposed cell has one pair of transmission gate as access switches. Due to use of TG instead of pass gate access transistor its write access time (TWA) is short. The low power consumption of the cell is due to stacking effect. This paper compares design metrics of the proposed cell with conventional 6T (CON6T) and ZIGZAG 8T (ZG8T) SRAM cells. The proposed 8T SRAM cell shows 1.15×/1.17× improvement in TWA as compared to CON6T/ZG8T at a penalty of 2.65×/2× in read access time (TRA). The proposed cell consumes 3.22× less hold power compared to both CON6T and ZG8T SRAM cells. And the proposed cell consumes 4.41× (4.44×) less write power as compared to CON6T (ZG8T) SRAM cell. Our proposed cell takes 1.37× lower chip area as compared to ZG8T cell at the expense of 1.49× higher area as compared to CON6T SRAM cell. The proposed cell also achieves 1.5×/3× higher stability during write operation as compared to CON6T/ZG8T SRAM cell, respectively. Read static margin of the proposed cell is same as CON6T but 3.2× lower than ZG8T SRAM cell.

  相似文献   

17.
Recent studies have shown that embedded DRAM (eDRAM) is a promising approach for 3D stacked last-level caches (LLCs) rather than SRAM due to its advantages over SRAM; (i) eDRAM occupies less area than SRAM due to its smaller bit cell size; and (ii) eDRAM has much less leakage power and access energy than SRAM, since it has much smaller number of transistors than SRAM. However, different from SRAM cells, eDRAM cells should be refreshed periodically in order to retain the data. Since refresh operations consume noticeable amount of energy, it is important to adopt appropriate refresh interval, which is highly dependent on the temperature. However, the conventional refresh method assumes the worst-case temperature for all eDRAM stacked cache banks, resulting in unnecessarily frequent refresh operations. In this paper, we propose a novel temperature-aware refresh scheme for 3D stacked eDRAM caches. Our proposed scheme dynamically changes refresh interval depending on the temperature of eDRAM stacked last-level cache (LLC). Compared to the conventional refresh method, our proposed scheme reduces the number of refresh operations of the eDRAM stacked LLC by 28.5% (on 32 MB eDRAM LLC), on average, with small area overhead. Consequently, our proposed scheme reduces the overall eDRAM LLC energy consumption by 12.5% (on 32 MB eDRAM LLC), on average.  相似文献   

18.
随着工艺的持续进步,多核处理器集成了越来越多的核以及片上缓存系统,因此利用非一致缓存架构(NUCA)应对片上多核处理器的缓存系统中逐渐增大的线延迟。高效的缓存块迁移策略对整个缓存系统至关重要。当前动态非一致缓存架构(D-NUCA)中的缓存块迁移策略未考虑缓存块的历史访问信息,导致缓存块在不同的bank之间抖动从而增加缓存块的访问延迟。为此,提出一种重用感知的缓存块迁移(RABM)策略,采用缓存块的历史迁移信息来预测将来的缓存块迁移,从而提升D-NUCA的性能以及降低整个缓存系统的功耗。基于PARSEC基准测试程序的全系统仿真结果显示,与D-NUCA相比,基于RABM的D-NUCA可以使每时钟周期指令数平均提高9.6%,片上缓存系统功耗降低14%。  相似文献   

19.
一种低功耗动态可重构cache算法的研究   总被引:1,自引:0,他引:1  
动态可重构cache算法根据指令时间数监测程序段的变化,确定容量调整.在程序段内,状态机根据平均访问时间对cache的访问进行预判,然后根据预判的结果确定当前程序段的cache结构.实验结果表明,此算法比传统四路组相联cache功耗降低61%,而性能损失只有2%左右.与已有算法相比,功耗和性能都得到进一步的提高.  相似文献   

20.
In this paper, we evaluate compressibility of L1 data caches and L2 cache in general-purpose graphics processing units (GPGPUs). Our proposed scheme is geared toward improving performance and power of GPGPUs through cache compression. GPGPUs are throughput-oriented devices which execute thousands of threads simultaneously. To handle working set of this massive number of threads, modern GPGPUs exploit several levels of caches. GPGPU design trend shows that the size of caches continues to grow to support even more thread level parallelism. We propose using cache compression to increase effective cache capacity, improve performance, and reduce power consumption in GPGPUs. Our work is motivated by the observation that the values within a cache block are similar, i.e., the arithmetic difference of two successive values within a cache block is small. To reduce data redundancy in L1 data caches and L2 cache, we use low-cost and implementation-efficient base-delta-immediate (BDI) algorithm. BDI replaces a cache block with a base and an array of deltas where the combined size of the base and deltas is less than the original cache block. We also study locality of fields in integer and floating-point numbers. We found that entropy of fields varies across different data types. Based on entropy, we offer different BDI compression schemes for integer and floating-point numbers. We augment a simple, yet effective, predictor that determines type of values dynamically in hardware and without the help of a compiler or a programmer. Evaluation results show that on average, cache compression improves performance by 8% and saves energy of caches by 9%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号