期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Write reconstruction for write throughput improvement on MLC PCM based main memory

《Journal of Systems Architecture》2016

The emerging Phase Change Memory (PCM) is considered as one of the most promising candidates to replace DRAM as main memory due to its better scalability and non-volatility. With multi-bit storage capability, Multiple-Level-Cell (MLC) PCM outperforms Single-Level-Cell (SLC) in density. However, the high write latency has been a performance bottleneck for MLC PCM for two reasons: First, MLC PCM has a much longer programming time; Second, the write latencies of different cell state transitions range significantly. When cells are concurrently written in the burst mode, the write latency of a burst is delayed by the worst state transitions. To improve the write throughput of MLC PCM based main memory, this paper proposes a Write Reconstruction (WR) scheme. WR reconstructs multiple burst writes targeting the same memory row, where the worst case cells are grouped together at some writes. With this approach, the write latency of other writes will be reduced. WR incurs low implementation overhead and shows significant efficiency. Experimental results show that WR achieves 18.1% of write latency reduction on average, with negligible power overhead. 相似文献

2.

Hyper switching memory utilization on hybrid main memory for improved task execution and reduced power consumption

《Microprocessors and Microsystems》2020

The problem of lifetime maximization of PCM has been well studied. The arrival of non-volatile memory devices has replaced the traditional DRAM. Still the DRAM has many limitations on endurance and high power write operations. Similarly, number of designs has been discussed earlier to maximize the lifetime of PCM by catching the main memory at available DRAM. Still they could not achieve the performance on power consumption reduction and increasing memory utilization. To improve the performance in power consumption reduction and lifetime maximization, and categorical model is presented in this paper. The proposed method categorizes the processes according to their memory access activity. The categorized process has been allocated to respective part of hybrid memory which encourages maximum read and minimum write in PCM. The proposed method increases the lifetime of PCM than other methods. 相似文献

3.

基于DRAM牺牲Cache的异构内存页迁移机制

裴颂文钱艺幻叶笑春刘海坤孔令和《计算机研究与发展》2022,59(3):568-581

当海量数据请求访问异构内存系统时,异构内存页在动态随机存储器(dynamic random access memory,DRAM)和非易失性存储器(non-volatile memory,NVM)之间进行频繁的往返迁移.然而,应用于传统内存页的迁移策略难以适应内存页"冷""热"度的快速动态变化,这使得从DRAM迁移至N... 相似文献

4.

基于布隆过滤器的新型混合内存架构磨损均衡策略

张震付印金胡谷雨《计算机应用》2018,38(8):2230-2235

相变存储器（PCM）凭借低功耗的优势有望成为新一代主存储器,但是耐受性的缺陷成为其广泛应用的重要障碍。现有的随机存取存储器（DRAM）缓存技术和磨损均衡分别从减少PCM写数量以及均匀化写操作分布两个角度延长PCM使用寿命,但前者在写回数据时未考虑数据的读写倾向性,后者在空间局部性较强的应用场景下存在数据交换粒度、空间开销、随机性等诸多问题。因此,设计一种全新的混合存储架构,结合最近最少使用（LRU）算法和带有时间变化的最不经常使用（LFU-Aging）算法提出区分数据读写倾向性的缓存策略,并且基于布隆过滤器（BF）设计针对强空间局部性工作集的动态磨损均衡算法,在有效减少冗余写操作的同时实现低空间开销的组间磨损均衡操作。实验结果表明,该策略能够减少PCM上13.4%~38.6%的写操作,同时有效均匀90%以上分组的写操作分布。相似文献

5.

Data parallel address architecture

Jung Ho Ahn Dally W.J. 《Computer Architecture Letters》2006,5(1):30-33

Data parallel memory systems must maintain a large number of outstanding memory references to fully use increasing DRAM bandwidth in the presence of increasing latency. At the same time, the throughput of modern DRAMs is very sensitive to access pattern's due to the time required to precharge and activate banks and to switch between read and write access. To achieve memory reference parallelism a system may simultaneously issue references from multiple reference threads. Alternatively multiple references from a single thread can be issued in parallel. In this paper, we examine this tradeoff and show that allowing only a single thread to access DRAM at any given time significantly improves performance by increasing the locality of the reference stream and hence reducing precharge/activate operations and read/write turnaround. Simulations of scientific and multimedia applications show that generating multiple references from a single thread gives, on average, 17% better performance than generating references from two parallel threads. 相似文献

6.

Data loss recovery for power failure in flash memory storage systems

《Journal of Systems Architecture》2015,61(1):12-27

Due to the rapid development of flash memory technology, NAND flash has been widely used as a storage device in portable embedded systems, personal computers, and enterprise systems. However, flash memory is prone to performance degradation due to the long latency in flash program operations and flash erasure operations. One common technique for hiding long program latency is to use a temporal buffer to hold write data. Although DRAM is often used to implement the buffer because of its high performance and low bit cost, it is volatile; thus, that the data may be lost on power failure in the storage system. As a solution to this issue, recent operating systems frequently issue flush commands to force storage devices to permanently move data from the buffer into the non-volatile area. However, the excessive use of flush commands may worsen the write performance of the storage systems. In this paper, we propose two data loss recovery techniques that require fewer write operations to flash memory. These techniques remove unnecessary flash writes by storing storage metadata along with user data simultaneously by utilizing the spare area associated with each data page. 相似文献

7.

Energy efficient task allocation for hybrid main memory architecture

《Journal of Systems Architecture》2016

Compared with the conventional dynamic random access memory (DRAM), emerging non-volatile memory technologies provide better density and energy efficiency. However, current NVM devices typically suffer from high write power, long write latency and low write endurance. In this paper, we study the task allocation problem for the hybrid main memory architecture with both DRAM and PRAM, in order to leverage system performance and the energy consumption of the memory subsystem via assigning different memory devices for each individual task. For an embedded system with a static set of periodical tasks, we design an integer linear programming (ILP) based offline adaptive space allocation (offline-ASA) algorithm to obtain the optimal task allocation. Furthermore, we propose an online adaptive space allocation (online-ASA) algorithm for dynamic task set where arrivals of tasks are not known in advance. Experimental results show that our proposed schemes achieve 27.01% energy saving on average, with additional performance cost of 13.6%. 相似文献

8.

利用相变存储器不对称性的写入优化方法

下载免费PDF全文

张格毅陈小刚郭继鹏宋志棠陈邦明《计算机工程与应用》2021,57(14):75-82

相变存储器具有集成度高、功耗低、非易失等优良特性,是作为非易失性内存最有潜力的存储介质之一。如何降低其写入延时和增加其使用寿命,是PCM作为非易失性内存时亟需解决的问题。为此,提出利用相变存储器擦除和写入时间不对称的特点擦写独立的写入方法,RSIW（Reset and Set Independently Write）。该方法不同于传统的写入方案,将写和擦的操作分离,让慢速的写操作在空闲时进行,使得相变存储器的写入速度获得显著提升。同时,RSIW还能结合磨损均衡的策略,有效地均衡各个块的写入频率。对擦写独立的写入方法和实施细节进行了描述,对比了同类使用相变存储器擦写不对称性进行优化的方案,最后使用gem5仿真器进行了实验,根据实验结果,该方法对比同类的技术能将系统的运行效率提高37.1%~69.1%。相似文献

9.

基于持久化内存的索引设计重新思考与优化

韩书楷熊子威蒋德钧熊劲《计算机研究与发展》2021,58(2):356-370

非易失性内存(non-volatile memory,NVM)是近几年来出现的一种新型存储介质.一方面,同传统的易失性内存一样,它有着低访问延迟、可字节寻址的特性;另一方面,与易失性内存不同的是,掉电后它存储的数据不会丢失,此外它还有着更高的密度以及更低的能耗开销这些特性使得非易失性内存有望被大规模应用在未来的计算机系... 相似文献

10.

PMSS: A programmable memory system and scheduler for complex memory patterns

Tassadaq Hussain Amna Haider Eduard Ayguadé 《Journal of Parallel and Distributed Computing》2014

HPC industry demands more computing units on FPGAs, to enhance the performance by using task/data parallelism. FPGAs can provide its ultimate performance on certain kernels by customizing the hardware for the applications. However, applications are getting more complex, with multiple kernels and complex data arrangements, generating overhead while scheduling/managing system resources. Due to this reason all classes of multi threaded machines–minicomputer to supercomputer–require to have efficient hardware scheduler and memory manager that improves the effective bandwidth and latency of the DRAM main memory. This architecture could be a very competitive choice for supercomputing systems that meets the demand of parallelism for HPC benchmarks. In this article, we proposed a Programmable Memory System and Scheduler (PMSS), which provides high speed complex data access pattern to the multi threaded architecture. This proposed PMSS system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the modified PMSS based multi-accelerator system consumes 50% less hardware resources, 32% less on-chip power and achieves approximately a 19x speedup compared to the MicroBlaze based system. 相似文献

11.

一种固态盘的读写性能优化调度方法

朱玥吴非熊钦谢长生《计算机科学》2017,44(6):51-56

相比于传统机械硬盘,基于NAND Flash的固态盘由于具有非易失性、高性能、低功耗等优点,被广泛应用于数据中心、云计算、在线事务交易等场景。然而,由于NAND Flash中的读操作速度远远快于写操作速度,当读写请求并发执行时,读请求可能被写请求阻塞,从而表现出极大的读延时。在许多以读请求为主的场合,尤其是在线事物交易中(读请求占总请求的比例超过90%),读延时的急剧增加严重影响了系统的整体性能。提出一种读写性能优化调度的策略,通过在闪存转换层之下动态调整读写请求的优先序列,使读性能获得显著的提升。实验中,通过对固态盘仿真器的设计与实现,对读写调度策略的有效性进行了系统的评估。实验结果表明,在该调度策略下,系统中读延时的最大值和平均值均得到了显著的减少,且降幅分别达到了72%和41%。相似文献

12.

Pinpointing and scheduling access conflicts to improve internal resource utilization in solid-state drives

Xuchao XIE Liquan XIAO Dengping WEI Qiong LI Zhenlong SONG Xiongzi GE 《Frontiers of Computer Science》2019,13(1):35

Modern solid-state drives (SSDs) are integrating more internal resources to achieve higher capacity. Parallelizing accesses across internal resources can potentially enhance the performance of SSDs. However, exploiting parallelism inside SSDs is challenging owing to real-time access conflicts. In this paper, we propose a highly parallelizable I/O scheduler (PIOS) to improve internal resource utilization in SSDs from the perspective of I/O scheduling. Specifically, we first pinpoint the conflicting flash requests with precision during the address translation in the Flash Translation Layer (FTL). Then, we introduce conflict eliminated requests (CERs) to reorganize the I/O requests in the device-level queue by dispatching conflicting flash requests to different CERs. Owing to the significant performance discrepancy between flash read and write operations, PIOS employs differentiated scheduling schemes for read and write CER queues to always allocate internal resources to the conflicting CERs that are more valuable. The small dominant size prioritized scheduling policy for the write queue significantly decreases the average write latency. The high parallelism density prioritized scheduling policy for the read queue better utilizes resources by exploiting internal parallelism aggressively. Our evaluation results show that the parallelizable I/O scheduler (PIOS) can accomplish better SSD performance than existing I/O schedulers implemented in both SSD devices and operating systems. 相似文献

13.

Prober: exploiting sequential characteristics in buffer for improving SSDs write performance

Wen ZHOU Dan FENG Yu HUA Jingning LIU Fangting HUANG Yu CHEN Shuangwu ZHANG 《Frontiers of Computer Science》2016,10(5):951-964

Solid state disks (SSDs) are becoming one of the mainstream storage devices due to their salient features, such as high read performance and low power consumption. In order to obtain high write performance and extend flash lifespan, SSDs leverage an internal DRAM to buffer frequently rewritten data to reduce the number of program operations upon the flash. However, existing buffer management algorithms demonstrate their blank in leveraging data access features to predict data attributes. In various real-world workloads, most of large sequential write requests are rarely rewritten in near future. Once these write requests occur, many hot data will be evicted from DRAM into flash memory, thus jeopardizing the overall system performance. In order to address this problem, we propose a novel large write data identification scheme, called Prober. This scheme probes large sequential write sequences among the write streams at early stage to prevent them from residing in the buffer. In the meantime, to further release space and reduce waiting time for handling the incoming requests, we temporarily buffer the large data into DRAM when the buffer has free space, and leverage an actively write-back scheme for large sequential write data when the flash array turns into idle state. Experimental results demonstrate that our schemes improve hit ratio of write requests by up to 10%, decrease the average response time by up to 42% and reduce the number of erase operations by up to 11%, compared with the state-of-the-art buffer replacement algorithms. 相似文献

14.

A spill data aware memory assignment technique for improving power consumption of multimedia memory systems

Youn Jonghee Cho Doosan 《Multimedia Tools and Applications》2019,78(5):5463-5478

As embedded memory technology evolves, the traditional Static Random Access Memory (SRAM) technology has reached the end of development. For deepening the manufacturing process technology, the next generation memory technology is highly required because of the exponentially increasing leakage current of SRAM. Non-volatile memories such as STT-MRAM (Spin Torque Transfer Magnetic Random Access Memory), PCM (Phase Change Memory) are good candidates for replacing SRAM technology in embedded memory systems. They have many advanced characteristics in the perspective of power consumption, leakage power, size (density) and latency. Nonetheless, nonvolatile memories have two major problems that hinder their use it the next-generation memory. First, the lifetime of the nonvolatile memory cell is limited by the number of write operations. Next, the write operation consumes more latency and power than the same size of the read operation. This study describes a compiler optimization technique to overcome such disadvantages of a nonvolatile memory component in hybrid cache memories. A hybrid cache is proposed to overcome the disadvantages using a compiler. Specifically, to minimize the number of write operations for nonvolatile memory, we present a data replacement technique that considers the locations of the register spill data. Many portions of the memory accesses are yielded by the spill data of a register allocator in an optimizing compiler. Such spill data can be partially removed using a recalculation method. Thus, we implemented an optimization technique that rearranges the data placement with recalculation to minimize the write instructions on the nonvolatile memory. Our experimental results show that the proposed technique can reduce the average number of spill codes by 20%, and improves the energy consumption by 20.2% on average.

相似文献

15.

Memory access schedule minimization for embedded systems

Jingtong Hu Chun Jason Xue Wei-Che Tseng Qingfeng Zhuge Yingchao Zhao Edwin H.-M. Sha 《Journal of Systems Architecture》2012,58(1):48-59

The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM’s speed and throughput. To achieve this goal, this paper proposes techniques to take advantage of the characteristics of the 3-stage access of contemporary DRAM chips by grouping the accesses of the same row together and interleaving the execution of memory accesses from different banks. A family of Bubble Filling Scheduling (BFS) algorithms are proposed in this paper to minimize memory access schedule length and improve memory access time for embedded systems.When the memory access trace is known in some application-specific embedded systems, this information can be fully utilized to generate efficient memory access schedules. The offline BFS algorithm can generate schedules which are 47.49% shorter than in-order scheduling and 8.51% shorter than existing burst scheduling on average. When memory accesses are received by the single memory controller in real time, the memory accesses have to be scheduled as they come. The online BFS algorithm in this paper serves this purpose and generates schedules which are 58.47% shorter than in-order scheduling and 4.73% shorter than burst scheduling on average. To improve the memory throughput and further reduce the memory access schedule, an architecture with dual memory controllers is proposed. According to the experimental results, the dual controller algorithm can generate schedules which are 62.89% shorter than in-order scheduling, 14.23% shorter than burst scheduling, and 10.07% shorter than single controller BFS algorithms on average. 相似文献

16.

Robust performance in hybrid-memory cooperative caches

Luiz Ramos Ricardo Bianchini 《Parallel Computing》2014

Modern servers require large main memories, which so far have been enabled by increasing DRAM’s density. With DRAM’s scalability nearing its limit, Phase-Change Memory (PCM) is being considered as an alternative technology. PCM is denser, more scalable, and consumes lower idle power than DRAM, while exhibiting byte-addressability and access times in the nanosecond range. Still, PCM is slower than DRAM and has limited endurance. These characteristics prompted the study of hybrid memory systems, combining a small amount of DRAM and a large amount of PCM. In this paper, we leverage hybrid memories to improve the performance of cooperative memory caches in server clusters. Our approach entails a novel policy that exploits popularity information in placing objects across servers and memory technologies. Our results show that (1) DRAM-only and PCM-only memory systems do not perform well in all cases; and (2) when managed properly, hybrid memories always exhibit the best or close-to-best performance, with significant gains in many cases, without increasing energy consumption. 相似文献

17.

Energy optimization for multi-level cell non-volatile memory using state remapping

《Microprocessors and Microsystems》2017

Non-volatile Memory (NVM) is emerging as a promising technology to build future main memory or cache. Multi-level cell (MLC) NVM that stores multiple bits in a single cell has been developed in recent years. Different NVM technology has its own writing schemes to store multiple bits, and the amount of write energy varies across different states. For MLC Phase-Change Memory (PCM), the energy consumption of writing intermediate states, ‘01’ and ‘10’, is bigger than that of writing states ‘00’ and ‘11’. For MLC Spin-Transfer Torque Magnetic RAM (STT-MRAM), the energy consumption of flipping the left bit of a 2-bit cell is greater than that of flipping the right bit. To reduce the MLC NVM write energy consumption, we propose an encoding scheme to reduce the amount of intermediate states’ write for MLC PCM and another encoding scheme to decrease the number of the left bit flips for MLC STT-MRAM. The main idea of both schemes is state remapping. We find two minimum write frequency states and remap them to state ‘01’ and ‘10’ respectively for MLC PCM. In addition, for MLC STT-MRAM, we seeks the remapping decision that can minimize the number of the left bit flips and reduces the write of states ‘01’ and ‘10’. The experimental results show that the encoding scheme for MLC PCM saves 5.25% energy on average and the encoding scheme for MLC STT-MRAM saves 12.17% energy on average. 相似文献

18.

基于DRAM和PCM的混合主存模拟器

张德志万寿红岳丽华《计算机系统应用》2017,26(9):16-23

相变存储器（PCM）由于其非易失性、高读取速度以及低静态功耗等优点,已成为主存研究领域的热点.然而,目前缺乏可用的PCM设备,这使得基于PCM的算法研究得不到有效验证.因此,本文提出了利用主存模拟器仿真并验证PCM算法的思路.本文首先介绍了现有主存模拟器的特点,并指出其并不能完全满足当前主存研究的实际需求,在此基础上提出并构建了一个基于DRAM和PCM的混合主存模拟器.与现有模拟器的实验比较结果表明,本文设计的混合主存模拟器能够有效地模拟DRAM和PCM混合存储架构,并能够支持不同形式的混合主存系统模拟,具有高可配置性.最后,论文通过一个使用示例说明了混合主存模拟器编程接口的易用性. 相似文献

19.

A hybrid memory architecture supporting fine-grained data migration

Ye CHI Jianhui YUE Xiaofei LIAO Haikun LIU Hai JIN 《Frontiers of Computer Science》2024,18(2):182103

Hybrid memory systems composed of dynamic random access memory (DRAM) and Non-volatile memory (NVM) often exploit page migration technologies to fully take the advantages of different memory media. Most previous proposals usually migrate data at a granularity of 4 KB pages, and thus waste memory bandwidth and DRAM resource. In this paper, we propose Mocha, a non-hierarchical architecture that organizes DRAM and NVM in a flat address space physically, but manages them in a cache/memory hierarchy. Since the commercial NVM device–Intel Optane DC Persistent Memory Modules (DCPMM) actually access the physical media at a granularity of 256 bytes (an Optane block), we manage the DRAM cache at the 256-byte size to adapt to this feature of Optane. This design not only enables fine-grained data migration and management for the DRAM cache, but also avoids write amplification for Intel Optane DCPMM. We also create an Indirect Address Cache (IAC) in Hybrid Memory Controller (HMC) and propose a reverse address mapping table in the DRAM to speed up address translation and cache replacement. Moreover, we exploit a utility-based caching mechanism to filter cold blocks in the NVM, and further improve the efficiency of the DRAM cache. We implement Mocha in an architectural simulator. Experimental results show that Mocha can improve application performance by 8.2% on average (up to 24.6%), reduce 6.9% energy consumption and 25.9% data migration traffic on average, compared with a typical hybrid memory architecture–HSCC. 相似文献

20.

A space allocation and reuse strategy for PCM-based embedded systems

《Journal of Systems Architecture》2014,60(8):655-667

Phase change memory (PCM) has emerged as a promising candidate to replace DRAM in embedded systems, due to its appealing properties, such as zero leakage power, scalability, shock-resistivity and high density. However, it can only sustain a limited number of write operations. On the other hand, as a program in embedded systems usually distributes write traffic in an extremely unbalanced way, which could further decrease PCM lifetime.In this paper, we propose a space-based wear leveling technique in software compiler level by exploiting the program-specific features. The basic idea is to extend frequently written variables into specific-sized arrays, and evenly distribute writes on allocated array. In such way, we can effectively distribute the write traffic of the program across the whole PCM chip. A space allocation and reuse (SAR) strategy and a polynomial-time algorithm are proposed to produce optimal and near-optimal space allocation, respectively, for achieving a balanced write distribution. The experimental results show our technique can greatly extend the lifetime of PCM-based embedded systems compared with the previous work, and achieve approximately 94% the theoretical maximum of lifetime. Compared with a baseline scheme without wear-leveling mechanism, our technique introduces no more than 0.8% extra writes and 0.7% running overhead. 相似文献