期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陈瑞夏承遗《计算机系统应用》2021,30(12):10-17

在大规模、数据量密集的特定应用场景下,以行存储访问数据的方式弊端日益凸显,逐渐不能满足数据高速访问的性能需求,数据亟需更加高效的传输和处理方式.因此,拓展新的内存访问方式,并且同时兼容行、列方向的访问对提升访问效率、降低整体功耗、节省内存空间有着重要意义.本文围绕动态随机存储和非易失性存储两个方面来详细介绍实现列方向的内存访问方式,重点分析了存储单元的结构设计以及实现列向存储访问过程.最后,对内存两种不同访问方式进行了比较和总结,并且对行列访问的内存数据库、数据挖掘、数据加密算法、实时系统的应用场景进行了展望. 相似文献

2.

Performance of One''s Complement Caches

Qing Yang Sridar Adina T. Sun 《Journal of Parallel and Distributed Computing》1998,48(2):143

On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which are critical for overall system performance. This paper introduces an innovative design for on-chip data caches of microprocessors, called one's complement cache. While binary complement numbers have been successfully used in designing arithmetic units, to the best of our knowledge, no one has ever considered using such complement numbers in cache memory designs. This paper will show that such complement numbers help greatly in reducing cache misses in a data cache, thereby improving data cache performance. By parallel computation of cache addresses and memory addresses, the new design does not increase the critical hit time of cache accesses. Cache misses caused by line interference are reduced by evenly distributing data items referenced by program loops across all sets in a cache. Even distribution of data in the cache is achieved by making the number of sets in the cache a prime or an odd number, so that the chance of related data being mapped to a same set is small. Trace-driven simulations are used to evaluate the performance of the new design. Performance results on benchmarks show that the new design improves cache performance significantly with negligible additional hardware cost. 相似文献

3.

基于程序访存模式的低功耗存储技术 总被引：1，自引：0，他引：1

章铁飞陈天洲吴剑钟《软件学报》2014,25(2):254-266

与不断提升的计算能力相适应,移动手持设备上的存储系统结构越来越复杂,容量越来越大.这种趋势导致存储系统,主要是片上缓存和主存,在系统总能耗的占比中不断攀升.在当前手持设备多由电池驱动并且电池容量十分有限的情况下,存储系统的低功耗设计就显得十分重要.虽然现有的存储器件提供了一定的硬件节能支持,但是只有与应用程序的访存行为的规律相结合,才能充分发挥硬件的节能潜力.对现有的各种低功耗存储技术进行了梳理和总结,给出程序的访存模式的概念,归纳出访存模式在3个方面的内涵,并进一步详细介绍了程序的访存模式在片上缓存和主存低功耗技术中的应用.最后,展望未来结合访存模式进行低功耗存储系统研发的可能方向. 相似文献

4.

基于共享存储的高可伸缩嵌入式集群模型

尹文轩高翔朱晓静郭德源《计算机研究与发展》2012,(Z1):245-251

利用对称多处理机(SMP)作结点可为嵌入式集群带来更高的计算性价比,但多个并行和存储层次也会带来存储一致性、可伸缩性、性能差异等问题.提出一种基于共享存储的嵌入式集群模型LESC.该模型通过高度综合实现"计算单元-互连一致性模块-系统"三级高可伸缩结构,获得功耗成本有效性.LESC完成分布式共享存储的基本功能,其目录缓存一致性和扩展的共享存储机制改善了传统存储层次,并利用"共享存储虚拟网络"提供模块级的高效通信,避免了网络硬件开销,同时支持MPI编程.经该模型的真实系统平台测试,模块内MPI通信性能是传统嵌入式集群的3倍以上,单元间通信性能可达单元内性能的86%以上,Linpack测试其扩展性能在最差情况下接近理想值的70%. 相似文献

5.

面向稀疏矩阵访存特性的Cache划分

邓林窦勇郑义《计算机工程与科学》2012,34(9):64-70

稀疏矩阵向量乘是许多科学计算的核心,计算中大量的间接和随机访存成为计算的主要瓶颈。本文通过分析稀疏矩阵向量乘运算的数据结构和计算过程,得到计算中不同数据的访存特征,并提出了一种面向数据访存特性的Cache划分方法。对12个稀疏矩阵向量乘的测试表明,本文的Cache划分方法能有效地提高可重用向量的Cache命中率,同时减少计算对Cache空间的需求。相似文献

6.

流处理器的相变存储器主存性能优化

下载免费PDF全文

郝秀蕊安虹李小强汤旭龙《计算机工程》2011,37(24):251-253

将相变存储器(PCRAM)作为流处理器Imagine的主存储器,对其性能进行优化。建立(PCRAM)性能分析模型,针对PCRAM可写次数有限的缺陷,采用避免冗余位写技术,使PCRAM的生命周期延长3.4倍。利用PCRAM的非易失性,避免不必要的缓存行写回。分析访存调度算法对 PCRAM性能的影响,结果表明,row/open调度算法性能较优,适合PCRAM使用。相似文献

7.

Design and evaluation of a hierarchical decoupled architecture

Won W. Ro Stephen P. Crago Alvin M. Despain Jean-Luc Gaudiot 《The Journal of supercomputing》2006,38(3):237-259

The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality. To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors. Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%. 相似文献

8.

基于TVM平台的MEC卷积算法优化

下载免费PDF全文

王朝闻蒋林李远成朱筠《计算机工程与应用》2023,59(1):180-186

针对MEC(memory efficient convolution)卷积算法在传统设备下因访问数据地址不连续导致的缓存命中率低、内存访问延时长等问题,提出一种适用于MEC算法访存行为的优化方法。该方法分为中间矩阵转换和矩阵运算两部分。对于中间矩阵转换部分,采用修改数据读取顺序的方式对其进行优化,使读取方式符合算法的访存行为。对于矩阵运算部分,采用更加适合矩阵运算的内存数据布局对卷积核矩阵修改,并利用TVM(tensor virtual machine)平台封装的计算函数,重新设计中间矩阵同卷积核矩阵的计算方式。使用平台自带并行库对运算过程进行加速。实验结果表明,相比传统MEC算法,提出的优化方法可以有效解决缓存命中率低、内存访问延时长等问题,同MEC算法的运算时间对比,在单个卷积层上平均获得了50%的速度提升,在多层神经网络中最低获得了57%以上的速度提升,同空间组合算法的运算时间对比,最高获得了80%的速度提升。相似文献

9.

Next high performance and low power flash memory package structure

下载免费PDF全文

Jung-Hoon Lee 《计算机科学技术学报》2007,22(4):515-520

In general, NAND flash memory has advantages in low power consumption, storage capacity, and fast erase/write performance in contrast to NOR flash. But, main drawback of the NAND flash memory is the slow access time for random read operations. Therefore, we proposed the new NAND flash memory package for overcoming this major drawback. We present a high performance and low power NAND flash memory system with a dual cache memory. The proposed NAND flash package consists of two parts, i.e., an NAND flash memory module, and a dual cache module. The new NAND flash memory system can achieve dramatically higher performance and lower power consumption compared with any conventionM NAND-type flash memory module. Our results show that the proposed system can reduce about 78% of write operations into the flash memory cell and about 70% of read operations from the flash memory cell by using only additional 3KB cache space. This value represents high potential to achieve low power consumption and high performance gain. 相似文献

10.

Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications

Luís Fabrício Wanderley Góes Christiane Pousa Ribeiro Márcio Castro Jean-François Méhaut Murray Cole Marcelo Cintra 《International journal of parallel programming》2014,42(2):365-382

Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines. 相似文献

11.

Hardware Transactional Memory Exploration in Coherence-Free Many-Core Architectures

Dimitra Papagiannopoulou Andrea Marongiu Tali Moreshet Luca Benini Maurice Herlihy R. Iris Bahar 《International journal of parallel programming》2018,46(6):1304-1328

High-end embedded systems, like their general-purpose counterparts, are turning to many-core cluster-based shared-memory architectures that provide a shared memory abstraction subject to non-uniform memory access costs. In order to keep the cores and memory hierarchy simple, many-core embedded systems tend to employ simple, scratchpad-like memories, rather than hardware managed caches that require some form of cache coherence management. These “coherence-free” systems still require some means to synchronize memory accesses and guarantee memory consistency. Conventional lock-based approaches may be employed to accomplish the synchronization, but may lead to both usability and performance issues. Instead, speculative synchronization, such as hardware transactional memory, may be a more attractive approach. However, hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory accesses among the cores. The lack of a cache-coherence protocol adds new challenges in the design of hardware speculative support. In this article, we present a new scheme for hardware transactional memory (HTM) support within a cluster-based, many-core embedded system that lacks an underlying cache-coherence protocol. We propose two alternative data versioning implementations for the HTM support, Full-Mirroring and Distributed Logging and we conduct a performance comparison between them. To the best of our knowledge, these are the first designs for speculative synchronization for this type of architecture. Through a set of benchmark experiments using our simulation platform, we show that our designs can achieve significant performance improvements over traditional lock-based schemes. 相似文献

12.

Exploiting write-only-once characteristics of file data in smartphone buffer cache management

《Pervasive and Mobile Computing》2017

In this article, we analyze file access characteristics of smartphone applications and find out that a large portion of file data in smartphones are written only once. This specific phenomenon appears due to the behavior of SQLite, a lightweight database library used in most smartphone applications. Based on this observation, we present a new buffer cache management scheme for smartphone systems that considers non-reusability of write-only-once data that we observe. Buffer cache improves file access performances by maintaining hot data in memory thereby servicing subsequent requests without storage accesses. The proposed scheme classifies write-only-once data and aggressively evicts them from the buffer cache to improve cache space utilization. Experimental results with various real smartphone applications show that the proposed buffer cache management scheme improves the performance of smartphone buffer cache by 5%–33%. We also show that our scheme can reduce the buffer cache size to 1/4 of the original system without performance degradation, which allows the reduction of energy consumption in a smartphone memory system by 27%–92%. 相似文献

13.

Compressed tag architecture for low-power embedded cache systems

Jong Wook Kwak Young Tae Jeon 《Journal of Systems Architecture》2010,56(9):419-428

Processors in embedded systems mostly employ cache architectures in order to alleviate the access latency gap between processors and memory systems. Caches in embedded systems usually occupy a major fraction of the implemented chip area. The power dissipation of cache system thus constitutes a significant fraction of the power dissipated by the entire processor in embedded systems. In this paper, we propose the compressed tag architecture to reduce the power dissipation of the tag store in cache systems. We introduce a new tag-matching mechanism by using a locality buffer and a tag compression technique. The main power reduction feature of our proposal is the use of small tag space matching instead of full tag matching, with modest additional hardware costs. The simulation results show that the proposed model provides a power and energy-delay product reduction of up to 27.8% and 26.5%, respectively, while still providing a comparable level of system performance to regular cache systems. 相似文献

14.

多线程环境的高效内存分配技术

马明理陈刚董金祥《计算机测量与控制》2006,14(11):1551-1553,1556

介绍了一种新的多线程内存分配技术（NIXMalloc）的设计和实现，提出了两种高效的分配策略及其自适应调优方法，有效地提高多线程应用程序的内存管理性能；其中Local分配策略对超级块对象Span进行了线程私有化，基于超级块对象为单位的垃圾回收和内存布局调整使多线程性能更优越；Global分配策略采用了自适应调优方法，在动态检测应用程序内存使用情况的基础上进行内存预取和线程缓存限值的动态调整；实验证明NIXMalloc可改善内存管理性能，提高吞吐量，同时降低内存使用量；在多线程应用系统中能获得较好的时空效率。相似文献

15.

A hybrid memory architecture supporting fine-grained data migration

Ye CHI Jianhui YUE Xiaofei LIAO Haikun LIU Hai JIN 《Frontiers of Computer Science》2024,18(2):182103

Hybrid memory systems composed of dynamic random access memory (DRAM) and Non-volatile memory (NVM) often exploit page migration technologies to fully take the advantages of different memory media. Most previous proposals usually migrate data at a granularity of 4 KB pages, and thus waste memory bandwidth and DRAM resource. In this paper, we propose Mocha, a non-hierarchical architecture that organizes DRAM and NVM in a flat address space physically, but manages them in a cache/memory hierarchy. Since the commercial NVM device–Intel Optane DC Persistent Memory Modules (DCPMM) actually access the physical media at a granularity of 256 bytes (an Optane block), we manage the DRAM cache at the 256-byte size to adapt to this feature of Optane. This design not only enables fine-grained data migration and management for the DRAM cache, but also avoids write amplification for Intel Optane DCPMM. We also create an Indirect Address Cache (IAC) in Hybrid Memory Controller (HMC) and propose a reverse address mapping table in the DRAM to speed up address translation and cache replacement. Moreover, we exploit a utility-based caching mechanism to filter cold blocks in the NVM, and further improve the efficiency of the DRAM cache. We implement Mocha in an architectural simulator. Experimental results show that Mocha can improve application performance by 8.2% on average (up to 24.6%), reduce 6.9% energy consumption and 25.9% data migration traffic on average, compared with a typical hybrid memory architecture–HSCC. 相似文献

16.

面向内存访问性能优化的总线仲裁方法

刘丹冯毅佟冬程旭王克义《计算机研究与发展》2012,49(5):1061-1071

访存交易的处理顺序对内存访问的性能有重要影响.同一个SoC设备发出的多个未决交易往往地址连续且读写类型相同.然而,传统的总线仲裁方法导致各个设备发出的未决交易序列交错地发送至内存控制器,而内存控制器访存调度的范围有限,最终导致此类序列通常无法连续地访问内存.为解决此问题,提出一种新型的总线仲裁方法CGH,该方法利用SoC设备通信行为的特征,通过识别同一个SoC设备发出的、行地址和读写类型相同的未决交易序列并让其连续获得仲裁授权,减少内存切换行地址和读写类型的次数;同时,在选择将要授权的未决交易序列时,优先考虑行地址和读写类型与最近授权交易相同的申请,进一步提高访存效率.将CGH仲裁方法应用至北大众志-SKSoC后,系统访存性能提高了21.37%,而总线面积仅增加2.83%.此外,由于行地址切换次数减少,内存的能耗也降低了15.15%. 相似文献

17.

Memory access schedule minimization for embedded systems

Jingtong Hu Chun Jason Xue Wei-Che Tseng Qingfeng Zhuge Yingchao Zhao Edwin H.-M. Sha 《Journal of Systems Architecture》2012,58(1):48-59

The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM’s speed and throughput. To achieve this goal, this paper proposes techniques to take advantage of the characteristics of the 3-stage access of contemporary DRAM chips by grouping the accesses of the same row together and interleaving the execution of memory accesses from different banks. A family of Bubble Filling Scheduling (BFS) algorithms are proposed in this paper to minimize memory access schedule length and improve memory access time for embedded systems.When the memory access trace is known in some application-specific embedded systems, this information can be fully utilized to generate efficient memory access schedules. The offline BFS algorithm can generate schedules which are 47.49% shorter than in-order scheduling and 8.51% shorter than existing burst scheduling on average. When memory accesses are received by the single memory controller in real time, the memory accesses have to be scheduled as they come. The online BFS algorithm in this paper serves this purpose and generates schedules which are 58.47% shorter than in-order scheduling and 4.73% shorter than burst scheduling on average. To improve the memory throughput and further reduce the memory access schedule, an architecture with dual memory controllers is proposed. According to the experimental results, the dual controller algorithm can generate schedules which are 62.89% shorter than in-order scheduling, 14.23% shorter than burst scheduling, and 10.07% shorter than single controller BFS algorithms on average. 相似文献

18.

基于ESCA系统的层次化显式访存机制研究 总被引：1，自引：0，他引：1

下载免费PDF全文

饶金理吴丹陈攀董冕邓承诺戴葵邹雪城《计算机工程》2011,37(22):24-27

针对高性能混合计算系统中的存储墙问题,在分析其计算模式特点及传统访存机制局限性的基础上,提出适用于混合计算系统的层次化显式存储访问机制,并基于ESCA多核处理器系统进行实现和评测。实验结果显示,针对核心应用程序DGEMM,延迟隐藏能够占据整体运行时间的56%,并获得1.5倍的加速比,能弥补计算与存储访问间的速度差异,提高系统计算效率。相似文献

19.

Architecting high-performance energy-efficient soft error resilient cache under 3D integration technology

Hongbin SunAuthor Vitae Pengju RenAuthor VitaeNanning ZhengAuthor Vitae Tong ZhangAuthor VitaeTao LiAuthor Vitae 《Microprocessors and Microsystems》2011,35(4):371-381

Radiation-induced soft error has become an emerging reliability threat to high performance microprocessor design. As the size of on chip cache memory steadily increased for the past decades, resilient techniques against soft errors in cache are becoming increasingly important for processor reliability. However, conventional soft error resilient techniques have significantly increased the access latency and energy consumption in cache memory, thereby resulting in undesirable performance and energy efficiency degradation. The emerging 3D integration technology provides an attractive advantage, as the 3D microarchitecture exhibits heterogeneous soft error resilient characteristics due to the shielding effect of die stacking. Moreover, the 3D shielding effect can offer several inner dies that are inherently invulnerable to soft error, as they are implicitly protected by the outer dies. To exploit the invulnerability benefit, we propose a soft error resilient 3D cache architecture, in which data blocks on the soft error invulnerable dies have no protection against soft error, therefore, access to the data block on the soft error invulnerable die incurs a considerably reduced access latency and energy. Furthermore, we propose to maximize the access on the soft error invulnerable dies by dynamically moving data blocks among different dies, thereby achieving further performance and energy efficiency improvement. Simulation results show that the proposed 3D cache architecture can reduce the power consumption by up to 65% for the L1 instruction cache, 60% for the L1 data cache and 20% for the L2 cache, respectively. In general, the overall IPC performance can be improved by 5% on average. 相似文献

20.

快速地址计算的自适应栈高速缓存

郇丹丹李祖松王剑章隆兵胡伟武刘志勇《计算机研究与发展》2007,44(1):169-176

随着存储系统的访问速度与处理器运算速度的差距越来越显著,访存性能已成为提高处理器性能的瓶颈.通过对程序的访存行为进行分析,提出快速地址计算的自适应栈高速缓存方案.该方案将栈访问从数据高速缓存的访问中分离出来,充分利用栈空间数据访问的特点,提高指令级并行度,减少数据高速缓存污染,降低数据高速缓存失效率,并采用快速地址计算策略,减少栈访问的命中时间.该栈高速缓存在发生栈溢出时能够自适应地关闭,以避免栈切换对处理器性能的影响.栈高速缓存标志中增加进程标识,进程切换时不需要将数据写到低层存储系统中,适用于多进程环境.SPEC CPU2000程序运行结果表明,采用快速地址计算的自适应栈高速缓存方案,25.8%的访存指令可以并行执行,数据高速缓存失效率平均降低9.4%,IPC值平均提高6.9%. 相似文献