期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

黄震春李三立马群生《小型微型计算机系统》2002,23(1):9-13

由于存储器间距日益扩大 ,存储系统对计算机系统整体性能的影响越来越严重 ,存储系统模拟器的研究与开发也日益重要 .传统的模拟器更多地将注意力集中于对 Cache的模拟 ,而对存储系统整体的模拟不够 .为了模拟并分析存储系统各部分的性能与其对存储系统整体性能的影响 ,本文设计并实现存储系统模拟器 Si Mem Sy(SImulator ofMEMory SYstem) .实验表明 ,Si Mem Sy可以准确、高效地对存储系统进行模拟并得到可信的结果相似文献

2.

Dynamic Partitioning of Shared Cache Memory 总被引：6，自引：0，他引：6

G. E. Suh L. Rudolph S. Devadas 《The Journal of supercomputing》2004,28(1):7-26

This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches.Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses.The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory. 相似文献

3.

基于Simics的分布式一致性协议仿真

郑志硕郑存陆曹宏徙《计算机与现代化》2011,(9):105-108

用于多种计算机系统和指令系统仿真的Virtutech Simics只提供一个简单的顺序扁平侦听式高速缓存一致性（Snoo-ping Cache Coherence Protocol）模型支持MESI协议,从而制约了可仿真的并行处理器个数。以下将基于目录的分布式高速缓存一致性协议（Distributed Directory-based Cache Coherence Protocol）模型应用于Simics中并给出基于Simics的分布式一致性协议的仿真结果。这一结果证实分布式协议能降低事件总数,减少网络中的事件。本文提出一个简单的基于目录的分布式高速缓存一致性协议,从而解决制约Simics的可扩放性问题。相似文献

4.

Linked instruction caches for enhancing power efficiency of embedded systems

Chang-Jung Ku Ching-Wen Chen An Hsia Chun-Lin Chen 《Microprocessors and Microsystems》2014

The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time. 相似文献

5.

一种片上众核结构共享Cache动态隐式隔离机制研究 总被引：2，自引：0，他引：2

宋风龙刘志勇范东睿张军超余磊《计算机学报》2009,32(10)

访存带宽是限制众核处理器件能提升的关键,将片上最后一级Cache设计为所有处理器核共享是必要的.在共享Cache中隔离放置冲突的数据,是提高共享Cache性能的关键.文中提出了缓存块链接的硬件方法,用于隔离共享Cache中不同线程之间的数据.文中基于时钟精准的片上众核结构模拟器,使用Splash2程序组和生物信息学中的仟务,对所提机制进行了评估.实验结果表明,与传统共享Cache相比,使用缓存块链接机制时,使得共享Cache的冲突性缺失率降低约20%,而使得IPC平均提高了约10%. 相似文献

6.

On the design of on-chip instruction caches

Carl McCrosky Brian ven der Buhs 《Microprocessors and Microsystems》1988,12(10):563-572

Cache memories reduce memory latency and traffic in computing systems. Most existing caches are implemented as board-based systems. Advancing VLSI technology will soon permit significant caches to be integrated on chip with the processors they support. In designing on-chip caches, the constraints of VLSI become significant. The primary constraints are economic limitations on circuit area and off-chip communications. The paper explores the design of on-chip instruction-only caches in terms of these constraints. The primary contribution of this work is the development of a unified economic model of on-chip instruction-only cache design which integrates the points of view of the cache designer and of the floorplan architect. With suitable data, this model permits the rational allocation of constrained resources to the achievement of a desired cache performance. Specific conclusions are that random line replacement is superior to LRU replacement, due to an increased flexibility in VLSI floorplan design; that variable set associativity can be an effective tool in regulating a chip's floorplan; and that sectoring permits area efficient caches while avoiding high transfer widths. Results are reported on economic functionality, from chip area and transfer width to miss ratio. These results, or the underlying analysis, can be used by microprocessor architects to make intelligent decisions regarding appropriate cache organizations and resource allocations. 相似文献

7.

ARP:同时多线程处理器中共享Cache自适应运行时划分机制 总被引：1，自引：1，他引：0

隋秀峰吴俊敏陈国良《计算机研究与发展》2008,45(7)

同时多线程是一种延迟容忍的体系结构,采用共享的二级Cache,在每个周期内可以执行多个线程的多条指令,这就会增加对存储层次的压力,文中主要研究了SMT处理器中多个并发执行的线程之间共享Cache的划分问题,尤其是Cache共享中的公平性问题以及它和吞吐量之间的关系,传统的LRU策略会根据线程的需要隐式地划分共享Cache,给具有较高需求的线程分配较多的Cache空间,对Cache的管理具有不公平性,从而会引起线程饿死、优先级反转等问题,实现了一种自适应、运行时划分机制(ARP)来管理共享Cache.ARP采用公平性作为划分的度量,并且使用动态划分算法来优化公平性,该算法具有易于实现,所需剖析较少的特点,硬件上使用经典的监控器来收集每个线程的栈距离信息,其存储开销不到0.25%.实验结果显示,与基于LRU的Cache划分相比,ARP可以将一个2路SMT处理器的公平性提高2.26倍,而将吞吐量平均提高14.75%. 相似文献

8.

一种基于Inter-warp异构性的缓存管理与内存调度机制

方娟魏泽琳于婷雯《计算机工程与科学》2019,41(5):788-795

在GPU中,一个warp内的所有线程在锁步中执行相同的指令。某些线程的内存请求可以得到快速处理,而其余请求会经历较长时间。在最慢的请求完成之前,warp不能执行下一条指令,导致内存发散。对GPU中warp间的异构性进行了研究,实现并优化了一种基于inter warp异构性的缓存管理机制和内存调度策略,以减少内存发散和缓存排队延迟的负面影响。根据缓存命中率将warp分类,以驱动后面的3个组件：（1）基于warp类型的缓存旁路技术组件,使低缓存利用率的warp进入旁路,不访问L2缓存;（2）基于warp类型的缓存插入/提升策略组件,防止来自高缓存利用率warp的数据被过早清除;（3）基于warp类型的内存控制器组件,优先处理从高缓存利用率的warp接收到的请求,并优先处理来自相同warp的请求。基于warp间异构性的缓存管理和内存调度机制在8种不同的GPGPU应用中,与基准GPU相比,平均加速18.0％。相似文献

9.

cache profiling信息指导的软件流水

周谦冯晓兵张兆庆《计算机研究与发展》2008,45(5):834-840

软件流水是一种重要的指令调度技术,它通过同时执行来自不同循环迭代的指令来加快循环的执行时间.随着处理器速度和访存速度差距越拉越大,访存指令尤其是cache miss的访存指令日益成为系统性能提高的瓶颈.由于这些指令的延迟不是固定的,如何在软件流水中预测并掩盖这些访存指令的延迟是非常重要的.与前人预测访存延迟的方法不同,引入cache profiling技术,通过动态收集到profile信息来预测访存延迟,并进行适当的调度.当增加模调度循环中的访存指令的延迟时,启动间隔也会随之增大,导致性能不会随之上升.CSMS算法和FLMS算法在尽量不增大启动间隔的情况下,改变访存指令的延迟.改进了CSMS算法和FLMS算法,根据cache profiling的信息来改变访存延迟,所以比前人的方法更为准确.实验表明,新方法可以有效地提高程序性能,对SPEC2000测试程序平均性能提高1%左右,个别例子的性能改进高达11%. 相似文献

10.

存储模型仿真器的设计与实现 总被引：2，自引：1，他引：1

吴俊敏杨超陈国良张淼辉门珂《计算机研究与发展》2005,42(3):394-403

存储一致性问题和高速缓存一致性问题是共享存储并行计算机中两个最关键的问题,通过仿真器对它们进行了量化研究,设计并实现了一个存储模型仿真器MMS．基于MMS仿真了不同并行机结构模型下多种存储一致性模型的行为;针对不同类型的计算问题比较了不同的存储一致性模型,并对实验结果进行了分析;实现了几个不同的高速缓存一致性协议,并比较了它们的性能．相似文献

11.

Performance of One''s Complement Caches

Qing Yang Sridar Adina T. Sun 《Journal of Parallel and Distributed Computing》1998,48(2):143

On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which are critical for overall system performance. This paper introduces an innovative design for on-chip data caches of microprocessors, called one's complement cache. While binary complement numbers have been successfully used in designing arithmetic units, to the best of our knowledge, no one has ever considered using such complement numbers in cache memory designs. This paper will show that such complement numbers help greatly in reducing cache misses in a data cache, thereby improving data cache performance. By parallel computation of cache addresses and memory addresses, the new design does not increase the critical hit time of cache accesses. Cache misses caused by line interference are reduced by evenly distributing data items referenced by program loops across all sets in a cache. Even distribution of data in the cache is achieved by making the number of sets in the cache a prime or an odd number, so that the chance of related data being mapped to a same set is small. Trace-driven simulations are used to evaluate the performance of the new design. Performance results on benchmarks show that the new design improves cache performance significantly with negligible additional hardware cost. 相似文献

12.

Cache Profiling技术 总被引：1，自引：0，他引：1

周谦冯晓兵张兆庆《计算机工程》2006,32(13):47-48

如何减少和隐藏cache失效的延迟,是人们关注的热点。编译器为了得到cache访问命中的情况,往往使用模拟器去跑一遍来得到结果,这样的速度很慢。为了克服以上缺点,提出了在编译器中作cache profiling来获取cache访问的信息。类似于value profiling和stride profiling,cache profiling对访存指令作插装,可以有效地提高速度,并且只需要编译器的支持即可。Cache profiling获得的信息可以用来改进指令调度、软件预取、生成cache hint和辅助线程等。相似文献

13.

SMA: A Self-Monitored Adaptive Cache Warm-Up Scheme for Microprocessor Simulation

Yue?Luo Email author Lizy?K.?John Lieven?Eeckhout 《International journal of parallel programming》2005,33(5):561-581

This paper presents an adaptive technique for warming up caches in sampled microprocessor simulation. The simulator monitors the warm-up process of the caches and decides when the caches are warmed up based on simple heuristics. This mechanism allows the warm up length to be adaptive to cache sizes and benchmark variability characteristics. With only half or one-third of the average warm-up length of previous methods, the proposed Self-Monitored Adaptive (SMA) warm-up technique achieves CPI results very similar to previous methods. On average SMA exhibits only 0.2% warm-up error in CPI. For simulating small caches, the SMA technique can reduce the warm-up overhead by an order of magnitude compared to previous techniques. Finally, SMA gives the user some indicator of warm-up error at the end of the cycle-accurate simulation that helps the user to gauge the accuracy of the warm-up. 相似文献

14.

Maintaining Cache Coherence through Compiler-Directed Data Prefetching

Hock-Beng Lim Pen-Chung Yew 《Journal of Parallel and Distributed Computing》1998,53(2):170

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency. 相似文献

15.

Using the First-Level Caches as Filters to Reduce the Pollution Caused by Speculative Memory References

Onur?Mutlu Email author Hyesoon?Kim David?N.?Armstrong Yale?N.?Patt 《International journal of parallel programming》2005,33(5):529-559

High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. This paper proposes the use of the first-level caches as filters that predict the usefulness of speculative memory references. With the proposed technique, speculative memory references bring data only into the first-level caches rather than all levels in the cache hierarchy. The processor monitors the use of the cache blocks in the first-level caches and decides which blocks to keep in the cache hierarchy based on the usefulness of cache blocks. It is shown that a simple implementation of this technique usually outperforms inclusive and exclusive baseline cache hierarchies commonly used by today’s processors and results in IPC performance improvements of up to 10% on the SPEC CPU2000 integer benchmarks. 相似文献

16.

SimK: A Large-Scale Parallel Simulation Engine

下载免费PDF全文

Jian-Wei Xu 《计算机科学技术学报》2009,24(6):1048-1060

Simulation is an important method to evaluate future computer systems. Currently microprocessor architecture has switched to parallel, but almost all simulators remained at sequential stage, and the advantages brought by multi-core or many-core processors cannot be utilized. This paper presents a parallel simulator engine (SimK) towards the prevalent SMP/CMP platform, aiming at large-scale fine-grained computer system simulation. In this paper, highly efficient synchronization, communication and buffer management policies used in SimK are introduced, and a novel lock-free scheduling mechanism that avoids using any atomic instructions is presented. To deal with the load fluctuation at light load case, a cooperated dynamic task migration scheme is proposed. Based on SimK, we have developed large-scale parallel simulators HppSim and HppNetSim, which simulate a full supercomputer system and its interconnection network respectively. Results show that HppSim and HppNetSim both gain sound speedup with multiple processors, and the best normalized speedup reaches 14.95X on a two-way quad-core server. 相似文献

17.

Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing

《Microprocessors and Microsystems》2016

To achieve higher performance and energy efficiency, GPGPU architectures have recently begun to employ hardware caches. Adding caches to GPGPUs, however, does not always guarantee improved performance and energy efficiency due to the thrashing in small caches shared by thousands of threads. While prior work has proposed warp-scheduling and cache-bypassing techniques to address this issue, relatively little work has been done in the context of advanced cache indexing (ACI).To bridge this gap, this work investigates the effectiveness of ACI for high-performance and energy-efficient GPGPU computing. We discuss the design and implementation of static and adaptive cache indexing schemes for GPGPUs. We then quantify the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation demonstrates that the ACI schemes are effective in that they provide significant performance and energy-efficiency gains over the conventional indexing scheme. Further, we investigate the performance sensitivity of ACI to key architectural parameters (e.g., indexing latency and cache associativity). Our experimental results show that the ACI schemes are promising in that they continue to provide significant performance gains even when additional indexing latency occurs due to the hardware complexity and the baseline cache is enhanced with high associativity or large capacity. 相似文献

18.

快速地址计算的自适应栈高速缓存

郇丹丹李祖松王剑章隆兵胡伟武刘志勇《计算机研究与发展》2007,44(1):169-176

随着存储系统的访问速度与处理器运算速度的差距越来越显著,访存性能已成为提高处理器性能的瓶颈.通过对程序的访存行为进行分析,提出快速地址计算的自适应栈高速缓存方案.该方案将栈访问从数据高速缓存的访问中分离出来,充分利用栈空间数据访问的特点,提高指令级并行度,减少数据高速缓存污染,降低数据高速缓存失效率,并采用快速地址计算策略,减少栈访问的命中时间.该栈高速缓存在发生栈溢出时能够自适应地关闭,以避免栈切换对处理器性能的影响.栈高速缓存标志中增加进程标识,进程切换时不需要将数据写到低层存储系统中,适用于多进程环境.SPEC CPU2000程序运行结果表明,采用快速地址计算的自适应栈高速缓存方案,25.8%的访存指令可以并行执行,数据高速缓存失效率平均降低9.4%,IPC值平均提高6.9%. 相似文献

19.

A queueing theoretic approach for performance evaluation of low-power multi-core embedded systems

Arslan Munir Ann Gordon-Ross Sanjay Ranka Farinaz Koushanfar 《Journal of Parallel and Distributed Computing》2014

With Moore’s law supplying billions of transistors on-chip, embedded systems are undergoing a transition from single-core to multi-core to exploit this high transistor density for high performance. However, the optimal layout of these multiple cores along with the memory subsystem (caches and main memory) to satisfy power, area, and stringent real-time constraints is a challenging design endeavor. The short time-to-market constraint of embedded systems exacerbates this design challenge and necessitates the architectural modeling of embedded systems to reduce the time-to-market by expediting target applications to device/architecture mapping. In this paper, we present a queueing theoretic approach for modeling multi-core embedded systems that provides a quick and inexpensive performance evaluation both in terms of time and resources as compared to the development of multi-core simulators and running benchmarks on these simulators. We verify our queueing theoretic modeling approach by running SPLASH-2 benchmarks on the SuperESCalar simulator (SESC). Results reveal that our queueing theoretic model qualitatively evaluates multi-core architectures accurately with an average difference of 5.6% as compared to the architectures’ evaluations from the SESC simulator. Our modeling approach can be used for performance per watt and performance per unit area characterizations of multi-core embedded architectures, with varying number of processor cores and cache configurations, to provide a comparative analysis. 相似文献

20.

Design and evaluation of a hierarchical decoupled architecture

Won W. Ro Stephen P. Crago Alvin M. Despain Jean-Luc Gaudiot 《The Journal of supercomputing》2006,38(3):237-259

The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality. To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors. Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%. 相似文献