期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

面向非一致Cache的任意步长预提升技术 总被引：2，自引：0，他引：2

吴俊杰杨学军《计算机科学与探索》2010,4(7):577-588

随着微电子工艺的不断进步,片上大容量非一致cache的研究受到广泛关注。提出了一种面向非一致cache的任意步长预提升技术,它能够优化非一致cache中的数据组织,使得即将访问的数据被放置在距离处理器较近的cachebank中,从而降低访存延迟,提升系统性能。详细介绍了任意步长预提升技术的设计,比较了预提升技术与预取技术的差别,并提出了二者的结合技术。通过对来自NPB和SPEC2000的11个基准测试程序在全系统模拟器上的实验评测,发现任意步长预提升技术能够有效减小访存延迟,在访存预测表尺寸为16和32的情况下,系统IPC分别平均增长4.17%和4.91%;在结合预提升和预取技术的情况下,系统IPC分别平均增长8.84%和11.06%。相似文献

2.

基于语义信息的cache管理策略

周勇蒋泽军王丽芳宋玲玲王斌《微处理机》2011,32(6):87-90

针对传统的cache在预取时不判断预取数据块的状态,导致一些不必要的I/O,同时降低cache命中率的缺点,提出了一种基于语义信息的cache管理策略.该策略首先通过收集语义信息让磁盘了解文件系统在磁盘上的数据布局,磁盘上每个数据块是活跃的还是死亡的,并得出磁盘上分区数据块的活跃度.然后根据语义信息在预取的时候不预取死亡的块,在活跃度高的分区上提高预取参数,而在cache替换出数据块时对于死亡块不进行写盘操作.实验结果表明该策略可以较好提高cache命中率进而提高系统的吞吐量. 相似文献

3.

基于快速上下文切换扩展的快速地址空间切换

下载免费PDF全文

吴贞海刘福岩《计算机工程》2010,36(10):285-287

在传统的x86处理器上进行地址空间切换通常需要清空TLB和cache,导致内核时间大量消耗。通过启用ARM920T嵌入式处理器上的快速上下文切换扩展机制,使每个进程地址空间中的低端32 MB可以被硬件重定向到该进程标志符指定的一段虚拟地址空间。该虚拟地址空间互不重叠,使得在进程切换时TLB和cache中的地址信息保持有效,消除了不必要的TLB和cache清空操作,提高了嵌入式系统的性能。相似文献

4.

SAN环境下高性能集群文件系统研究与实现 总被引：1，自引：0，他引：1

黄九鸣罗宇《计算机研究与发展》2007,44(Z1):69-74

在研究现有主流的集群文件系统后,针对SAN环境的特点,提出了一种高性能、低成本、支持大容量的集群文件系统模型.在论述其正确性后,对其性能进行分析.该模型采用文件元数据集中管理、多级cache、元数据预分配及预取等关键技术来提高系统的可靠性及吞吐率.最后通过实验验证了该模型的高效性. 相似文献

5.

Cache Profiling技术 总被引：1，自引：0，他引：1

周谦冯晓兵张兆庆《计算机工程》2006,32(13):47-48

如何减少和隐藏cache失效的延迟,是人们关注的热点。编译器为了得到cache访问命中的情况,往往使用模拟器去跑一遍来得到结果,这样的速度很慢。为了克服以上缺点,提出了在编译器中作cache profiling来获取cache访问的信息。类似于value profiling和stride profiling,cache profiling对访存指令作插装,可以有效地提高速度,并且只需要编译器的支持即可。Cache profiling获得的信息可以用来改进指令调度、软件预取、生成cache hint和辅助线程等。相似文献

6.

面向多核处理器系统的Cache感知调度算法

徐远超沈岩谭旭万虎张志敏《小型微型计算机系统》2013,34(2):365-369

Cache空间的不公平使用和争用直接影响系统的整体性能,现有Linux操作系统的默认调度算法不能感知程序的行为,包括访问cache的失效次数,不了解线程之间访存模式和频度上可能存在的差异,因而无法做出更加合理的调度.本文提出并在Linux环境下实现了一种Cache感知的调度算法CAS,通过监测每个任务每千条指令的共享cache失效次数,把cache失效次数相近的任务聚合到同一个核上,使得cache失效次数差异较大的任务运行在不同的核上,避免了cache失效次数都很大的任务在不同的核上同时运行,从而减小了cache空间的不公平使用和争用.实验表明,CAS算法在大多数情况下,减少了整个负载的共享cache失效次数,提高系统的平均吞吐量约5％左右. 相似文献

7.

结合访存失效队列状态的预取策略 总被引：1，自引：0，他引：1

郇丹丹李祖松胡伟武刘志勇《计算机学报》2007,30(7):1104-1114

随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%. 相似文献

8.

Pentium4处理器的内存层次分析

吴金齐欢《计算机技术与发展》2004,14(7)

处理器存储系统的效率对其整体性能有着十分重要的作用.文中介绍了P4处理器内存的体系结构,它包括一级数据Cache、二级Cache、Trace Cache;各部分完成的功能以及为提高命中率和降低存取时间,从而提高效率而采取的预取处理机制;P4处理器主要采取具有层次结构的内存设计、大容量的二级Cache和在跟踪Cache中采用预取处理机制的方法来提高Cache的命中率和降低未命中的代价来缩短处理器的访问时间,最终达到提高处理器整体性能的目的. 相似文献

9.

Pentium4处理器的内存层次分析 总被引：2，自引：0，他引：2

吴金齐欢《微机发展》2004,14(7):47-48,51

处理器存储系统的效率对其整体性能有着十分重要的作用。文中介绍了P4处理器内存的体系结构，它包括一级数据Cache、二级Cache、Trace Cache；各部分完成的功能以及为提高命中率和降低存取时间，从而提高效率而采取的预取处理机制；P4处理器主要采取具有层次结构的内存设计、大容量的二级Cache和在跟踪Cache中采用预取处理机制的方法来提高Cache的命中率和降低未命中的代价来缩短处理器的访问时间，最终达到提高处理器整体性能的目的。相似文献

10.

利用数据预取机制降低块执行模型的访存延迟 总被引：1，自引：0，他引：1

从明安虹张军任永青《小型微型计算机系统》2010,31(8)

块执行模型通过将串行程序划分成一系列可并行执行的指令块来挖掘应用中潜在的指令级并行性.访存延迟是阻碍块执行模型提高指令级并行性的主要因素之一,而数据预取技术在传统执行模型中可有效降低访存延迟,对块执行模型也同样具有较强的适应性.本文分析了在块执行模型中引入数据预取机制的可行性,并从cache命中率、访存指令的延迟等方面验证了数据预取在块执行模型中的作用,仿真结果表明数据预取可有效降低块执行模型中的访存延迟. 相似文献

11.

The impact of parallel loop scheduling strategies on prefetching ina shared memory multiprocessor

Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(6):573-584

Trace-driven simulations of numerical Fortran programs are used to study the impact of the parallel loop scheduling strategy on data prefetching in a shared memory multiprocessor with private data caches. The simulations indicate that to maximize memory performance, it is important to schedule blocks of consecutive iterations to execute on each processor, and then to adaptively prefetch single-word cache blocks to match the number of iterations scheduled. Prefetching multiple single-word cache blocks on a miss reduces the miss ratio by approximately 5% to 30% compared to a system with no prefetching. In addition, the proposed adaptive prefetching scheme further reduces the miss ratio while significantly reducing the false sharing among cache blocks compared to nonadaptive prefetching strategies. Reducing the false sharing causes fewer coherence invalidations to be generated, and thereby reduces the total network traffic. The impact of the prefetching and scheduling strategies on the temporal distribution of coherence invalidations also is examined. It is found that invalidations tend to be evenly distributed throughout the execution of parallel loops, but tend to be clustered when executing sequential program sections. The distribution of invalidations in both types of program sections is relatively insensitive to the prefetching and scheduling strategy 相似文献

12.

Sequential hardware prefetching in shared-memory multiprocessors

Dahlgren F. Dubois M. Stenstrom P. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(7):733-746

To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively 相似文献

13.

Threaded Prefetching: A New Instruction Memory Hierarchy for Real-Time Systems

Lee Minsuk Min Sang Lyul Shin Heonshik Kim Chong Sang Park Chang Yun 《Real-Time Systems》1997,13(1):47-65

Cache memories have been extensively used to bridge the speed gap between high speed processors and relatively slow main memory. However, they are not widely used in real-time systems due to their unpredictable performance. This paper proposes an instruction prefetching scheme called threaded prefetching as an alternative to instruction caching in real-time systems. In the proposed threaded prefetching, an instruction block pointer called a thread is assigned to each instruction memory block and is made to point to the next block on the worst case execution path that is determined by a compile-time analysis. Also, the thread is not updated throughout the entire program execution to guarantee predictability. This paper also compares the worst case performances of various previous instruction prefetching schemes with that of the proposed threaded prefetching. By analyzing several benchmark programs, we show that the worst case performance of the proposed scheme is significantly better than those of previous instruction prefetching schemes. The results also show that when the block size is large enough the worst case performance of the proposed threaded prefetching scheme is almost as good as that of an instruction cache with 100 % hit ratio. 相似文献

14.

基于存取模式的Cache预取自适应策略研究

周可张江陵冯丹万志坤《计算机工程与科学》2003,25(1):80-84

不同的Cache预取策略适用于不同的存取模式。本文介绍了存储系统Cache预取技术的研究现状，从分析存取模式出发，构造了存取模式三元组模型，并在磁盘阵列上测试了适用于复杂环境下的Cache预取自适应策略，结果证明，自适应策略能够在不同环境上获得磁盘阵列的最优性能。相似文献

15.

Performance and Optimization of Data Prefetching Strategies in Scalable Multiprocessors

Saavedra R. H. Mao W. H. Hwang K. 《Journal of Parallel and Distributed Computing》1994,22(3)

Prefetching is one of several techniques for hiding and tolerating the large memory latencies of scalable multiprocessors. In this paper, we present a performance model for analyzing the limits and effectiveness of data prefetching. The model incorporates the effects of program behavior, network characteristics, cache coherency protocols, and memory consistency model. Our results indicate that, as long as there is enough extra network bandwidth, prefetching is very effective in hiding large latencies. In machines with sufficiently large caches to hold the program working set, the intra- and internode cache interference is marginally low enough to have any significant impact on prefetching performance. Furthermore, we reveal the fact that the effective prefetch distance plays a vital role and adapts extremely well to changes in cache miss rates and remote latencies, thus allowing prefetches to be more effective in hiding latency. An adaptive algorithm is provided to optimize the prefetch distance. This is based on the dynamic behavior of the application, interconnection network, and distributed caches and memories. This optimization of the prefetch distance constitutes a significant advantage of prefetching over other latency tolerating techniques, such as multithreading. We show that the prefetch distance can be chosen constant, program-dependent, or decided by performance information. The optimal distance could be adaptively determined using both compile-time and runtime conditions. Our results are therefore useful not only to compiler writers, but also for the development of runtime support systems in multiprocessors. In large-scale systems, in which network traffic control predominates the performance, the ultimate goal is to match program behavior with machine behavior. 相似文献

16.

GCC编译器中循环数组预取优化的实现及效果

董钰山李春江徐颖《计算机工程与应用》2016,52(6):19-25

数据预取技术是为缓解微处理器与DRAM之间速度差异而出现的隐藏访存延迟的方法。GCC作为广泛使用的开源编译器,在tree-ssa上对循环级数组实现了预取优化。在深入分析GCC4.9循环级数组预取的基本实现机制,以及剖析基于预取收益和分析时间的三种不予预取的代价模型的基础上,得出影响循环数组预取效果的几个因素,并针对典型测试用例测试了GCC编译器循环数组预取的效果。此项工作对于改进和提高GCC现有循环级数组的预取优化有指导意义。相似文献

17.

Stride prefetching for the secondary data cache

《Journal of Systems Architecture》2000,46(12):1093-1102

A prefetch method that enables stride prefetching at the secondary cache without accessing the processor's internal resources is developed and evaluated. It uses a data-range-table that enables it to detect usable strides and memory access streams which fall into the same data range. By using program driven simulation of scientific applications in the context of shared-memory multiprocessors, it is shown that the proposed method can reduce load stall times by an amount comparable to a conventional stride driven prefetching method which requires access to the processor's instruction address register. 相似文献

18.

以基本块为单位的非顺序指令预取 总被引：1，自引：0，他引：1

沈立戴葵王志英《计算机工程与科学》2003,25(4):94-98

取指令能力的高低对微处理器的性能有很大影响。指令预取技术能够有效地降低指令Cache的访问失效率,提高微处理器的取指令能力,进而提高微处理器的性能。本文提出了一种由分支指令指导的、以基本块为单位的非顺序指令预取技术,每次预取将一个完整的基本块读入指令Cache。这种方法使用静态策略分析程序行为,实现所需的硬件复杂度低。模拟结果显示,该方法能够有效地提高指令Cache访问的命中率。相似文献

19.

一种面向实时系统的程序基本块指令预取技术

王恩东倪璠陈继承王洪伟唐士斌《软件学报》2016,27(9):2426-2442

面向通用计算机系统的指令预取技术无法满足实时系统的应用需求,其中一个重要原因是：无效预取引起的指令Cache内容污染使得实时任务WCET评估值不够精确,导致系统可调度性下降,严重影响系统效率.以简化实时任务WCET分析、降低任务WCET评估值为目标,提出一种基于程序基本块的指令预取方法.该方法以基本块为粒度执行指令预取,避免了传统指令预取技术引入的无效预取;通过简化最坏情况下的指令访问命中/缺失情况判定,简化任务WCET分析过程并优化WCET评估值.实时基准测试程序评估结果表明：与常规无预取方法相比,该预取方法可使实时任务WCET评估值降低约20%,平均执行情况下的指令Cache访问性能提升约10%. 相似文献