首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40–50 cycles) and does not normally delay the program beyond the memory latency of the miss. Once started, the prefetch engine executes on its own and causes no instruction overhead. The only instruction overhead in our approach is when a trap handler completes after data arrives. The advantages of this technique are (1) it exploits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction overhead, which is a limitation for traditional software-controlled prefetching, and (3) it is accurate in the sense that it generates very little useless traffic while maintaining a high prefetching coverage. We also study whether one could emulate the prefetch engine in software, which would not require any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engine removes in general less stall time at a higher instruction overhead.  相似文献   

2.
Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems through a compiler-directed cache coherence scheme called the Cache Coherence with Data Prefetching (CCDP) scheme. The CCDP scheme enforces cache coherence by prefetching the potentially stale references in a parallel program. It also prefetches the non-stale references to hide their memory latencies. To optimize the performance of the CCDP scheme, some prefetch hardware support is provided to efficiently handle these two forms of data prefetching operations. We also developed the compiler techniques utilized by the CCDP scheme for stale reference detection, prefetch target analysis, and prefetch scheduling. We evaluated the performance of the CCDP scheme via execution-driven simulations of several numerical applications from the SPEC CFP95 and the Perfect benchmark suites. The simulation results show that the CCDP scheme provides significant performance improvements for the applications studied, comparable to that obtained with a full-map hardware cache coherence scheme.  相似文献   

3.
To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively  相似文献   

4.
基于网络性能的智能Web加速技术——缓存与预取   总被引:8,自引:0,他引:8  
Web业务在网络业务中占有很大比重,在无法扩大网络带宽时,需要采取一定技术合理利用带宽,改善网络性能。研究了基于RTT(round trip time)等网络性能指标的Web智能加速技术,在对Web代理服务器上的业务进行分析和对网络RTT进行测量分析的基础上,提出了智能预取控制技术及新的缓存(cache)替换方法。对新算法的仿真研究表明,该方法提高了缓存的命中率。研究表明预取技术在不明显增加网络负荷的前提下,提高了业务的响应速度,有效地改进了Web访问性能。  相似文献   

5.
This paper proposes using a user-level memory thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: The correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide applicability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53.  相似文献   

6.
Prefetching is one of several techniques for hiding and tolerating the large memory latencies of scalable multiprocessors. In this paper, we present a performance model for analyzing the limits and effectiveness of data prefetching. The model incorporates the effects of program behavior, network characteristics, cache coherency protocols, and memory consistency model. Our results indicate that, as long as there is enough extra network bandwidth, prefetching is very effective in hiding large latencies. In machines with sufficiently large caches to hold the program working set, the intra- and internode cache interference is marginally low enough to have any significant impact on prefetching performance. Furthermore, we reveal the fact that the effective prefetch distance plays a vital role and adapts extremely well to changes in cache miss rates and remote latencies, thus allowing prefetches to be more effective in hiding latency. An adaptive algorithm is provided to optimize the prefetch distance. This is based on the dynamic behavior of the application, interconnection network, and distributed caches and memories. This optimization of the prefetch distance constitutes a significant advantage of prefetching over other latency tolerating techniques, such as multithreading. We show that the prefetch distance can be chosen constant, program-dependent, or decided by performance information. The optimal distance could be adaptively determined using both compile-time and runtime conditions. Our results are therefore useful not only to compiler writers, but also for the development of runtime support systems in multiprocessors. In large-scale systems, in which network traffic control predominates the performance, the ultimate goal is to match program behavior with machine behavior.  相似文献   

7.
Retrospective adaptive prefetching for interactive Web GIS applications   总被引:1,自引:0,他引:1  
A major task of a Web GIS (Geographic Information Systems) system is to transfer map data to client applications over the Internet, which may be too costly. To improve this inefficient process, various solutions are available. Caching the responses of the requests on the client side is the most commonly implemented solution. However, this method may not be adequate by itself. Besides caching the responses, predicting the next possible requests from a client and updating the cache with responses for those requests together provide a remarkable performance improvement. This procedure is called “prefetching” and makes caching mechanisms more effective and efficient. This paper proposes an efficient prefetching algorithm called Retrospective Adaptive Prefetch (RAP), which is constructed over a heuristic method that considers the former actions of a given user. The algorithm reduces the user-perceived response time and improves user navigation efficiency. Additionally, it adjusts the cache size automatically, based on the memory size of the client’s machine. RAP is compared with four other prefetching algorithms. The experiments show that RAP provides better performance enhancements than the other methods.  相似文献   

8.
漆锋滨  王飞  李中升 《软件学报》2009,20(Z1):34-39
传统的基于静态编译的数据预取大多针对数组访问,现在的应用程序中大量出现由指针构成的链式数据结构,依赖传统的编译方法难以进行预取优化.反馈式数据预取优化技术是当前高性能计算技术中前沿的一种编译优化手段,可以很好地解决链式结构的预取问题.在研究ORC编译器反馈式编译优化技术的基础上,针对Alpha结构的特点,对针对链式结构的反馈式数据预取进行了优化.SPEC2000测试表明,平均性能提高了4.1%.  相似文献   

9.
并行文件系统中适度贪婪的Cache预取一体化算法   总被引:3,自引:0,他引:3  
卢凯  金士尧  卢锡城 《计算机学报》1999,22(11):1172-1177
传统文件系统中的Cache和预取技术是两种降低访问延迟的有效方法。在并行科学计算应用的I/O访问模式下,简单的Cache和预取技术已无法提供较高的Cache命中率,该文在分析该I/O模式的基础上提出了适度贪婪的Cache和预取一体化算法(PGI),该算法充分利用了并行文件系统环境的特点,采用了适度贪婪的动态滑模技术,可以有铲地消除预取时的抖动,降低系统处理开锁,并同时采用了Cache和预取一体化的  相似文献   

10.
片内多核已成为延长摩尔定律的方式,并行算法设计、编程模型、编译器和运行时系统都需要利用计算模型进行分析。现有多核模型对线程间共享缓存等资源的竞争已有较精确的模型,但是对于线程间数据共享考虑较少。提出线程间共享缓存的横向局部性和任务共享率概念,基于此扩展串行存储层次模型RAM(h),提出考虑任务共享率的多核并行计算模型MRAM(h)。  相似文献   

11.
Trace-driven simulations of numerical Fortran programs are used to study the impact of the parallel loop scheduling strategy on data prefetching in a shared memory multiprocessor with private data caches. The simulations indicate that to maximize memory performance, it is important to schedule blocks of consecutive iterations to execute on each processor, and then to adaptively prefetch single-word cache blocks to match the number of iterations scheduled. Prefetching multiple single-word cache blocks on a miss reduces the miss ratio by approximately 5% to 30% compared to a system with no prefetching. In addition, the proposed adaptive prefetching scheme further reduces the miss ratio while significantly reducing the false sharing among cache blocks compared to nonadaptive prefetching strategies. Reducing the false sharing causes fewer coherence invalidations to be generated, and thereby reduces the total network traffic. The impact of the prefetching and scheduling strategies on the temporal distribution of coherence invalidations also is examined. It is found that invalidations tend to be evenly distributed throughout the execution of parallel loops, but tend to be clustered when executing sequential program sections. The distribution of invalidations in both types of program sections is relatively insensitive to the prefetching and scheduling strategy  相似文献   

12.
当前人工智能技术应用于系统结构领域的研究前景广阔,特别是将深度学习应用于多核架构的数据预取研究已经成为国内外的研究热点。针对基于深度学习的缓存预取任务进行了研究,形式化地定义了深度学习缓存预取模型。在介绍当前常见的多核缓存架构和预取技术的基础上,全面分析了现有基于深度学习的典型缓存预取器的设计思路。深度学习神经网络在多核缓存预取领域的应用主要采用了深度神经网络、循环神经网络、长短期记忆网络和注意力机制等机器学习方法,综合对比分析现有基于深度学习的数据预取神经网络模型后发现,基于深度学习的多核缓存预取技术在计算成本、模型优化和实用性等方面还存在着局限性,未来在自适应预取模型以及神经网络预取模型的实用性方面还有很大的研究探索空间和发展前景。  相似文献   

13.
一种支持并发访问流的文件预取算法   总被引:1,自引:0,他引:1  
吴峰光  奚宏生  徐陈锋 《软件学报》2010,21(8):1820-1833
设计并实现了一种按需预取算法,采用更为宽松的顺序性判决条件,并以页面和页面缓存的状态作为可靠的决策依据.它可以发现淹没在随机读中的顺序访问并进行有效的预读,支持对单个文件实例的并发访问而产生的交织访问模式.实验结果表明:相对于原Linux预读算法,该算法在随机干扰下的顺序读性能可提高29%;交织读的性能是传统算法的4~27倍;同时,应用程序可见延迟改善可达35倍.该算法已被Linux 2.6.24内核采用.  相似文献   

14.
The web resources in the World Wide Web are rising, to large extent due to the services and applications provided by it. Because web traffic is large, gaining access to these resources incurs user-perceived latency. Although the latency can never be avoided, it can be minimized to a larger extent. Web prefetching is identified as a technique that anticipates the user’s future requests and fetches them into the cache prior to an explicit request made. Because web objects are of various types, a new algorithm is proposed that concentrates on prefetching embedded objects, including audio and video files. Further, clustering is employed using adaptive resonance theory (ART)2 in order to prefetch embedded objects as clusters. For comparative study, the web objects are clustered using ART2, ART1, and other statistical techniques. The clustering results confirm the supremacy of ART2 and, thereby, prefetching web objects in clusters is observed to produce a high hit rate.  相似文献   

15.
Data deduplication has been widely utilized in large-scale storage systems, particularly backup systems. Data deduplication systems typically divide data streams into chunks and identify redundant chunks by comparing chunk fingerprints. Maintaining all fingerprints in memory is not cost-effective because fingerprint indexes are typically very large. Many data deduplication systems maintain a fingerprint cache in memory and exploit fingerprint prefetching to accelerate the deduplication process. Although fingerprint prefetching can improve the performance of data deduplication systems by leveraging the locality of workloads, inaccurately prefetched fingerprints may pollute the cache by evicting useful fingerprints. We observed that most of the prefetched fingerprints in a wide variety of applications are never used or used only once, which severely limits the performance of data deduplication systems. We introduce a prefetch-aware fingerprint cache management scheme for data deduplication systems (PreCache) to alleviate prefetch-related cache pollution. We propose three prefetch-aware fingerprint cache replacement policies (PreCache-UNU, PreCache-UOO, and PreCache-MIX) to handle different types of cache pollution. Additionally, we propose an adaptive policy selector to select suitable policies for prefetch requests. We implement PreCache on two representative data deduplication systems (Block Locality Caching and SiLo) and evaluate its performance utilizing three real-world workloads (Kernel, MacOS, and Homes). The experimental results reveal that PreCache improves deduplication throughput by up to 32.22% based on a reduction of on-disk fingerprint index lookups and improvement of the deduplication ratio by mitigating prefetch-related fingerprint cache pollution.  相似文献   

16.
Multiple prefetch adaptive disk caching   总被引:1,自引:0,他引:1  
A new disk caching algorithm is presented that uses an adaptive prefetching scheme to reduce the average service time for disk references. Unlike schemes which simply prefetch the next sector or group of sectors, this method maintains information about the order of past disk accesses which is used to accurately predict future access sequences. The range of parameters of this scheme is explored, and its performance is evaluated through trace-driven simulation, using traces obtained from three different UNIX minicomputers. Unlike disk trace data previously described in the literature, the traces used include time stamps for each reference. With this timing information-essential for evaluating any prefetching scheme-it is shown that a cache with the adaptive prefetching mechanism can reduce the average time to service a disk request by a factor of up to three, relative to an identical disk cache without prefetching  相似文献   

17.
针对集群服务器LARD调度算法只能利用已有缓存的问题,提出一种基于预取的算法Prefetch-LARD,该算法从Web访问日志中挖掘页面之间的转移概率,建立马尔科夫链模型,在调度请求时利用概率关系提前将下一次可能访问的文档从节点磁盘取到本地cache中,提高了请求的缓存命中率;算法还采用了加权的节点超载判断方法,以提高集群节点的负载均衡度,实验表明,在同样的测试环境下,Prefetch-LARD算法比LARD算法的缓存命中率提高26.9%,系统的吞吐量相应提高18.8%.  相似文献   

18.
一种自适应的数据预取与缓冲算法   总被引:1,自引:0,他引:1  
在海量数据中进行的直接查找往往耗时巨大,在实际应用中很难满足实时性的需求,因此采用数据预取和缓冲技术实现对查找操作的优化成为实际系统中的重要环节。自适应的数据预取和缓冲算法是通过使用人工智能中的技术来分析用户的查询习惯,从而实现动态的预取策略并对预取的数据进行缓冲,以达到提高查询速度的目的。文章根据不同的数据查询需求提出了两类智能算法以适应不同的应用场合。在实验中分别针对单个用户的历史查询应用和多用户的并发查询应用分别进行了分析,证明了这两类智能算法分别对不同的应用场合拥有较好的性能。  相似文献   

19.
Helper threaded prefetching based on chip multiprocessor has been shown to reduce memory latency and improve overall system performance, and has been explored in linked data structures accesses. In our earlier work, we had proposed an effective threaded prefetching technique that balances delinquent loads between main thread and helper thread to improve effectiveness of prefetching. In this paper, we analyze memory access characteristic of specific application to estimate effective prefetch distance range for our proposed threaded prefetching technique. The effect of hardware prefetchers on the estimation is also exploited. We discuss key design issues of our proposed method and present preliminary experimental results. Our experimental evaluations indicated that the bounded range of effective prefetch distance can be determined using our method, and the optimal prefetch distances can be determined based on the estimated effective prefetch distance range by few trial runs.  相似文献   

20.
寄存器作为机器硬件结构中有限的宝贵资源,使得寄存器分配技术成为编译器最为关键的优化技术之一。寄存器分配效率提高的关键在于如何最大限度地减少溢出带来的开销,论文针对这一问题,提出了在传统RISC芯片存储结构的基础上加一级缓冲寄存器来处理溢出的设想,并以作者为一种新型的RISC体系结构微处理器研制的编译系统为背景,提出了基于该缓冲寄存器的寄存器分配优化策略。实验表明,优化效果明显。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号