首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle.  相似文献   

2.
Data deduplication has been widely utilized in large-scale storage systems, particularly backup systems. Data deduplication systems typically divide data streams into chunks and identify redundant chunks by comparing chunk fingerprints. Maintaining all fingerprints in memory is not cost-effective because fingerprint indexes are typically very large. Many data deduplication systems maintain a fingerprint cache in memory and exploit fingerprint prefetching to accelerate the deduplication process. Although fingerprint prefetching can improve the performance of data deduplication systems by leveraging the locality of workloads, inaccurately prefetched fingerprints may pollute the cache by evicting useful fingerprints. We observed that most of the prefetched fingerprints in a wide variety of applications are never used or used only once, which severely limits the performance of data deduplication systems. We introduce a prefetch-aware fingerprint cache management scheme for data deduplication systems (PreCache) to alleviate prefetch-related cache pollution. We propose three prefetch-aware fingerprint cache replacement policies (PreCache-UNU, PreCache-UOO, and PreCache-MIX) to handle different types of cache pollution. Additionally, we propose an adaptive policy selector to select suitable policies for prefetch requests. We implement PreCache on two representative data deduplication systems (Block Locality Caching and SiLo) and evaluate its performance utilizing three real-world workloads (Kernel, MacOS, and Homes). The experimental results reveal that PreCache improves deduplication throughput by up to 32.22% based on a reduction of on-disk fingerprint index lookups and improvement of the deduplication ratio by mitigating prefetch-related fingerprint cache pollution.  相似文献   

3.
面向通用计算机系统的指令预取技术无法满足实时系统的应用需求,其中一个重要原因是:无效预取引起的指令Cache内容污染使得实时任务WCET评估值不够精确,导致系统可调度性下降,严重影响系统效率.以简化实时任务WCET分析、降低任务WCET评估值为目标,提出一种基于程序基本块的指令预取方法.该方法以基本块为粒度执行指令预取,避免了传统指令预取技术引入的无效预取;通过简化最坏情况下的指令访问命中/缺失情况判定,简化任务WCET分析过程并优化WCET评估值.实时基准测试程序评估结果表明:与常规无预取方法相比,该预取方法可使实时任务WCET评估值降低约20%,平均执行情况下的指令Cache访问性能提升约10%.  相似文献   

4.
This paper proposes using a user-level memory thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: The correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide applicability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53.  相似文献   

5.
Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines.  相似文献   

6.
In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency.  相似文献   

7.
The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely-explored approach to improve cache performance is hardware prefetching, which allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches are unable to exploit the potential improvement in performance, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including MPEG-2 decoding and encoding, convolution, thresholding, and edge chain coding.  相似文献   

8.
随着集成电路制造工艺的快速发展,片上实现大容量的cache成为可能,这从很大程度上降低了cache的失效率,与此同时,大容量的cache发生失效时的开销也更加显著。通过分析cache失效行为,设计了一种新的二级cache步长自适应预取机制,该机制充分利用了二级cache对指令地址不可见的特点,使用失效地址作为索引检查预取表。通过分析测试结果,选择了合适的结构参数,有效提高了cache性能。  相似文献   

9.
《Journal of Systems Architecture》1999,45(12-13):1047-1073
In this paper we provide a survey of hardware-based data cache prefetching strategies. We then present two new methods which improve both the accuracy and effectiveness of data cache prefetching. The first design ties data address prediction to the instruction prefetching logic, allowing data cache prefetching to work in tandem with dynamic branch prediction. The second mechanism prefetches link-based data structures, typically problematic data accesses for sequential prefetching schemes. Combining the two mechanisms we can improve data cache hit rates, while reducing memory bus traffic by as much as 50%.  相似文献   

10.
A prefetch method that enables stride prefetching at the secondary cache without accessing the processor's internal resources is developed and evaluated. It uses a data-range-table that enables it to detect usable strides and memory access streams which fall into the same data range. By using program driven simulation of scientific applications in the context of shared-memory multiprocessors, it is shown that the proposed method can reduce load stall times by an amount comparable to a conventional stride driven prefetching method which requires access to the processor's instruction address register.  相似文献   

11.
当前人工智能技术应用于系统结构领域的研究前景广阔,特别是将深度学习应用于多核架构的数据预取研究已经成为国内外的研究热点。针对基于深度学习的缓存预取任务进行了研究,形式化地定义了深度学习缓存预取模型。在介绍当前常见的多核缓存架构和预取技术的基础上,全面分析了现有基于深度学习的典型缓存预取器的设计思路。深度学习神经网络在多核缓存预取领域的应用主要采用了深度神经网络、循环神经网络、长短期记忆网络和注意力机制等机器学习方法,综合对比分析现有基于深度学习的数据预取神经网络模型后发现,基于深度学习的多核缓存预取技术在计算成本、模型优化和实用性等方面还存在着局限性,未来在自适应预取模型以及神经网络预取模型的实用性方面还有很大的研究探索空间和发展前景。  相似文献   

12.
李靖  余建桥 《计算机应用》2010,30(7):1950-1952
数据预取是移动数据库缓存技术中的关键,CMIP预取策略通过客户端历史访问记录关联规则的挖掘得到预取数据,使系统性能得到了提高。但由于没考虑到数据的更新率及数据大小,将会经常发生缓存失效。在此算法的基础上增加对数据更新率及大小的判断并对所选数据排序,然后进行预取数据的选择。通过改进降低了缓存的失效率、减少了数据访问的时间及电能的消耗。  相似文献   

13.
When caches aren't enough: data prefetching techniques   总被引:1,自引:0,他引:1  
Vander Wiel  S.P. Lilja  D.J. 《Computer》1997,30(7):23-30
With data prefetching, memory systems call data into the cache before the processor needs it, thereby reducing memory-access latency. Using the most suitable techniques is critical to maximizing data prefetching's effectiveness. The authors review three popular prefetching techniques: software-initiated prefetching, sequential hardware-initiated prefetching, and prefetching via reference prediction tables  相似文献   

14.
预取作为一种主动缓存技术可用于提高缓存命中率,但其效果好坏很大程度上取决于所采用的预取策略。本文提出了一种代理服务器下基于对象的限定预取策略,通过调整代理服务器中预取空间的大小,防止无效页面占用过多的缓存空间,提高了缓存的利用率,从而获得较高的命中率。实验表明,基于对象的限定预取策略命中率远远高于LRU策略,并且相对于基于对象的LRU策略也有明显的改善。  相似文献   

15.
Helper threaded prefetching based on Chip Multiprocessor is a well known approach to reducing memory latency and has been explored in linked data structures accesses. However, conventional helper threaded prefetching often suffers from useless prefetches and cache thrashing, which affect its effectiveness. In this paper, we first analyzed the shortcomings of conventional helper threaded prefetching for linked data structures. Then we proposed an improved helper threaded prefetching, Skip Helper Threaded Prefetching, for hotspots with two level data traversals. Our solution is to profile the applications and balance delinquent loads between main thread and prefetching thread based on the characteristic of operations in their hotspots. Evaluations show that the proposed solution improves average performance by 8.9% (-O2) and 8.5% (-O3) over the conventional helper threaded prefetching that greedily prefetches all delinquent loads. We also compare our proposal with the active threaded prefetching which synchronizes with main thread by semaphore, and find that our proposal provides better performance for the targeted applications.  相似文献   

16.
Trace-driven simulations of numerical Fortran programs are used to study the impact of the parallel loop scheduling strategy on data prefetching in a shared memory multiprocessor with private data caches. The simulations indicate that to maximize memory performance, it is important to schedule blocks of consecutive iterations to execute on each processor, and then to adaptively prefetch single-word cache blocks to match the number of iterations scheduled. Prefetching multiple single-word cache blocks on a miss reduces the miss ratio by approximately 5% to 30% compared to a system with no prefetching. In addition, the proposed adaptive prefetching scheme further reduces the miss ratio while significantly reducing the false sharing among cache blocks compared to nonadaptive prefetching strategies. Reducing the false sharing causes fewer coherence invalidations to be generated, and thereby reduces the total network traffic. The impact of the prefetching and scheduling strategies on the temporal distribution of coherence invalidations also is examined. It is found that invalidations tend to be evenly distributed throughout the execution of parallel loops, but tend to be clustered when executing sequential program sections. The distribution of invalidations in both types of program sections is relatively insensitive to the prefetching and scheduling strategy  相似文献   

17.
In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40–50 cycles) and does not normally delay the program beyond the memory latency of the miss. Once started, the prefetch engine executes on its own and causes no instruction overhead. The only instruction overhead in our approach is when a trap handler completes after data arrives. The advantages of this technique are (1) it exploits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction overhead, which is a limitation for traditional software-controlled prefetching, and (3) it is accurate in the sense that it generates very little useless traffic while maintaining a high prefetching coverage. We also study whether one could emulate the prefetch engine in software, which would not require any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engine removes in general less stall time at a higher instruction overhead.  相似文献   

18.
不同的Cache预取策略适用于不同的存取模式。本文介绍了存储系统Cache预取技术的研究现状,从分析存取模式出发,构造了存取模式三元组模型,并在磁盘阵列上测试了适 用于复杂环境下的Cache预取自适应策略,结果证明,自适应策略能够在不同环境上获得磁盘阵列的最优性能。  相似文献   

19.
以基本块为单位的非顺序指令预取   总被引:1,自引:0,他引:1  
取指令能力的高低对微处理器的性能有很大影响。指令预取技术能够有效地降低指令Cache的访问失效率,提高微处理器的取指令能力,进而提高微处理器的性能。本文提出了一种由分支指令指导的、以基本块为单位的非顺序指令预取技术,每次预取将一个完整的基本块读入指令Cache。这种方法使用静态策略分析程序行为,实现所需的硬件复杂度低。模拟结果显示,该方法能够有效地提高指令Cache访问的命中率。  相似文献   

20.
结合访存失效队列状态的预取策略   总被引:1,自引:0,他引:1  
随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号