期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

Lee Jaejin Jung Changhee Lim Daeseob Solihin Yan 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(9):1309-1324

This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle. 相似文献

2.

Maintaining Cache Coherence through Compiler-Directed Data Prefetching

Hock-Beng Lim Pen-Chung Yew 《Journal of Parallel and Distributed Computing》1998,53(2):170

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency. 相似文献

3.

Multiple prefetch adaptive disk caching 总被引：1，自引：0，他引：1

Grimsrud K.S. Archibald J.K. Nelson B.E. 《Knowledge and Data Engineering, IEEE Transactions on》1993,5(1):88-103

A new disk caching algorithm is presented that uses an adaptive prefetching scheme to reduce the average service time for disk references. Unlike schemes which simply prefetch the next sector or group of sectors, this method maintains information about the order of past disk accesses which is used to accurately predict future access sequences. The range of parameters of this scheme is explored, and its performance is evaluated through trace-driven simulation, using traces obtained from three different UNIX minicomputers. Unlike disk trace data previously described in the literature, the traces used include time stamps for each reference. With this timing information-essential for evaluating any prefetching scheme-it is shown that a cache with the adaptive prefetching mechanism can reduce the average time to service a disk request by a factor of up to three, relative to an identical disk cache without prefetching 相似文献

4.

A PAB-Based Multi-Prefetcher Mechanism

Alexander Gendler Avi Mendelson Yitzhak Birk 《International journal of parallel programming》2006,34(2):171-188

Aggressive prefetching mechanisms improve performance of some important applications, but substantially increase bus traffic and “pressure” on cache tag arrays. They may even reduce performance of applications that are not memory bounded. We introduce a “feedback” mechanism, termed Prefetcher Assessment Buffer (PAB), which filters out requests that are unlikely to be useful. With this, applications that cannot benefit from aggressive prefetching will not suffer from their side-effects. The PAB is evaluated with different configurations, e.g., “all L1 accesses trigger prefetches” and “only misses to L1 trigger prefetches”. When compared with the non-selective concurrent use of multiple prefetchers, the PAB’s application to prefetching from main memory to the L2 cache can reduce the number of loads from main memory by up to 25% without losing performance. Application of more sophisticated techniques to prefetches between the L2- and L1-cache can increase IPC by 4% while reducing the traffic between the caches 8-fold. 相似文献

5.

同步数据触发体系结构中指令预取技术研究

下载免费PDF全文

郭建军戴葵王志英《计算机工程与科学》2009,31(8)

同步数据触发体系结构SDTA将传统指令级并行细化到微操作级并行,具有较高的数据处理能力,但其特殊的指令格式及指令特性,给指令Cache访问带来了挑战。指令预取技术能够有效地降低指令Cache的访问失效率,增强处理器取指能力,提高性能。本文分析了SDTA指令集特性,提出了一种适合SDTA指令集特性的软硬件相结合的混合指令预取机制,采用硬件预取引擎和软件提示相结合进行预取。该方法能够有效地提高指令Cache命中率,且具有实现简单、无效预取率低、不会增加代码体积等特点。相似文献

6.

Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Hock-Beng Lim Pen-Chung Yew 《Journal of Parallel and Distributed Computing》2001,61(12):1775

Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems through a compiler-directed cache coherence scheme called the Cache Coherence with Data Prefetching (CCDP) scheme. The CCDP scheme enforces cache coherence by prefetching the potentially stale references in a parallel program. It also prefetches the non-stale references to hide their memory latencies. To optimize the performance of the CCDP scheme, some prefetch hardware support is provided to efficiently handle these two forms of data prefetching operations. We also developed the compiler techniques utilized by the CCDP scheme for stale reference detection, prefetch target analysis, and prefetch scheduling. We evaluated the performance of the CCDP scheme via execution-driven simulations of several numerical applications from the SPEC CFP95 and the Perfect benchmark suites. The simulation results show that the CCDP scheme provides significant performance improvements for the applications studied, comparable to that obtained with a full-map hardware cache coherence scheme. 相似文献

7.

Design and evaluation of a hierarchical decoupled architecture

Won W. Ro Stephen P. Crago Alvin M. Despain Jean-Luc Gaudiot 《The Journal of supercomputing》2006,38(3):237-259

The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality. To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors. Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%. 相似文献

8.

The block distributed memory model 总被引：1，自引：0，他引：1

Jaja J.F. Kwan Woo Ryu 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(8):830-840

We introduce a computation model for developing and analyzing parallel algorithms on distributed memory machines. The model allows the design of algorithms using a single address space and does not assume any particular interconnection topology. We capture performance by incorporating a cost measure for interprocessor communication induced by remote memory accesses. The cost measure includes parameters reflecting memory latency, communication bandwidth, and spatial locality. Our model allows the initial placement of the input data and pipelined prefetching. We use our model to develop parallel algorithms for various data rearrangement problems, load balancing, sorting, FFT, and matrix multiplication. We show that most of these algorithms achieve optimal or near optimal communication complexity while simultaneously guaranteeing an optimal speed-up in computational complexity. Ongoing experimental work in testing and evaluating these algorithms has thus far shown very promising results 相似文献

9.

Prefetch-aware fingerprint cache management for data deduplication systems

Mei LI Hongjun ZHANG Yanjun WU Chen ZHAO 《Frontiers of Computer Science》2019,13(3):500

Data deduplication has been widely utilized in large-scale storage systems, particularly backup systems. Data deduplication systems typically divide data streams into chunks and identify redundant chunks by comparing chunk fingerprints. Maintaining all fingerprints in memory is not cost-effective because fingerprint indexes are typically very large. Many data deduplication systems maintain a fingerprint cache in memory and exploit fingerprint prefetching to accelerate the deduplication process. Although fingerprint prefetching can improve the performance of data deduplication systems by leveraging the locality of workloads, inaccurately prefetched fingerprints may pollute the cache by evicting useful fingerprints. We observed that most of the prefetched fingerprints in a wide variety of applications are never used or used only once, which severely limits the performance of data deduplication systems. We introduce a prefetch-aware fingerprint cache management scheme for data deduplication systems (PreCache) to alleviate prefetch-related cache pollution. We propose three prefetch-aware fingerprint cache replacement policies (PreCache-UNU, PreCache-UOO, and PreCache-MIX) to handle different types of cache pollution. Additionally, we propose an adaptive policy selector to select suitable policies for prefetch requests. We implement PreCache on two representative data deduplication systems (Block Locality Caching and SiLo) and evaluate its performance utilizing three real-world workloads (Kernel, MacOS, and Homes). The experimental results reveal that PreCache improves deduplication throughput by up to 32.22% based on a reduction of on-disk fingerprint index lookups and improvement of the deduplication ratio by mitigating prefetch-related fingerprint cache pollution. 相似文献

10.

Improving Data Prefetching Efficacy in Multimedia Applications

Cucchiara Rita Prati Andrea Piccardi Massimo 《Multimedia Tools and Applications》2003,20(2):159-178

The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely-explored approach to improve cache performance is hardware prefetching, which allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches are unable to exploit the potential improvement in performance, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including MPEG-2 decoding and encoding, convolution, thresholding, and edge chain coding. 相似文献

11.

Compile-time partitioning of iterative parallel loops to reducecache coherency traffic

Abraham S.G. Hudak D.E. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(3):318-328

Adaptive data partitioning (ADP) which reduces the execution time of parallel programs by reducing interprocessor communication for iterative parallel loops is discussed. It is shown that ADP can be integrated into a communication-reducing back end for existing parallelizing compilers or as part of a machine-specific partitioner for parallel programs. A multiprocessor model to analyze program execution factors that lead to interprocessor communication and a model for the iterative parallel loop to quantify communication patterns within a program are defined. A vector notation is chosen to quantify communication across a global data set. Communication parameters are computed by examining the indexes of array accesses and are adjusted to reflect the underlying system architecture by compensating for cache line sizes. These values are used to generate rectangular and hexagonal partitions that reduce interprocessor communication 相似文献

12.

A hybrid intelligent system to improve predictive accuracy for cache prefetching

Sohail Sarwar Zia Ul-Qayyum Owais Ahmed Malik 《Expert systems with applications》2012,39(2):1626-1636

Cache being the fastest medium in memory hierarchy has a vital role to play for fully exploiting available resources, concealing latencies in IO operations, languishing the impact of these latencies and hence in improving system response time. Despite plenty of efforts made, caches alone cannot comprehend larger storage requirements without prefetching. Cache prefetching is speculatively fetching data to restrain all delays. However, effective prefetching requires a strong prediction mechanism to load relevant data with higher degree of accuracy. In order to ameliorate the predictive performance of cache prefetching, we applied the hybrid of two AI approaches named case based reasoning (CBR) and artificial neural networks (ANN). CBR maintains the past experience and ANN are used in adaptation phase of CBR instead of employing static rule base. The novelty of technique in this domain is valued due to hybrid of two approaches as well as usage of suffix tree in populating the CBR’s case base. Suffix trees provide rich data patterns for populating case base and greatly enhanced the overall performance. A number of evaluations from different aspects with varying parameters are presented (along with some findings) where the efficacy of our technique is affirmed with improved predictive accuracy and reduced level of associated costs. 相似文献

13.

结合访存失效队列状态的预取策略 总被引：1，自引：0，他引：1

郇丹丹李祖松胡伟武刘志勇《计算机学报》2007,30(7):1104-1114

随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%. 相似文献

14.

Stride prefetching for the secondary data cache

《Journal of Systems Architecture》2000,46(12):1093-1102

A prefetch method that enables stride prefetching at the secondary cache without accessing the processor's internal resources is developed and evaluated. It uses a data-range-table that enables it to detect usable strides and memory access streams which fall into the same data range. By using program driven simulation of scientific applications in the context of shared-memory multiprocessors, it is shown that the proposed method can reduce load stall times by an amount comparable to a conventional stride driven prefetching method which requires access to the processor's instruction address register. 相似文献

15.

Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications

Luís Fabrício Wanderley Góes Christiane Pousa Ribeiro Márcio Castro Jean-François Méhaut Murray Cole Marcelo Cintra 《International journal of parallel programming》2014,42(2):365-382

Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines. 相似文献

16.

Combining data prefetching with non-blocking loads to alleviate cache pollution effects

《Journal of Systems Architecture》1999,45(9):681-685

Data prefetching is a useful approach for reduction of memory access stalls in many scientific applications. However, it suffers from cache pollution severly in some applications. In this paper, we study the effectiveness of combining data prefetching with non-blocking loads on cache pollution and explain why it shows good result in our simulation. 相似文献

17.

Branch-directed and pointer-based data cache prefetching

《Journal of Systems Architecture》1999,45(12-13):1047-1073

In this paper we provide a survey of hardware-based data cache prefetching strategies. We then present two new methods which improve both the accuracy and effectiveness of data cache prefetching. The first design ties data address prediction to the instruction prefetching logic, allowing data cache prefetching to work in tandem with dynamic branch prediction. The second mechanism prefetches link-based data structures, typically problematic data accesses for sequential prefetching schemes. Combining the two mechanisms we can improve data cache hit rates, while reducing memory bus traffic by as much as 50%. 相似文献

18.

Web object-based storage management in proxy caches

《Future Generation Computer Systems》2006,22(1-2):16-31

Proxy caches are essential to improve the performance of the World Wide Web and to enhance user perceived latency. Appropriate cache management strategies are crucial to achieve these goals. In our previous work, we have introduced Web object-based caching policies. A Web object consists of the main HTML page and all of its constituent embedded files. Our studies have shown that these policies improve proxy cache performance substantially.In this paper, we propose a new Web object-based policy to manage the storage system of a proxy cache. We propose two techniques to improve the storage system performance. The first technique is concerned with prefetching the related files belonging to a Web object, from the disk to main memory. This prefetching improves performance as most of the files can be provided from the main memory rather than from the proxy disk. The second technique stores the Web object members in contiguous disk blocks in order to reduce the disk access time. We used trace-driven simulations to study the performance improvements one can obtain with these two techniques. Our results show that the first technique by itself provides up to 50% reduction in hit latency, which is the delay involved in providing a hit document by the proxy. An additional 5% improvement can be obtained by incorporating the second technique. 相似文献

19.

Stream data prefetcher for the GPU memory interface

Nuno Neves Pedro Tomás Nuno Roma 《The Journal of supercomputing》2018,74(6):2314-2328

相似文献

20.

深度学习在多核缓存预取中的应用研究综述

张建勋《计算机应用研究》2024,41(2)

当前人工智能技术应用于系统结构领域的研究前景广阔,特别是将深度学习应用于多核架构的数据预取研究已经成为国内外的研究热点。针对基于深度学习的缓存预取任务进行了研究,形式化地定义了深度学习缓存预取模型。在介绍当前常见的多核缓存架构和预取技术的基础上,全面分析了现有基于深度学习的典型缓存预取器的设计思路。深度学习神经网络在多核缓存预取领域的应用主要采用了深度神经网络、循环神经网络、长短期记忆网络和注意力机制等机器学习方法,综合对比分析现有基于深度学习的数据预取神经网络模型后发现,基于深度学习的多核缓存预取技术在计算成本、模型优化和实用性等方面还存在着局限性,未来在自适应预取模型以及神经网络预取模型的实用性方面还有很大的研究探索空间和发展前景。相似文献