期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Hock-Beng Lim Pen-Chung Yew 《Journal of Parallel and Distributed Computing》2001,61(12):1775

Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems through a compiler-directed cache coherence scheme called the Cache Coherence with Data Prefetching (CCDP) scheme. The CCDP scheme enforces cache coherence by prefetching the potentially stale references in a parallel program. It also prefetches the non-stale references to hide their memory latencies. To optimize the performance of the CCDP scheme, some prefetch hardware support is provided to efficiently handle these two forms of data prefetching operations. We also developed the compiler techniques utilized by the CCDP scheme for stale reference detection, prefetch target analysis, and prefetch scheduling. We evaluated the performance of the CCDP scheme via execution-driven simulations of several numerical applications from the SPEC CFP95 and the Perfect benchmark suites. The simulation results show that the CCDP scheme provides significant performance improvements for the applications studied, comparable to that obtained with a full-map hardware cache coherence scheme. 相似文献

2.

基于CMP的指针数据预取方法

下载免费PDF全文

朱会东黄永雨宋宝卫《计算机工程》2011,37(6):71-73

针对现代计算机系统中的存储墙问题,提出一种适合于链式数据结构的数据预取方法——纯遍历推送方法。采用基于共享高速缓存的多核处理器平台CMP上的多线程技术,在主程序运行时分离出一个推送线程,由其将主线程需要的数据提前预取至处理器共享高速缓存中以隐藏主线程的存储器延迟。实验结果证明该方法在CMP架构下对以链式结构为主的内存受限程序的性能有一定的改进。相似文献

3.

结合访存失效队列状态的预取策略 总被引：1，自引：0，他引：1

郇丹丹李祖松胡伟武刘志勇《计算机学报》2007,30(7):1104-1114

随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%. 相似文献

4.

Memory-side prefetching for linked data structures for processor-in-memory systems

《Journal of Parallel and Distributed Computing》2005,65(4):448-463

This paper studies a memory-side prefetching technique to hide latency incurred by inherently serial accesses to linked data structures (LDS). A programmable engine sits close to memory and traverses LDS independently from the processor. The engine can run ahead of the processor because of its low latency path to memory, allowing it to initiate data transfers earlier than the processor and pipeline multiple transfers over the network. We evaluate the proposed memory-side prefetching scheme for the Olden benchmarks on a processor-in-memory system. For the six benchmarks where LDS memory stall time is significant, the memory-side scheme reduces execution time by an average of 27% compared to a system without any prefetching. Compared to a state-of-the-art processor-side software prefetching scheme, the memory-side scheme reduces execution time in the range of 20–50% for three of the six applications, is about the same for two applications, and is worse by 18% for one application. We conclude that our memory-side scheme is effective, but a combination of the processor- and memory-side prefetching schemes is best and provide a qualitative framework to determine when either scheme should be used. 相似文献

5.

Improving Data Prefetching Efficacy in Multimedia Applications

Cucchiara Rita Prati Andrea Piccardi Massimo 《Multimedia Tools and Applications》2003,20(2):159-178

The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely-explored approach to improve cache performance is hardware prefetching, which allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches are unable to exploit the potential improvement in performance, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including MPEG-2 decoding and encoding, convolution, thresholding, and edge chain coding. 相似文献

6.

Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors

David K. Poulsen Pen-Chung Yew 《Journal of Parallel and Distributed Computing》1996,33(2):172

This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronization, and is applied to communication-related accesses between successive parallel loops. Prefetching and forwarding are each shown to be more effective for certain types of architectural and application characteristics. Given this result, a new hybrid prefetching and forwarding approach is proposed and evaluated that allows the relative amounts of prefetching and forwarding used to be adapted to these characteristics. When compared to prefetching or forwarding alone, the new hybrid scheme is shown to increase performance stability over varying application characteristics, to reduce processor instruction overheads, cache miss ratios, and memory system bandwidth requirements, and to reduce performance sensitivity to architectural parameters such as cache size. Algorithms for data prefetching, data forwarding, and hybrid prefetching and forwarding are described. These algorithms are applied by using a parallelizing compiler and are evaluated via execution-driven simulations of large, optimized, numerical application codes with loop-level and vector parallelism. 相似文献

7.

The impact of parallel loop scheduling strategies on prefetching ina shared memory multiprocessor

Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(6):573-584

Trace-driven simulations of numerical Fortran programs are used to study the impact of the parallel loop scheduling strategy on data prefetching in a shared memory multiprocessor with private data caches. The simulations indicate that to maximize memory performance, it is important to schedule blocks of consecutive iterations to execute on each processor, and then to adaptively prefetch single-word cache blocks to match the number of iterations scheduled. Prefetching multiple single-word cache blocks on a miss reduces the miss ratio by approximately 5% to 30% compared to a system with no prefetching. In addition, the proposed adaptive prefetching scheme further reduces the miss ratio while significantly reducing the false sharing among cache blocks compared to nonadaptive prefetching strategies. Reducing the false sharing causes fewer coherence invalidations to be generated, and thereby reduces the total network traffic. The impact of the prefetching and scheduling strategies on the temporal distribution of coherence invalidations also is examined. It is found that invalidations tend to be evenly distributed throughout the execution of parallel loops, but tend to be clustered when executing sequential program sections. The distribution of invalidations in both types of program sections is relatively insensitive to the prefetching and scheduling strategy 相似文献

8.

Cache Profiling技术 总被引：1，自引：0，他引：1

周谦冯晓兵张兆庆《计算机工程》2006,32(13):47-48

如何减少和隐藏cache失效的延迟，是人们关注的热点。编译器为了得到cache访问命中的情况，往往使用模拟器去跑一遍来得到结果，这样的速度很慢。为了克服以上缺点，提出了在编译器中作cache profiling来获取cache访问的信息。类似于value profiling和stride profiling，cache profiling对访存指令作插装，可以有效地提高速度，并且只需要编译器的支持即可。Cache profiling获得的信息可以用来改进指令调度、软件预取、生成cache hint和辅助线程等。相似文献

9.

Compiler analysis for cache coherence: interprocedural array data-flow analysis and its impact on cache performance

Choi L. Pen-Chung Yew 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(9):879-896

In this paper, we present compiler algorithms for detecting references to stale data in shared-memory multiprocessors. The algorithm consists of two key analysis techniques, state reference detection and locality preserving analysis. While the stale reference detection finds the memory reference patterns that may violate cache coherence, the locality preserving analysis minimizes the number of such stale references by analyzing both temporal and spatial reuses. By computing the regions referenced by arrays inside loops, we extend the previous scalar algorithms for more precise analysis. We develop a full interprocedural array data-flow algorithm, which performs both bottom-up side-effect analysis and top-down context analysis on the procedure call graph to further exploit locality across procedure boundaries. The interprocedural algorithm eliminates cache invalidations at procedure boundaries, which were assumed in the previous compiler algorithms. We have fully implemented the algorithm in the Polaris parallelizing compiler. Using execution-driven simulations on Perfect Club benchmarks, we demonstrate how unnecessary cache misses can be eliminated by the automatic stale reference detection. The algorithm can be used to implement cache coherence in the shared-memory multiprocessors that do not have hardware directories, such as Cray T3D. 相似文献

10.

集群下Cholesky分解的核外预取算法

刘凤刘青昆《微型机与应用》2011,30(4)

核外计算中,由于I/O操作速度比较慢,所以对文件的访阿时间占的比例较大.如果使文件操作和计算重叠则可以大幅度地提高运行效率.软件数据预取是一种有效的隐藏存储延迟的技术,通过预取使数据在实际使用之前从硬盘读到缓存中,提高了缓存(cache)的命中率,降低了读取数据的时间.通过设置两个缓冲区来轮流存放本次和下一次读入的数据块,实现访存完全命中cache的效果,使Cholesky分解并行程序执行核外计算的效率得到了大幅度的提高.同时,I/O操作的时间与CPU的执行时间的比例也是影响效率的主要因素. 相似文献

11.

Design and analysis of a scalable cache coherence scheme based onclocks and timestamps

Min S.L. Baer J.-L. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(1):25-44

A timestamp-based software-assisted cache coherence scheme that does not require any global communication to enforce the coherence of multiple private caches is proposed. It is intended for shared memory multiprocessors. The scheme is based on a compile-time marking of references and a hardware-based local incoherence detection scheme. The possible incoherence of a cache entry is detected and the associated entry is implicitly invalidated by comparing a clock (related to program flow) and a timestamp (related to the time of update in the cache). Results of a performance comparison, which is based on a trace-driven simulation using actual traces. between the proposed timestamp-based scheme and other software-assisted schemes indicate that the proposed scheme performs significantly better than previous software-assisted schemes, especially when the processors are carefully scheduled so as to maximize the reuse of cache contents. This scheme requires neither a shared resource nor global communication and is, therefore, scalable up to a large number of processors 相似文献

12.

The Performance Optimization of Threaded Prefetching for Linked Data Structures

Yan Huang Jie Tang Zhi-min Gu Min Cai Jianxun Zhang Ninghan Zheng 《International journal of parallel programming》2012,40(2):141-163

Helper threaded prefetching based on Chip Multiprocessor is a well known approach to reducing memory latency and has been explored in linked data structures accesses. However, conventional helper threaded prefetching often suffers from useless prefetches and cache thrashing, which affect its effectiveness. In this paper, we first analyzed the shortcomings of conventional helper threaded prefetching for linked data structures. Then we proposed an improved helper threaded prefetching, Skip Helper Threaded Prefetching, for hotspots with two level data traversals. Our solution is to profile the applications and balance delinquent loads between main thread and prefetching thread based on the characteristic of operations in their hotspots. Evaluations show that the proposed solution improves average performance by 8.9% (-O2) and 8.5% (-O3) over the conventional helper threaded prefetching that greedily prefetches all delinquent loads. We also compare our proposal with the active threaded prefetching which synchronizes with main thread by semaphore, and find that our proposal provides better performance for the targeted applications. 相似文献

13.

Using the First-Level Caches as Filters to Reduce the Pollution Caused by Speculative Memory References

Onur?Mutlu Email author Hyesoon?Kim David?N.?Armstrong Yale?N.?Patt 《International journal of parallel programming》2005,33(5):529-559

High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. This paper proposes the use of the first-level caches as filters that predict the usefulness of speculative memory references. With the proposed technique, speculative memory references bring data only into the first-level caches rather than all levels in the cache hierarchy. The processor monitors the use of the cache blocks in the first-level caches and decides which blocks to keep in the cache hierarchy based on the usefulness of cache blocks. It is shown that a simple implementation of this technique usually outperforms inclusive and exclusive baseline cache hierarchies commonly used by today’s processors and results in IPC performance improvements of up to 10% on the SPEC CPU2000 integer benchmarks. 相似文献

14.

基于弱一致性模型软件数据预取策略 总被引：1，自引：0，他引：1

窦勇周兴铭《软件学报》1997,8(2):81-86

本文针对分布共享存储器中存在的远地访问大延迟问题，提出了基于弱序一致性模型的存储访问优化策略，主要是利用并行程序中同步操作提供的信息，在同步点成块预取将要被使用的数据．该方法能够有效地掩盖远地存储访问的大延迟. 相似文献

15.

Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications

Luís Fabrício Wanderley Góes Christiane Pousa Ribeiro Márcio Castro Jean-François Méhaut Murray Cole Marcelo Cintra 《International journal of parallel programming》2014,42(2):365-382

Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines. 相似文献

16.

Predicting and precluding problems with memory latency

Boland K. Dollas A. 《Micro, IEEE》1994,14(4):59-67

By examining the rate at which successive generations of processor and DRAM cycle times have been diverging over time, we can track the latency problem of computer memory systems. Our research survey starts with the fundamentals of single-level caches and moves to the need for multilevel cache hierarchies. We look at some of the current techniques for boosting cache performance, especially compiler-based methods for code restructuring and instruction and data prefetching. These two areas will likely yield improvements for a much larger domain of applications in the future 相似文献

17.

快速地址计算的自适应栈高速缓存

郇丹丹李祖松王剑章隆兵胡伟武刘志勇《计算机研究与发展》2007,44(1):169-176

随着存储系统的访问速度与处理器运算速度的差距越来越显著,访存性能已成为提高处理器性能的瓶颈.通过对程序的访存行为进行分析,提出快速地址计算的自适应栈高速缓存方案.该方案将栈访问从数据高速缓存的访问中分离出来,充分利用栈空间数据访问的特点,提高指令级并行度,减少数据高速缓存污染,降低数据高速缓存失效率,并采用快速地址计算策略,减少栈访问的命中时间.该栈高速缓存在发生栈溢出时能够自适应地关闭,以避免栈切换对处理器性能的影响.栈高速缓存标志中增加进程标识,进程切换时不需要将数据写到低层存储系统中,适用于多进程环境.SPEC CPU2000程序运行结果表明,采用快速地址计算的自适应栈高速缓存方案,25.8%的访存指令可以并行执行,数据高速缓存失效率平均降低9.4%,IPC值平均提高6.9%. 相似文献

18.

High Performance Software Coherence for Current and Future Architectures

《Journal of Parallel and Distributed Computing》1995,29(2):179-195

Shared memory provides an attractive and intuitive programming model for large-scale parallel computing, but requires a coherence mechanism to allow caching for performance while ensuring that processors do not use stale data in their computation. Implementation options range from distributed shared memory emulations on networks of workstations to tightly coupled fully cache-coherent distributed shared memory multiprocessors. Previous work indicates that performance varies dramatically from one end of this spectrum to the other. Hardware cache coherence is fast, but also costly and time-consuming to design and implement, while DSM systems provide acceptable performance on only a limit class of applications. We claim that an intermediate hardware option-memory-mapped network interfaces that support a global physical address space, without cache coherence-can provide most of the performance benefits of fully cache-coherent hardware, at a fraction of the cost. To support this claim we present a software coherence protocol that runs on this class of machines, and use simulation to conduct a performance study. We look at both programming and architectural issues in the context of software and hardware coherence protocols. Our results suggest that software coherence on NCC-NUMA machines in a more cost-effective approach to large-scale shared-memory multiprocessing than either pure distributed shared memory or hardware cache coherence. 相似文献

19.

片上多处理器中基于步长和指针的预取 总被引：1，自引：1，他引：0

下载免费PDF全文

肖俊华冯子军章隆兵《计算机工程》2009,35(4):58-60

在对大量程序访存行为进行分析的基础上,提出基于步长和指针的预取方法。能捕获规整的数据访问模式和指针访问模式。在L2cache和内存之间采用全局历史缓存实现该预取方法。全系统模拟结果表明,该预取方法对商业应用测试程序的性能平均提高14％,对科学计算测试程序的性能平均提高34．5％。相似文献

20.

When caches aren't enough: data prefetching techniques 总被引：1，自引：0，他引：1

Vander Wiel S.P. Lilja D.J. 《Computer》1997,30(7):23-30

With data prefetching, memory systems call data into the cache before the processor needs it, thereby reducing memory-access latency. Using the most suitable techniques is critical to maximizing data prefetching's effectiveness. The authors review three popular prefetching techniques: software-initiated prefetching, sequential hardware-initiated prefetching, and prefetching via reference prediction tables 相似文献