期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The Performance Optimization of Threaded Prefetching for Linked Data Structures

Yan Huang Jie Tang Zhi-min Gu Min Cai Jianxun Zhang Ninghan Zheng 《International journal of parallel programming》2012,40(2):141-163

Helper threaded prefetching based on Chip Multiprocessor is a well known approach to reducing memory latency and has been explored in linked data structures accesses. However, conventional helper threaded prefetching often suffers from useless prefetches and cache thrashing, which affect its effectiveness. In this paper, we first analyzed the shortcomings of conventional helper threaded prefetching for linked data structures. Then we proposed an improved helper threaded prefetching, Skip Helper Threaded Prefetching, for hotspots with two level data traversals. Our solution is to profile the applications and balance delinquent loads between main thread and prefetching thread based on the characteristic of operations in their hotspots. Evaluations show that the proposed solution improves average performance by 8.9% (-O2) and 8.5% (-O3) over the conventional helper threaded prefetching that greedily prefetches all delinquent loads. We also compare our proposal with the active threaded prefetching which synchronizes with main thread by semaphore, and find that our proposal provides better performance for the targeted applications. 相似文献

2.

反馈指导的链式数据结构预取优化

漆锋滨王飞李中升《软件学报》2009,20(Z1):34-39

传统的基于静态编译的数据预取大多针对数组访问,现在的应用程序中大量出现由指针构成的链式数据结构,依赖传统的编译方法难以进行预取优化.反馈式数据预取优化技术是当前高性能计算技术中前沿的一种编译优化手段,可以很好地解决链式结构的预取问题.在研究ORC编译器反馈式编译优化技术的基础上,针对Alpha结构的特点,对针对链式结构的反馈式数据预取进行了优化.SPEC2000测试表明,平均性能提高了4.1%. 相似文献

3.

Compiler Controlled Prefetching for Multiprocessors Using Low-Overhead Traps and Prefetch Engines

《Journal of Parallel and Distributed Computing》2000,60(5):585-615

In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40–50 cycles) and does not normally delay the program beyond the memory latency of the miss. Once started, the prefetch engine executes on its own and causes no instruction overhead. The only instruction overhead in our approach is when a trap handler completes after data arrives. The advantages of this technique are (1) it exploits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction overhead, which is a limitation for traditional software-controlled prefetching, and (3) it is accurate in the sense that it generates very little useless traffic while maintaining a high prefetching coverage. We also study whether one could emulate the prefetch engine in software, which would not require any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engine removes in general less stall time at a higher instruction overhead. 相似文献

4.

Threaded Prefetching: A New Instruction Memory Hierarchy for Real-Time Systems

Lee Minsuk Min Sang Lyul Shin Heonshik Kim Chong Sang Park Chang Yun 《Real-Time Systems》1997,13(1):47-65

Cache memories have been extensively used to bridge the speed gap between high speed processors and relatively slow main memory. However, they are not widely used in real-time systems due to their unpredictable performance. This paper proposes an instruction prefetching scheme called threaded prefetching as an alternative to instruction caching in real-time systems. In the proposed threaded prefetching, an instruction block pointer called a thread is assigned to each instruction memory block and is made to point to the next block on the worst case execution path that is determined by a compile-time analysis. Also, the thread is not updated throughout the entire program execution to guarantee predictability. This paper also compares the worst case performances of various previous instruction prefetching schemes with that of the proposed threaded prefetching. By analyzing several benchmark programs, we show that the worst case performance of the proposed scheme is significantly better than those of previous instruction prefetching schemes. The results also show that when the block size is large enough the worst case performance of the proposed threaded prefetching scheme is almost as good as that of an instruction cache with 100 % hit ratio. 相似文献

5.

片上多处理器中时空结合的数据预取

胡同森刘敬伟《小型微型计算机系统》2012,33(8):1856-1861

片上多处理器中不同核的缺失地址序列之间通常存在一定的空间和时间相关性,为了充分利用该性质,本文提出时空结合的数据预取.该机制首先寻找核内缺失地址序列的相关性,在核内探索不到的情况下再寻找核间的相关性,因此可利用其它核的访存行为来预测本核可能即将发生的访存行为.实验结果表明,本文提出的数据预取机制可使测试程序的平均性能提高12.6%,与扩展应用在多核上的C/DC策略相比较,性能提高了3.8%. 相似文献

6.

Taxonomy of Data Prefetching for Multicore Processors

下载免费PDF全文

Surendra Byna Member IEEE Yong Chen 《计算机科学技术学报》2009,24(3):405-417

Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory.With hardware and/or software support,data prefetching brings data closer to a processor before it is actually needed.Many prefetching techniques have been developed for single-core processors.Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching t... 相似文献

7.

一种内存网格的数据预取算法

褚瑞卢锡城肖侬《软件学报》2006,17(11):2234-2244

内存网格(RAM(random access memory) grid)是一种面向广域网上内存资源共享的新型网格系统.它的主要目标是在物理内存不足的情况下,提高内存密集型应用或IO密集型应用的系统性能.内存网格的应用效果取决于网络通信开销.在减少或隐藏网络通信开销的情况下,其性能可以进一步提高.通过对内存网格的分析,设计了一种基于"推"数据的内存网格预取机制.借助数据挖掘领域中序列模式挖掘的方法,提出了相应的预取算法.通过基于真实运行状态的模拟,对预取算法进行了评估和验证. 相似文献

8.

Software Controlled Adaptive Pre-Execution for Data Prefetching

ákos Dudás Sándor Juhász Tamás Schrádi 《International journal of parallel programming》2012,40(4):381-396

Data prefetching mechanisms are widely used for hiding memory latency in data intensive applications. They mask the speed gap between CPUs and their memory systems by preloading data into the CPU caches, where accessing them is by at least one order of magnitude faster. Pre-execution is a combined prefetching method, which executes a slice of the original code preloading the code and its data at the same time. Pre-execution is often mentioned in the literature, but according to our knowledge, it has not been formally defined yet. We fill this void by presenting the formal definition of speculative and non-speculative pre-execution, and derive a lightweight software-based strategy which accelerates the main working thread by introducing an adaptive, non-speculative pre-execution helper thread. This helper thread acts as a perfect predictor, calculates memory addresses, prefetches the data and consumes cache misses early. The adaptive automatic control allows the helper thread to configure itself in run-time for best performance. The method is directly applicable to any data intensive application without requiring hardware modifications. Our method was able to achieve an average speedup of 10–30% in a real-life application. 相似文献

9.

Software Data Prefetching for Software Pipelined Loops

《Journal of Parallel and Distributed Computing》1999,58(2):236-259

This paper focuses on the interaction between software prefetching (both binding and nonbinding prefetch) and software pipelining for statically scheduled machines. First, it is shown that evaluating software pipelined schedules without considering memory effects can be rather inaccurate due to stalls caused by dependences with memory instructions (even if a lockup-free cache is considered). It is also shown that the penalty of the stalls is in general higher than the effect of spill code. Second, we show that, in general, binding schemes are more powerful than nonbinding ones for software pipelined schedules. Finally, the main contribution of this paper is an heuristic scheme that schedules some memory operations according to the locality estimated at compile time and other attributes of the dependence graph. The proposed scheme is shown to outperform other heuristic approaches since it achieves a better trade-off between compute and stall time than the others. 相似文献

10.

面向单线程应用的数据预取技术研究

欧国东张民选《计算机研究与发展》2007,44(Z1):140-147

多线程处理器的推广受限于应用,目前大部分应用尤其是桌面应用都是单线程程序,不能充分利用多线程处理器提供的多个现场,并行执行以提高速度.使用空闲现场加速单线程应用是目前研究的一个热点,研究主要集中在提高传统串行应用存储访问的效率和分支预测的精度.在基于线程的数据预取方法TDP中,数据预取线程是从主线程的执行踪迹中提取的,它们使用空闲的现场,和主线程并行执行.由于数据预取线程仅仅包括和预取相关的指令,它们比主线程执行要快,可以在主线程需要数据之前,把数据取到离处理器更近的存储层次.基于线程的数据预取方法能够有效地解决传统数据预取方法难以处理的诸多问题,如不规则内存访问模式.研究控制相关对TDP的影响,具体分析使用错误前瞻的数据预取方法:通过在预取线程中加入分支指令,并用它们控制预取线程的执行过程.通过研究发现,在某些情况下即使控制前瞻已经被证实是错误的,继续执行预取线程可以获得更好的预取效果.模拟结果显示,使用错误前瞻可以获得5%的性能提升. 相似文献