共查询到10条相似文献,搜索用时 62 毫秒
1.
To obtain significant execution speedups, GPUs rely heavily on the inherent data-level parallelism present in the targeted application. However, application programs may not always be able to fully utilize these parallel computing resources due to intrinsic data dependencies or complex data pointer operations. In this paper, we explore how to leverage aggressive software-based value prediction techniques on a GPU to accelerate programs that lack inherent data parallelism. This class of applications are typically difficult to map to parallel architectures due to the presence of data dependencies and complex data pointer manipulation present in these applications. Our experimental results show that, despite the overhead incurred due to software speculation and the communication overhead between the CPU and GPU, we obtain up to 6.5 $\times $ speedup on a selected set of kernels taken from the SPEC CPU2006, PARSEC and Sequoia benchmark suites. 相似文献
2.
Dongsoo Kang Chen Liu Jean-Luc Gaudiot 《International journal of parallel programming》2008,36(4):361-385
By executing two or more threads concurrently, Simultaneous MultiThreading (SMT) architectures are able to exploit both Instruction-Level
Parallelism (ILP) and Thread-Level Parallelism (TLP) from the increased number of in-flight instructions that are fetched
from multiple threads. However, due to incorrect control speculations, a significant number of these in-flight instructions
are discarded from the pipelines of SMT processors (which is a direct consequence of these pipelines getting wider and deeper).
Although increasing the accuracy of branch predictors may reduce the number of instructions so discarded from the pipelines,
the prediction accuracy cannot be easily scaled up since aggressive branch prediction schemes strongly depend on the particular
predictability inherently to the application programs. In this paper, we present an efficient thread scheduling mechanism
for SMT processors, called SAFE-T (Speculation-Aware Front-End Throttling): it is easy to implement and allows an SMT processor
to selectively perform speculative execution of threads according to the confidence level on branch predictions, hence preventing
wrong-path instructions from being fetched. SAFE-T provides an average reduction of 57.9% in the number of discarded instructions
and improves the instructions per cycle (IPC) performance by 14.7% on average over the ICOUNT policy across the multi-programmed
workloads we simulate.
This paper is an extended version of the paper, “Speculation Control for Simultaneous Multithreading,” which appeared in the
Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, April 2004. 相似文献
3.
Hisashi Hayashi Kenta Cho Akihiko Ohsuga 《Electronic Notes in Theoretical Computer Science》2002,70(5)
In some multi-agent systems, when an agent cannot retrieve information from another agent, the agent makes an assumption and tentatively performs the computation. When the agent comes across a mistake in the preliminary assumption, the computation is modified. This kind of speculative computation is effective when the assumption is correct. However, once the agent executes an action, it is impossible to modify the computation in these systems. This paper shows how to integrate speculative computation and action execution through logic programming. 相似文献
4.
The MapReduce framework has become the de facto standard for big data processing due to its attractive features and abilities. One is that it automatically parallelizes a job into multiple tasks and transparently handles task execution on a large cluster of commodity machines. The increasing heterogeneity of distributed environments may result in a few straggling tasks, which prolong job completion. Speculative execution is proposed to mitigate stragglers. However, the existing speculative execution mechanism could not work efficiently as many speculative tasks are still slower than their original tasks. In this paper, we explore an approach to increase the efficiency of speculative execution, and further improve MapReduce performance. We propose the Partial Speculative Execution (PSE) strategy to make speculative tasks start from the checkpoint. By leveraging the checkpoint of original tasks, PSE can eliminate the costs of re-reading, re-copying, and re-computing the processed data. We implement PSE in Hadoop, and evaluate its performance in terms of job completion time and the efficiency of speculative execution under several kinds of classical workloads. Experimental results show that, in heterogeneous environments with stragglers, PSE completes jobs 56 % faster than that with no speculation and 12 % faster than that with LATE, an improved speculative execution algorithm. In addition, on average PSE can improve the efficiency of speculative execution by 24 % compared to LATE. 相似文献
5.
利用连续两阶段在线剖析优化多线程推测执行 总被引:1,自引:0,他引:1
针对当前推测多线程优化中使用的离线剖析受到训练输入集限制的问题,提出一种根据在线剖析结果自动变换推测多线程程序的动态优化方法.该方法在程序运行时执行剖析和优化工作,不需要单独的剖析过程以及通用的训练输入集.该方法也适用于那些运行时行为特征呈阶段性变化的程序.实验表明,在指导事务划分和选择并行循环方面,动态优化方法能够达到和静态优化方法相似的效果,完全可以在离线剖析失效时被使用. 相似文献
6.
由成百上千处理器核构成的众核处理器在提供大量计算能力的同时,也对如何高效利用资源提出挑战;具有不同并行度的应用对处理器核资源有不同的需求,不合理的分配会造成资源浪费(分配过多)或者限制并行性开发(分配过少).针对众核结构上串行程序线程级推测执行面临的处理器核资源分配问题,提出一种基于硬件的推测执行能力监测和评估机制,设计三种线程级推测执行能力评估器;该评估器能够根据串行程序推测执行能力的动态变化,对应用分配的处理器核资源数量进行实时调整.实验结果表明,利用一个硬件开销极小的评估器对众核平台上串行程序的线程级推测执行进行资源分配指导,即可使性能和资源利用率达到有效的平衡. 相似文献
7.
基于事务性执行的投机并行多线程是一种适合未来多核微处理器架构的新型并行程序设计和编译技术.但在此基础上的并行程序执行过程更为复杂,程序执行过程的模拟成为关键问题之一.本文提出利用二进制代码级动态插桩技术对投机并行多线程程序进行功能性模拟,设计并实现了完整的软件平台,可精确地模拟和监控并行程序的线程级投机执行过程,检测访存冲突,从而实现投机并行多线程的语义.该软件平台同时可以作为进一步研究投机多线程并行程序真实执行过程的基础,并有效支持投机并行多线程编译器的设计和分析. 相似文献
8.
Po-Yung Chang Eric Hao Yale N. Patt Pohua P. Chang 《International journal of parallel programming》1996,24(3):209-234
Conditional branches incur a severe performance penalty in wide-issue, deeply pipelined processors. Speculative execution(1, 2) and predicated execution(3–9) are two mechanisms that have been proposed for reducing this penalty. Speculative execution can completely eliminate the penalty associated with a particular branch, but requires accurate branch prediction to be effective. Predicated execution does not require accurate branch prediction to eliminate the branch penalty, but is not applicable to all branches and can increase the latencies within the program. This paper examines the performance benefit of using both mechanisms to reduce the branch execution penalty. Predicated execution is used to handle the hard-to-predict branches and speculative execution is used to handle the remaining branches. The hard-to-predict branches within the program are determined by profiling. We show that this approach can significantly reduce the branch execution penalty suffered by wide-issue processors. 相似文献
9.
张骏 《小型微型计算机系统》2012,33(5):987-994
“存储墙”问题已经成为处理器性能提升的主要障碍,而处理器内核猜测执行预测路径上访存指令时预载入的存储器数据所导致Cache污染会严重影响处理器性能.本文提出一种针对猜测执行过程中预载入数据的Cache污染控制方法CSDA.首先,利用置信度评估技术从所有预测路径中分离出错误概率较大的路径.然后,根据低置信度污染型访存指令识别历史表将低置信度预测路径上的访存指令划分为预取型和污染型,为污染型的访存指令建立低优先级Load/Store队列,并采用污染数据Cache存储污染数据.仿真结果表明,在双核模式下,CSDA策略相对于baseline结构来说,L1 D-Cache缺失率降低幅度从9%-23%,平均降低了17%;L2 Cache缺失率的下降范围从1.02%-14.39%,平均为5.67%;IPC的提升幅度从0.19% -5.59%,平均为2.21%. 相似文献
10.
As microprocessor designs move towards deeper pipelines and support for multiple instruction issue, steps must be taken to alleviate the negative impact of branch operations on processor performance. One approach is to use branch prediction hardware and perform speculative execution of the instructions following an unresolved branch. Another technique is to eliminate certain branch instructions altogether by translating the instructions following a forward branch into predicate form. Both these techniques are employed in many current processor designs. This paper investigates the relationship between branch prediction techniques and branch predication. In particular, we are interested in how using predication to remove a certain class of poorly predicted branches affects the prediction accuracy of the remaining branches. A variety of existing predication models for eliminating branch operations are presented, and the effect that eliminating branches has on branch prediction schemes ranging from simple prediction mechanisms to the newer more sophisticated branch predictors is studied. We also examine the impact of predication on basic block size, and how the two techniques used together affect overall processor performance. 相似文献