共查询到4条相似文献,搜索用时 0 毫秒
1.
《Journal of Systems Architecture》2014,60(5):420-430
General-purpose graphics processing unit (GPGPU) plays an important role in massive parallel computing nowadays. A GPGPU core typically holds thousands of threads, where hardware threads are organized into warps. With the single instruction multiple thread (SIMT) pipeline, GPGPU can achieve high performance. But threads taking different branches in the same warp violate SIMD style and cause branch divergence. To support this, a hardware stack is used to sequentially execute all branches. Hence branch divergence leads to performance degradation. This article represents the PDOM (post dominator) stack as a binary tree, and each leaf corresponds to a branch target. We propose a new PDOM stack called PDOM-ASI, which can schedule all the tree leaves. The new stack can hide more long operation latencies with more schedulable warps without the problem of warp over-subdivision. Besides, a multi-level warp scheduling policy is proposed, which lets part of the warps run ahead and creates more opportunities to hide the latencies. The simulation results show that our policies achieve 10.5% performance improvements over baseline policies with only 1.33% hardware area overhead. 相似文献
2.
Multi‐core systems equipped with micro processing units and accelerators such as digital signal processors (DSPs) and graphics processing units (GPUs) have become a major trend in processor design in recent years in attempts to meet ever‐increasing application performance requirements. Open Computing Language (OpenCL) is one of the programming languages that include new extensions proposed to exploit the computing power of these kinds of processors. Among the newly extended language features, the single‐instruction multiple‐data (SIMD) linguistics and vector types are added to OpenCL to exploit hardware features of the accelerators. The addition makes it necessary to consider how traditional compiler data flow analysis can be adopted to meet the optimization requirements of vector linguistics. In this paper, we propose a calculus framework to support the data flow analysis of vector constructs for OpenCL programs that compilers can use to perform SIMD optimizations. We model OpenCL vector operations as data access functions in the style of mathematical functions. We then show that the data flow analysis for OpenCL vector linguistics can be performed based on the data access functions. Based on the information gathered from data flow analysis, we illustrate a set of SIMD optimizations on OpenCL programs. The experimental results incorporating our calculus and our proposed compiler optimizations show that the proposed SIMD optimizations can provide average performance improvements of 22% on x86 CPUs and 4% on advanced micro devices GPUs. For the selected 15 benchmarks, 11 of them are improved on x86 CPUs, and six of them are improved on advanced micro devices GPUs. The proposed framework has the potential to be used to construct other SIMD optimizations on OpenCL programs. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献
3.
GPU已被广泛应用于当前的高性能计算系统中,但其性能却受到程序运行时不同控制流方向的严重制约。这一问题通常通过动态Warp重组技术来解决,即将一个或多个Warp内沿相同控制流执行的线程组合在一起,构成一个新的Warp。但是,这类方法普遍存在一些不必要的重组,引入了较大的额外性能开销。分析了线程重组的性能开销,并提出了一种称作"部分重组"的性能优化方法。这种方法在保证重组效率的前提下,避免了对包含活跃线程数量较多的Warp的重组,从而有效减少了线程重组引入的性能开销。测试结果表明,部分重组能够在保证重组效率的前提下带来较为明显的性能提升。 相似文献
4.
为了提高实时多媒体通信的服务质量,在综合考虑网络延迟和网络抖动对实时流媒体的影响下,定义了基于RTT的综合标志量。在此基础上,提出了一种改进的实时流量自适应控制机制。仿真结果表明,与基于丢包率和仅考虑延迟的RTT算法相比,该机制有效提高了数据流的平稳性和带宽的利用率,有更高的自适应性。 相似文献