期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hong Jun Choi Dong Oh Son Jong Myon Kim Cheol Hong Kim 《The Journal of supercomputing》2014,69(1):330-356

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead. 相似文献

2.

通过部分Warp重组消除GPGPU控制流的不一致性

沈立杨耀华王志英《计算机工程与科学》2019,41(8):1335-1342

GPU已被广泛应用于当前的高性能计算系统中,但其性能却受到程序运行时不同控制流方向的严重制约。这一问题通常通过动态Warp重组技术来解决,即将一个或多个Warp内沿相同控制流执行的线程组合在一起,构成一个新的Warp。但是,这类方法普遍存在一些不必要的重组,引入了较大的额外性能开销。分析了线程重组的性能开销,并提出了一种称作"部分重组"的性能优化方法。这种方法在保证重组效率的前提下,避免了对包含活跃线程数量较多的Warp的重组,从而有效减少了线程重组引入的性能开销。测试结果表明,部分重组能够在保证重组效率的前提下带来较为明显的性能提升。相似文献

3.

一种基于Inter-warp异构性的缓存管理与内存调度机制

方娟魏泽琳于婷雯《计算机工程与科学》2019,41(5):788-795

在GPU中,一个warp内的所有线程在锁步中执行相同的指令。某些线程的内存请求可以得到快速处理,而其余请求会经历较长时间。在最慢的请求完成之前,warp不能执行下一条指令,导致内存发散。对GPU中warp间的异构性进行了研究,实现并优化了一种基于inter warp异构性的缓存管理机制和内存调度策略,以减少内存发散和缓存排队延迟的负面影响。根据缓存命中率将warp分类,以驱动后面的3个组件：（1）基于warp类型的缓存旁路技术组件,使低缓存利用率的warp进入旁路,不访问L2缓存;（2）基于warp类型的缓存插入/提升策略组件,防止来自高缓存利用率warp的数据被过早清除;（3）基于warp类型的内存控制器组件,优先处理从高缓存利用率的warp接收到的请求,并优先处理来自相同warp的请求。基于warp间异构性的缓存管理和内存调度机制在8种不同的GPGPU应用中,与基准GPU相比,平均加速18.0％。相似文献

4.

Improved GPU SIMD control flow efficiency via hybrid warp size mechanism

《Microprocessors and Microsystems》2014,38(7):717-729

High single instruction multiple data (SIMD) efficiency and low power consumption have made graphic processing units (GPUs) an ideal platform for many complex computational applications. Thousands of threads can be created by programmers and grouped into fixed-size SIMD batches, known as warps. High throughput is then achieved by concurrently executing such warps with minimal control overhead. However, if a branch instruction occurs, which assigns different paths to different threads, one warp will be broken into multiple warps that have to be executed serially, consequently reducing the efficiency advantage of SIMD. In this paper, the contemporary fixed-size warp design is abandoned for a hybrid warp size (HWS) mechanism. Mixed-size warps are generated according to HWS and are scheduled and issued flexibly. The simulation results show that this mechanism yields an average speedup of 1.20 over the baseline architecture for a wide variety of general purpose GPU applications. The paper also integrates HWS with dynamic warp formation (DWF), which is a well-known branch handling mechanism used to improve SIMD utilization by forming new warps out of split warps in real time. The simulation results show that the combination of DWF and HWS generates an average speedup of 1.27 over the DWF-only platform with an estimated area increase of about 1% of DWF. 相似文献

5.

基于自适应线程束的GPU并行粒子群优化算法

张硕何发智周毅鄢小虎《计算机应用》2016,36(12):3274-3279

基于统一计算设备架构（CUDA）对图形处理器（GPU）下的并行粒子群优化（PSO）算法作改进研究。根据CUDA的硬件体系结构特点,可知Block是串行执行的,线程束（Warp）才是流多处理器（SM）调度和执行的基本单位。为了充分利用Block中线程的并行性,提出基于自适应线程束的GPU并行PSO算法：将粒子的维度和线程相对应;利用GPU的Warp级并行,根据维度的不同自适应地将每个粒子与一个或多个Warp相对应;自适应地将一个或多个粒子与每个Block相对应。与已有的粗粒度并行方法（将每个粒子和线程相对应）以及细粒度并行方法（将每个粒子和Block相对应）进行了对比分析,实验结果表明,所提出的并行方法相对前两种并行方法,CPU加速比最多提高了40。相似文献

6.

同时多线程处理器上的动态分支预测器设计方案研究

任建安虹路放梁博《计算机科学》2006,33(3):239-243

同时多线程处理器（SMT）每个周期能够从多个线程中发射指令执行，从而大大地提高了超标量微处理器的指令吞吐量，但多个线程的同时执行也带来了许多硬件资源的共享冲突问题.其中，多个线程共享分支预测硬件的方案会对分支预测精度产生较大的影响.研究SMT处理器中分支处理方案对于处理器整体性能的影响，对于指导SMT处理器的设计是十分重要的.本文利用SMT处理器模拟器，针对各线程运行独立应用的SMT结构实验评估了几种著名的分支预测方案;给出了在单线程和多线程情况下，分支预测方案对分支预测精度和处理器整体性能的影响的分析;总结出在这样的SMT结构中，各线程拥有独立的预测器是一种较好的选择，并且由于各独立预测器可以采用小而简单的结构，所以不会带来太多的硬件开销. 相似文献

7.

ParaC:面向GPU平台的图像处理领域的编程框架

卢兴敬刘雷贾海鹏冯晓兵武成岗《软件学报》2017,28(7):1655-1675

GPGPU加速器是当前提高图像处理算法性能的主流加速平台,但是,在GPGPU平台上,同一个程序充分利用硬件体系结构特征和软件特征的优化版本与简单实现版本在性能上会有数量级的差异。GPGPU加速器具有多维多层的大量执行线程和层次化存储体系结构,后者的不同层次具有不同的容量、带宽、延迟和访问权限。同时,图像处理应用程序具有复杂的计算操作、边界处理规则和数据访问特性。因此,任务的并发执行模式、线程的组织方式和并发任务到设备的映射不仅影响到程序的并发度、调度、通信和同步等特性,而且也会影响到访存的带宽、延迟等。因此,GPGPU平台上的程序优化是一个困难、复杂且效率较低的过程。本文提出基于语言扩展的领域编程模型：ParaC。ParaC编程环境利用高层语言扩展描述的程序语义信息,自动分析获取应用程序的操作信息、并发任务间的数据重用信息和访存信息等程序特征,同时结合硬件平台特征,利用基于领域先验知识驱动的编译优化模型自动生成GPGPU平台上的优化代码,最后,利用源源变换编译器生成标准OpenCL程序。本文在测试用例上的实验结果表明,ParaC在GPGPU平台上自动生成的优化版本相对于手工优化版本的加速比最高达到3.22倍,但代码行数只是后者的1.2%到39.68%。相似文献

8.

An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing

Jung-Wook Park Hoon-Mo Yang Gi-Ho Park Shin-Dug Kim Charles C. Weems 《Journal of Parallel and Distributed Computing》2010

In order to guarantee both performance and programmability demands in 3D graphics applications, vector and multithreaded SIMD architectures have been employed in recent graphics processing units. This paper introduces a novel instruction-systolic array architecture, which transfers an instruction stream in a pipelined fashion to efficiently share the expensive functional resources of a graphics processor. Specifically, cache misses and dynamic branches can cause additional latencies and complicated management in these parallel architectures. To address this problem, we combine a systolic execution scheme with on-demand warp activation that handles cache miss latency and branch divergence efficiently without significantly increasing hardware resources, either in terms of logic or register space. Simulation indicates that the proposed architecture offers 25% better performance than a traditional SIMD architecture with the same resources, and requires significantly fewer resources to match the performance of a typical modern vector multi-threaded GPU architecture. 相似文献

9.

Virtual-Cache: A cache-line borrowing technique for efficient GPU cache architectures

《Microprocessors and Microsystems》2021

GPUs provide megabytes of registers and shared memories to maintain the contexts for thousands of threads and enable fast data sharing amongst threads of a thread block, respectively. Besides, GPUs employ L1 cache to provide the high bandwidth service for memory requests. However, the average L1 cache capacity per thread is very limited, resulting in cache thrashing which in turn impairs the performance. Meanwhile, many registers and shared memories are unassigned to any warps or thread blocks. Moreover, registers and shared memories that are assigned can be idle when warps or thread blocks are finished. Exploiting the above insights, we propose Virtual-Cache to cost-effectively increase the effective size of L1 cache by utilizing the unassigned and released registers and shared memories as cache-lines in this paper. Specifically, we leverage the unassigned registers and shared memories to serve cache requests directly. Regarding the registers assigned to a warp, they can work as cache-lines after the warp completes the execution and before they are accessed again by a new launched warp. Regarding the shared memories of a thread block, they are enabled to serve cache requests when the thread block is finished till they are referenced by shared memory instructions of the relaunched thread block. The register file, shared memory and L1 cache are physically independent but logically unified as a large virtual cache with redesigned cache-line management. We develop the control and data path for the register file, making the register file accessible for cache requests by borrowing an operand collector to serve the cache requests. We also expand the control and data path for the shared memory to serve the cache requests. Our evaluation results show that Virtual-Cache makes the performance improved by 28% over the previously proposed cache management technique for cache-sensitive applications. 相似文献

10.

Efficient data management for incoherent ray tracing

Xin Yang Duan-qing Xu Lei Zhao 《Applied Soft Computing》2013,13(1):1-8

To obtain good performance on the GPU hardware, it is necessary to design algorithms to manage data, access memory under GPU memory hierarchy, and schedule more efficient threads. In this paper, we propose an efficient data management and task management designed for GPU based ray tracing. Due to the dynamic and uncertainty in ray tracing, we design data-management layer and task-management layer combined with fuzzy spatial analysis, use the two-level ray sorting and a ray bucket structure to reorganize ray data, then a warp's threads can be scheduled to access coherent geometry and nodes data, reduce memory bandwidth, and dispatch the data locally. We schedule tasks in data-driven execution according to coherent data, propose an adaptive ray compaction to eliminate inactive threads, maintain task efficiency of threads in a warp, and design two heuristics to decrease the compaction cost. On the basis of it, we also introduce a memory-optimized dynamic traversal management to reduce incoherent memory access, and avoid frequent sorting computation and compaction operations. Our experiments demonstrate all of these work combined can achieve good performance. 相似文献

11.

Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing

《Microprocessors and Microsystems》2016

To achieve higher performance and energy efficiency, GPGPU architectures have recently begun to employ hardware caches. Adding caches to GPGPUs, however, does not always guarantee improved performance and energy efficiency due to the thrashing in small caches shared by thousands of threads. While prior work has proposed warp-scheduling and cache-bypassing techniques to address this issue, relatively little work has been done in the context of advanced cache indexing (ACI).To bridge this gap, this work investigates the effectiveness of ACI for high-performance and energy-efficient GPGPU computing. We discuss the design and implementation of static and adaptive cache indexing schemes for GPGPUs. We then quantify the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation demonstrates that the ACI schemes are effective in that they provide significant performance and energy-efficiency gains over the conventional indexing scheme. Further, we investigate the performance sensitivity of ACI to key architectural parameters (e.g., indexing latency and cache associativity). Our experimental results show that the ACI schemes are promising in that they continue to provide significant performance gains even when additional indexing latency occurs due to the hardware complexity and the baseline cache is enhanced with high associativity or large capacity. 相似文献

12.

Modeling and characterizing GPGPU reliability in the presence of soft errors

Jingweijia Tan Yang Yi Fangyang Shen Xin Fu 《Parallel Computing》2013

The general-purpose computing on graphic processing units (GPGPUs) becomes increasingly popular due to its high computational throughput for data parallel applications. Modern GPU architectures have limited capability for error detection and fault tolerance since they are originally designed for graphics processing. However, the rigorous execution correctness is required for general-purpose applications, which makes reliability a growing concern in the GPGPU architecture design. With CMOS processing technologies continuously scaling down to the nano-scale, on-chip soft error rate (SER) has been predicted to increase exponentially. GPGPUs with hundreds of cores integrated into a single chip are prone to manifest high SER. This paper explores a first step to model and characterize GPGPU reliability in light of soft errors. We develop GPGPU-SODA (GPGPU SOftware Dependability Analysis), a framework to estimate the soft-error vulnerability of GPGPU microarchitecture. By using GPGPU-SODA, we observe that several microarchitecture structures in GPGPUs exhibit high soft-error susceptibility, and the structure vulnerability is sensitive to the workload characteristics (e.g. branch divergences, memory access pattern). We further investigate the impact of several architectural optimizations on GPU soft-error robustness. For example, we find that increasing the number of threads supported by GPU significantly affects the GPGPU soft-error robustness. However, changing the warp scheduling policy has little impact on the structure vulnerability. The observations made in this study provide designers the useful guidance to build resilient GPGPUs: a comprehensive resiliency solution for GPGPUs should consider the entire GPGPU design instead of solely focusing on a particular structure. 相似文献

13.

Using Predicated Execution to Improve the Performance of a Dynamically Scheduled Machine with Speculative Execution

Po-Yung Chang Eric Hao Yale N. Patt Pohua P. Chang 《International journal of parallel programming》1996,24(3):209-234

Conditional branches incur a severe performance penalty in wide-issue, deeply pipelined processors. Speculative execution^{(1, 2)} and predicated execution^(3–9) are two mechanisms that have been proposed for reducing this penalty. Speculative execution can completely eliminate the penalty associated with a particular branch, but requires accurate branch prediction to be effective. Predicated execution does not require accurate branch prediction to eliminate the branch penalty, but is not applicable to all branches and can increase the latencies within the program. This paper examines the performance benefit of using both mechanisms to reduce the branch execution penalty. Predicated execution is used to handle the hard-to-predict branches and speculative execution is used to handle the remaining branches. The hard-to-predict branches within the program are determined by profiling. We show that this approach can significantly reduce the branch execution penalty suffered by wide-issue processors. 相似文献

14.

使用取指策略控制同时多线程处理器中个体线程的性能

孙彩霞张民选《计算机学报》2008,31(2):309-317

当前,对同时多线程(Si multaneous Multithreading,SMT)处理器取指策略的研究大都集中在总体性能的优化上.文中提出一种新颖的SMT处理器取指策略(Controlling Performance of Individual Thread,CPIT),用于控制个体线程的执行.结果表明,对于模拟的所有负载,CPIT在94%以上的情况下都能保证受控线程获得期望性能.而对于失败的情况,受控线程的平均性能偏差不超过1.25%.此外,CPIT策略对处理器总体性能的影响并不大.与ICOUNT这种以优化性能为目标的取指策略相比,总体性能的平均降低不超过3%,而除受控线程外的其他线程的性能平均只降低了1.75%. 相似文献

15.

Enable back memory and global synchronization on LLC buffer

Licheng Yu Yulong Pei Tianzhou Chen Xueqing Lou Minghui Wu Tiefei Zhang 《The Journal of supercomputing》2017,73(12):5414-5439

The last-level cache (LLC) shared by heterogeneous processors such as CPU and general-purpose graphics processing unit (GPGPU) brings new opportunities to optimize data sharing among them. Previous work introduces the LLC buffer, which uses part of the LLC storage as a FIFO buffer to enable data sharing between CPU and GPGPU with negligible management overhead. However, the baseline LLC buffer’s capacity is limited and can lead to deadlock when the buffer is full. It also relies on inefficient CPU kernel relaunch and high overhead atomic operations on GPGPU for global synchronization. These limitations motivate us to enable back memory and global synchronization on the baseline LLC buffer and make it more practical. The back memory divides the buffer storage into two levels. While they are managed as a single queue, the data storage in each level is managed as individual circular buffer. The data are redirected to the memory level when the LLC level is full, and are loaded back to the LLC level when it has free space. The case study of n-queen shows that the back memory has a comparative performance with a LLC buffer of infinite LLC level. On the contrary, LLC buffer without back memory exhibits 10% performance degradation incurred by buffer space contention. The global synchronization is enabled by peeking the data about to be read from the buffer. Any request to read the data in LLC buffer after the global barrier is allowed only when all the threads reach the barrier. We adopt breadth-first search (BFS) as a case study and compare the LLC buffer with an optimized implementation of BFS on GPGPU. The results show the LLC buffer has speedup of 1.70 on average. The global synchronization time on GPGPU and CPU is decreased to 38 and 60–5%, respectively. 相似文献

16.

A Static Greedy and Dynamic Adaptive Thread Spawning Approach for Loop-Level Parallelism

下载免费PDF全文

李美蓉赵银亮陶悠王启明《计算机科学技术学报》2014,29(6):962-975

Thread-level speculation becomes more attractive for the exploitation of thread-level parallelism from irregular sequential applications. But it is common for speculative threads to fail to reach the expected parallel performance. The reason is that the performance of speculative threads is extremely complicated by the fact that it not only suffers from the imprecision of compiler-directed performance estimation due to ambiguous control and data dependences, but also depends on the underlying hardware configuration and program behaviors. Thus, this paper proposes a statically greedy and dynamically adaptive approach for loop-level speculation to dynamically determine the best loop level at runtime. It relies on the compiler to select and optimize all loop candidates greedily, which are then proceeded on the cost-benefit analysis of different loop nesting levels for the determination of the order of loop speculation. Under the runtime loop execution prediction, we dynamically schedule and update the order of loop speculation, and ensure the best loop level to be always parallelized. Two different policies are also examined to maximize overall performance. Compared with traditional static loop selection techniques, our approach （：an achieve comparable or better performance. 相似文献

17.

Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads

Teo Milanez Sylvain Collange Fernando Magno Quintão Pereira Wagner Meira Jr. Renato Ferreira 《Parallel Computing》2014

Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same processing unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is one architecture recently proposed that shares instruction decoding and execution between threads running the same program in an SMT processor, thereby generalizing the approach followed by Graphics Processing Units to general-purpose processors. In this paper we propose new ways to expose redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristic that handles function calls better than previous approaches. Our heuristic only inspects the program counter and the stack frame to reconverge threads; hence, it is amenable to efficient and inexpensive hardware implementation. Second, we demonstrate that this heuristic is able to reveal the existence of substantial regularity in inter-thread memory access patterns. We validate our results on data-parallel applications from the PARSEC and SPLASH suites. Our new reconvergence heuristic increases the throughput of our MMT model by 7%, when compared to a previous, and substantially more complex approach, due to Long et al. Moreover, it gives us an effective way to increase regularity in memory accesses. We have observed that over 70% of simultaneous memory accesses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses. 相似文献

18.

Emerging technology enabled energy-efficient GPGPUs register file

《Microprocessors and Microsystems》2017

Modern Graphics Processing Units (GPGPUs) employ the fine-grained multi-threading among thousands of active threads, leading to the sizable register file (RF) with massive energy consumption. In this study, we explore the emerging technology (i.e., Tunnel FET (TFET)) enabled energy-efficient GPGPUs RF. TFET is much more energy-efficient than CMOS at the low voltage operations, but always using TFET at the low voltage (so that low frequency) causes significant performance degradation. In this study, we first design the hybrid CMOS-TFET based register file, and propose the memory-contention-aware TFET register allocation (MEM_RA). MEM_RA allocates TFET-based registers to threads whose execution progress can be delayed to some degree to avoid the memory contentions with other threads, and the CMOS-based registers are still used for threads requiring normal execution speed. We further observe the insufficient TFET register resources for the memory-intensive benchmarks when applying the MEM_RA technique. We then develop the TFET-register-utilization-aware block allocation (TUBA) and TFET-regsiter-request-aware warp scheduling (TRWS) mechanisms to effectively utilize the limited TFET registers and achieve the maximal energy savings. Our experimental results show that the proposed techniques achieve 40% energy (including both dynamic and leakage) reduction in GPGPUs register file with negligible performance overhead. 相似文献

19.

An Approach of Branch Multithreading Switch Mechanism

Lan Dong Zhenzhou Ji Mingzeng Hu Xinmin Tang 《通讯和计算机》2006,3(5):14-16,39

Branch misprediction is a considerable drawback to processor performance. The capacity of mulfithreading technology that can hide latency among threads brings up new opportunities to diminish the effect of misprediction. Since previous multithrending approaches either bring up additional pipeline lost or lack flexibility to implement. In this paper, an approach is proposed, which exploits instruction extension and instruction prefetch mechanism together with a parallel decoder. Not only the impact of misprediction delay is diminished, but also no accessional waste is caused by the thread switch mechanism. Experiment shows that IPC is improved by 10.2%-25% compared with Gshare branch prediction method for the most of the benchmarks. 相似文献

20.

Reconstructing permutation table to improve the Tabu Search for the PFSP on GPU

Kai-Cheng Wei Xue Sun Hsun Chu Chao-Chin Wu 《The Journal of supercomputing》2017,73(11):4711-4738

General-purpose computing on graphics processing unit (GPGPU) has been adopted to accelerate the running of applications which require long execution time in various problem domains. Tabu Search belonging to meta-heuristics optimization has been used to find a suboptimal solution for NP-hard problems within a more reasonable time interval. In this paper, we have investigated in how to improve the performance of Tabu Search algorithm on GPGPU and took the permutation flow shop scheduling problem (PFSP) as the example for our study. In previous approach proposed recently for solving PFSP by Tabu Search on GPU, all the job permutations are stored in global memory to successfully eliminate the occurrences of branch divergence. Nevertheless, the previous algorithm requires a large amount of global memory space, because of a lot of global memory access resulting in system performance degradation. We propose a new approach to address the problem. The main contribution of this paper is an efficient multiple-loop struct to generate most part of the permutation on the fly, which can decrease the size of permutation table and significantly reduce the amount of global memory access. Computational experiments on problems according with benchmark suite for PFSP reveal that the best performance improvement of our approach is about 100%, comparing with the previous work. 相似文献