期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing

Jung-Wook Park Hoon-Mo Yang Gi-Ho Park Shin-Dug Kim Charles C. Weems 《Journal of Parallel and Distributed Computing》2010

In order to guarantee both performance and programmability demands in 3D graphics applications, vector and multithreaded SIMD architectures have been employed in recent graphics processing units. This paper introduces a novel instruction-systolic array architecture, which transfers an instruction stream in a pipelined fashion to efficiently share the expensive functional resources of a graphics processor. Specifically, cache misses and dynamic branches can cause additional latencies and complicated management in these parallel architectures. To address this problem, we combine a systolic execution scheme with on-demand warp activation that handles cache miss latency and branch divergence efficiently without significantly increasing hardware resources, either in terms of logic or register space. Simulation indicates that the proposed architecture offers 25% better performance than a traditional SIMD architecture with the same resources, and requires significantly fewer resources to match the performance of a typical modern vector multi-threaded GPU architecture. 相似文献

2.

Embedded GPU and multicore processors for emotional-based mobile robotic agents

《Future Generation Computer Systems》2016

Control architectures based on emotions are becoming promising solutions for the implementation of future robotic systems. The basic controllers of this architecture are the emotional processes that decide which behaviors the robot must activate to fulfill the objectives. The number of emotional processes increases (hundreds of millions/s) with the complexity level of the application, limiting the processing capacity of a main processor to solve the complex problems. Fortunately, the potential parallelism of emotional processes permits their execution in parallel, hence enabling the computing power to tackle the complex dynamic problems. In this paper, Graphic Processing Unit (GPU), multicore processors and single instruction multiple data (SIMD) instructions are used to provide parallelism for the emotional processes. Different GPUs, multicore processors and SIMD instruction sets are evaluated and compared to analyze their suitability to cope with robotic applications. The applications are set-up taking into account different environmental conditions, robot dynamics and emotional states. Experimental results show that, despite the fact that GPUs have a bottleneck in the data transmission between the host and the device, the evaluated GTX 670 GPU provides a performance of more than one order of magnitude greater than the initial implementation of the architecture on a single core. Thus, all complex proposed application problems can be solved using the GPU technology in contrast to the first prototype where only 55% of them could be solved. Using AVX SIMD instructions, the performance of the architecture is increased in 3.25 times in relation to the first implementation. Thus, from the 27 proposed applications about 88.8% are solved. In the case of the SSE SIMD instructions, the performance is almost doubled and the robot could solve about 74% of the proposed application problems. The use of AVX and SSE SIMD instructions provides almost the same performance as a quad- and a dual-core, respectively, with the advantage that they do not add any additional hardware cost. 相似文献

3.

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

《Parallel Computing》2013,39(10):586-602

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively. 相似文献

4.

Improving branch divergence performance on GPGPU with a new PDOM stack and multi-level warp scheduling

《Journal of Systems Architecture》2014,60(5):420-430

General-purpose graphics processing unit (GPGPU) plays an important role in massive parallel computing nowadays. A GPGPU core typically holds thousands of threads, where hardware threads are organized into warps. With the single instruction multiple thread (SIMT) pipeline, GPGPU can achieve high performance. But threads taking different branches in the same warp violate SIMD style and cause branch divergence. To support this, a hardware stack is used to sequentially execute all branches. Hence branch divergence leads to performance degradation. This article represents the PDOM (post dominator) stack as a binary tree, and each leaf corresponds to a branch target. We propose a new PDOM stack called PDOM-ASI, which can schedule all the tree leaves. The new stack can hide more long operation latencies with more schedulable warps without the problem of warp over-subdivision. Besides, a multi-level warp scheduling policy is proposed, which lets part of the warps run ahead and creates more opportunities to hide the latencies. The simulation results show that our policies achieve 10.5% performance improvements over baseline policies with only 1.33% hardware area overhead. 相似文献

5.

一种适应GPU的混合访问缓存索引框架

张鸿骏武延军张珩张立波《软件学报》2020,31(10):3038-3055

散列表（hash table）作为一类根据关键码值（key value）提供高效数据访问的数据索引结构,其广泛应用于各类计算机应用中,尤其是在对性能要求极高的系统软件、数据库以及高性能计算领域.在网络、云计算和物联网服务方面,以散列表为核心结构已经成为缓存系统的重要系统组件.然而,随着大规模数据量的大幅度增加,以多核CPU为核心设计散列表结构的系统已经逐渐出现性能瓶颈,亟需进一步改进散列表的高性能和可扩展性.随着通用图形处理器（graphic processing unit,简称GPU）的日益普及以及硬件计算能力和并发性能的大幅度提升,各类以并行计算为核心的系统软件任务在GPU上进行了优化设计并得到可观的性能提升.由于存在稀疏性和随机性,采用现有散列表的并行结构直接在GPU上应用势必会带来高频次的内存访问和频繁的总线数据传输,影响了散列表在GPU上的性能发挥.重点分析了缓存系统中散列表索引的内存访问、命中率与索引开销,提出并设计了一种适应GPU的混合访问缓存索引框架CCHT（cache cuckoo hash table）,提供了两种适应不同命中率和索引开销要求的缓存策略,允许写入与查询操作并发执行,最大程度地利用了GPU硬件的计算性能与并发特性,减少了内存访问与总线传输.通过在GPU硬件上的实现与实验验证,CCHT在保证缓存命中率的同时,性能优于其他用于缓存索引的散列表. 相似文献

6.

一种基于GPU的高精度体系结构级功耗模型

王卓薇程良伦肖红《计算机科学》2016,43(11):30-35

随着硬件功能的不断丰富和软件开发环境的逐渐成熟,GPU开始被应用于通用计算领域,协助CPU加速程序运行。为了追求高性能,GPU往往包含成百上千个核心运算单元,高密度的计算资源使得其性能远高于CPU的同时功耗也高于CPU,功耗问题已经成为制约GPU发展的重要问题之一。在深入研究Fermi GPU架构的基础上,提出一种高精度的体系结构级功耗模型,该模型首先计算不同native指令及每次访问存储器消耗的功耗;然后根据应用在硬件上的执行指令和采样工具获得采样结果,分析预测其功耗;最后通过13个基准测试应用对实际测试与功耗模型测试结果进行对比分析,该模型的预测精度可达90%左右。相似文献

7.

Two-phase execution of binary applications on CPU/GPU machines

Erzhou Zhu Ruhui Ma Yang Hou Yindong Yang Feng Liu Haibing Guan 《Computers & Electrical Engineering》2014

High computational power of GPUs (Graphics Processing Units) offers a promising accelerator for general-purpose computing. However, the need for dedicated programming environments has made the usage of GPUs rather complicated, and a GPU cannot directly execute binary code of a general-purpose application. This paper proposes a two-phase virtual execution environment (GXBIT) for automatically executing general-purpose binary applications on CPU/GPU architectures. GXBIT incorporates two execution phases. The first phase is responsible for extracting parallel hot spots from the sequential binary code. The second phase is responsible for generating the hybrid executable (both CPU and GPU instructions) for execution. This virtual execution environment works well for any applications that run repeatedly. The performance of generated CUDA (Compute Unified Device Architecture) code from GXBIT on a number of benchmarks is close to 63% of the hand-tuned GPU code. It also achieves much better overall performance than the native platforms. 相似文献

8.

一种面向55 nm工艺的可扩展统一架构图形处理器设计与实现

黄亮秦信刚武玲娟熊庭刚《计算机工程与科学》2014,36(12):2418-2423

现代3D图形处理器已从固定渲染管线发展成可编程渲染管线,且其并行度越来越高,研究并设计高性能的3D图形处理器对3D图形处理具有重要意义。着色器是实现3D图形处理器的核心,因此开发性能高、面积小、功耗低又易于扩展的着色器对3D图形处理器的开发具有重要作用。提出的统一架构图形处理器基于单指令多线程和单指令多数据,单指令多线程可以提高图形处理的并行度,从而提高图形处理性能;单指令多数据可以降低设计复杂度,从而实现面积小、功耗低又易于扩展的着色器。实验结果表明,提出的统一架构图形处理器在面积较小、功耗较低的情况下实现了较高的性能,且设计可扩展性较好。相似文献

9.

Modeling and characterizing GPGPU reliability in the presence of soft errors

Jingweijia Tan Yang Yi Fangyang Shen Xin Fu 《Parallel Computing》2013

The general-purpose computing on graphic processing units (GPGPUs) becomes increasingly popular due to its high computational throughput for data parallel applications. Modern GPU architectures have limited capability for error detection and fault tolerance since they are originally designed for graphics processing. However, the rigorous execution correctness is required for general-purpose applications, which makes reliability a growing concern in the GPGPU architecture design. With CMOS processing technologies continuously scaling down to the nano-scale, on-chip soft error rate (SER) has been predicted to increase exponentially. GPGPUs with hundreds of cores integrated into a single chip are prone to manifest high SER. This paper explores a first step to model and characterize GPGPU reliability in light of soft errors. We develop GPGPU-SODA (GPGPU SOftware Dependability Analysis), a framework to estimate the soft-error vulnerability of GPGPU microarchitecture. By using GPGPU-SODA, we observe that several microarchitecture structures in GPGPUs exhibit high soft-error susceptibility, and the structure vulnerability is sensitive to the workload characteristics (e.g. branch divergences, memory access pattern). We further investigate the impact of several architectural optimizations on GPU soft-error robustness. For example, we find that increasing the number of threads supported by GPU significantly affects the GPGPU soft-error robustness. However, changing the warp scheduling policy has little impact on the structure vulnerability. The observations made in this study provide designers the useful guidance to build resilient GPGPUs: a comprehensive resiliency solution for GPGPUs should consider the entire GPGPU design instead of solely focusing on a particular structure. 相似文献

10.

Hybrid Access Cache Indexing Framework Adapted to GPU

下载免费PDF全文

Hongjun Zhang Yanjun Wu Heng Zhang Libo Zhang 《International Journal of Software and Informatics》2021,11(2):195-216

Hash tables, as a type of data indexing structure that provides efficient data access based on key values, are widely used in various computer applications, especially in system software, databases, and high-performance computing field that requires extremely high performance. In network, cloud computing and IoT services, hash tables have become the core system components of cache systems. However, with the large-scale increase in the amount of large-scale data, performance bottlenecks have gradually emerged in systems designed with a multi-core CPU as the core of the hash table structure. There is an urgent need to further improve the high performance and scalability of the hash tables. With the increasing popularity of general-purpose Graphic Processing Units (GPUs) and the substantial improvement of hardware computing capabilities and concurrency performance, various types of system software tasks with parallel computing as the core have been optimized on the GPU and have achieved considerable performance promotion. Due to the sparseness and randomness, using the existing parallel structure of the hash tables directly on the GPUs will inevitably bring high-frequency memory access and frequent bus data transmission, which affects the performance of the hash tables on the GPUs. This study focuses on the analysis of memory access, hit ratio, and index overhead of hash table indexes in the cache system. A hybrid access cache indexing framework CCHT (Cache Cuckoo Hash Table) adapted to GPU is proposed and provided. The cache strategy suitable to different requirements of hit ratios and index overheads allows concurrent execution of write and query operations, maximizing the use of the computing performance and concurrency characteristics of GPU hardware, reducing memory access and bus transferring overhead. Through GPU hardware implementation and experimental verification, CCHT has better performance than other cache indexing hash tables while ensuring cache hit ratios. 相似文献

11.

Graphics processing units and genetic programming: an overview 总被引：1，自引：0，他引：1

W. B. Langdon 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2011,15(8):1657-1669

A top end graphics card (GPU) plus a suitable SIMD interpreter can deliver a several hundred fold speed up, yet cost less than the computer holding it. We give highlights of AI and computational intelligence applications in the new field of general purpose computing on graphics hardware (GPGPU). In particular, we surveyed genetic programming (GP) use with GPU. We gave several applications from Bioinformatics and showed that how the fastest GP is based on an interpreter rather than compilation. Finally using GP to generate GPU CUDA kernel C++ code is sketched. 相似文献

12.

A Compute Unified System Architecture for Graphics Clusters Incorporating Data Locality

Mü ller Christoph Frey Steffen Strengert Magnus Dachsbacher Carsten Ertl Thomas 《IEEE transactions on visualization and computer graphics》2009,15(4):605-617

We present a development environment for distributed GPU computing targeted for multi-GPU systems, as well as graphics clusters. Our system is based on CUDA and logically extends its parallel programming model for graphics processors to higher levels of parallelism, namely, the PCI bus and network interconnects. While the extended API mimics the full function set of current graphics hardware—including the concept of global memory—on all distribution layers, the underlying communication mechanisms are handled transparently for the application developer. To allow for high scalability, in particular for network-interconnected environments, we introduce an automatic GPU-accelerated scheduling mechanism that is aware of data locality. This way, the overall amount of transmitted data can be heavily reduced, which leads to better GPU utilization and faster execution. We evaluate the performance and scalability of our system for bus and especially network-level parallelism on typical multi-GPU systems and graphics clusters. 相似文献

13.

MMX technology extension to the Intel architecture 总被引：2，自引：0，他引：2

Peleg A. Weiser U. 《Micro, IEEE》1996,16(4):42-50

Designed to accelerate multimedia and communications software, MMX technology improves performance by introducing data types and instructions to the IA that exploit the parallelism in these applications. MMX technology extends the Intel architecture (IA) to improve the performance of multimedia, communications, and other numeric-intensive applications. It uses a SIMD (single-instruction, multiple-data) technique to exploit the parallelism inherent in many algorithms, producing full application performance of 1.5 to 2 times faster than the same applications run on the same processor without MMX. The extension also maintains full compatibility with existing IA microprocessors, operating systems, and applications while providing new instructions and data types that applications can use to achieve a higher level of performance on the host CPU 相似文献

14.

一种SIMD优化中的向量寄存器部分重用方法 总被引：1，自引：0，他引：1

下载免费PDF全文

钱兴隆臧斌宇朱传琪《计算机工程与科学》2007,29(5):141-146

SIMD架构用于多媒体加速,已经广泛应用于现代通用处理器中.SIMD架构的数据并行性可大大提高处理器的运算能力,但由于存储系统的速度远远不能与其匹配,使得应用程序的性能很难获得进一步的提高.因此,本文基于SIMD架构的访存特性,提出了一种向量寄存器部分重用的方法,以提高访存效率;并给出了相应的程序转换算法,通过数据相关性的分
分析,在应用程序向量化时,生成采用向量寄存器部分重用的优化代码.实验结果说明,该算法对多媒体应用程序的性能有显著的提高. 相似文献

15.

大规模稀疏矩阵的主特征向量计算优化方法 总被引：1，自引：0，他引：1

王伟陈建平曾国荪俞莉花谭一鸣《计算机科学与探索》2012,6(2):118-124

矩阵主特征向量(principal eigenvectors computing,PEC)的求解是科学与工程计算中的一个重要问题。随着图形处理单元通用计算(general-purpose computing on graphics pro cessing unit,GPGPU)的兴起,利用GPU来优化大规模稀疏矩阵的图形处理单元求解得到了广泛关注。分别从应用特征和GPU体系结构特征两方面分析了PEC运算的性能瓶颈,提出了一种面向GPU的稀疏矩阵存储格式——GPU-ELL和一个针对GPU的线程优化映射策略,并设计了相应的PEC优化执行算法。在ATI HD Radeon5850上的实验结果表明,相对于传统CPU,该方案获得了最多200倍左右的加速,相对于已有GPU上的实现,也获得了2倍的加速。相似文献

16.

Improved GPU SIMD control flow efficiency via hybrid warp size mechanism

《Microprocessors and Microsystems》2014,38(7):717-729

High single instruction multiple data (SIMD) efficiency and low power consumption have made graphic processing units (GPUs) an ideal platform for many complex computational applications. Thousands of threads can be created by programmers and grouped into fixed-size SIMD batches, known as warps. High throughput is then achieved by concurrently executing such warps with minimal control overhead. However, if a branch instruction occurs, which assigns different paths to different threads, one warp will be broken into multiple warps that have to be executed serially, consequently reducing the efficiency advantage of SIMD. In this paper, the contemporary fixed-size warp design is abandoned for a hybrid warp size (HWS) mechanism. Mixed-size warps are generated according to HWS and are scheduled and issued flexibly. The simulation results show that this mechanism yields an average speedup of 1.20 over the baseline architecture for a wide variety of general purpose GPU applications. The paper also integrates HWS with dynamic warp formation (DWF), which is a well-known branch handling mechanism used to improve SIMD utilization by forming new warps out of split warps in real time. The simulation results show that the combination of DWF and HWS generates an average speedup of 1.27 over the DWF-only platform with an estimated area increase of about 1% of DWF. 相似文献

17.

GPU acceleration of the stochastic grid bundling method for early-exercise options

Álvaro Leitao Cornelis W. Oosterlee 《国际计算机数学杂志》2015,92(12):2433-2454

In this work, a parallel graphics processing units (GPU) version of the Monte Carlo stochastic grid bundling method (SGBM) for pricing multi-dimensional early-exercise options is presented. To extend the method's applicability, the problem dimensions and the number of bundles will be increased drastically. This makes SGBM very expensive in terms of computational costs on conventional hardware systems based on central processing units. A parallelization strategy of the method is developed and the general purpose computing on graphics processing units paradigm is used to reduce the execution time. An improved technique for bundling asset paths, which is more efficient on parallel hardware is introduced. Thanks to the performance of the GPU version of SGBM, a general approach for computing the early-exercise policy is proposed. Comparisons between sequential and GPU parallel versions are presented. 相似文献

18.

Mapping of option pricing algorithms onto heterogeneous many-core architectures

Shuai Zhang Zhao Wang Ying Peng Bertil Schmidt Weiguo Liu 《The Journal of supercomputing》2017,73(9):3715-3737

The rapid development of technologies and applications in recent years poses high demands and challenges for high-performance computing. Because of their competitive performance/price ratio, heterogeneous many-core architectures are widely used in high-performance computing areas. GPU and Xeon Phi are two popular general-purpose many-core accelerators. In this paper, we demonstrate how heterogeneous many-core architectures, powered by multi-core CPUs, CUDA-enabled GPUs and Xeon Phis can be used as an efficient computational platform to accelerate popular option pricing algorithms. In order to make full use of the compute power of this architecture, we have used a hybrid computing model which consists of two types of data parallelism: worker level and device level. The worker level data parallelism uses a distributed computing infrastructure for task distribution, while the device level data parallelism uses both the multi-core CPUs and many-core accelerators for fast option pricing calculation. Experiments show that our implementations achieve good performance and scalability on this architecture and also outperform other state-of-the-art GPU-based solutions for Monte Carlo European/American option pricing and BSDE European option pricing. 相似文献

19.

Algorithm level power efficiency optimization for CPU-GPU processing element in data intensive SIMD/SPMD computing

Da Qi Ren^{Author Vitae} 《Journal of Parallel and Distributed Computing》2011,71(2):245-253

Power efficiency investigation has been required in each level of a High Performance Computing (HPC) system because of the increasing computation demands of scientific and engineering applications. Focusing on handling the critical design constraints in the software level that run beyond a parallel system composed of huge numbers of power-hungry components, we optimize HPC program design in order to achieve the best possible power performance on the target hardware platform. The power performance of a CUDA Processing Element (PE) is determined by both hardware factors including power features of each component including with CPU, GPU, main memory and PCI buses, and their interconnection architecture; and software factors including algorithm design and the character of executable instructions performed on it. In this paper, approaches to model and evaluate the power consumption of large scale SIMD computation by CUDA PEs on multi-core and GPU platforms are introduced. The model allows obtaining design characteristic values at the early programming stage, thus benefitting programmers by providing the necessary environment information for choosing the best power-efficient alternative. Based on the model, CPU Dynamic frequency scaling (DFS) can be applied on CUDA PE architecture that adjusts CPU frequency to enhance power efficiency of the entire PE without compromising its computing performance. The power model and power efficiency improvements of the new designs have been validated by measuring the new programs on the real GPU multiprocessing system. 相似文献

20.

An efficient scheduling scheme using estimated execution time for heterogeneous computing systems

Hong Jun Choi Dong Oh Son Seung Gu Kang Jong Myon Kim Hsien-Hsin Lee Cheol Hong Kim 《The Journal of supercomputing》2013,65(2):886-902

Computing systems should be designed to exploit parallelism in order to improve performance. In general, a GPU (Graphics Processing Unit) can provide more parallelism than a CPU (Central Processing Unit), resulting in the wide usage of heterogeneous computing systems that utilize both the CPU and the GPU together. In the heterogeneous computing systems, the efficiency of the scheduling scheme, which selects the device to execute the application between the CPU and the GPU, is one of the most critical factors in determining the performance. This paper proposes a dynamic scheduling scheme for the selection of the device between the CPU and the GPU to execute the application based on the estimated-execution-time information. The proposed scheduling scheme enables the selection between the CPU and the GPU to minimize the completion time, resulting in a better system performance, even though it requires the training period to collect the execution history. According to our simulations, the proposed estimated-execution-time scheduling can improve the utilization of the CPU and the GPU compared to existing scheduling schemes, resulting in reduced execution time and enhanced energy efficiency of heterogeneous computing systems. 相似文献