期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Enqiang Sun David Kaeli 《International journal of parallel programming》2014,42(1):30-48

To obtain significant execution speedups, GPUs rely heavily on the inherent data-level parallelism present in the targeted application. However, application programs may not always be able to fully utilize these parallel computing resources due to intrinsic data dependencies or complex data pointer operations. In this paper, we explore how to leverage aggressive software-based value prediction techniques on a GPU to accelerate programs that lack inherent data parallelism. This class of applications are typically difficult to map to parallel architectures due to the presence of data dependencies and complex data pointer manipulation present in these applications. Our experimental results show that, despite the overhead incurred due to software speculation and the communication overhead between the CPU and GPU, we obtain up to 6.5 $\times $ speedup on a selected set of kernels taken from the SPEC CPU2006, PARSEC and Sequoia benchmark suites. 相似文献

2.

Hybrid Access Cache Indexing Framework Adapted to GPU

下载免费PDF全文

Hongjun Zhang Yanjun Wu Heng Zhang Libo Zhang 《International Journal of Software and Informatics》2021,11(2):195-216

Hash tables, as a type of data indexing structure that provides efficient data access based on key values, are widely used in various computer applications, especially in system software, databases, and high-performance computing field that requires extremely high performance. In network, cloud computing and IoT services, hash tables have become the core system components of cache systems. However, with the large-scale increase in the amount of large-scale data, performance bottlenecks have gradually emerged in systems designed with a multi-core CPU as the core of the hash table structure. There is an urgent need to further improve the high performance and scalability of the hash tables. With the increasing popularity of general-purpose Graphic Processing Units (GPUs) and the substantial improvement of hardware computing capabilities and concurrency performance, various types of system software tasks with parallel computing as the core have been optimized on the GPU and have achieved considerable performance promotion. Due to the sparseness and randomness, using the existing parallel structure of the hash tables directly on the GPUs will inevitably bring high-frequency memory access and frequent bus data transmission, which affects the performance of the hash tables on the GPUs. This study focuses on the analysis of memory access, hit ratio, and index overhead of hash table indexes in the cache system. A hybrid access cache indexing framework CCHT (Cache Cuckoo Hash Table) adapted to GPU is proposed and provided. The cache strategy suitable to different requirements of hit ratios and index overheads allows concurrent execution of write and query operations, maximizing the use of the computing performance and concurrency characteristics of GPU hardware, reducing memory access and bus transferring overhead. Through GPU hardware implementation and experimental verification, CCHT has better performance than other cache indexing hash tables while ensuring cache hit ratios. 相似文献

3.

基于真实历史反馈的自适应值预测器的设计与优化

隋兵才《计算机工程与科学》2021,43(2):274-279

乱序超标量处理器所能获得的指令级并行能力越来越有限,为了获得更高的指令并行性,必须增加更多的乱序执行和控制资源.随着处理器架构的变化,值预测技术能够在现有主流处理器微架构的基础上以更少的硬件开销,获得更高的数据并行性,进一步提升处理器的乱序执行性能.提出了一种基于真实历史反馈的上下文值预测器(RH-VTAGE),通过设置失效列表和预测精度表来控制反馈RH-VTAGE的预测精度,减少预测失效时的流水线恢复开销.同时,在值预测器的最后阶段增加了真实历史反馈的控制计数器,并设计了自适应置信度控制逻辑,针对不同类型的指令按概率对置信度进行动态调整.实际测试结果表明,相对于其他预测器,RH-VTAGE的整数程序预测性能没有明显提升,但是对于浮点程序性能最大提升31.2％. 相似文献

4.

Coarse-grained thread pipelining: a speculative parallel executionmodel for shared-memory multiprocessors

Kazi I.H. Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(9):952-966

This paper presents a new parallelization model, called coarse-grained thread pipelining, for exploiting speculative coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the fine-grained thread pipelining model proposed for the superthreaded architecture, allows concurrent execution of loop iterations in a pipelined fashion with runtime data-dependence checking and control speculation. The speculative execution combined with the runtime dependence checking allows the parallelization of a variety of program constructs that cannot be parallelized with existing runtime parallelization algorithms. The pipelined execution of loop iterations in this new technique results in lower parallelization overhead than in other existing techniques. We evaluated the performance of this new model using some real applications and a synthetic benchmark. These experiments show that programs with a sufficiently large grain size compared to the parallelization overhead obtain significant speedup using this model. The results from the synthetic benchmark provide a means for estimating the performance that can be obtained from application programs that will be parallelized with this model. The library routines developed for this thread pipelining model are also useful for evaluating the correctness of the codes generated by the superthreaded compiler and in debugging and verifying the simulator for the superthreaded processor 相似文献

5.

Design space exploration of a software speculative parallelization scheme 总被引：2，自引：0，他引：2

Cintra M. Llanos D.R. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(6):562-576

With speculative parallelization, code sections that cannot be fully analyzed by the compiler are optimistically executed in parallel. Hardware schemes are fast but expensive and require modifications to the processors and/or memory system. Software schemes require no changes to the hardware of existing shared-memory systems, but can suffer from significant overheads involved with the speculative execution. In fact, the performance of software schemes is highly dependent on application characteristics, the design and implementation of the scheme, and the system configuration and size. This paper explores the design space of a recently proposed software speculative parallelization scheme. In the process, we gain insight into the most beneficial features of software schemes for speculative parallelization, as well as the most influential application characteristics. For instance, experimental results show that, contrary to intuition, checking for data dependence violations on every speculative store, as opposed to at commit time, leads to little performance degradation in the worst case and to significantly better performance with large configurations. Also, scheduling policies based on windows can perform very close to fully dynamic policies with a fraction of the memory overhead. Finally, experimental results show consistent speedups in the execution of loops that cannot be parallelized at compile time, both with and without RAW data dependences, for 4 to 32 processors. 相似文献

6.

A Static Greedy and Dynamic Adaptive Thread Spawning Approach for Loop-Level Parallelism

下载免费PDF全文

李美蓉赵银亮陶悠王启明《计算机科学技术学报》2014,29(6):962-975

Thread-level speculation becomes more attractive for the exploitation of thread-level parallelism from irregular sequential applications. But it is common for speculative threads to fail to reach the expected parallel performance. The reason is that the performance of speculative threads is extremely complicated by the fact that it not only suffers from the imprecision of compiler-directed performance estimation due to ambiguous control and data dependences, but also depends on the underlying hardware configuration and program behaviors. Thus, this paper proposes a statically greedy and dynamically adaptive approach for loop-level speculation to dynamically determine the best loop level at runtime. It relies on the compiler to select and optimize all loop candidates greedily, which are then proceeded on the cost-benefit analysis of different loop nesting levels for the determination of the order of loop speculation. Under the runtime loop execution prediction, we dynamically schedule and update the order of loop speculation, and ensure the best loop level to be always parallelized. Two different policies are also examined to maximize overall performance. Compared with traditional static loop selection techniques, our approach （：an achieve comparable or better performance. 相似文献

7.

Achieving middleware execution efficiency: hardware-assisted garbage collection operations

Jie Tang Shaoshan Liu Zhimin Gu Xiao-Feng Li Jean-Luc Gaudiot 《The Journal of supercomputing》2012,59(3):1101-1119

Although virtualization technologies bring many benefits to cloud computing environments, as the virtual machines provide more features, the middleware layer has become bloated, introducing a high overhead. Our ultimate goal is to provide hardware-assisted solutions to improve the middleware performance in cloud computing environments. As a starting point, in this paper, we design, implement, and evaluate specialized hardware instructions to accelerate GC operations. We select GC because it is a common component in virtual machine designs and it incurs high performance and energy consumption overheads. We performed a profiling study on various GC algorithms to identify the GC performance hotspots, which contribute to more than 50% of the total GC execution time. By moving these hotspot functions into hardware, we achieved an order of magnitude speedup and significant improvement on energy efficiency. In addition, the results of our performance estimation study indicate that the hardware-assisted GC instructions can reduce the GC execution time by half and lead to a 7% improvement on the overall execution time. 相似文献

8.

嵌入式应用中的循环级线程推测并行性分析

卜得庆王耀彬李凌杨洋程一鸣刘志勤吴亚东《计算机应用研究》2019,36(9)

如何有效利用多核提供的丰富晶体管资源对串行程序的执行进行加速是当前研究中的热点问题。线程级推测（thread-level speculation,TLS）技术旨在充分利用多核资源,最大化地开发出串行代码中存在的潜在并行性。目前TLS技术已经在多种串行应用的并行化工作中得到有效利用,但嵌入式应用程序仍未在推测并行化方面进行有效的分析。因此,选取了八个具有代表性的嵌入式应用,对其在循环级推测并行化中的性能提升潜力和运行时特征（数据依赖、线程粒度和并行覆盖率）进行探讨。实验结果表明,利用线程级推测并行化嵌入式应用的加速效果优于指令级并行技术,实验中的最大加速比达到了13.29;在嵌入式应用领域,该技术可以有效地利用4到8核的计算资源。相似文献

9.

Using the Xeon Phi Platform to Run Speculatively-Parallelized Codes

Alvaro Estebanez Diego R. Llanos Arturo Gonzalez-Escribano 《International journal of parallel programming》2017,45(2):225-241

Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code. 相似文献

10.

Efficient breadth first search on multi-GPU systems

Enrico Mastrostefano Massimo Bernaschi 《Journal of Parallel and Distributed Computing》2013

Simple algorithms for the execution of a Breadth First Search on large graphs lead, running on clusters of GPUs, to a situation of load unbalance among threads and un-coalesced memory accesses, resulting in pretty low performances. To obtain a significant improvement on a single GPU and to scale by using multiple GPUs, we resort to a suitable combination of operations to rearrange data before processing them. We propose a novel technique for mapping threads to data that achieves a perfect load balance by leveraging prefix-sum and binary search operations. To reduce the communication overhead, we perform a pruning operation on the set of edges that needs to be exchanged at each BFS level. The result is an algorithm that exploits at its best the parallelism available on a single GPU and minimizes communication among GPUs. We show that a cluster of GPUs can efficiently perform a distributed BFS on graphs with billions of nodes. 相似文献

11.

Exploiting the parallelism available in loops 总被引：1，自引：0，他引：1

Lilja D.J. 《Computer》1994,27(2):13-26

Because a loop's body often executes many times, loops provide a rich opportunity for exploiting parallelism. This article explains scheduling techniques and compares results on different architectures. Since parallel architectures differ in synchronization overhead, instruction scheduling constraints, memory latencies, and implementation details, determining the best approach for exploiting parallelism can be difficult. To indicate their performance potential, this article surveys several architectures and compilation techniques using a common notation and consistent terminology. First we develop the critical dependence ratio to determine a loop's maximum possible parallelism, given infinite hardware. Then we look at specific architectures and techniques. Loops can provide a large portion of the parallelism available in an application program, since the iterations of a loop may be executed many times. To exploit this parallelism, however, one must look beyond a single basic block or a single iteration for independent operations. The choice of technique depends on the underlying architecture of the parallel machine and the characteristics of each individual loop 相似文献

12.

Shared write buffer to boost applications on SpMT architecture

Ming Chen John Ye Tianzhou Chen Hongjun Dai 《The Journal of supercomputing》2017,73(8):3508-3525

With the trend of growing number of integrated processing cores on Chip Multiprocessors, researchers are working hard to increase the available parallelism of software programs so as to efficiently harness the growing computing power. One noticeable direction among these efforts is speculative multi-threading (SpMT), a.k.a thread level speculation, which aims to extract thread level parallelism by splitting a sequential execution thread into several finer ones and execute them on parallel. A SpMT thread is in speculative status before it “knows” all its input data are correct. A speculative thread needs to write to the L1 cache, but its output might be discarded if the speculation eventually fails. However, another speculative thread may have already read in such speculative output. Therefore, some mechanism is needed to support speculative read and write. And because the SpMT threads are extracted from a single thread, they usually share lots of data, thus there might be intense data coherence among the L1 caches. It would be very complicated to support data coherence and speculation together. This Paper proposes a shared write buffer among the SpMT cores. We are able to confine the speculative read and write in the SWB, thus the speculation will not interference with coherence, and the L1 cache design could be drastically simplified. Experiments show that the SWB can capture a big portion of inter-core data sharing, reduce cache coherence, and drastically improve data access performance of SpMT threads. 相似文献

13.

Performance,optimization, and fitness: Connecting applications to architectures

Mohammad A. Bhuiyan Melissa C. Smith Vivek K. Pallipuram 《Concurrency and Computation》2011,23(10):1066-1100

Recent trends involving multicore processors and graphical processing units (GPUs) focus on exploiting task‐ and thread‐level parallelism. In this paper, we have analyzed various aspects of the performance of these architectures including NVIDIA GPUs, and multicore processors such as Intel Xeon, AMD Opteron, IBM's Cell Broadband Engine. The case study used in this paper is a biological spiking neural network (SNN), implemented with the Izhikevich, Wilson, Morris–Lecar, and Hodgkin–Huxley neuron models. The four SNN models have varying requirements for communication and computation making them useful for performance analysis of the hardware platforms. We report and analyze the variation of performance with network (problem size) scaling, available optimization techniques and execution configuration. A Fitness performance model, that predicts the suitability of the architecture for accelerating an application, is proposed and verified with the SNN implementation results. The Roofline model, another existing performance model, has also been utilized to determine the hardware bottleneck(s) and attainable peak performance of the architectures. Significant speedups for the four SNN neuron models utilizing these architectures are reported; the maximum speedup of 574x was observed in our GPU implementation. Our results and analysis show that a proper match of architecture with algorithm complexity provides the best performance. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

14.

Containers on the Parallelization of General-Purpose Java Programs 总被引：1，自引：0，他引：1

Peng Wu David Padua 《International journal of parallel programming》2000,28(6):589-605

Static parallelization of general-purpose programs is still impossible, in general, due to their common use of pointers, irregular data structures, and complex control-flows. One promising strategy is to exploit parallelism at runtime. Runtime parallelization schemes, particularly data speculations, alleviate the need to statically prove independent computations at compile-time. However, studies show that many real-world applications exhibit limited speculative parallelism to offset the overhead and penalty of speculation schemes. This paper addresses this issue by using compiler analyses to compensate for speculative parallelizations. We focus on general-purpose Java programs with extensive use of Java container classes. In our scheme, compilers serve as a guideline of where to speculate by lazily detecting dependences that are mostly static, while leaving those that are more dynamic to runtime. We also propose techniques to enhance speculative parallelism in the programs. The experimental results show that, after eliminating static dependences, the four applications we study exhibit significant parallelism that can be gainfully exploited by a speculative parallelization system. 相似文献

15.

A Methodology for Granularity-Based Control of Parallelism in Logic Programs

P. LOPEZ M. HERMENEGILDO S. DEBRAY 《Journal of Symbolic Computation》1996,21(4-6)

Several types of parallelism can be exploited in logic programs while preserving correctness and efficiency, i.e. ensuring that the parallel execution obtains the same results as the sequential one and the amount of work performed is not greater. However, such results do not take into account a number of overheads which appear in practice, such as process creation and scheduling, which can induce a slow-down, or, at least, limit speedup, if they are not controlled in some way. This paper describes a methodology whereby the granularity of parallel tasks, i.e. the work available under them, is efficiently estimated and used to limit parallelism so that the effect of such overheads is controlled. The run-time overhead associated with the approach is usually quite small, since as much work is done at compile time as possible. Also, a number of run-time optimizations are proposed. Moreover, a static analysis of the overhead associated with the granularity control process is performed in order to decide its convenience. The performance improvements resulting from the incorporation of grain size control are shown to be quite good, specially for systems with medium to large parallel execution overheads. 相似文献

16.

VForce: An environment for portable applications on high performance systems with accelerators

Nicholas Moore Miriam Leeser Laurie Smith King 《Journal of Parallel and Distributed Computing》2012

Special Purpose Processors (SPPs), including Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs), are increasingly being used to accelerate scientific applications. VForce aims to aid application programmers in using such accelerators with minimal changes in user code. VForce is an extensible middleware framework that enables VSIPL++ (the Vector Signal Image Processing Library extension) programs to transparently use Special Purpose Processors (SPPs) while maintaining portability across platforms with and without SPP hardware. The framework is designed to maintain a VSIPL++-like environment and hide hardware-specific details from the application programmer while preserving performance and productivity. VForce focuses on the interface between application code and accelerator code. The same application code can run in software on a general purpose processor or take advantage of SPPs if they are available. VForce is unique in that it supports calls to both FPGAs and GPUs while requiring no changes in user code. Results on systems with NVIDIA Tesla GPUs and Xilinx FPGAs are presented. This paper describes VForce, illustrates its support for portability, and discusses lessons learned for providing support for different hardware configurations at run time. Key considerations involve global knowledge about the relationship between processing steps for defining application mapping, memory allocation, and task parallelism. 相似文献

17.

Cache-efficient numerical algorithms using graphics hardware

《Parallel Computing》2007,33(10-11):663-684

We present cache-efficient algorithms for scientific computations using graphics processing units (GPUs). Our approach is based on mapping the nested loops in the numerical algorithms to the texture mapping hardware and efficiently utilizing GPU caches. This mapping exploits the inherent parallelism, pipelining and high memory bandwidth on GPUs. We further improve the performance of numerical algorithms by accounting for the same relative memory address accesses performed at data elements in nested loops. Based on the similarity of memory accesses performed at the data elements in the input array, we decompose the input arrays into sub-arrays with similar memory access patterns and execute on the sub-arrays for faster execution. Our approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency. Overall, our formulation for GPU-based algorithms extends the current graphics runtime APIs without exposing the underlying hardware complexity to the programmer. This makes it possible to achieve portability and higher performance across different GPUs. We use this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we observe 2–10× improvement in performance. 相似文献

18.

Compiling C for the EARTH multithreaded architecture

Laurie J. Hendren Xinan Tang Yingchun Zhu Shereen Ghobrial Guang R. Gao Xun Xue Haiying Cai Pierre Ouellet 《International journal of parallel programming》1997,25(4):305-338

Multithreaded architectures provide an opportunity for efficiently executing programs with irregular parallelism and/or irregular locality. This paper presents a strategy that makes use of the multithreaded execution model without exposing multithreading to the programmer. Our approach is to design simple extensions to C, and to provide compiler support that automatically translates high-level C programs into lower-level threaded programs. In this paper we present EARTH-C our extended C language which contains simple constructs for specifying control parallelism, data locality, shared variables and atomic operations. Based on EARTH-C, we describe compiler techniques that are used for translating to lower-level Threaded-C programs for the EARTH multithreaded architecture. We demonstrate our approach with six benchmark programs. We show that even naive EARTH-C programs can lead to reasonable performance, and that more advanced EARTH-C programs can give performance very close to hand-coded threated-C programs. This work supported, in part, by NSERC and FCAR. 相似文献

19.

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hong Jun Choi Dong Oh Son Jong Myon Kim Cheol Hong Kim 《The Journal of supercomputing》2014,69(1):330-356

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead. 相似文献

20.

一种适应GPU的混合访问缓存索引框架

张鸿骏武延军张珩张立波《软件学报》2020,31(10):3038-3055

散列表（hash table）作为一类根据关键码值（key value）提供高效数据访问的数据索引结构,其广泛应用于各类计算机应用中,尤其是在对性能要求极高的系统软件、数据库以及高性能计算领域.在网络、云计算和物联网服务方面,以散列表为核心结构已经成为缓存系统的重要系统组件.然而,随着大规模数据量的大幅度增加,以多核CPU为核心设计散列表结构的系统已经逐渐出现性能瓶颈,亟需进一步改进散列表的高性能和可扩展性.随着通用图形处理器（graphic processing unit,简称GPU）的日益普及以及硬件计算能力和并发性能的大幅度提升,各类以并行计算为核心的系统软件任务在GPU上进行了优化设计并得到可观的性能提升.由于存在稀疏性和随机性,采用现有散列表的并行结构直接在GPU上应用势必会带来高频次的内存访问和频繁的总线数据传输,影响了散列表在GPU上的性能发挥.重点分析了缓存系统中散列表索引的内存访问、命中率与索引开销,提出并设计了一种适应GPU的混合访问缓存索引框架CCHT（cache cuckoo hash table）,提供了两种适应不同命中率和索引开销要求的缓存策略,允许写入与查询操作并发执行,最大程度地利用了GPU硬件的计算性能与并发特性,减少了内存访问与总线传输.通过在GPU硬件上的实现与实验验证,CCHT在保证缓存命中率的同时,性能优于其他用于缓存索引的散列表. 相似文献