期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hong Jun Choi Dong Oh Son Jong Myon Kim Cheol Hong Kim 《The Journal of supercomputing》2014,69(1):330-356

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead. 相似文献

2.

基于上下文定界的Fork/Join并行性的并发程序可达性分析

钱俊彦贾书贵蔡国永赵岭忠《计算机工程与科学》2013,35(2):1-6

随着多核技术日益发展,并发程序通过引入Fork/Join并行性,将任务分解为更细粒度的子任务并行执行,从而充分利用多核处理器提供的计算性能。并发执行线程之间的交错可能产生隐匿的程序设计错误,因此有必要对此类并发程序的正确性进行分析。上下文定界分析方法是一种检测并发程序中隐匿错误的高效方法,计算线程有限次上下文切换内的可达状态,确定错误状态是否可达。针对Fork/Join并行性的并发程序的可达性分析思想如下:首先,动态并发程序被建模为可模拟线程Fork/Join操作的动态并发下推系统P;然后从P中提取模拟其k-定界执行的并发下推系统Pk。现有的上下文定界可达算法可解决提取后的并发下推系统的k-定界可达性问题。相似文献

3.

Multiparadigm distributed computing with TPVM

ADAM FERRARI V. S. SUNDERAM 《Concurrency and Computation》1998,10(3):199-228

Distributed concurrent computing based on lightweight processes can potentially address performance and functionality limits in heterogeneous systems. The TPVM framework, based on the notion of ‘exportable services’, is an extension to the PVM message-passing system, but uses threads as units of computing, scheduling, and parallelism. TPVM facilitates and supports three different distributed concurrent programming paradigms: (a) the traditional, task based, explicit message-passing model; (b) a data-driven instantiation model that enables straightforward specification of computation based on data dependencies; and (c) a partial shared-address space model via remote memory access, with naming and typing of distributed data areas. The latter models offer significantly different computing paradigms for network-based computing, while maintaining a close resemblance to, and building upon, the conventional PVM infrastructure in the interest of compatibility and ease of transition. The TPVM system comprises three basic modules: a library interface that provides access to thread-based distributed concurrent computing facilities, a portable thread interface module which abstracts the required thread-related services, and a thread server module which performs scheduling and system data management. System implementation as well as applications experiences have been very encouraging, indicating the viability of the proposed models, the feasibility of portable and efficient threads systems for distributed computing, and the performance improvements that result from multithreaded concurrent computing. © 1998 John Wiley & Sons, Ltd. 相似文献

4.

Mapping of option pricing algorithms onto heterogeneous many-core architectures

Shuai Zhang Zhao Wang Ying Peng Bertil Schmidt Weiguo Liu 《The Journal of supercomputing》2017,73(9):3715-3737

The rapid development of technologies and applications in recent years poses high demands and challenges for high-performance computing. Because of their competitive performance/price ratio, heterogeneous many-core architectures are widely used in high-performance computing areas. GPU and Xeon Phi are two popular general-purpose many-core accelerators. In this paper, we demonstrate how heterogeneous many-core architectures, powered by multi-core CPUs, CUDA-enabled GPUs and Xeon Phis can be used as an efficient computational platform to accelerate popular option pricing algorithms. In order to make full use of the compute power of this architecture, we have used a hybrid computing model which consists of two types of data parallelism: worker level and device level. The worker level data parallelism uses a distributed computing infrastructure for task distribution, while the device level data parallelism uses both the multi-core CPUs and many-core accelerators for fast option pricing calculation. Experiments show that our implementations achieve good performance and scalability on this architecture and also outperform other state-of-the-art GPU-based solutions for Monte Carlo European/American option pricing and BSDE European option pricing. 相似文献

5.

Characterizing the challenges and evaluating the efficacy of a CUDA-to-OpenCL translator

Mark Gardner Paul SathreWu-chun Feng Gabriel Martinez 《Parallel Computing》2013

The proliferation of heterogeneous computing systems has led to increased interest in parallel architectures and their associated programming models. One of the most promising models for heterogeneous computing is the accelerator model, and one of the most cost-effective, high-performance accelerators currently available is the general-purpose, graphics processing unit (GPU). 相似文献

6.

Modeling and characterizing GPGPU reliability in the presence of soft errors

Jingweijia Tan Yang Yi Fangyang Shen Xin Fu 《Parallel Computing》2013

The general-purpose computing on graphic processing units (GPGPUs) becomes increasingly popular due to its high computational throughput for data parallel applications. Modern GPU architectures have limited capability for error detection and fault tolerance since they are originally designed for graphics processing. However, the rigorous execution correctness is required for general-purpose applications, which makes reliability a growing concern in the GPGPU architecture design. With CMOS processing technologies continuously scaling down to the nano-scale, on-chip soft error rate (SER) has been predicted to increase exponentially. GPGPUs with hundreds of cores integrated into a single chip are prone to manifest high SER. This paper explores a first step to model and characterize GPGPU reliability in light of soft errors. We develop GPGPU-SODA (GPGPU SOftware Dependability Analysis), a framework to estimate the soft-error vulnerability of GPGPU microarchitecture. By using GPGPU-SODA, we observe that several microarchitecture structures in GPGPUs exhibit high soft-error susceptibility, and the structure vulnerability is sensitive to the workload characteristics (e.g. branch divergences, memory access pattern). We further investigate the impact of several architectural optimizations on GPU soft-error robustness. For example, we find that increasing the number of threads supported by GPU significantly affects the GPGPU soft-error robustness. However, changing the warp scheduling policy has little impact on the structure vulnerability. The observations made in this study provide designers the useful guidance to build resilient GPGPUs: a comprehensive resiliency solution for GPGPUs should consider the entire GPGPU design instead of solely focusing on a particular structure. 相似文献

7.

A comparative study of GPU programming models and architectures using neural networks

Vivek K. Pallipuram Mohammad Bhuiyan Melissa C. Smith 《The Journal of supercomputing》2012,61(3):673-718

Recently, General Purpose Graphical Processing Units (GP-GPUs) have been identified as an intriguing technology to accelerate numerous data-parallel algorithms. Several GPU architectures and programming models are beginning to emerge and establish their niche in the High-Performance Computing (HPC) community. New massively parallel architectures such as the Nvidia??s Fermi and AMD/ATi??s Radeon pack tremendous computing power in their large number of multiprocessors. Their performance is unleashed using one of the two GP-GPU programming models: Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). Both of them offer constructs and features that have direct bearing on the application runtime performance. In this paper, we compare the two GP-GPU architectures and the two programming models using a two-level character recognition network. The two-level network is developed using four different Spiking Neural Network (SNN) models, each with different ratios of computation-to-communication requirements. To compare the architectures, we have chosen the two extremes of the SNN models for implementation of the aforementioned two-level network. An architectural performance comparison of the SNN application running on Nvidia??s Fermi and AMD/ATi??s Radeon is done using the OpenCL programming model exhausting all of the optimization strategies plausible for the two architectures. To compare the programming models, we implement the two-level network on Nvidia??s Tesla C2050 based on the Fermi architecture. We present a hierarchy of implementations, where we successively add optimization techniques associated with the two programming models. We then compare the two programming models at these different levels of implementation and also present the effect of the network size (problem size) on the performance. We report significant application speed-up, as high as 1095× for the most computation intensive SNN neuron model, against a serial implementation on the Intel Core 2 Quad host. A comprehensive study presented in this paper establishes connections between programming models, architectures and applications. 相似文献

8.

On the promise of general-purpose parallel computing

James J. Hack 《Parallel Computing》1989,10(3):261-275

It has become generally accepted that continued improvements in high-performance scientific computation will be achieved only through the ‘exploitation of parallelism’. Despite the nebulous nature of this expression, enthusiasm for the potential of parallel computing has led to calls for improvements in computational performance of more than a thousand-fold in the next few years, or for what is sometimes referred to as a Teraflop (one trillion floating-point operations per second) Computer. Such a system is envisioned as a general-purpose tool for accelerating progress in such widely varied applications as astronomy, biochemistry, circuit analysis, computational fluid dynamics, global economic modeling, high energy physics, materials science, structural analysis, and weather prediction.

Although parallel architectures appear to offer the greatest promise for significant improvements in overall computational performance, it is not yet clear whether a general-purpose parallel architecture can realize the large increases solicited by the scientific community. This note will take a practical look at the prospect for general-purpose parallel computation and will consider some of the potential limitations by using a simple parametric model of computational performance. 相似文献

9.

A Superscalar software architecture model for Multi-Core Processors (MCPs)

Gyu Sang Choi Author Vitae Chita R. Das^{Author Vitae} 《Journal of Systems and Software》2010,83(10):1823-1837

Design of high-performance servers has become a research thrust to meet the increasing demand of network-based applications. One approach to design such architectures is to exploit the enormous computing power of Multi-Core Processors (MCPs) that are envisioned to become the state-of-the-art in processor architecture. In this paper, we propose a new software architecture model, called SuperScalar, suitable for MCP machines. The proposed SuperScalar model consists of multiple pipelined thread pools, where each pipelined thread pool consists of multiple threads, and each thread takes a different role. The main advantages of the proposed model are global information sharing by the threads and minimal memory requirement due to fewer threads.We have conducted in-depth performance analyses of the proposed scheme along with three prior software architecture schemes (Multi-Process (MP), Multi-Thread (MT) and Event-Driven (ED)) via an analytical model. The performance results indicate that the proposed SuperScalar model shows the best performance across all system and workload parameters compared to the MP, MT and ED models. Although the MT model shows competitive performance with less number of processing cores and smaller data cache size, the advantage of the SuperScalar model becomes obvious as the number of processing cores increases. 相似文献

10.

Real-time 3D microtubule gliding simulation accelerated by GPU computing

Gregory Gutmann Daisuke Inoue Akira Kakugo Akihiko Konagaya 《国际自动化与计算杂志》2016,13(2):108-116

A microtubule gliding assay is a biological experiment observing the dynamics of microtubules driven by motor proteins fixed on a glass surface. When appropriate microtubule interactions are set up on gliding assay experiments, microtubules often organize and create higher-level dynamics such as ring and bundle structures. In order to reproduce such higher-level dynamics on computers, we have been focusing on making a real-time 3D microtubule simulation. This real-time 3D microtubule simulation enables us to gain more knowledge on microtubule dynamics and their swarm movements by means of adjusting simulation parameters in a real-time fashion. One of the technical challenges when creating a real-time 3D simulation is balancing the 3D rendering and the computing performance. Graphics processor unit (GPU) programming plays an essential role in balancing the millions of tasks, and makes this real-time 3D simulation possible. By the use of general-purpose computing on graphics processing units (GPGPU) programming we are able to run the simulation in a massively parallel fashion, even when dealing with more complex interactions between microtubules such as overriding and snuggling. Due to performance being an important factor, a performance model has also been constructed from the analysis of the microtubule simulation and it is consistent with the performance measurements on different GPGPU architectures with regards to the number of cores and clock cycles. 相似文献

11.

Coping with Java threads

Sanden B. 《Computer》2004,37(4):20-27

A thread is a basic unit of program execution that can share a single address space with other threads - that is, they can read and write the same variables and data structures. Originally, only assembly programmers used threads. A few older programming languages such as PL/I supported thread concurrency, but newer languages such as C and C++ use libraries instead. Only recently have programming languages again begun to build in direct support for threads. Java and Ada are examples of industry-strength languages for multithreading. The Java thread model has its roots in traditional concurrent programming. As the "real-time specification for Java" sidebar describes, RTSJ attempts to remove some of the limitations relative to real-time applications - primarily by circumventing garbage collection. But RTSJ does not make the language safer. It retains standard Java's threading pitfalls and is a risky candidate for critical concurrent applications. 相似文献

12.

面向申威异构架构的并行代码自动生成

陶小涵朱雨庞建民赵捷徐金龙《软件学报》2023,34(4):1570-1593

异构架构逐渐成为高性能计算领域的主流架构,但相较于同构多核架构,其硬件结构及存储层次更为复杂,程序编写更为困难.先进的优化编译器可以协助程序开发人员实现更为高效的代码,降低程序开发复杂度.多面体编译模型通过抽象分析将程序抽象成空间多面体表示形式,能够将多种循环变换与硬件映射相结合,并面向特定体系结构生成相应的代码.设计实现了一个面向国产申威异构架构的并行代码自动生成系统,采用“源-源”编译模式,基于多面体编译模型实现.系统针对申威异构架构特点将程序计算过程进行硬件部署,同时实现数据传输与内存空间的自动管理.实验基于Polybench测试集中线性代数相关用例进行测试.结果表明,利用代码自动生成系统生成的异构并行代码能够在申威异构平台上正确运行,并能够有效发挥申威异构平台的性能,基于申威异构平台利用64线程加速计算的平均加速比达到了539.16倍. 相似文献

13.

Scout: a data-parallel programming language for graphics processors

《Parallel Computing》2007,33(10-11):648-662

Commodity graphics hardware has seen incredible growth in terms of performance, programmability, and arithmetic precision. Even though these trends have been primarily driven by the entertainment industry, the price-to-performance ratio of graphics processors (GPUs) has attracted the attention of many within the high-performance computing community. While the performance of the GPU is well suited for computational science, the programming interface, and several hardware limitations, have prevented their wide adoption. In this paper we present Scout, a data-parallel programming language for graphics processors that hides the nuances of both the underlying hardware and supporting graphics software layers. In addition to general-purpose programming constructs, the language provides extensions for scientific visualization operations that support the exploration of existing or computed data sets. 相似文献

14.

Scala Actors: Unifying thread-based and event-based programming

Philipp Haller Martin Odersky 《Theoretical computer science》2009,410(2-3):202-220

There is an impedance mismatch between message-passing concurrency and virtual machines, such as the JVM. VMs usually map their threads to heavyweight OS processes. Without a lightweight process abstraction, users are often forced to write parts of concurrent applications in an event-driven style which obscures control flow, and increases the burden on the programmer.In this paper we show how thread-based and event-based programming can be unified under a single actor abstraction. Using advanced abstraction mechanisms of the Scala programming language, we implement our approach on unmodified JVMs. Our programming model integrates well with the threading model of the underlying VM. 相似文献

15.

图形处理器空间插值并行算法的实现

下载免费PDF全文

赵艳伟程振林董慧方金云《中国图象图形学报》2012,17(4):575-581

空间插值是地理信息系统(GIS)空间分析中计算复杂且耗时的操作,因此无法满足实时性的要求。随着图形处理器(GPU)浮点计算能力的大幅提高,GPU通用计算已成为处理GIS领域内复杂计算的研究热点。为实时化一些传统低效的算法提供了良好的契机。利用GPU在并行计算上的优势,将反距离加权法插值算法映射到了统一计算设备架构(CUDA)并行编程架构。首先在GPU中建立二级索引使计算层次得到了合理的划分,然后利用多线程分块策略执行并行插值计算。最后通过实验表明,该方法的插值误差与CPU方法相比能控制在10-6数量级,并且在插值半径较大插值数据较多的情况下,该算法可达到40倍以上的加速比。充分证明了该方法的正确性及高效性。相似文献

16.

Gauss: A Framework for Verifying Scientific Computing Software

Robert Palmer Steve Barrus Yu Yang Ganesh Gopalakrishnan Robert M. Kirby 《Electronic Notes in Theoretical Computer Science》2006,144(3):95

High performance scientific computing software is of critical international importance as it supports scientific explorations and engineering. Software development in this area is highly challenging owing to the use of parallel/distributed programming methods and complex communication and synchronization libraries. There is very little use of formal methods to debug software in this area, given that the scientific computing community and the formal methods community have not traditionally worked together. The Utah Gauss project combines expertise from scientific computing and formal methods in addressing this problem. We currently focus on MPI programs which are the kind that run on over 60% of world's supercomputers. These are programs written in C / C++ / FORTRAN employing message passing concurrency supported by the Message Passing Interface (MPI) library. Large-scale MPI programs also employ shared memory threads to manage concurrency within smaller task sub-groups, capitalizing on the recent availability of small-scale (e.g. single-chip) shared memory multiprocessors; such mixed programming styles can result in additional bugs. MPI libraries themselves can be buggy as they strive to implement complex requirements employing aggressive techniques such as multi-threading. We have built a model extractor that extracts from MPI C programs a formal model consisting of communicating processes represented in Microsoft's Zing modeling language. MPI library functions are also being modeled in Zing. This allows us to run formal analysis on the models to detect bugs in the MPI programs being analyzed. Our preliminary results and future plans are described; in addition, our contribution is to expose the special needs of this area and suggest specific avenues for problem- driven advances in software model-checking applied to scientific computing software development and verification. 相似文献

17.

Distributed transactional memory for metric-space networks

Maurice Herlihy Ye Sun 《Distributed Computing》2007,20(3):195-208

Transactional Memory is a concurrent programming API in which concurrent threads synchronize via transactions (instead of locks). Although this model has mostly been studied in the context of multiprocessors, it has attractive features for distributed systems as well. In this paper, we consider the problem of implementing transactional memory in a network of nodes where communication costs form a metric. The heart of our design is a new cache-coherence protocol, called the Ballistic protocol, for tracking and moving up-to-date copies of cached objects. For constant-doubling metrics, a broad class encompassing both Euclidean spaces and growth-restricted networks, this protocol has stretch logarithmic in the diameter of the network. Supported by NSF grant 0410042 and by grants from Intel Corporation and Sun Microsystems. 相似文献

18.

Seamless hardware-software integration in reconfigurable computing systems 总被引：3，自引：0，他引：3

Vuletid M. Pozzi L. Ienne P. 《Design & Test of Computers, IEEE》2005,22(2):102-113

Ideally, reconfigurable-system programmers and designers should code algorithms and write hardware accelerators independently of the underlying platform. To realize this scenario, the authors propose a portable, hardware-agnostic programming paradigm, which delegates platform-specific tasks to a system-level virtualization layer. This layer supports a chosen programming model and hides platform details from users much as general-purpose computers do. We introduce multithreaded programming model for reconfigurable computing based on a unified virtual-memory image for both software and hardware application parts. We also address the challenge of achieving seamless hardware-software interfacing and portability with minimal performance penalties. 相似文献

19.

一种可重构计算系统设计与实现 总被引：4，自引：1，他引：3

罗毅辉李仁发熊曙初《计算机应用研究》2006,23(1):154-156

可重构计算系统是一种新的实现计算系统的方法,它补充了原有通用处理器和专用硬件计算系统的不足,既具有在制造后的可编程性,又能提供较高的计算性能和计算密度。在简单介绍可重构计算系统体系结构的基础上,通过一个嵌入式实时控制系统实例,给出了可重构计算系统的一种实现方法。相似文献

20.

Compiler Techniques for the Superthreaded Architectures1, 2

Jenn-Yuan Tsai Zhenzhen Jiang Pen-Chung Yew 《International journal of parallel programming》1999,27(1):1-19

Several useful compiler and program transformation techniques for the superthreaded architectures are presented in this paper. The superthreaded architecture adopts a thread pipelining execution model to facilitate runtime data dependence checking between threads, and to maximize thread overlap to enhance concurrency. In this paper, we present some important program transformation techniques to facilitate concurrent execution among threads, and to manage critical system resources such as the memory buffers effectively. We evaluate the effectiveness of those program transformation techniques by applying them manually on several benchmark programs, and using a trace-driven, cycle-by-cycle superthreaded processor simulator. The simulation results show that a superthreaded processor can achieve promising speedup for most of the benchmark programs. 相似文献