排序方式: 共有6条查询结果,搜索用时 0 毫秒
1
1.
2.
In this paper,we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory.Unlike ea... 相似文献
3.
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world’s fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens. 相似文献
4.
Wei Mi 《计算机科学技术学报》2009,24(6):1086-1097
DRAM row buffer conflicts can increase memory access latency significantly. This paper presents a new page-allocation-based
optimization that works seamlessly together with some existing hardware and software optimizations to eliminate significantly
more row buffer conflicts. Validation in simulation using a set of selected scientific and engineering benchmarks against
a few representative memory controller optimizations shows that our method can reduce row buffer miss rates by up to 76% (with
an average of 37.4%). This reduction in row buffer miss rates will be translated into performance speedups by up to 15% (with
an average of 5%). 相似文献
5.
Modern petascale and future exascale systems are massively heterogeneous architectures. Developing productive intra-node programming models is crucial toward addressing their programming challenge. We introduce a directive- based intra-node programming model, OpenMC, and show that this new model can achieve ease of programming, high performance, and the degree of portability desired for heterogeneous nodes, especially those in TianHe supercomputers. While existing models are geared towards oifloading computations to accelerators (typically one), OpenMC alms to more uniformly and adequately exploit the potential offered by multiple CPUs and accelerators in a compute node. OpenMC achieves this by providing a unified abstraction of hardware resources as workers and facilitating the exploitation of asynchronous task parallelism on the workers. We present an overview of OpenMC, a prototyping implementation, and results from some initial comparisons with OpenMP and hand-written code in developing six applications on two types of nodes from TianHe supercomputers. 相似文献
6.
本文论述了系统NET。该系统实现了RD11计算机与dual68000计算机之间的文件传输,并部分实现了RD11机对dual68000机的远程命命执行。 相似文献
1