A Network-Failure-Tolerant Message-Passing System for Terascale Clusters |
| |
Authors: | Graham Richard L Choi Sung-Eun Daniel David J Desai Nehal N Minnich Ronald G Rasmussen Craig E Risinger L Dean Sukalski Mitchel W |
| |
Affiliation: | (1) Advanced Computing Laboratory, MS-B287, Los Alamos National Laboratory, Los Alamos, New Mexico, 87545 |
| |
Abstract: | The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|