一种支持容错的任务并行程序设计模型 Task-Based Parallel Programming Model Supporting Fault Tolerance期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种支持容错的任务并行程序设计模型

引用本文：	王一拙,陈旭,计卫星,苏岩,王小军,石峰.一种支持容错的任务并行程序设计模型[J].软件学报,2016,27(7):1789-1804.

作者姓名：	王一拙陈旭计卫星苏岩王小军石峰

作者单位：	北京理工大学计算机学院, 北京 100081,北京理工大学计算机学院, 北京 100081,北京理工大学计算机学院, 北京 100081,北京理工大学计算机学院, 北京 100081,北京理工大学计算机学院, 北京 100081,北京理工大学计算机学院, 北京 100081

基金项目：	国家自然科学基金(61300011)

摘要：	任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持.
关键词：	并行程序设计容错任务并行工作窃取调度负载均衡
收稿时间：	2014/12/31 0:00:00
修稿时间：	3/2/2015 12:00:00 AM
Task-Based Parallel Programming Model Supporting Fault Tolerance

WANG Yi-Zhuo,CHEN Xu,JI Wei-Xing,SU Yan,WANG Xiao-Jun and SHI Feng.Task-Based Parallel Programming Model Supporting Fault Tolerance[J].Journal of Software,2016,27(7):1789-1804.

Authors:	WANG Yi-Zhuo CHEN Xu JI Wei-Xing SU Yan WANG Xiao-Jun and SHI Feng

Affiliation:	School of Computer Science, Beijing Institute of Technology, Beijing 100081, China,School of Computer Science, Beijing Institute of Technology, Beijing 100081, China,School of Computer Science, Beijing Institute of Technology, Beijing 100081, China,School of Computer Science, Beijing Institute of Technology, Beijing 100081, China,School of Computer Science, Beijing Institute of Technology, Beijing 100081, China and School of Computer Science, Beijing Institute of Technology, Beijing 100081, China

Abstract:	Task-Based parallel programming model has become the mainstream parallel programming model to improve the performance of parallel computer systems by exploiting task parallelism. This paper presents a novel task-based parallel programming model which supports hardware fault tolerance. This model incorporates fault tolerance mechanisms into the task-based parallel programming model and aim to improve system performance and reliability. It uses task as the basic unit of scheduling, execution, fault detection and recovery, and supports fault tolerance in the application level. A buffer-commit computation model is used for transient fault tolerance and application-level diskless checkpointing technique is employed for permanent fault tolerance. A work-stealing scheduling scheme supporting fault tolerance is adopted to achieve dynamic load balancing. Experimental results show that the proposed model provides hardware fault tolerance with low performance overhead.

Keywords:	parallel programming fault tolerance task parallelism work-stealing scheduling load balancing

	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏