提升大规模集群上并行计算软件系统可靠性和服务性的方法与实践 Methods to enhance reliability and serviceability of parallel computing software on large scale clusters期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

提升大规模集群上并行计算软件系统可靠性和服务性的方法与实践

引用本文：	林彦宇,陈虎,苗军,韩佳龙媚,赖路双.提升大规模集群上并行计算软件系统可靠性和服务性的方法与实践[J].计算机工程与科学,2015,37(1):1-6.

作者姓名：	林彦宇陈虎苗军韩佳龙媚赖路双

作者单位：	(华南理工大学软件学院,广东广州 510006)

摘要：	大规模集群上的并行计算软件需要具备处理部分节点、网络等失效的容错能力,也需要具有易于管理、维护、移植和可扩展的服务能力。针对星形计算模型,研究和开发了一套并行计算框架。利用调度节点内部的可变粒度分解器、相关队列等方法,实现了全系统容错,且具有较好的易用性、可移植性和可扩展性。系统目前可以实现300TFlops计算能力下连续运行超过150h,而且还具有进一步的可扩展能力。
关键词：	可靠性可扩展性服务性大规模集群并行计算软件
收稿时间：	2013-09-24
修稿时间：	2013-12-18
Methods to enhance reliability and serviceability of parallel computing software on large scale clusters

LIN Yan-yu,CHEN Hu,MIAO Jun,HAN Jia-long-mei,LAI Lu-shuang.Methods to enhance reliability and serviceability of parallel computing software on large scale clusters[J].Computer Engineering & Science,2015,37(1):1-6.

Authors:	LIN Yan-yu CHEN Hu MIAO Jun HAN Jia-long-mei LAI Lu-shuang

Affiliation:	(School of Software,South China University of Technology,Guangzhou 510006,China)

Abstract:	Parallel computing software on large scale clusters requires not only fault tolerance against local nodes or network failure,but also manageability,maintainability,portability and scalability. Based on the star model,we design a parallel computing framework and achieve system wide fault tolerance, usability,portability and scalability,using methods such as the variable granularity decomposer and associated queue on the scheduling nodes.Our system can continuously run over 150 hours with 300 TFlops computational capability.Besides,the system is scalable.

Keywords:	availability scalability serviceability large scale cluster parallel computing software
本文献已被万方数据等数据库收录！
	点击此处可从《计算机工程与科学》浏览原始摘要信息
	点击此处可从《计算机工程与科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏