首页 | 本学科首页   官方微博 | 高级检索  
     

大规模集群中一种自适应可扩展的RPC超时机制
引用本文:钱迎进,肖侬,金士尧. 大规模集群中一种自适应可扩展的RPC超时机制[J]. 软件学报, 2010, 21(12): 3199-3210. DOI: 10.3724/SP.J.1001.2010.03718
作者姓名:钱迎进  肖侬  金士尧
基金项目:Supported by the National Natural Science Foundation of China under Grant No.60736013 (国家自然科学基金)
摘    要:在基于RPC(remote produce call)构建的分布式系统中,超时是一种通用的失效检测手段.在超大规模Lustre存储集群的压力测试中,发现传统的固定超时机制会导致很多不必要的超时而存在缺陷.提出了一种综合考虑了网络条件、服务器负载、扩展性和性能等因素的自适应可扩展的RPC超时机制(Adaptive Scalable RPC Timeout mechanism,简称AST).在其控制下,客户端超时值可以根据网络和服务器的拥塞情况动态地调整设置,而且服务器可以通过额外消息传递通知客户端修改原超时值.经过一系列的模拟和验证,其结果表明,AST是一种更适合的RPC失效检测模型,增强了系统的响应性、可靠性和稳定性,而且对系统的性能没有过大的负面影响.

关 键 词:远程过程调用  失效检测  超时  大规模  扩展性  响应性  可靠性
收稿时间:2009-04-28
修稿时间:2009-08-12

Adaptive Scalable RPC Timeout Mechanism for Large Scale Clusters
QIAN Ying-Jin,XIAO Nong and JIN Shi-Yao. Adaptive Scalable RPC Timeout Mechanism for Large Scale Clusters[J]. Journal of Software, 2010, 21(12): 3199-3210. DOI: 10.3724/SP.J.1001.2010.03718
Authors:QIAN Ying-Jin  XIAO Nong  JIN Shi-Yao
Abstract:Timeouts are usually used for failure detection in RPC (remote produce call) based systems, which are typically reported on a per-call basis. During pressure testing, on a very large cluster system, it has been found that the traditional fixed timeout mechanism leads lots of unnecessary timeouts, especially when the server loading is involved. This paper proposes an Adaptive Scalable RPC Timeout (AST for short) mechanism that considers network conditions, server load, scalability, and performance. Under this control, the timeout value, set by clients, can be adapted and adjusted in a dynamic fashion, according to congestion of the network and the server. Moreover, the server can notify the client to modify the timeout value of the RPC. Via a series of simulations, it has been proved that the AST mechanism is a more suitable failure detection mechanism for RPC models with timeouts, and it enhances the system responsibility, reliability, and stability without negative impact on performance, even for large-scaled cluster systems.
Keywords:RPC (remote produce call)   failure detection   timeout   large scale   scalability   responsibility  reliability
本文献已被 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号