首页 | 本学科首页   官方微博 | 高级检索  
     

基于用户行为的超级计算机作业失败预测方法
引用本文:唐阳坤,鲜港,杨文祥,喻杰,张晓蓉,王耀彬.基于用户行为的超级计算机作业失败预测方法[J].计算机工程与科学,2022,44(10):1753-1761.
作者姓名:唐阳坤  鲜港  杨文祥  喻杰  张晓蓉  王耀彬
作者单位:(1.西南科技大学计算机科学与技术学院,四川 绵阳 621010; 2.中国空气动力研究与发展中心计算空气动力研究所,四川 绵阳 621050;3.国防科技大学计算机学院,湖南 长沙 410073)
基金项目:国家自然科学基金(61872304,61802320);空气动力学国家重点实验室基金(SKLA20200203)
摘    要:超级计算机的规模不断扩大,与此同时,科学应用的复杂性也在不断增加,这导致了超级计算机上许多作业失败。作业失败会造成资源浪费,排队作业等待时间延长,严重影响系统的执行效率。提前预测作业失败,就可以采取必要的措施提升系统资源利用率和系统执行效率,这对未来的E级超级计算机至关重要。为此,尝试研究从已知的传统特征和构建特征中预测作业失败,发现能够反映用户工作行为模式和提交行为模式的特征及处理方式。通过结合行为特征和传统特征,提出基于树结构模型的综合框架来预测作业失败。实验结果表明,预测效果优于其他相关方法。

关 键 词:系统执行效率  作业日志分析  用户行为  作业失败预测  机器学习  
收稿时间:2021-09-02
修稿时间:2022-01-10

Job failure prediction based onuser behavior on supercomputers
TANG Yang-kun,XIAN Gang,YANG Wen-xiang,YU Jie,ZHANG Xiao-rong,WANG Yao-bin.Job failure prediction based onuser behavior on supercomputers[J].Computer Engineering & Science,2022,44(10):1753-1761.
Authors:TANG Yang-kun  XIAN Gang  YANG Wen-xiang  YU Jie  ZHANG Xiao-rong  WANG Yao-bin
Affiliation:(1.School of Computer Science and Technology,Southwest University of Science and Technology,Mianyang 621010; 2.Computational Aerodynamics Institute,China Aerodynamics Research and Development Center,Mianyang 621050; 3.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
Abstract:The scale of supercomputers is expanding. Meanwhile, the complexity of scientific applications is also increasing, which leads to many job failures on supercomputers. These failed jobs causes a waste of resources and prolong the waiting time of queuing jobs, which seriously affects the reliability of the system. If these failed jobs can be predicted in advance, necessary measures can be taken to improve the system resource utilization and system execution efficiency, which is very important for the future exascale supercomputers. Therefore, this paper attempts to predict these job failures from the known traditional features and construction features, and find the features and processing methods that can reflect the users work behavior patterns and submission behavior patterns. By combining behavior features and traditional features, a comprehensive framework based on tree structure model is proposed to predict job failure. The prediction experimental results show that the comprehensive prediction framework is better than the single model prediction, and the comparative experimental results show that the prediction effect is better than other related methods.
Keywords:system execution efficiency  job log analysis  user behavior  job failure prediction  machine learning  
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号