首页 | 本学科首页   官方微博 | 高级检索  
     

FD-LSTM:基于大规模系统日志的故障分析模型
引用本文:方姣丽,左克,黄春,刘杰,李胜国,卢凯.FD-LSTM:基于大规模系统日志的故障分析模型[J].计算机工程与科学,2021,43(1):33-41.
作者姓名:方姣丽  左克  黄春  刘杰  李胜国  卢凯
作者单位:(国防科技大学计算机学院,湖南 长沙 410073)
摘    要:可靠性研究是高性能计算领域的经典问题,随着制程技术与集成工艺的不断发展,当前全系统规模呈指数级快速增长,给可靠性研究尤其是故障分析带来巨大挑战.收集了自主高性能计算系统投产后工作故障日志信息203510247条,时间自2016年1月28日至2016年12月6日.首先使用K-M eans聚类方法对故障进行分类,并分析故障分布特征.接着基于聚类结果设计基于时序的故障分析模型FD-LSTM,使用结构化日志训练后,预测不同故障类型的发生时间和空间,结果表明所提出的FD-LSTM预测模型准确率可达80.56%.本文研究表明,基于日志信息的时序模型FD-LSTM在时间预测和空间预测方面,较之前传统的故障分析模型,在提高故障分析准确度、加强机器运维高效性,乃至增进全系统协同设计合理化等方面都具有现实的指导意义.

关 键 词:系统日志  LSTM  K-Means  故障分析  
收稿时间:2020-06-11
修稿时间:2020-07-17

FD-LSTM: A fault analysis model based on large-scale system logs
FANG Jiao-li,ZUO Ke,HUANG Chun,LIU Jie,LI Sheng-guo,LU Kai.FD-LSTM: A fault analysis model based on large-scale system logs[J].Computer Engineering & Science,2021,43(1):33-41.
Authors:FANG Jiao-li  ZUO Ke  HUANG Chun  LIU Jie  LI Sheng-guo  LU Kai
Affiliation:(College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
Abstract:Reliability research is a classic problem in the field of high-performance computing. With the continuous development of process technology and integrated technology, the current scale of the entire system has grown exponentially, which has brought great challenges to reliability research, especially failure analysis. This paper collects 203510247 pieces of work failure log information after the operation of the independent high-performance computing system, from January 28, 2016 to December 6, 2016. Firstly, the K-Means clustering method is used to classify the faults and analyze the fault distribution characteristics. Secondly, based on the clustering results, a time-based fault analysis model FD-LSTM is designed. After training with structured logs, the occurrence time and space of different fault types are predicted. The results show that the accuracy of the proposed FD-LSTM prediction model can reach 80.56%. The research in this paper shows that, compared with the traditional fault analysis mo- del, in terms of time prediction and spatial prediction, the time series model FD-LSTM based on log information have practical guiding significance in improving the accuracy of fault analysis, enhancing the efficiency of machine operation and maintenance, improving the rationalization of collaborative whole system design, and other aspects.
Keywords:system log  long short-term memory  K-Means  fault analysis   
  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号