首页 | 本学科首页   官方微博 | 高级检索  
     

FD-LSTM:基于大规模系统日志的故障分析模型
引用本文:方姣丽,左克,黄春,刘杰,李胜国,卢凯. FD-LSTM:基于大规模系统日志的故障分析模型[J]. 计算机工程与科学, 2021, 43(1): 33-41. DOI: 10.3969/j.issn.1007-130X.2021.01.005
作者姓名:方姣丽  左克  黄春  刘杰  李胜国  卢凯
作者单位:(国防科技大学计算机学院,湖南 长沙 410073)
摘    要:可靠性研究是高性能计算领域的经典问题,随着制程技术与集成工艺的不断发展,当前全系统规模呈指数级快速增长,给可靠性研究尤其是故障分析带来巨大挑战.收集了自主高性能计算系统投产后工作故障日志信息203510247条,时间自2016年1月28日至2016年12月6日.首先使用K-M eans聚类方法对故障进行分类,并分析故障...

关 键 词:系统日志  LSTM  K-Means  故障分析
收稿时间:2020-06-11
修稿时间:2020-07-17

FD-LSTM: A fault analysis model based on large-scale system logs
FANG Jiao-li,ZUO Ke,HUANG Chun,LIU Jie,LI Sheng-guo,LU Kai. FD-LSTM: A fault analysis model based on large-scale system logs[J]. Computer Engineering & Science, 2021, 43(1): 33-41. DOI: 10.3969/j.issn.1007-130X.2021.01.005
Authors:FANG Jiao-li  ZUO Ke  HUANG Chun  LIU Jie  LI Sheng-guo  LU Kai
Affiliation:(College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
Abstract:Reliability research is a classic problem in the field of high-performance computing. With the continuous development of process technology and integrated technology, the current scale of the entire system has grown exponentially, which has brought great challenges to reliability research, especially failure analysis. This paper collects 203510247 pieces of work failure log information after the operation of the independent high-performance computing system, from January 28, 2016 to December 6, 2016. Firstly, the K-Means clustering method is used to classify the faults and analyze the fault distribution characteristics. Secondly, based on the clustering results, a time-based fault analysis model FD-LSTM is designed. After training with structured logs, the occurrence time and space of different fault types are predicted. The results show that the accuracy of the proposed FD-LSTM prediction model can reach 80.56%. The research in this paper shows that, compared with the traditional fault analysis mo- del, in terms of time prediction and spatial prediction, the time series model FD-LSTM based on log information have practical guiding significance in improving the accuracy of fault analysis, enhancing the efficiency of machine operation and maintenance, improving the rationalization of collaborative whole system design, and other aspects.
Keywords:system log  long short-term memory  K-Means  fault analysis   
  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号