首页 | 本学科首页   官方微博 | 高级检索  
     

基于双重信息检索的Bash代码注释生成方法
引用本文:陈翔,于池,杨光,濮雪莲,崔展齐. 基于双重信息检索的Bash代码注释生成方法[J]. 软件学报, 2023, 34(3): 1310-1329
作者姓名:陈翔  于池  杨光  濮雪莲  崔展齐
作者单位:南通大学 信息科学技术学院, 江苏 南通 226019;信息安全国家重点实验室(中国科学院 信息工程研究所), 北京 100093;南通大学 经济与管理学院, 江苏 南通 226019;北京信息科技大学 计算机学院, 北京 100101
基金项目:国家自然科学基金(61872263,61702041,61202006);信息安全国家重点实验室开放课题(2020-MS-07);江苏省前沿引领技术基础研究专项(BK20202001);江苏省重点产业专利导航项目(DH20200072-10)
摘    要:Bash是Linux默认的shell命令语言.它在Linux系统的开发和维护中起到重要作用.对不熟悉Bash语言的开发人员来说,理解Bash代码的目的和功能具有一定的挑战性.针对Bash代码注释自动生成问题提出了一种基于双重信息检索的方法 ExplainBash.该方法基于语义相似度和词法相似度进行双重检索,从而生成高质量代码注释.其中,语义相似度基于CodeBERT和BERT-whitening操作训练出代码语义表示,并基于欧式距离来实现;词法相似度基于代码词元构成的集合,并基于编辑距离来实现.以NL2Bash研究中共享的语料库为基础,进一步合并NLC2CMD竞赛共享的数据以构造高质量语料库.随后,选择了来自代码注释自动生成领域的9种基准方法,这些基准方法覆盖了基于信息检索的方法和基于深度学习的方法.实证研究和人本研究的结果验证了ExplainBash方法的有效性.然后设计了消融实验,对ExplainBash方法内设定(例如检索策略、BERT-whitening操作等)的合理性进行了分析.最后,基于所提方法开发出一个浏览器插件,以方便用户对Bash代码的理解.

关 键 词:程序理解  Bash代码  代码注释生成  信息检索  代码语义  代码词法
收稿时间:2021-09-14
修稿时间:2022-01-13

Bash Code Comment Generation Method Based on Dual Information Retrieval
CHEN Xiang,YU Chi,YANG Guang,PU Xue-Lian,CUI Zhan-Qi. Bash Code Comment Generation Method Based on Dual Information Retrieval[J]. Journal of Software, 2023, 34(3): 1310-1329
Authors:CHEN Xiang  YU Chi  YANG Guang  PU Xue-Lian  CUI Zhan-Qi
Affiliation:School of Information Science and Technology, Nantong University, Nantong 226019, China;State Key Laboratory of Information Security (Institute of Information Engineering, Chinese Academy of Sciences), Beijing 100093, China;Economics and Management School, Nantong University, Nantong 226019, China; School of Computer, Beijing Information Science and Technology University, Beijing 100101, China
Abstract:Bash is the default shell command language for Linux, which plays an important role in the development and maintenance of Linux systems. Nevertheless, understanding the purpose and functionality of the Bash code is a challenging task. Therefore, an automatic method ExplainBash is proposed based on dual information retrieval for automatic Bash code comment generation. Specifically, the proposed method is based on semantic similarity and lexical similarity to perform dual information retrieval, which aims to generate high-quality code comments. For semantic similarity, CodeBERT and BERT-whitening operator are used to learn the code semantic representation, and Euclidean distance is resorted to compute semantic similarity; while for lexical similarity, code is represented as a set of code tokens, then the edit distance is resorted to compute lexical similarity. A high-quality corpus is constructed based on the corpus shared in the NL2Bash study and the data shared in the NLC2CMD competition. After that, nine state-of-the-art baselines are selected from the automatic code comment generation domain, which cover the information retrieval-based methods and deep learning-based methods. Results of empirical study and human study verify the effectiveness of the proposed method. Ablation experiments are also designed to analyze the rationality of the settings (such as retrieval strategy, BERT-whitening operator) in the proposed method. Finally, a browser plug-in is developed based on the proposed method to facilitate the code comprehension of the Bash code.
Keywords:program comprehension  Bash code  code comment generation  information retrieval  code semantic  code lexical
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号