版本失配和数据泄露对基于缺陷报告的缺陷定位模型的影响 Watch out for Version Mismtaching and Data Leakage! A Case Study of Their Influence in Bug Report Based Bug Localization Models期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

版本失配和数据泄露对基于缺陷报告的缺陷定位模型的影响

引用本文：	周慧聪,郭肇强,梅元清,李言辉,陈林,周毓明.版本失配和数据泄露对基于缺陷报告的缺陷定位模型的影响[J].软件学报,2023,34(5):2196-2217.

作者姓名：	周慧聪郭肇强梅元清李言辉陈林周毓明

作者单位：	计算机软件新技术国家重点实验室(南京大学), 江苏南京 210093;南京大学计算机科学与技术系, 江苏南京 210023

基金项目：	国家自然科学基金（61772259，61872177）

摘要：	为了降低缺陷定位过程中的人力成本,研究者们在缺陷报告的基础上提出了许多基于信息检索的缺陷定位模型,包括使用传统特征和使用深度学习特征进行建模的定位模型.在评价不同缺陷定位模型时设计的实验中,现有研究大多忽视了缺陷报告所属的版本与目标源代码的版本之间存在的“版本失配”问题或/和在训练和测试模型时缺陷报告的时间顺序所引发的“数据泄露”问题.致力于报告现有模型在更加真实的应用场景下的性能表现,并分析版本失配和数据泄露问题对评估各模型真实性能产生的影响.选取6个使用传统特征的定位模型(BugLocator、BRTracer、BLUiR、AmaLgam、BLIA、Locus)和1个使用深度学习特征的定位模型(CodeBERT)作为研究对象.在5个不同实验设置下基于8个开源项目进行系统性的实证分析.首先, CodeBERT模型直接应用于缺陷定位效果并不理想,其定位的准确率依赖于目标项目的版本数目和源代码规模.其次,版本匹配设置下使用传统特征的定位模型在平均准确率均值(MAP)、平均序位倒数均值(MRR)两个指标上比版本失配实验设置下最高可以提高47.2%和46.0%, CodeBERT模型的效果也...
关键词：	缺陷定位缺陷报告版本失配数据泄露信息检索
收稿时间：	2021/3/2 0:00:00
修稿时间：	2021/4/26 0:00:00
Watch out for Version Mismtaching and Data Leakage! A Case Study of Their Influence in Bug Report Based Bug Localization Models

ZHOU Hui-Cong,GUO Zhao-Qiang,MEI Yuan-Qing,LI Yan-Hui,CHEN Lin,ZHOU Yu-Ming.Watch out for Version Mismtaching and Data Leakage! A Case Study of Their Influence in Bug Report Based Bug Localization Models[J].Journal of Software,2023,34(5):2196-2217.

Authors:	ZHOU Hui-Cong GUO Zhao-Qiang MEI Yuan-Qing LI Yan-Hui CHEN Lin ZHOU Yu-Ming

Affiliation:	State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210093, China;Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China

Abstract:	In order to reduce the labor cost in the process of bug localization, researchers have proposed various automated information retrieval based bug localization models (IRBL), including those models leveraging traditional features and deep learning based features. When evaluating the effectiveness of IRBL models, most of the existing studies neglect the following problems: the software version mismatching between bug reports and the corresponding source code files in the testing data or/and the data leakage caused by the chronological order of bug reports when training and testing their models. This study aims to investigate the performance of existing models in real experiment settings and analyzes the impact of version mismatching and data leakage on the real performance of each model. F irst, six traditional information retrieval-based models (Buglocator, BTRracer, BLUiR, AmaLgam, BLIA, and Locus) and one novel deep learning model (CodeBERT) are selected as the research objects. Then, an empirical analysis is conducted based on eight open-source projects under five different experimental settings. The experimental results demonstrate that the effectiveness of directly applying CodeBERT in bug localization is not as good as expected, since its accuracy depends on the version and source code size of a test project. Second, the results also show that, compared with the traditional version mismatching experimental setting, the traditional information retrieval-based models under the version matching setting can lead to an improviment that is up to 47.2% and 46.0% in terms of MAP and MRR. Meanwhile, the effectiveness of CodeBERT model is also affected by both data leakage and version mismatching. It means that the effectiveness of traditional information retrieval-based bug localization is underestimated while the application of deep learning based CodeBERT to bug localization still needs more exploration.

Keywords:	bug localization bug report version mismatching data leakage information retrieval

	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏