首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于同步语义对齐的异构缺陷预测方法
引用本文:李伟湋,陈翔,张恒伟,黄志球,贾修一. 一种基于同步语义对齐的异构缺陷预测方法[J]. 软件学报, 2023, 34(6): 2669-2689
作者姓名:李伟湋  陈翔  张恒伟  黄志球  贾修一
作者单位:南京航空航天大学 计算机科学与技术学院,江苏 南京 211106;南通大学 信息科学技术学院,江苏 南通 226019;南京航空航天大学 航天学院,江苏 南京 211106;南京理工大学 计算机科学与工程学院,江苏 南京 210094;南京航空航天大学 计算机科学与技术学院,江苏 南京 211106
基金项目:国家重点研发计划(2018YFB1003900);国家自然科学基金(61906090,62176123);中央高校基本科研业务费专项资金(30920021131)
摘    要:异构缺陷预测(heterogeneous defect prediction,HDP)在具有异构特征的项目间进行缺陷预测,可以有效解决源项目和目标项目使用了不同特征的问题.当前大多数HDP方法都是通过学习域不变特征子空间以减少域之间的差异来解决异构特征问题.但是,源域和目标域通常呈现出巨大的异质性,使得域对齐效果并不好.究其原因,这些方法都忽视了分类器对于两个域中的同一类别应产生相似的分类概率分布这一潜在知识,没有挖掘数据中包含的内在语义信息.另一方面,由于在新启动项目或历史遗留项目中搜集训练数据依赖于专家知识,费时费力且容易出错,探究了基于目标项目内少数标记模块来进行异构缺陷预测的可能性.鉴于此,提出一种基于同步语义对齐的异构缺陷预测方法(SHSSAN).一方面,探索从标记的源项目中学到的隐性知识,从而在类别之间传递相关性,达到隐式语义信息迁移.另一方面,为了学习未标记目标数据的语义表示,通过目标伪标签进行质心匹配达到显式语义对齐.同时,SHSSAN可以有效解决异构缺陷数据集中常见的类不平衡和数据线性不可分问题,并充分利用目标项目中的标签信息.对包含30个不同项目的公共异构数据集进行的实验表明,与目前表现优异的CTKCCA、CLSUP、MSMDA、KSETE和CDAA方法相比,在F-measure和AUC上分别提升了6.96%、19.68%、19.43%、13.55%、9.32%和2.02%、3.62%、2.96%、3.48%、2.47%.

关 键 词:异构缺陷预测  语义对齐  少样本数据  类不平衡  线性不可分
收稿时间:2021-04-12
修稿时间:2021-07-18

Heterogeneous Defect Prediction Based on Simultaneous Semantic Alignment
LI Wei-Wei,CHEN Xiang,ZHANG Heng-Wei,HUANG Zhi-Qiu,JIA Xiu-Yi. Heterogeneous Defect Prediction Based on Simultaneous Semantic Alignment[J]. Journal of Software, 2023, 34(6): 2669-2689
Authors:LI Wei-Wei  CHEN Xiang  ZHANG Heng-Wei  HUANG Zhi-Qiu  JIA Xiu-Yi
Affiliation:College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China;College of Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China;School of Information Science and Technology, Nantong University, Nantong 226019, China;School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
Abstract:Heterogeneous defect prediction (HDP) can effectively solve the problem that the source project and the target project use different features. It uses heterogeneous feature data from the source project to predict the defect tendency of the software module in the target project. At present, HDP has made certain achievements, but its overall performance is not satisfactory. Most previous HDP methods solve this problem by learning domain invariant feature subspace to reduce the difference between domains. However, the source domain and the target domain usually show huge heterogeneity, which makes the domain alignment effect not satisfied. The reason is that these methods ignore the potential knowledge that the classifier should generate similar classification probability distributions for the same category in the two domains, and fail to mine the intrinsic semantic information contained in the data. In addition, because the collection of training data in newly launched projects or historical legacy projects relies on expert knowledge, is time-consuming, laborious, and error-prone, the possibility of heterogeneous defect prediction is explored based on a small number of labeled modules in the target project. Based on these, a heterogeneous defect prediction method is proposed based on simultaneous semantic alignment (SHSSAN). On the one hand, it explores the implicit knowledge learned from the labeled source projects, so as to transfer relevance between categories and achieve implicit semantic information transfer. On the other hand, in order to learn the semantic representation of unlabeled target data, centroid matching is performed through target pseudo-labels to achieve explicit semantic alignment. At the same time, SHSSAN can effectively solve the class imbalance problem and the data linearly inseparable problem, and make full use of the label information in the target project. Experiments on public heterogeneous data sets containing 30 different projects show that compared with the current excellent CTKCCA, CLSUP, MSMDA, KSETE, and CDAA methods, the F-measure and AUC are increased by 6.96%, 19.68%, 19.43%, 13.55%, 9.32% and 2.02%, 3.62%, 2.96%, 3.48%, 2.47%, respectively.
Keywords:heterogeneous defect prediction (HDP)  semantic alignment  few sample data  class imbalance  linearly inseparable
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号