基于Token语义构建的代码克隆检测 Code Clone Detection Based on Token Semantics期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Token语义构建的代码克隆检测

引用本文：	王文杰,徐云. 基于Token语义构建的代码克隆检测[J]. 计算机系统应用, 2022, 31(11): 60-67

作者姓名：	王文杰徐云

作者单位：	中国科学技术大学计算机科学与技术学院, 合肥 230027;中国科学技术大学高性能计算安徽省重点实验室, 合肥 230027

基金项目：	国家自然科学基金(61672480); 国家外专局111引智计划(BP0719016)

摘要：	传统的基于Token的克隆检测方法利用代码字符串的序列化特性,可以在大型代码仓中快速检测克隆.但是与基于抽象语法树(AST)、程序依赖图(PDG)的方法相比,由于缺少语法及语义信息,针对文本有较大差异的克隆代码检测困难.为此,提出一种赋予语义信息的Token克隆检测方法.首先,分析抽象语法树,使用AST路径抽象位于叶子节点的Token的语义信息;然后,在函数名和类型名角色的Token上建立低成本索引,达到快速并有效地筛选候选克隆片段的目的.最后,使用赋予语义信息的Token判定代码块之间的相似性.在公开的大规模数据集BigCloneBench实验结果表明,该方法在文本相似度较低的Moderately Type-3和Weakly Type-3/Type-4类型克隆上显著优于主流方法,包括NiCad、Deckard、CCAligner等,同时在大型代码仓上需要更少的检测时间.
关键词：	代码克隆检测抽象语法树语义信息高效索引源代码
收稿时间：	2022-02-24
修稿时间：	2022-03-28
Code Clone Detection Based on Token Semantics

WANG Wen-Jie,XU Yun. Code Clone Detection Based on Token Semantics[J]. Computer Systems& Applications, 2022, 31(11): 60-67

Authors:	WANG Wen-Jie XU Yun

Affiliation:	School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China;Key Laboratory of High Performance Computing of Anhui Province, University of Science and Technology of China, Hefei 230027, China

Abstract:	Traditional token-based clone detection methods utilize the serialization characteristics of code strings to quickly detect clones in large code repositories. However, compared with the methods based on the abstract syntax tree (AST) and program dependency graph (PDG), traditional methods can hardly detect code clones with large text differences due to the lack of syntax and semantic information. Therefore, this study proposes a token-based clone detection method with semantic information. First, AST is analyzed, and the semantic information of tokens located at the leaf nodes is abstracted using the AST path. Then, a low-cost index is established on the tokens for function names and type roles to quickly filter valid candidate clone fragments. Finally, the similarity between code blocks is judged using the tokens with semantic information. The experimental results on the public large-scale dataset BigCloneBench reveal that this method significantly outperforms the mainstream methods, including NiCad, Deckard, and CCAligner in Moderately Type-3 and Weakly Type-3/Type-4 clones with low text similarity while requiring less detection time on large code repositories.

Keywords:	code clone detection abstract syntax tree semantic information efficient index source code

	点击此处可从《计算机系统应用》浏览原始摘要信息
	点击此处可从《计算机系统应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏