首页 | 本学科首页   官方微博 | 高级检索  
     

大规模代码克隆的检测方法
引用本文:郭 颖,陈峰宏,周明辉.大规模代码克隆的检测方法[J].计算机科学与探索,2014(4):417-426.
作者姓名:郭 颖  陈峰宏  周明辉
作者单位:[1]北京大学信息科学技术学院软件研究所北京100871 [2]北京大学高可信软件技术教育部重点实验室北京100871
基金项目:The National Natural Science Foundation of China under Grant Nos. 91118004, 61073016 (国家自然科学基金); the National Basic Research Program of China under Grant No. 2011CB302604 (国家重点基础研究发展计划(973计划)); the Joint Funds of the National Natural Science Foundation of China under Grant No. U1201252 (国家自然科学基金联合基金资助项目); the National High Tech- nology Research and Development Program of China under Grant No. 2012AA011202 (国家高技术研究发展计划(863计划)).
摘    要:代码克隆检测在剽窃检测、版权侵犯调查、软件演化分析、代码压缩、错误检测,以及寻找bug,发现复用模式等方面有重要作用。现有的代码克隆检测工具算法复杂,或需要消耗大量的计算资源,不适用于规模巨大的代码数据。为了能够在大规模的数据上检测代码克隆,提出了一种新的代码克隆检测算法。该算法结合数据消重中的基于内容可变长度分块(content-defined chunking,CDC)思想和网页查重中的Simhash算法思想,采用了对代码先分块处理再模糊匹配的方法。在一个包含多种开源项目,超过5亿个代码文件,共约10 TB代码内容的数据源上,实现了该算法。通过实验,比较了不同分块长度对代码克隆检测率和所需要时间的影响,验证了新算法可以运用于大规模代码克隆检测,并且能够检测出一些级别3的克隆代码,达到了较高的准确率。

关 键 词:代码克隆  检测  大规模代码数据

Code Clone Detection Method for Large-Scale Source Code
GUO Ying,CHEN Fenghong,ZHOU Minghui.Code Clone Detection Method for Large-Scale Source Code[J].Journal of Frontier of Computer Science and Technology,2014(4):417-426.
Authors:GUO Ying  CHEN Fenghong  ZHOU Minghui
Affiliation:1. Institute of Software, School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China ;2. Key Laboratory of High Confidence Software Technologies of Ministry of Education, Peking University, Beijing 100871, China)
Abstract:The benefits of detecting code clones include detecting plagiarism and copyright infringement, helping in code compacting, error detecting, and finding usage patterns et al. The existing clone detection tools usually use com-plicated algorithm, or need lots of computing resources, so they can not be applied to detect code clones on large-scale code data. In order to implement code clone detection on massive data, this paper proposes a new code clone detection algorithm. The algorithm combines the idea of content-defined chunking (CDC) in data de-duplication and that of Simhash algorithm in finding duplicate webpage, and uses the method of first chunking then fuzzy matching. The algorithm is implemented on a data source which contains more than 500 million files of 10 TB from a variety of open source projects. This paper compares the influence of choosing different chunk lengths on detection rate and detection time. The experimental results show that the new algorithm can be applied not only to detect large scale code clones, but also to detect some Type 3 clones, with a high detection precision.
Keywords:code clone  detection  large-scale code data
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号