首页 | 本学科首页   官方微博 | 高级检索  
     

基于向量表示的代码搜索方法
引用本文:慕江林,刘克剑,林晗.基于向量表示的代码搜索方法[J].西华大学学报(自然科学版),2019,38(5):106-112.
作者姓名:慕江林  刘克剑  林晗
作者单位:1.西华大学计算机与软件工程学院,四川 成都 610039
摘    要:软件开发者在开发项目过程中往往需要引用大量由其他开发者开发的基础软件包。为获取除基础软件包开发文档外的使用方式,软件开发者需将代码关键词输入到代码搜索引擎搜索代码片段。文章提出一种基于向量表示的代码搜索方法,该方法收集Github和Stack Overflow数据集中的代码片段训练一个扩充代码词的skip-gram模型,并使用这个模型扩充从搜索文本中提取的与代码词关联的搜索关键词,得到搜索关键词上下文代码片段向量组,将搜索关键词上下文代码片段向量组和待匹配代码片段向量组编码后,计算余弦相似度并排序生成搜索结果。为验证算法的有效性,分别在Github数据集和Stack Overflow上验证。在Stack Overflow数据集上测试表明:58%的搜索能在第1个搜索结果找到正确答案;65%的搜索能在前5个答案中找到正确答案;72%的搜索能在前10个答案中找到正确答案,并在召回率和F值也有一定程度的提升。在Github数据集上测试表明:59%的搜索能在第1个搜索结果找到正确答案;67%的搜索能在前5个答案中找到正确答案;74%的搜索能在前10个答案中找到正确答案,并在召回率和F值也有一定程度的提升。针对大量数据的代码检索,本算法效果优于典型方法的搜索结果。

关 键 词:代码向量表示    代码搜索    语义编码    余弦相似度
收稿时间:2019-01-07

A Code Search Approach Based on Vector Representation
MU Jianglin,LIU Kejian,LIN Han.A Code Search Approach Based on Vector Representation[J].Journal of Xihua University:Natural Science Edition,2019,38(5):106-112.
Authors:MU Jianglin  LIU Kejian  LIN Han
Affiliation:1.School of Computer and Software Engineering, Xihua University, Chengdu 610039 China
Abstract:Software developers often need to refer a large number of base packages developed by other developers during the development. In order to obtain usage in addition to the base package development documentation, the software developer code keywords are entered into the code search engine search code snippet. This paper proposes a code search method based on vector representation, which collects code fragments in Github and Stack Overflow data sets, trains a skip-gram model of extended code words, and uses this model to augment the association with code words extracted from search text. The search keyword is obtained by getting a search keyword context code segment vector group, encoding the search keyword context code segment vector group and the to-be-matched code segment vector group, and calculating the cosine similarity ranking to generate the search result. In order to verify the effectiveness of the proposed algorithm, the validity of the algorithm was verified on the Github dataset and Stack Overflow. Results of the tests on the Stack Overflow dataset show that 58% of searches can find the correct answer in the first search result.65% of the search can find the correct answer in the first five answers.72% of the search can find the correct answer in the first ten answers.And a certain degree of improvement in the recall rate and F value.Results of the tests on the Github dataset show that 59% of searches can find the correct answer in the first search result.67% of the search can find the correct answer in the first five answers.74% of the search can find the right answer in the first ten answers and a certain degree of improvement in the recall rate and F value.The experimental results show that the algorithm proposed in this paper is better than the search results of typical methods for code retrieval of large amounts of data.
Keywords:
本文献已被 CNKI 等数据库收录!
点击此处可从《西华大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《西华大学学报(自然科学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号