基于深度学习的Linux内核引用计数字段识别方法 Refcount Field Identification for Linux Kernel Based on Deep Learning期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于深度学习的Linux内核引用计数字段识别方法

引用本文：	谈心,杨悉瑜,曹家俊,张源. 基于深度学习的Linux内核引用计数字段识别方法[J]. 软件学报, 2022, 33(6): 2030-2046

作者姓名：	谈心杨悉瑜曹家俊张源

作者单位：	复旦大学计算机科学技术学院, 上海 201203

基金项目：	国家自然科学基金(U1836210, U1836213, U1736208, 61972099, 62172105);上海自然科学基金(19ZR1404800);上海市青年科技启明星计划(21QA1400700)

摘要：	引用计数机制是现代软件中一种常见的内存管理技术.引用计数错误往往会导致内存泄露、释放后使用(useafterfree)等严重的安全问题.现有致力于提高引用计数安全性的工作都依赖于对引用计数的字段进行识别.然而,由于类似于Linux等软件系统的代码十分复杂,在代码中识别出引用计数字段是一项十分困难的工作.传统的基于代码模式匹配的引用计数字段识别方法一方面存在需要专家经验总结规则,人工开销大的问题;另一方面存在总结的模式无法覆盖所有情况,召回率较低等局限.针对这些问题,发现与字段有关的代码行为以及字段的名称可以用来表征这个字段的特征,帮助识别引用计数字段.基于这两个层面的特征,设计了一种基于多模态深度学习的引用计数字段识别方法,并面向Linux内核实现原型系统.测试数据表明:该原型系统的精确率、召回率分别为96.98%和93.54%,而传统的基于代码模式匹配的方法没有识别出任何引用计数字段.此外,在Linux内核上发现61个引用计数字段使用不安全的数据类型,并对其中21个向Linux内核社区提交数据类型转换补丁以提高引用计数字段的安全性,其中6个已经被合并到Linux内核代码主分支.
关键词：	引用计数识别静态程序分析多模态深度学习
收稿时间：	2021-09-05
修稿时间：	2021-10-15
Refcount Field Identification for Linux Kernel Based on Deep Learning

TAN Xin,YANG Xi-Yu,CAO Jia-Jun,ZHANG Yuan. Refcount Field Identification for Linux Kernel Based on Deep Learning[J]. Journal of Software, 2022, 33(6): 2030-2046

Authors:	TAN Xin YANG Xi-Yu CAO Jia-Jun ZHANG Yuan

Affiliation:	School of Computer Science, Fudan University, Shanghai 200438, China

Abstract:	Reference counting (refcount) is a common memory management technique in modern software. Refcount errors can often lead to severe memory errors such as Memory Leak, Use-After-Free, etc. Many efforts to harden refcount security rely on known refcount fields as their input. However, due to the complexity of software code, identifying refcount fields in source code is very challenging. Traditional methods of identifying refcount fields are mainly based on code pattern matching and have great limitations such as requiring expert experience to summarize patterns, which is a laborious job. Besides, the manually-summarized patterns do not cover all cases, resulting in a low recall. To address these problems, this paper proposes to characterize a field based on the field name and the code behaviour associated with the field; and designs a multimodal deep learning based approach. The paper implements a prototype of the new approach for Linux kernel code. In the evaluation, the precision and recall achieved by the prototype system are 96.98% and 93.54%. In contrast, the traditional code-pattern-based identification method did not report any refcount fields on the testing set. In addition, we identify 61 refcount fields which are implemented with insecure data types in the latest Linux kernel. Until now, we reported 21 of them to the Linux community, of which six have been confirmed.

Keywords:	Refcount field identification Static program analysis Multimodal deep learning
本文献已被万方数据等数据库收录！
	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏