首页 | 本学科首页   官方微博 | 高级检索  
     

基于可分离结构变换的轻量级Vision Transformer
引用本文:黄延辉,兰海,魏宪. 基于可分离结构变换的轻量级Vision Transformer[J]. 计算机与现代化, 2022, 0(10): 75-81
作者姓名:黄延辉  兰海  魏宪
作者单位:1. 福州大学电气工程与自动化学院;2. 中国科学院福建物质结构研究所泉州装备制造研究中心
基金项目:中国福建光电信息科学技术创新实验室(闽都创新实验室)项目(2021ZZ120); 福建省科技计划项目(2021T3003, 2021T3068); 泉州市科技计划项目(2021C065L); 莆田市科技计划项目(2020HJSTS006)
摘    要:由于视觉Transformer结构模型参数量大、浮点计算次数高,使得其难以部署到终端设备上。因为注意力矩阵存在低秩瓶颈,所以模型压缩算法和注意力机制加速算法不能很好地平衡模型参数量、模型推理速度和模型性能之间的关系。为了解决上述问题,本文设计一种轻量级的Vi T-SST模型用于图像分类任务。首先,通过将传统全连接层转换为可分离结构,大幅度降低模型参数量且提高了模型推理速度,保证了注意力矩阵不会因出现低秩而破坏模型表达能力;其次,提出一种基于SVD分解的克罗内克积近似分解法,可以将公开的Vi T-Base模型预训练参数转换至Vi T-Base-SST模型,略微缓解了Vi T模型的过拟合现象并提高了模型精度。在常见公开图片数据集CIFAR系列和Caltech系列上的实验验证了本文方法优于对比方法。

关 键 词:深度学习  计算机视觉  图像分类  模型压缩
收稿时间:2022-10-21

Lightweight Vision Transformer Based on Separable Structured Transformations
Abstract:Due to a large number of parameters and high floating-point calculations of the Visual Transformer model, it is difficult to deploy it to portable or terminal devices. Because the attention matrix has a low-rank bottleneck, the model compression algorithm and the attention mechanism acceleration algorithm cannot well balance the relationship between the amount of model parameters, model inference speed and model performance. In order to solve the above problems, a lightweight ViT-SST model is designed. Firstly, by transforming the traditional fully connected layer into a separable structure, the number of model parameters is greatly reduced and the reasoning speed of the model is improved, and it is guaranteed that the attention matrix will not destroy the model’s expressive ability due to the appearance of low rank. Secondly, this paper proposes a Kronecker product approximate decomposition method based on SVD decomposition, which can convert the pre-training parameters of the public ViT-Base model to the ViT-Base-SST model. It slightly alleviates the overfitting phenomenon of the ViT-Base model and improves the accuracy of the model. Experiments on five common public datasets show that the proposed method is more suitable for the Transformer structure model than traditional compression methods.
Keywords:deep learning   computer vision   image classification   model compression  
点击此处可从《计算机与现代化》浏览原始摘要信息
点击此处可从《计算机与现代化》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号