首页 | 本学科首页   官方微博 | 高级检索  
     

基于数据增强视觉Transformer的细粒度图像分类
引用本文:胡晓斌,彭太乐. 基于数据增强视觉Transformer的细粒度图像分类[J]. 西华大学学报(自然科学版), 2022, 41(6): 9-16. DOI: 10.12198/j.issn.1673-159X.4544
作者姓名:胡晓斌  彭太乐
作者单位:淮北师范大学计算机科学与技术学院, 安徽 淮北 235000
基金项目:国家自然科学基金项目(61976101);安徽省高校自然科学研究项目(KJ2017A843)
摘    要:近来,视觉Transformer (vision transformer, ViT)在图像识别领域取得突破性进展,其自我注意力机制(self-attention mechanism, MSA)能够提取不同像素块的判别性标记信息进而提升图像分类精度,但其深层中的分类标记容易忽略层级之间的局部特征,此外,嵌入层将固定大小的像素块输入网络,不可避免地引入额外的图像噪声。为此,本文研究了一种基于数据增强的视觉 Transformer(data augmentation vision transformer, DAVT),并提出注意力裁剪的数据增强方法,以注意力权重为指导裁剪图像,提高网络学习关键特征的能力。其次,本文还提出层级注意力选择方法(hierarchical attention selection, HAS),它通过对层级之间标记进行筛选并融合,提升网络学习层级之间判别性标记的能力。实验结果表明,该方法在CUB-200-2011和Stanford Dogs两个通用数据集上的准确率优于现有的主流方法,其准确率比原始ViT分别提高1.4%和1.6%。

关 键 词:细粒度图像分类   层级注意力选择   数据增强机制   图像识别
收稿时间:2020-04-24

Fine-grained Image Classification Based on Data Augmentation Vision Transformer
HU Xiaobin,PENG Taile. Fine-grained Image Classification Based on Data Augmentation Vision Transformer[J]. Journal of Xihua University(Natural Science Edition), 2022, 41(6): 9-16. DOI: 10.12198/j.issn.1673-159X.4544
Authors:HU Xiaobin  PENG Taile
Affiliation:School of Computer Science and Technology, Huaibei Normal University, Huaibei 235000 China
Abstract:Recently, vision Transformer (ViT) has made a breakthrough in the field of image recognition. Its self-attention mechanism (MSA) can extract the discriminant token information of different pixel blocks to improve the accuracy of image classification, but the classification tokens in its deep layer are easy to ignore the local features between levels. Secondly, the embedded layer inputs fixed-size pixel patches into the network, which inevitably introduces additional image noise. Secondly, Hierarchical attention selection method (HAS) is proposed in this paper to improve the ability of network learning discriminative markers between levels by screening and fusing markers between levels. Therefore, a hierarchical attention selection method (HAS) is proposed, which improves the ability of discriminative tokens between levels of e-learning by screening and integrating tokens between levels. Experimental results show that the accuracy of this algorithm on the two general data sets of CUB-200-2011 and Stanford Dogs is better than that of the existing methods based on ViT framework, which is 1.4% and 1.6% higher than the original ViT, respectively.
Keywords:
点击此处可从《西华大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《西华大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号