基于频谱图转换器的音频场景分类 Audio Scene Classification Based on Audio Spectrogram Transformer期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于频谱图转换器的音频场景分类

引用本文：	袁双,杨立东,郭勇,牛大伟,张丹丹.基于频谱图转换器的音频场景分类[J].信号处理,2023,39(4):730-736.

作者姓名：	袁双杨立东郭勇牛大伟张丹丹

作者单位：	内蒙古科技大学信息工程学院，内蒙古包头 014010

基金项目：	国家自然科学基金项目62161040内蒙古科技计划项目2021GG0023内蒙古自然科学基金项目2021MS06030内蒙古自治区高等学校青年科技英才支持计划NJYT22056

摘要：	音频场景分类是场景理解重要的一环，学习音频场景特征并精准分类能加强机器与环境的交互能力，在大数据时代其重要性不言而喻。鉴于分类任务表现依赖数据集规模，但实际任务中又面临数据集严重不足的情况，本文提出了数据增强和网络模型预训练策略，将频谱图转换器模型和音频场景分类任务相结合。首先，提取音频信号对数梅尔能量频谱图输入模型，然后通过模型动态交互能力，加强音频序列空间关系，最后由标记向量完成分类。将本文方法在DCASE2019task1和DCASE2020task1公开数据集上进行测试，分类准确率分别达到了96.489%和93.227%，与已有算法相比有明显的提升，说明本方法适用高精度音频场景分类任务，为高精度智能设备感知环境内容、检测环境动态打下了基础。
关键词：	音频场景分类转换器预训练数据增强
收稿时间：	2022-11-14
Audio Scene Classification Based on Audio Spectrogram Transformer

Affiliation:	Inner Mongolia University of Science and Technology，School of Information Engineering，Baotou，Inner Mongolia 014010，China

Abstract:	? ?Audio scene classification was an important part of scene understanding. Learning the characteristics of audio scenes and accurate classification can strengthen the interaction between machines and the environment， and its importance is self-evident in the age of big data. In view of the fact that the performance of classification task depends on the size of the dataset， but the actual task is faced with a serious shortage of data sets， this paper proposed a data enhancement and network model pre-training strategy， which combined the audio spectrogram transformer model with the audio scene classification task. First， extracted the input model of the log-Mel energies spectrum of the audio signal， then strengthened the spatial relationship of the audio sequence through the dynamic interaction ability of the model， and finally complete the classification by the tag vector. The method in this paper is tested on the public datasets of DCASE2019task1 and DCASE2020task1， and the classification accuracy rates are 96.489% and 93.227% respectively， which is significantly improved compared with the existing algorithms， indicating that this method is applicable to high-precision audio scene classification tasks， laying a foundation for high-precision intelligent devices to perceive environmental content and detect environmental dynamics.

Keywords:

	点击此处可从《信号处理》浏览原始摘要信息
	点击此处可从《信号处理》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏