首页 | 本学科首页   官方微博 | 高级检索  
     

一种改进的线性注意力机制语音识别方法
引用本文:李宜亭,屈丹,杨绪魁,张昊,沈小龙.一种改进的线性注意力机制语音识别方法[J].信号处理,2023,39(3):516-525.
作者姓名:李宜亭  屈丹  杨绪魁  张昊  沈小龙
作者单位:1.中国人民解放军战略支援部队信息工程大学信息系统工程学院, 河南 郑州 450001
基金项目:国家自然科学基金62171470河南省中原科技创新领军人才项目234200510019河南省自然科学基金面上项目232300421240
摘    要:Conformer模型因其优越的性能,吸引了越来越多研究者的关注,逐渐成为语音识别领域的主流模型,但因其采用注意力机制从输入中提取信息,需要对输入序列中所有样本点进行交互计算,导致网络计算复杂度为输入序列长度的平方,因此在对长语音进行识别时需要消耗更多计算资源,其识别速度较慢。针对此问题,本文提出一种线性注意力机制的语音识别方法。首先,提出一种新型门控线性注意力结构将多头注意力改进为单头,将注意力计算复杂度改进为序列长度的线性关系,以有效减少注意力计算复杂度。其次,为了弥补使用线性注意力导致的模型建模能力下降,在线性注意力求解过程中,综合使用局部注意力和全局注意力,联合线性注意力编码,提高模型识别精度。最后,为了进一步提升模型识别效果,在注意力损失和连接时序分类(connectionist temporal classification, CTC)损失的基础上使用注意力引导损失和中间CTC损失融合建模目标函数。在中文普通话数据集AISHELL-1和英文LibriSpeech数据集上的实验结果表明,改进模型的性能明显优于基线模型,且模型显存消耗下降,训练、识别速度得到较大提升。

关 键 词:语音识别  端到端  高效注意力  连接时序分类  Conformer
收稿时间:2022-08-25

Speech Recognition Model Based on Improved Linear Attention Mechanism
Affiliation:1.College of Information Systems Engineering,PLA Strategic Force Information Engineering University,Zhengzhou,Henan 450001,China2.Troops 95897 of PLA,Dalian,Liaoning 116001,China
Abstract:? ?The Conformer model has drawn more and more researchers attention and gradually become the mainstream model in the field of speech recognition because of its superior performance. But because it uses the attention mechanism to extract information from the input, which needs to be interactively calculated for all sample points in the input sequences, resulting in the complexity of the network calculation being the square of the length of the input sequences. So it needs to consume more computing resources when recognizing long speech sequences, and its recognition speed is slower. Aiming at solving this problem, this paper proposed a speech recognition method of linear attention mechanism. Firstly, a novel gated linear attention structure was proposed to effectively reduce the attention calculation complexity. The multi-head attention was improved to single head attention and the attention calculation complexity reduced to linear relationship of the sequence length. Secondly, in order to make up for the decline in modeling ability caused by the use of linear attention, the combination of local attention and global attention was used with the help of linear attention positional coding. Finally, in order to further improve the model recognition performance, the guided attention loss and intermediate connectionist temporal classification (CTC) loss was added to the objective function on the basis of attention loss and CTC loss. Experimental results on the Chinese Mandarin dataset AISHELL-1 and the English LibriSpeech dataset showed that the performance of the improved model was significantly better than the baseline model, and the video memory consumption of the model decreased, with the training and recognition speed greatly improved. 
Keywords:
点击此处可从《信号处理》浏览原始摘要信息
点击此处可从《信号处理》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号