首页 | 本学科首页   官方微博 | 高级检索  
     

基于韵律特征辅助的端到端语音识别方法
引用本文:刘聪,万根顺,高建清,付中华.基于韵律特征辅助的端到端语音识别方法[J].计算机应用,2023,43(2):380-384.
作者姓名:刘聪  万根顺  高建清  付中华
作者单位:科大讯飞股份有限公司 AI研究院,合肥 230088
西安讯飞超脑信息科技有限公司,西安 710000
基金项目:科技创新2030-“新一代人工智能”重大项目(2020AAA0103600)
摘    要:针对传统的语音识别系统采用数据驱动并利用语言模型来决策最优的解码路径,导致在部分场景下的解码结果存在明显的音对字错的问题,提出一种基于韵律特征辅助的端到端语音识别方法,利用语音中的韵律信息辅助增强正确汉字组合在语言模型中的概率。在基于注意力机制的编码-解码语音识别框架的基础上,首先利用注意力机制的系数分布提取发音间隔、发音能量等韵律特征;然后将韵律特征与解码端结合,从而显著提升了发音相同或相近、语义歧义情况下的语音识别准确率。实验结果表明,该方法在1 000 h及10 000 h级别的语音识别任务上分别较端到端语音识别基线方法在准确率上相对提升了5.2%和5.0%,进一步改善了语音识别结果的可懂度。

关 键 词:语音识别  端到端  语义歧义  注意力机制  韵律特征
收稿时间:2022-01-06
修稿时间:2022-04-06

End-to-end speech recognition method based on prosodic features
Cong LIU,Genshun WAN,Jianqing GAO,Zhonghua FU.End-to-end speech recognition method based on prosodic features[J].journal of Computer Applications,2023,43(2):380-384.
Authors:Cong LIU  Genshun WAN  Jianqing GAO  Zhonghua FU
Affiliation:AI Institute,iFLYTEK Company Limited,Hefei Anhui 230088,China
Xi’an iFLYTEK Hyper?brain Information Technology Company Limited,Xi’an Shaanxi 710000,China
Abstract:In the traditional speech recognition system, the optimal decoding paths are determined by a language model restrained by the training data. Almost inevitably, the right pronunciation may produce wrong character recognition results in some scenarios. In order to use the prosodic information in speech to enhance the probability of correct character combination in language model, an end-to-end speech recognition method based on prosodic features was proposed. Based on the attention mechanism based encoder-decoder speech recognition framework, firstly, the coefficient distribution of attention mechanism was used to extract prosodic features such as pronunciation interval and pronunciation energy. Then, the prosodic features were combined with decoder to significantly improve the accuracy of speech recognition in the cases with the same or similar pronunciation and semantic ambiguity. Experimental results show that the proposed method achieves a relative accuracy improvement of 5.2% and 5.0% respectively compared with the baseline end-to-end speech recognition method on 1 000 h and 10 000 h speech recognition tasks and improves the intelligibility of speech recognition results.
Keywords:speech recognition  end-to-end  semantic ambiguity  attention mechanism  prosodic feature  
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号