首页 | 本学科首页   官方微博 | 高级检索  
     

基于DL-T及迁移学习的语音识别研究
引用本文:张威,刘晨,费鸿博,李巍,俞经虎,曹毅. 基于DL-T及迁移学习的语音识别研究[J]. 工程科学学报, 2021, 43(3): 433-441. DOI: 10.13374/j.issn2095-9389.2020.01.12.001
作者姓名:张威  刘晨  费鸿博  李巍  俞经虎  曹毅
作者单位:1.江南大学机械工程学院,无锡 214122
基金项目:江苏省研究生创新计划资助项目;国家自然科学基金资助项目;高等学校学科创新引智计划资助项目;江苏省"六大人才高峰"计划资助项目
摘    要:
为解决RNN–T语音识别时预测错误率高、收敛速度慢的问题,本文提出了一种基于DL–T的声学建模方法。首先介绍了RNN–T声学模型;其次结合DenseNet与LSTM网络提出了一种新的声学建模方法— —DL–T,该方法可提取原始语音的高维信息从而加强特征信息重用、减轻梯度问题便于深层信息传递,使其兼具预测错误率低及收敛速度快的优点;然后,为进一步提高声学模型的准确率,提出了一种适合DL–T的迁移学习方法;最后为验证上述方法,采用DL–T声学模型,基于Aishell–1数据集开展了语音识别研究。研究结果表明:DL–T相较于RNN–T预测错误率相对降低了12.52%,模型最终错误率可达10.34%。因此,DL–T可显著改善RNN–T的预测错误率和收敛速度。 

关 键 词:深度学习   语音识别   声学模型   DL–T   迁移学习
收稿时间:2020-01-12

Research on automatic speech recognition based on a DL-T and transfer learning
ZHANG Wei,LIU Chen,FEI Hong-bo,LI Wei,YU Jing-hu,CAO Yi. Research on automatic speech recognition based on a DL-T and transfer learning[J]. Chinese Journal of Engineering, 2021, 43(3): 433-441. DOI: 10.13374/j.issn2095-9389.2020.01.12.001
Authors:ZHANG Wei  LIU Chen  FEI Hong-bo  LI Wei  YU Jing-hu  CAO Yi
Affiliation:1.School of Mechanical Engineering, Jiangnan University, Wuxi 214122, China2.Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and Technology, Wuxi 214122, China3.Suzhou Institute of Industrial Technology, Suzhou 215104, China
Abstract:
Speech has been a natural and effective way of communication, widely used in the field of information-communication and human–machine interaction. In recent years, various algorithms have been used for achieving efficient communication. The main purpose of automatic speech recognition (ASR), one of the key technologies in this field, is to convert the analog signals of input speech into corresponding text digital signals. Further, ASR can be divided into two categories: one based on hidden Markov model (HMM) and the other based on end to end (E2E) models. Compared with the former, E2E models have a simple modeling process and an easy training model and thus, research is carried out in the direction of developing E2E models for effectively using in ASR. However, HMM-based speech recognition technologies have some disadvantages in terms of prediction error rate, generalization ability, and convergence speed. Therefore, recurrent neural network–transducer (RNN–T), a typical E2E acoustic model that can model the dependencies between the outputs and can be optimized jointly with a Language Model (LM), was proposed in this study. Further, a new acoustic model of DL–T based on DenseNet (dense convolutional network)–LSTM (long short-term memory)–Transducer, was proposed to solve the problems of a high prediction error rate and slow convergence speed in a RNN–T. First, a RNN–T was briefly introduced. Then, combining the merits of both DenseNet and LSTM, a novel acoustic model of DL–T, was proposed in this study. A DL–T can extract high-dimensional speech features and alleviate gradient problems and it has the advantages of low character error rate (CER) and fast convergence speed. Apart from that, a transfer learning method suitable for a DL–T was also proposed. Finally, a DL–T was researched in speech recognition based on the Aishell–1 dataset for validating the abovementioned methods. The experimental results show that the relative CER of DL–T is reduced by 12.52% compared with RNN–T, and the final CER is 10.34%, which also demonstrates a low CER and better convergence speed of the DL–T. 
Keywords:
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《工程科学学报》浏览原始摘要信息
点击此处可从《工程科学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号