A Mixture of Recurrent Neural Networks for Speaker Normalisation |
| |
Authors: | Edmondo Trentin Diego Giuliani |
| |
Affiliation: | (1) ITC-irst, Centro per la Ricerca Scientifica e Technologica, Povo (Trento), Italy, IT |
| |
Abstract: | In spite of recent advances in automatic speech recognition, the performance of state-of-the-art speech recognisers fluctuates
depending on the speaker. Speaker normalisation aims at the reduction of differences between the acoustic space of a new speaker
and the training acoustic space of a given speech recogniser, improving performance. Normalisation is based on an acoustic
feature transformation, to be estimated from a small amount of speech signal. This paper introduces a mixture of recurrent
neural networks as an effective regression technique to approach the problem. A suitable Vit-erbi-based time alignment procedure
is proposed for generating the adaptation set. The mixture is compared with linear regression and single-model connectionist
approaches. Speaker-dependent and speaker-independent continuous speech recognition experiments with a large vocabulary, using
Hidden Markov Models, are presented. Results show that the mixture improves recognition performance, yielding a 21% relative
reduction of the word error rate, i.e. comparable with that obtained with model-adaptation approaches. |
| |
Keywords: | :Mixture of neural networks Multivariate regression Recurrent neural network Speaker adaptation Speaker normalisation
Speech recognition |
本文献已被 SpringerLink 等数据库收录! |
|