Using LSTM neural networks for cross-lingual phonetic speech segmentation with an iterative correction procedure |
| |
Authors: | Zdeněk Hanzlíček Jindřich Matoušek Jakub Vít |
| |
Affiliation: | 1. NTIS–New Technologies for the Information Society, Faculty of 2. Applied Sciences, University of West Bohemia, Pilsen, Czech Republic |
| |
Abstract: | This article describes experiments on speech segmentation using long short-term memory recurrent neural networks. The main part of the paper deals with multi-lingual and cross-lingual segmentation, that is, it is performed on a language different from the one on which the model was trained. The experimental data involves large Czech, English, German, and Russian speech corpora designated for speech synthesis. For optimal multi-lingual modeling, a compact phonetic alphabet was proposed by sharing and clustering phones of particular languages. Many experiments were performed exploring various experimental conditions and data combinations. We proposed a simple procedure that iteratively adapts the inaccurate default model to the new voice/language. The segmentation accuracy was evaluated by comparison with reference segmentation created by a well-tuned hidden Markov model-based framework with additional manual corrections. The resulting segmentation was also employed in a unit selection text-to-speech system. The generated speech quality was compared with the reference segmentation by a preference listening test. |
| |
Keywords: | LSTM neural networks multi-lingual and cross-lingual modeling speech segmentation |
|
|