Speech-to-Lip Movement Synthesis by Maximizing Audio-Visual Joint Probability Based on the EM Algorithm |
| |
Authors: | Satoshi Nakamura and Eli Yamamoto |
| |
Affiliation: | (1) ATR Spoken Language Translation Research Laboratories, 2-2 Hikaridai, Seika-cho Soraku-gun Kyoto, 619-0288, Japan;(2) Faculty of Systems Engineering, Wakayama University, 930 Sakaedani, Wakayama, 640-8510, Japan |
| |
Abstract: | In this paper, we investigate a Hidden Markov Model (HMM)-based method to drive a lip movement sequence with input speech. In a previous study, we have already investigated a mapping method based on the Viterbi decoding algorithm which converts an input speech signal to a lip movement sequence through the most likely HMM state sequence using audio HMMs. However, the method can result in errors due to incorrectly decoded HMM states. This paper proposes a method to re-estimate visual parameters using HMMs of audio-visual joint probability using the Expectation-Maximization (EM) algorithm. In the experiments, the proposed mapping method results in a 26% error reduction when compared to the Viterbi-based algorithm at incorrectly decoded bilabial consonants. |
| |
Keywords: | lip movement synthesis speech recognition EM algorithm audio-visual joint probability |
本文献已被 SpringerLink 等数据库收录! |
|