A comparison of models for fusion of the auditory and visual sensors in speech perception |
| |
Authors: | Jordi Robert-Ribes Jean-Luc Schwartz Pierre Escudier |
| |
Affiliation: | (1) Institut de la Communication Parlée, CNRS UA 368, INPG/ENSERG, Université Stendhal INPG, 46 Av. Félix Viallet, 38031 Grenoble Cedex 1, France |
| |
Abstract: | Though a large amount of psychological and physiological evidence of audio-visual integration in speech has been collected in the last 20 years, there is no agreement about the nature of the fusion process. We present the main experimental data, and describe the various models proposed in the literature, together with a number of studies in the field of automatic audiovisual speech recognition. We discuss these models in relation to general proposals arising from psychology in the field of intersensory interaction, or from the field of vision and robotics in the field of sensor fusion. Then we examine the characteristics of four main models, in the light of psychological data and formal properties, and we present the results of a modelling study on audio-visual recognition of French vowels in noise. We conclude in favor of the relative superiority of a model in which the auditory and visual inputs are projected and fused in a common representation space related to motor properties of speech objects, the fused representation being further classified for lexical access. |
| |
Keywords: | audiovisual speech perception sensor fusion noisy speech recognition intersensory interactions nowel processing |
本文献已被 SpringerLink 等数据库收录! |
|