首页 | 本学科首页   官方微博 | 高级检索  
     


Voice activity detection and speaker localization using audiovisual cues
Authors:Dante A. Blauth  Bowon Lee
Affiliation:a Applied Computing - UNISINOS, Av. Unisinos, 950, São Leopoldo 93022-000, RS, Brazil
b Institute of Informatics - UFRGS, Av. Bento Gonçalves, 9500, Porto Alegre 91501-970, RS, Brazil
c Hewlett-Packard Laboratories, 1501 Page Mill Road, Palo Alto, CA 94304, USA
d Huawei Innovation Center US R&D, 2330 Central Expressway, Santa Clara, CA 95050, USA
Abstract:This paper proposes a multimodal approach to distinguish silence from speech situations, and to identify the location of the active speaker in the latter case. In our approach, a video camera is used to track the faces of the participants, and a microphone array is used to estimate the Sound Source Location (SSL) using the Steered Response Power with the phase transform (SRP-PHAT) method. The audiovisual cues are combined, and two competing Hidden Markov Models (HMMs) are used to detect silence or the presence of a person speaking. If speech is detected, the corresponding HMM also provides the spatio-temporally coherent location of the speaker. Experimental results show that incorporating the HMM improves the results over the unimodal SRP-PHAT, and the inclusion of video cues provides even further improvements.
Keywords:User interfaces   Voice activity detection   Speaker localization   Multimodal analysis   Hidden Markov Models
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号