首页 | 本学科首页   官方微博 | 高级检索  
     

基于发音特征的音视频说话人识别鲁棒性的研究
引用本文:陈雁翔,刘鸣.基于发音特征的音视频说话人识别鲁棒性的研究[J].电子学报,2010,38(12):2920-2924.
作者姓名:陈雁翔  刘鸣
作者单位:1. 合肥工业大学计算机与信息学院,安徽合肥 230009;2. 伊利诺伊大学香槟分校电子计算机工程系,伊利诺伊州 61801
基金项目:国家自然科学基金,安徽省优秀青年科技基金
摘    要: 人类对语音的感知是多模态的,会同时受到听觉和视觉的影响.以语音及其视觉特征的融合为研究核心,依据发音机理中揭示的音视频之间非同步关联的深层次成因,采用多个发音特征的非同步关联,去描述表面上观察到的音视频之间的非同步,提出了一个基于动态贝叶斯网络的语音与唇动联合模型,并通过音视频双模态的多层次融合,实现了说话人识别系统鲁棒性的提高.音视频双模态数据库上的实验表明了,在不同语音信噪比的条件下多层次融合均达到了更好的性能.

关 键 词:发音特征  音视频  说话人识别  动态贝叶斯网络
收稿时间:2009-08-06

Research on Robustness of Audio-Visual Speaker Recognition Based on Articulatory Features
CHEN Yan-xiang,LIU Ming.Research on Robustness of Audio-Visual Speaker Recognition Based on Articulatory Features[J].Acta Electronica Sinica,2010,38(12):2920-2924.
Authors:CHEN Yan-xiang  LIU Ming
Affiliation:1. School of Computer Science and Information,Hefei University of Technology,Hefei,Anhui 230009,China;2. Department of Electrical and Computer Engineering,University of Illinois at Urbana-Champaign,Illinois 61801,USA
Abstract:Speech perception of human is bimodal because of the simultaneous audible and visible influence.This paper investigates how to fuse speech and visual speech features.From research on articulatory mechanism,the apparently observed audio-visual asynchrony is represented by asynchronous articulatory feature streams.An audio-visual model composed of speech and lip-moving is proposed based on Dynamic Bayesian Network,and then the multi-level fusion is implemented to improve the robustness of speaker recognition system.The experiment for audio-visual bimodal corpus shows that the multi-level fusion can improve the performance at all levels of acoustic signal-to-noise ratio (SNR) from 0 to 30dB.
Keywords:articulatory feature  audio-visual  speaker recognition  dynamic Bayesian network
本文献已被 万方数据 等数据库收录!
点击此处可从《电子学报》浏览原始摘要信息
点击此处可从《电子学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号