期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Jianrong Wang Ju Zhang Kiyoshi Honda Jianguo Wei Jianwu Dang 《Multimedia Systems》2016,22(3):315-323

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments. 相似文献

2.

Multiple cameras audio visual speech recognition using active appearance model visual features in car environment

Astik Biswas P. K. Sahu Mahesh Chandra 《International Journal of Speech Technology》2016,19(1):159-171

Consideration of visual speech features along with traditional acoustic features have shown decent performance in uncontrolled auditory environment. However, most of the existing audio-visual speech recognition (AVSR) systems have been developed in the laboratory conditions and rarely addressed the visual domain problems. This paper presents an active appearance model (AAM) based multiple-camera AVSR experiment. The shape and appearance information are extracted from jaw and lip region to enhance the performance in vehicle environments. At first, a series of visual speech recognition (VSR) experiments are carried out to study the impact of each camera on multi-stream VSR. Four cameras in car audio-visual corpus is used to perform the experiments. The individual camera stream is fused to have four-stream synchronous hidden Markov model visual speech recognizer. Finally, optimum four-stream VSR is combined with single stream acoustic HMM to build five-stream AVSR. The dual modality AVSR system shows more robustness compared to acoustic speech recognizer across all driving conditions. 相似文献

3.

多噪声环境下的层级语音识别模型

曹晶晶许洁萍邵聖淇《计算机应用》2018,38(6):1790-1794

针对多噪声环境下的语音识别问题,提出了将环境噪声作为语音识别上下文考虑的层级语音识别模型。该模型由含噪语音分类模型和特定噪声环境下的声学模型两层组成,通过含噪语音分类模型降低训练数据与测试数据的差异,消除了特征空间研究对噪声稳定性的限制,并且克服了传统多类型训练在某些噪声环境下识别准确率低的弊端,又通过深度神经网络（DNN）进行声学模型建模,进一步增强声学模型分辨噪声的能力,从而提高模型空间语音识别的噪声鲁棒性。实验中将所提模型与多类型训练得到的基准模型进行对比,结果显示所提层级语音识别模型较该基准模型的词错率（WER）相对降低了20.3%,表明该层级语音识别模型有利于增强语音识别的噪声鲁棒性。相似文献

4.

Improved features and dynamic stream weight adaption for robust Audio-Visual Speech Recognition framework

《Digital Signal Processing》2019

This paper investigates the enhancement of a speech recognition system that uses both audio and visual speech information in noisy environments by presenting contributions in two main system stages: front-end and back-end. The double use of Gabor filters is proposed as a feature extractor in the front-end stage of both modules to capture robust spectro-temporal features. The performance obtained from the resulted Gabor Audio Features (GAFs) and Gabor Visual Features (GVFs) is compared to the performance of other conventional features such as MFCC, PLP, RASTA-PLP audio features and DCT2 visual features. The experimental results show that a system utilizing GAFs and GVFs has a better performance, especially in a low-SNR scenario. To improve the back-end stage, a complete framework of synchronous Multi-Stream Hidden Markov Model (MSHMM) is used to solve the dynamic stream weight estimation problem for Audio-Visual Speech Recognition (AVSR). To demonstrate the usefulness of the dynamic weighting in the overall performance of AVSR system, we empirically show the preference of Late Integration (LI) compared to Early Integration (EI) especially when one of the modalities is corrupted. Results confirm the superior recognition accuracy for all SNR levels the superiority of the AVSR system with the Late Integration. 相似文献

5.

遮挡情况下刚体位姿估计的自适应无迹卡尔曼分布式融合

冯远静黄良鹏张文安《控制理论与应用》2020,37(1):69-80

针对视觉目标位姿估计系统中常出现的因为特征点遮挡而造成系统估计结果不准确的问题,本文提出了一种利用自适应无迹卡尔曼滤波(AUKF)作为局部滤波器的分布式融合估计方法.通过引入改进的Sage-Husa噪声估计器自适应过程噪声.根据特征点识别量将遮挡情况分为部分遮挡和严重遮挡,对部分遮挡子系统根据先验信息修复缺失观测点后进行局部滤波估计,严重遮挡子系统不参与融合,利用当前时刻整体估计结果对其进行初始化.通过仿真获取了区分遮挡情况的阈值,实验结果表明所提方法能够提升系统在遮挡情况下的估计精度与鲁棒性. 相似文献

6.

An audio-visual corpus for multimodal automatic speech recognition

Andrzej Czyzewski Bozena Kostek Piotr Bratoszewski Jozef Kotus Marcin Szykulski 《Journal of Intelligent Information Systems》2017,49(2):167-192

相似文献

7.

自适应视听信息融合用于抗噪语音识别

梁冰陈德运程慧《控制理论与应用》2011,28(10):1461-1466

为了提高噪音环境中语音识别的准确性和鲁棒性,提出了基于自适应视听信息融合的抗噪语音识别方法,视听信息在识别过程中具有变化的权重,动态的自适应于环境输入的信噪比．根据信噪比和反馈的识别性能,通过学习自动机计算视觉信息的最优权重;根据视听信息的特征向量,利用隐马尔科夫模型进行视听信息的模式匹配,并根据最优权重组合视觉和声音隐马尔科夫模型的决策,获得最终的识别结果．实验结果表明,在各种噪音水平下,自适应权重比不变权重的视听信息融合的语音识别性能更优．相似文献

8.

抗噪声语音识别及语音增强算法的应用 总被引：1，自引：0，他引：1

汤玲戴斌《计算机仿真》2006,23(9):80-82,143

提高语音识别系统的鲁棒性是语音识别技术一个重要的研究课题。语音识别系统往往由于训练环境下的数据和识别环境下的数据不匹配造成系统的识别性能下降，为了让语音识别系统在含噪的环境下获得令人满意的工作性能，该文根据人耳听觉特性提出了一种鲁棒语音特征提取方法。在MFCC特征提取之前先对含噪语音特征进行掩蔽特性处理，同时结合语音增强方法对特征进行处理，最后得到鲁棒语音特征。通过4种不同试验结果分析表明，将这种方法用于抗噪声分析可以提高系统的抗噪声能力；同时这种特征的处理方法对不同噪声在不同信噪比有很好的适应性。相似文献

9.

复杂环境下基于自适应深度神经网络的鲁棒语音识别

张开生赵小芬《计算机工程与科学》2022,44(6):1105-1113

在连续语音识别系统中,针对复杂环境（包括说话人及环境噪声的多变性）造成训练数据与测试数据不匹配导致语音识别率低下的问题,提出一种基于自适应深度神经网络的语音识别算法。结合改进正则化自适应准则及特征空间的自适应深度神经网络提高数据匹配度;采用融合说话人身份向量i-vector及噪声感知训练克服说话人及环境噪声变化导致的问题,并改进传统深度神经网络输出层的分类函数,以保证类内紧凑、类间分离的特性。通过在TIMIT英文语音数据集和微软中文语音数据集上叠加多种背景噪声进行测试,实验结果表明,相较于目前流行的GMM-HMM和传统DNN语音声学模型,所提算法的识别词错误率分别下降了5.151%和3.113%,在一定程度上提升了模型的泛化性能和鲁棒性。相似文献

10.

Speech enhancement based on stationary bionic wavelet transform and maximum a posterior estimator of magnitude-squared spectrum

Talbi Mourad 《International Journal of Speech Technology》2017,20(1):75-88

Numerous efforts have focused on the problem of reducing the impact of noise on the performance of various speech systems such as speech coding, speech recognition and speaker recognition. These approaches consider alternative speech features, improved speech modeling, or alternative training for acoustic speech models. In this paper, we propose a new speech enhancement technique, which integrates a new proposed wavelet transform which we call stationary bionic wavelet transform (SBWT) and the maximum a posterior estimator of magnitude-squared spectrum (MSS-MAP). The SBWT is introduced in order to solve the problem of the perfect reconstruction associated with the bionic wavelet transform. The MSS-MAP estimation was used for estimation of speech in the SBWT domain. The experiments were conducted for various noise types and different speech signals. The results of the proposed technique were compared with those of other popular methods such as Wiener filtering and MSS-MAP estimation in frequency domain. To test the performance of the proposed speech enhancement system, four objective quality measurement tests [signal to noise ratio (SNR), segmental SNR, Itakura–Saito distance and perceptual evaluation of speech quality] were conducted for various noise types and SNRs. Experimental results and objective quality measurement test results proved the performance of the proposed speech enhancement technique. It provided sufficient noise reduction and good intelligibility and perceptual quality, without causing considerable signal distortion and musical background noise. 相似文献