首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到16条相似文献,搜索用时 406 毫秒
1.
相对解码重建后的语音进行说话人识别,从VoIP的语音流中直接提取语音特征参数进行说话人识别方法具有便于实现的优点,针对G.729编码域数据,研究基于DTW算法的快速说话人识别方法。实验结果表明,在相关的说话人识别中,DTW算法相比GMM在识别正确率和效率上有了很大提高。  相似文献   

2.
压缩域说话人识别算法(Compressed-domain automatic speaker recognition,CD-ASR)即从压缩语音数据中直接提取压缩参数进行说话人识别,无需参数译码和波形合成.本文提出了基于概率统计直方图的VoIP压缩域说话人识别算法,包括矢量量化统计直方图和高斯混合模型统计直方图两种方法.在给出了G.729,G.723.1(6.3 kb/s),G.723.1(5.3 kb/s)压缩码流的压缩域特征提取方案后,分别以矢量量化统计直方图和高斯混合模型统计直方图作为识别模型进行说话人识别.实验结果表明,概率统计直方图法比在压缩码漉中提取同样识别参数的GMM模型,识别率有很大提高.  相似文献   

3.
VoIP压缩码流说话人识别研究   总被引:1,自引:0,他引:1       下载免费PDF全文
研究基于微聚类算法的VoIP压缩码流说话人识别算法。给出直接从G.729,G.723.1(6.3Kb/s),G.723.1(5.3Kb/s)压缩语音的码流中提取识别参数,以微聚类算法作为识别结构的说话人识别算法。实验结果表明,对比在压缩码流中使用同样识别参数的GMM模型,微聚类算法在识别正确率和效率上都有很大的提高。  相似文献   

4.
基于小波变分辨率频谱特征的静音检测   总被引:1,自引:0,他引:1  
薛卫  都思丹  叶迎宪 《计算机工程》2009,35(13):232-233
针对静音检测提出基于小波变分辨率频谱特征的检测算法。算法采用多门限过零率对静音进行初判,并提取多个语音感觉特征与基于小波变分辨率频谱的Mel频率倒谱系数(MFCC)组合成语音特征,通过二分类支持向量机对该特征进行分类实现静音检测。测试结果表明,该算法在不同信噪比下语音识别正确率高于G.729b,MFCC特征静音检测算法,基于该算法的视频会议服务器运算量低于使用G.729b静音检测算法的视频系统。  相似文献   

5.
小波变分辨率频谱特征静音检测和短时自适应混音算法   总被引:1,自引:0,他引:1  
静音检测算法使用两种语音感觉特征与变分辨率频谱的Mel频率倒谱系数组合成音频特征,采用多门限过零率对静音进行初判,并通过二分类支持向量机对组合语音特征进行分类;实时混音算法使用每一路音频的短时能量作为混音权重.测试表明,静音检测算法在不同信噪比下语音识别正确率高于G.729b静音检测算法;实时混音算法听觉测试优于传统的算法,并且混音计算延时低,满足网络实时传输的要求;两种算法同时应用于视频会议系统,视频会议服务器的运算量低于使用了G.729b静音检测算法的视频系统.  相似文献   

6.
一种抵抗去同步攻击的音频隐藏信息的方法   总被引:4,自引:0,他引:4  
王让定  李倩 《传感技术学报》2006,19(4):1023-1028
基于音频信息隐藏技术,提出了一种有效抵抗恶意去同步攻击的语音保密通信方法.对保密语音进行压缩编码,利用G.729编码标准的帧内独立编码特性,实现语音码流的帧内自同步;采用量化方法,将语音信息隐藏到载体音频的小波域中;以PN序列作为时域同步帧,定位保密信息的隐藏位置.该算法复杂度低,隐藏容量满足正常语音通信要求,且保密语音的检测与提取不需要使用原始音频.实验表明,算法抵抗音频处理(如加噪、MP3压缩、重采样、随机裁剪等)性能理想,特别是对于音频信号的恶意裁剪攻击,与同类方法相比具有更强的鲁棒性.  相似文献   

7.
在G.729的语音编码算法中,线谱频率量化是采用预测式矢量量化。当语音传送中出现帧丢失时,采用该方法在译码端会产生误差积累,从而导致语音质量下降。为了降低误差积累的影响,本文提出了一种新型的矢量量化方法。实验结果表明,该方法在防止误差积累方面与G.729相比,性能有明显的提高。  相似文献   

8.
针对说话人识别的性能易受到情感因素影响的问题,提出利用片段级别特征和帧级别特征联合学习的方法。利用长短时记忆网络进行说话人识别任务,提取时序输出作为片段级别的情感说话人特征,保留了语音帧特征原本信息的同时加强了情感信息的表达,再利用全连接网络进一步学习片段级别特征中每一个特征帧的说话人信息来增强帧级别特征的说话人信息表示能力,最后拼接片段级别特征和帧级别特征得到最终的说话人特征以增强特征的表征能力。在普通话情感语音语料库(MASC)上进行实验,验证所提出方法有效性的同时,探究了片段级别特征中包含语音帧数量和不同情感状态对情感说话人识别的影响。  相似文献   

9.
由于在说话人识别研究中发现,语音信号包括静音段、辅音段和浊音段,说话人的个性特征主要蕴含在浊音段中,静音段与辅音段参与识别会明显降低说话人的识别率。同时大量的实验证明,使用端点检测去掉静音段和辅音段后识别率有明显的提升,针对这种情况,指出了端点检测后的语音,其特征分布将会更加符合高斯分布。实验结果表明,端点检测越精确,其识别效果也就越好。  相似文献   

10.
基于脉冲位置参数统计特征的压缩域语音隐写分析   总被引:1,自引:1,他引:0  
丁琦  平西建 《计算机科学》2011,38(1):217-220
针对依合成分析法编码的压缩语音隐写,提出一种基于脉冲位置参数统计特征的隐写分析方法。以G.729编码的压缩语音为例,分析了载体语音和载密语音脉冲位置直方图和0、1出现概率的不同特点,在此基础上提出直方图平坦度这一特征,并采用脉冲位置修正直方图的平坦度、特征函数质心和方差,以及将码流中。和1出现的概率差作为分类特征,用SVM分类器进行隐写分析。实验结果表明,该方法对于G.729编码的压缩域语音隐写分析检测有较高的准确率。该方法也适用于其它使用依合成分析技术编码的压缩域语音隐写的分析。  相似文献   

11.
This paper proposes a novel low-complexity lip contour model for high-level optic feature extraction in noise-robust audiovisual (AV) automatic speech recognition systems. The model is based on weighted least-squares parabolic fitting of the upper and lower lip contours, does not require the assumption of symmetry across the horizontal axis of the mouth, and is therefore realistic. The proposed model does not depend on the accurate estimation of specific facial points, as do other high-level models. Also, we present a novel low-complexity algorithm for speaker normalization of the optic information stream, which is compatible with the proposed model and does not require parameter training. The use of the proposed model with speaker normalization results in noise robustness improvement in AV isolated-word recognition relative to using the baseline high-level model.   相似文献   

12.
A novel approach, based on robust regression with normalized score fusion (namely Normalized Scores following Robust Regression Fusion: NSRRF), is proposed for enhancement of speaker recognition over IP networks, which can be used both in Network Speaker Recognition (NSR) and Distributed Speaker Recognition (DSR) systems. In this framework, it is basically assumed that the speech must be encoded by G729 coder in client side, and then, transmitted at a server side, where the ASR systems are located. The Universal Background Gaussian Mixture Model (GMM-UBM) and Gaussian Supervector (GMM-SVM) with normalized scores are used for speaker recognition. In this work, Mel Frequency Cepstral Coefficient (MFCC) and Linear Prediction Cepstral Coefficient (LPCC), both of these features are derived from Line Spectral Pairs (LSP) extracted from G729 bit-stream over IP, constitute the features vectors. Experimental results, conducted with the LIA SpkDet system based on the ALIZE platform3 using ARADIGITS database, have shown in first that the proposed method using features extracted directly from G729 bit-stream reduces significantly the error rate and outperforms the baseline system in ASR over IP based on the resynthesized (reconstructed) speech obtained from the G729 decoder. In addition, the obtained results show that the proposed approach, based on scores normalization following robust regression fusion technique, achieves the best result and outperform the conventional ASR over IP network.  相似文献   

13.
Speaker space-based adaptation methods for automatic speech recognition have been shown to provide significant performance improvements for tasks where only a few seconds of adaptation speech is available. However, these techniques are not widely used in practical applications because they require large amounts of speaker-dependent training data and large amounts of computer memory. The authors propose a robust, low-complexity technique within this general class that has been shown to reduce word error rate, reduce the large storage requirements associated with speaker space approaches, and eliminate the need for large numbers of utterances per speaker in training. The technique is based on representing speakers as a linear combination of clustered linear basis vectors and a procedure is presented for maximum-likelihood estimation of these vectors from training data. Significant word error rate reduction was obtained using these methods relative to speaker independent performance for the Resource Management and Wall Street Journal task domains.  相似文献   

14.
基于最小二乘向量机的说话人识别研究   总被引:1,自引:0,他引:1  
说话人识别系统在说话人模板的建立过程中由于说话人的语音帧数量太多,往往要进行筛选,通常这种选择是一种基于枚举的大量反复的提取过程,复杂费时且结果往往并不是最优的。而基于统计学习理论的支持向量机(SVM)方法正好克服了这方面的不足。讨论了一种改进的SVM即最小二乘向量机(LSSVM)的方法进行说话人识别研究。研究表明,基于LSSVM的说话人识别比传统的SVM说话人识别计算复杂度小、效率更高、对说话人识别有很强的适应性。  相似文献   

15.
Speech coding facilitates speech compression without perceptual loss that results in the elimination or deterioration of both speech and speaker specific features used for a wide range of applications like automatic speaker and speech recognition, biometric authentication, prosody evaluations etc. The present work investigates the effect of speech coding in the quality of features which include Mel Frequency Cepstral Coefficients, Gammatone Frequency Cepstral Coefficients, Power-Normalized Cepstral Coefficients, Perceptual Linear Prediction Cepstral Coefficients, Rasta-Perceptual Linear Prediction Cepstral Coefficients, Residue Cepstrum Coefficients and Linear Predictive Coding-derived cepstral coefficients extracted from codec compressed speech. The codecs selected for this study are G.711, G.729, G.722.2, Enhanced Voice Services, Mixed Excitation Linear Prediction and also three codecs based on compressive sensing frame work. The analysis also includes the variation in the quality of extracted features with various bit-rates supported by Enhanced Voice Services, G.722.2 and compressive sensing codecs. The quality analysis of extracted epochs, fundamental frequency and formants estimated from codec compressed speech was also performed here. In the case of various features extracted from the output of selected codecs, the variation introduced by Mixed Excitation Linear Prediction codec is the least due to its unique method for the representation of excitation. In the case of compressive sensing based codecs, there is a drastic improvement in the quality of extracted features with the augmentation of bit rate due to the waveform type coding used in compressive sensing based codecs. For the most popular Code Excited Linear Prediction codec based on Analysis-by-Synthesis coding paradigm, the impact of Linear Predictive Coding order in feature extraction is investigated. There is an improvement in the quality of extracted features with the order of linear prediction and the optimum performance is obtained for Linear Predictive Coding order between 20 and 30, and this varies with gender and statistical characteristics of speech. Even though the basic motive of a codec is to compress single voice source, the performance of codecs in multi speaker environment is also studied, which is the most common environment in majority of the speech processing applications. Here, the multi speaker environment with two speakers is considered and there is an augmentation in the quality of individual speeches with increase in diversity of mixtures that are passed through codecs. The perceptual quality of individual speeches extracted from the codec compressed speech is almost same for both Mixed Excitation Linear Prediction and Enhanced Voice Services codecs but regarding the preservation of features, the Mixed Excitation Linear Prediction codec has shown a superior performance over Enhanced Voice Services codec.  相似文献   

16.
This research explores the various indicators for non-verbal cues of speech and provides a method of building a paralinguistic profile of these speech characteristics which determines the emotional state of the speaker. Since a major part of human communication consists of vocalization, a robust approach that is capable of classifying and segmenting an audio stream into silent and voiced regions and developing a paralinguistic profile for the same is presented. The data consisting of disruptions is first segmented into frames and this data is analyzed by exploiting short term acoustic features, temporal characteristics of speech and measures of verbal productivity. A matrix is finally developed relating the paralinguistic properties of average pitch, energy, rate of speech, silence duration and loudness to their respective context. Happy and confident states possessed high values of energy and rate of speech and less silence duration whereas tense and sad states showed low values of energy and speech rate and high periods of silence. Paralanguage was found to be an important cue to decipher the implicit meaning in a speech sample.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号