首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
A two stage hybrid embedded speech/audio coding structure and algorithm are proposed. The first stage of the structure consists of a core speech coder which provides a minimum output bit rate and acceptable performance on clean speech inputs. The second stage is a perceptual/transform based coder which provides a separate optional bitstream for the enhancement of the core stage output.The two stage structure can be used to enhance the quality of an existing codec without modification of the original coding algorithm. In this regard it can be considered a value added option that can be used with a standard (existing) system. The structure can also be used in systems in which many users/systems force the coding algorithm to work simultaneously under multiple constraints of bitrate, complexity, delay, and coding quality.Informal testing of the algorithm has been done using ITU-T standard G.723.1 at 5.3 kb/s as a core coder. The maximum combined bitrate from the core and enhancement stages for the tests is 16 kb/s. The tests show that the second stage significantly improves the quality of the core output in the cases of music and speech with background noise. Compared to the non-embedded fixed rate standard LD-CELP G.728 at 16 kb/s, the quality of the two stage structure is generally lower on these inputs; the embedded feature does affect quality. On clean speech the quality of the two stage structure at 16 kb/s is close to if not better than that of G.728 at 16 kb/s.  相似文献   

2.
Prosody modification involves changing the pitch and duration of speech without affecting the message and naturalness. This paper proposes a method for prosody (pitch and duration) modification using the instants of significant excitation of the vocal tract system during the production of speech. The instants of significant excitation correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excitations like onset of burst in the case of nonvoiced speech. Instants of significant excitation are computed from the linear prediction (LP) residual of speech signals by using the property of average group-delay of minimum phase signals. The modification of pitch and duration is achieved by manipulating the LP residual with the help of the knowledge of the instants of significant excitation. The modified residual is used to excite the time-varying filter, whose parameters are derived from the original speech signal. Perceptual quality of the synthesized speech is good and is without any significant distortion. The proposed method is evaluated using waveforms, spectrograms, and listening tests. The performance of the method is compared with linear prediction pitch synchronous overlap and add (LP-PSOLA) method, which is another method for prosody manipulation based on the modification of the LP residual. The original and the synthesized speech signals obtained by the proposed method and by the LP-PSOLA method are available for listening at http://speech.cs.iitm.ernet.in/Main/result/prosody.html.  相似文献   

3.
In this paper, a robust audio watermarking scheme for MPEG-1/ Audio Layer II compressed domain is proposed. The scheme is implemented by modifying the subband coefficients using adaptive quantization index modulation. The watermarking procedure exploits perceptual frequency and temporal masking of the human auditory system (HAS) of MPEG coder to satisfy the requirements of robustness, security and transparency. This reduces the computational complexity of proposed scheme. The paper investigates the use of elevated masking threshold to improve detection and achieve higher robustness against re-encoding and awgn attacks. Experimental results show that high capacity of 6,840 bps with ODG ?0.5 without altering the MPEG/audio bitrate.  相似文献   

4.
Exploiting the residual redundancy in a source coder output stream during the decoding process has been proven to be a bandwidth efficient way to combat the noisy channel degradations. In this paper, we consider soft reconstruction of speech spectrum, in GSM adaptive multirate and IS-641 vocoders, transmitted over a channel disturbed with noise and/or packet loss. Several schemes are presented which exploit different levels of intraframe and interframe residual redundancy for improved source decoding at the receiver. A packetization strategy is proposed which is matched to the presented error concealment units. For decoders that exploit the residual redundancy, extensive complexity has been a serious concern, especially as the quantizer bitrate increases . In this paper, a novel method is presented to construct reduced complexity algorithms. The proposed methodology is based on the classification of the signal domain and efficient approximation of the residual redundancy or the a priori transition probabilities. The presented schemes provide high quality error concealment solutions for code excited linear prediction (CELP) coders  相似文献   

5.
This paper investigates the use of sparse overcomplete decompositions for audio coding. Audio signals are decomposed over a redundant union of modified discrete cosine transform (MDCT) bases having eight different scales. This approach produces a sparser decomposition than the traditional MDCT-based orthogonal transform and allows better coding efficiency at low bitrates. Contrary to state-of-the-art low bitrate coders, which are based on pure parametric or hybrid representations, our approach is able to provide transparency. Moreover, we use a bitplane encoding approach, which provides a fine-grain scalable coder that can seamlessly operate from very low bitrates up to transparency. Objective evaluation, as well as listening tests, show that the performance of our coder is significantly better than a state-of-the-art transform coder at very low bitrates and has similar performance at high bitrates. We provide a link to test soundfiles and source code to allow better evaluation and reproducibility of the results.   相似文献   

6.
This paper presents an analysis of the speaker discrimination power of vocal source related features, in comparison to the conventional vocal tract related features. The vocal source features, named wavelet octave coefficients of residues (WOCOR), are extracted by pitch-synchronous wavelet transform of the linear predictive (LP) residual signals. Using a series of controlled experiments, it is shown that WOCOR is less sensitive to spoken content than the conventional MFCC features and thus more discriminative when the amount of training data is limited. These advantages of WOCOR are exploited in the task of speaker segmentation for telephone conversation, in which statistical speaker models need to be built upon short speech segments. Experimental results show that the proposed use of WOCOR leads to noticeable reduction of segmentation errors.  相似文献   

7.
在现在编码标准(如MPEG4)中,基于小波的压缩算法已经成为一种主要技术。文章在已经成熟的小波变换及零数编码等算法的基础上,分析小波变换以及小波变换系数的特点,经过多次实验得出一种在小波变换前对图象先进行预处理的可逆平滑算法。实验表明这种算法在已有小波压缩编码的基础上大大提高了图象压缩效率。  相似文献   

8.
基于前置滤波和小波变换的带噪语音基音周期检测方法   总被引:10,自引:0,他引:10  
根据语音信号的基音周期范围有限和在声门闭合时刻语音信号出现锐变的特点,提出一种基于前置滤波和小波变换的基音周期检测方法。带噪语音信号经过3阶椭圆低通滤波器滤波后,采用以二次样条小波作为小波函数,进行一级小波变换检测语音信号的锐变点,再计算基音周期。实验表明,本文提出的基音周期检测方法,与平均幅度差函数(AMDF)和自相关函数(ACF)方法相比,提高了提取基音周期的准确率;与多尺度小波变换的基音周期检测方法相比,减小了计算量,削弱了噪声信号和语音的共振峰对基音周期检测的影响。  相似文献   

9.
Numerous efforts have focused on the problem of reducing the impact of noise on the performance of various speech systems such as speech coding, speech recognition and speaker recognition. These approaches consider alternative speech features, improved speech modeling, or alternative training for acoustic speech models. In this paper, we propose a new speech enhancement technique, which integrates a new proposed wavelet transform which we call stationary bionic wavelet transform (SBWT) and the maximum a posterior estimator of magnitude-squared spectrum (MSS-MAP). The SBWT is introduced in order to solve the problem of the perfect reconstruction associated with the bionic wavelet transform. The MSS-MAP estimation was used for estimation of speech in the SBWT domain. The experiments were conducted for various noise types and different speech signals. The results of the proposed technique were compared with those of other popular methods such as Wiener filtering and MSS-MAP estimation in frequency domain. To test the performance of the proposed speech enhancement system, four objective quality measurement tests [signal to noise ratio (SNR), segmental SNR, Itakura–Saito distance and perceptual evaluation of speech quality] were conducted for various noise types and SNRs. Experimental results and objective quality measurement test results proved the performance of the proposed speech enhancement technique. It provided sufficient noise reduction and good intelligibility and perceptual quality, without causing considerable signal distortion and musical background noise.  相似文献   

10.
为了提高深度模型的编码重构性能,本文为传统对比散度(Contrastive divergence,CD)添加了基于交叉熵的重构误差约束。利用改进后的算法训练了重构性深度自编码机(Reconstructive deep auto encoder,RDAE),并用RDAE替换混合激励线性预测编码(Mixed excitation linear prediction, MELP)语音编码器中 LSF参数的矢量量化方法。测试结果表明,改进后的算法在损失一定模型似然度的条件下获得了重构性能的提升,当RDAE隐藏层结点设为19 bit时,本文方法所测得的加权LSF距离、重构语音质量、谱失真指标在训练集和测试集上均优于25 bit矢量量化方法,即利用本文方法改进的MELP编码器,在不降低语音质量的条件下,可将MELP编码速率从2.4 kb/s降低至2.1 kb/s,编码速率降低了12.5%。  相似文献   

11.
A novel software-based video compression algorithm, the Popular Video Coder (PVC), is presented in this paper, and a video phone system, the Popular Phone, is also implemented based on the PVC. The PVC simplifies the traditional video coder by removing the transform and the motion estimation parts and modifies the quantizer and entropy coder. Two novel coding algorithms, the adaptive quantizer and the modified windowed Huffman-like coder, are used in the PVC to encode the video data with a quality picture at a high compression ratio. The video quality of the proposed coder is as good as that of the MPEG coder when the input is a low-resolution and slow-motion video, and the computational complexity of the PVC is much lower than that of the Motion Picture Expert Group (MPEG). Since no compression hardware is needed for the PVC to encode and decode video data, the cost and complexity of developing multimedia applications, such as video phone and multimedia e-mail systems, can be greatly reduced. Furthermore, some networking issues, such as error control and flow control, are discussed in connection with applying the PVC to implement the Popular Phone.  相似文献   

12.
本文主要介绍基于G.723.1语音压缩的语音网络传输系统中,编码器以帧为处理单位对语音及其它音频信号进行处理时,如何用语音丢包后的补偿算法来提高语音质量。  相似文献   

13.
The paper addresses a bitstream scalable coder based on the MPEG-4 scalable lossless (SLS) coding system where, in contrast to SLS, the bitrate of the enhancement layer is not fixed but instead an attempt is made to create a quality-fixed enhancement layer. With a PCM audio input, the proposed structure is able to produce an audio version with near-transparent quality on top of the existing low-quality version. In particular, the proposed fixed quality enhancing process with checking procedures is able to provide the minimum amount of enhancement for the low-quality version to obtain a near-transparent quality that is almost indistinguishable from the CD quality. In addition, a bitrate estimation model is proposed. The model enables the direct estimation of the enhancing bitrate from two parameters extracted from the encoding process of the low-quality version. Evaluation results indicate that a better defined quality level is guaranteed compared to a fixed bitrate setting and that in the mean a lower (approximately 20%) bitrate is attained. It is also shown that the estimation model proposed is able to accurately predict the necessary enhancing bitrate and at the same time, reduce the complexity by around 17%.   相似文献   

14.
The objective of this paper is to propose an efficient model-based bit allocation process optimizing the performances of a wavelet coder for semiregular meshes. More precisely, this process should compute the best quantizers for the wavelet coefficient subbands that minimize the reconstructed mean square error for one specific target bitrate. In order to design a fast and low complex allocation process, we propose an approximation of the reconstructed mean square error relative to the coding of semiregular mesh geometry. This error is expressed directly from the quantization errors of each coefficient subband. For that purpose, we have to take into account the influence of the wavelet filters on the quantized coefficients. Furthermore, we propose a specific approximation for wavelet transforms based on lifting schemes. Experimentally, we show that, in comparison with a "naive" approximation (depending on the subband levels), using the proposed approximation as distortion criterion during the model-based allocation process improves the performances of a wavelet-based coder for any model, any bitrate, and any lifting scheme.  相似文献   

15.
一种新的MDCT快速算法   总被引:5,自引:0,他引:5  
改进型的离散余弦变换(Modified disrete cosine transform)作为良好的时频分析工具在单频编码中广泛应用。本文提出了一种基于快速DCT变换的MDCT快速算法,与其他文献的算法相比,其运算量明显减少。  相似文献   

16.
This paper proposes a new speaker-dependent coding algorithm to efficiently compress a large speech database for corpus-based concatenative text-to-speech (TTS) engines while maintaining high fidelity. To achieve a high compression ratio and meet the fundamental requirements of concatenative TTS synthesizers, such as partial segment decoding and random access capability, we adopt a nonpredictive analysis-by-synthesis scheme for speaker-dependent parameter estimation and quantization. The spectral coefficients are quantized by using a memoryless split vector quantization (VQ) approach that does not use frame correlation. Considering that excitation signals of a specific speaker show low intra-variation especially in the voiced regions, the conventional adaptive codebook for pitch prediction is replaced by a speaker-dependent pitch-pulse codebook trained by a corpus of single-speaker speech signals. To further improve the coding efficiency, the proposed coder flexibly combines nonpredictive and predictive type method considering the structure of the TTS system. By applying the proposed algorithm to a Korean TTS system, we could obtain comparable quality to the G.729 speech coder and satisfy all the requirements that TTS system needs. The results are verified by both objective and subjective quality measurements. In addition, the decoding complexity of the proposed coder is around 55% lower than that of G.729 annex A  相似文献   

17.
基于心理声学模型的多码率零树小波音频压缩方法   总被引:3,自引:0,他引:3  
何冬梅  高文 《计算机学报》2000,23(3):278-284
MPEG-4音频编码标准不仅对码率和音质提出了更高的要求,而且还要求编码器具有多种功能以满足各种不同应用的需要,该文利用不同尺度小波系数的自相似特性和人耳的掩蔽效应,提出了一种基于心理声学模型的零树小波音频编码算法。该算法不仅可在低码率(56kb/s)上得到透明质量的CD音频信号,而且可产生嵌入式码流,在最优意义上支持多码率的可分级编码,是一种很有前途的适用一多媒体通信等领域的编码方案。  相似文献   

18.
Automatic speaker verification (ASV) systems are highly vulnerable against spoofing attacks. Anti-spoofing, determining whether a speech signal is natural/genuine or spoofed, is very important for improving the reliability of the ASV systems. Spoofing attacks using the speech signals generated using speech synthesis and voice conversion have recently received great interest due to the 2015 edition of Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015). In this paper, we propose to use linear prediction (LP) residual based features for anti-spoofing. Three different features extracted from LP residual signal were compared using the ASVspoof 2015 database. Experimental results indicate that LP residual phase cepstral coefficients (LPRPC) and LP residual Hilbert envelope cepstral coefficients (LPRHEC) obtained from the analytic signal of the LP residual yield promising results for anti-spoofing. The proposed features are found to outperform standard Mel-frequency cepstral coefficients (MFCC) and Cosine Phase (CosPhase) features. LPRPC and LPRHEC features give the smallest equal error rates (EER) for eight spoofing methods out of ten spoofing attacks in comparison to MFCC and CosPhase features.  相似文献   

19.
We present a new speech enhancement scheme for a single-microphone system to meet the demand for quality noise reduction algorithms capable of operating at a very low signal-to-noise ratio. A psychoacoustic model is incorporated into the generalized perceptual wavelet denoising method to reduce the residual noise and improve the intelligibility of speech. The proposed method is a generalized time-frequency subtraction algorithm, which advantageously exploits the wavelet multirate signal representation to preserve the critical transient information. Simultaneous masking and temporal masking of the human auditory system are modeled by the perceptual wavelet packet transform via the frequency and temporal localization of speech components. The wavelet coefficients are used to calculate the Bark spreading energy and temporal spreading energy, from which a time-frequency masking threshold is deduced to adaptively adjust the subtraction parameters of the proposed method. An unvoiced speech enhancement algorithm is also integrated into the system to improve the intelligibility of speech. Through rigorous objective and subjective evaluations, it is shown that the proposed speech enhancement system is capable of reducing noise with little speech degradation in adverse noise environments and the overall performance is superior to several competitive methods.  相似文献   

20.
We present a new speech enhancement scheme for a single-microphone system to meet the demand for quality noise reduction algorithms capable of operating at a very low signal-to-noise ratio. A psychoacoustic model is incorporated into the generalized perceptual wavelet denoising method to reduce the residual noise and improve the intelligibility of speech. The proposed method is a generalized time-frequency subtraction algorithm, which advantageously exploits the wavelet multirate signal representation to preserve the critical transient information. Simultaneous masking and temporal masking of the human auditory system are modeled by the perceptual wavelet packet transform via the frequency and temporal localization of speech components. The wavelet coefficients are used to calculate the Bark spreading energy and temporal spreading energy, from which a time-frequency masking threshold is deduced to adaptively adjust the subtraction parameters of the proposed method. An unvoiced speech enhancement algorithm is also integrated into the system to improve the intelligibility of speech. Through rigorous objective and subjective evaluations, it is shown that the proposed speech enhancement system is capable of reducing noise with little speech degradation in adverse noise environments and the overall performance is superior to several competitive methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号