首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
    
Lombard and Clear speech represent two acoustically and perceptually distinct speaking styles that humans employ to increase intelligibility. For Lombard speech, increased spectral energy in a band spanning the range of formants is consistent, effectively augmenting loudness, while vowel space expansion is exhibited in Clear speech, indicating greater articulation. On the other hand, analyses in the first part of this work illustrate that Clear speech does not exhibit significant spectral energy boosting, nor does the Lombard effect invoke an expansion of vowel space. Accordingly, though these two acoustic phenomena are largely attributed with the respective intelligibility gains of the styles, present analyses would suggest that they are mutually exclusive in human speech production. However, these phenomena can be used to inspire signal processing algorithms that seek to exploit and ultimately compound their respective intelligibility gains, as is explored in the second part of this work. While Lombard-inspired spectral shaping has been shown to successfully increase intelligibility, Clear speech-inspired modifications to expand vowel space are rarely explored. With this in mind, the latter part of this work focuses mainly on a novel frequency warping technique that is shown to achieve vowel space expansion. The frequency warping is then incorporated into an established Lombard-inspired Spectral Shaping method that pairs with dynamic range compression to maximize speech audibility (SSDRC). Finally, objective and subjective evaluations are presented in order to assess and compare the intelligibility gains of the different styles and their inspired modifications.  相似文献   

2.
Post-filtering can be used in mobile communications to improve the quality and intelligibility of speech. Energy reallocation with a high-pass type filter has been shown to work effectively in improving the intelligibility of speech in difficult noise conditions. This paper introduces a post-filtering algorithm that adapts to the background noise level as well as to the fundamental frequency of the speaker and models the spectral effects observed in natural Lombard speech. The introduced method and another post-filtering technique were compared to unprocessed telephone speech in subjective listening tests in terms of intelligibility and quality. The results indicate that the proposed method outperforms the reference method in difficult noise conditions.  相似文献   

3.
This paper presents a study on the importance of short-term speech parameterizations for expressive statistical parametric synthesis. Assuming a source-filter model of speech production, the analysis is conducted over spectral parameters, here defined as features which represent a minimum-phase synthesis filter, and some excitation parameters, which are features used to construct a signal that is fed to the minimum-phase synthesis filter to generate speech. In the first part, different spectral and excitation parameters that are applicable to statistical parametric synthesis are tested to determine which ones are the most emotion dependent. The analysis is performed through two methods proposed to measure the relative emotion dependency of each feature: one based on K-means clustering, and another based on Gaussian mixture modeling for emotion identification. Two commonly used forms of parameters for the short-term speech spectral envelope, the Mel cepstrum and the Mel line spectrum pairs are utilized. As excitation parameters, the anti-causal cepstrum, the time-smoothed group delay, and band-aperiodicity coefficients are considered. According to the analysis, the line spectral pairs are the most emotion dependent parameters. Among the excitation features, the band-aperiodicity coefficients present the highest correlation with the speaker's emotion. The most emotion dependent parameters according to this analysis were selected to train an expressive statistical parametric synthesizer using a speaker and language factorization framework. Subjective test results indicate that the considered spectral parameters have a bigger impact on the synthesized speech emotion when compared with the excitation ones.  相似文献   

4.
    
A method to improve voicing decision using glottal activity features proposed for statistical parametric speech synthesis. In existing methods, voicing decision relies mostly on fundamental frequency F0, which may result in errors when the prediction is inaccurate. Even though F0 is a glottal activity feature, other features that characterize this activity may help in improving the voicing decision. The glottal activity features used in this work are the strength of excitation (SoE), normalized autocorrelation peak strength (NAPS), and higher-order statistics (HOS). These features obtained from approximated source signals like zero-frequency filtered signal and integrated linear prediction residual. To improve voicing decision and to avoid heuristic threshold for classification, glottal activity features are trained using different statistical learning methods such as the k-nearest neighbor, support vector machine (SVM), and deep belief network. The voicing decision works best with SVM classifier, and its effectiveness is tested using the statistical parametric speech synthesis. The glottal activity features SoE, NAPS, and HOS modeled along with F0 and Mel-cepstral coefficients in Hidden Markov model and deep neural network to get the voicing decision. The objective and subjective evaluations demonstrate that the proposed method improves the naturalness of synthetic speech.  相似文献   

5.
Highest quality synthetic voices remain scarce in both parametric synthesis systems and in concatenative ones. Much synthetic speech lacks naturalness, pleasantness and flexibility. While great strides have been made over the past few years in the quality of synthetic speech, there is still much work that needs to be done. Now the major challenges facing developers are how to provide optimal size, performance, extensibility, and flexibility, together with developing improved signal processing techniques. This paper focuses on issues of performance and flexibility against a background containing a brief evolution of speech synthesis; some acoustic, phonetic and linguistic issues; and the merits and demerits of two commonly used synthesis techniques: parametric and concatenative. Shortcomings of both techniques are reviewed. Methodological developments in the variable size, selection and specification of the speech units used in concatenative systems are explored and shown to provide a more positive outlook for more natural, bearable synthetic speech. Differentiating considerations in making and improving concatenative systems are explored and evaluated. Acoustic and sociophonetic criteria are reviewed for the improvement of variable synthetic voices, and a ranking of their relative importance is suggested. Future rewards are weighed against current technical and developmental challenges. The conclusion indicates some of the current and future applications of TTS.  相似文献   

6.
    
This paper describes speech intelligibility enhancement for Hidden Markov Model (HMM) generated synthetic speech in noise. We present a method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech. We update these coefficients such that the glimpse proportion – an objective measure of the intelligibility of speech in noise – increases, while keeping the speech energy fixed. An acoustic analysis reveals that the modified speech is boosted in the region 1–4 kHz, particularly for vowels, nasals and approximants. Results from listening tests employing speech-shaped noise show that the modified speech is as intelligible as a synthetic voice trained on plain speech whose duration, Mel cepstral coefficients and excitation signal parameters have been adapted to Lombard speech from the same speaker. Our proposed method does not require these additional recordings of Lombard speech. In the presence of a competing talker, both modification and adaptation of spectral coefficients give more modest gains.  相似文献   

7.
    
A speech pre-processing algorithm is presented that improves the speech intelligibility in noise for the near-end listener. The algorithm improves intelligibility by optimally redistributing the speech energy over time and frequency according to a perceptual distortion measure, which is based on a spectro-temporal auditory model. Since this auditory model takes into account short-time information, transients will receive more amplification than stationary vowels, which has been shown to be beneficial for intelligibility of speech in noise. The proposed method is compared to unprocessed speech and two reference methods using an intelligibility listening test. Results show that the proposed method leads to significant intelligibility gains while still preserving quality. Although one of the methods used as a reference obtained higher intelligibility gains, this happened at the cost of decreased quality. Matlab code is provided.  相似文献   

8.
该文介绍基于声学统计建模的语音合成技术,重点回顾中国科学技术大学讯飞语音实验室在语音合成领域这一前沿发展方向的创新性工作成果。具体包括 融合发音动作参数与声学参数,提高声学参数生成的灵活性;以最小生成误差准则取代最大似然准则,提高合成语音的音质;使用单元挑选与波形拼接方法取代参数合成器重构,改善参数语音合成器在合成语音音质上的不足。以上技术创新使得语音合成系统在自然度、表现力、灵活性及多语种应用等方面的性能都有进一步的提升,并推动语音合成技术在呼叫中心信息服务、移动嵌入式设备人机语音交互、智能语音教学等领域的广泛引用。  相似文献   

9.
This study investigated whether the signal-to-noise ratio (SNR) of the interlocutor (speech partner) influences a speaker's vocal intensity in conversational speech. Twenty participants took part in artificial conversations with controlled levels of interlocutor speech and background noise. Three different levels of background noise were presented over headphones and the participant engaged in a “live interaction” with the experimenter. The experimenter's vocal intensity was manipulated in order to modify the SNR. The participants’ vocal intensity was measured. As observed previously, vocal intensity increased as background noise level increased. However, the SNR of the interlocutor did not have a significant effect on participants’ vocal intensity. These results suggest that increasing the signal level of the other party at the earpiece would not reduce the tendency of telephone users to talk loudly  相似文献   

10.
11.
在增大训练数据的情况下,使用传统的隐马尔科夫模型难以提升参数化语音合成预测质量。长短期记忆神经网络学习序列内的长程特征,在大规模并行数值计算下获得更准确的语音时长和更连贯的频谱模型,但同时也包含了可简化的计算。首先分析双向长短期记忆神经网络功能结构,接着移除遗忘门和输出门,最后对文本音素信息到倒频谱的映射关系建模。在普通话语料库上的对比实验证明,简化双向长短期记忆神经网络计算量减少一半,梅尔倒频率失真度由隐马尔科夫模型的3.466 1降低到1.945 9。  相似文献   

12.
What makes speech produced in the presence of noise (Lombard speech) more intelligible than conversational speech produced in quiet conditions? This study investigates the hypothesis that speakers modify their speech in the presence of noise in such a way that acoustic contrasts between their speech and the background noise are enhanced, which would improve speech audibility.Ten French speakers were recorded while playing an interactive game first in quiet condition, then in two types of noisy conditions with different spectral characteristics: a broadband noise (BB) and a cocktail-party noise (CKTL), both played over loudspeakers at 86 dB SPL.Similarly to (Lu and Cooke, 2009b), our results suggest no systematic “active” adaptation of the whole speech spectrum or vocal intensity to the spectral characteristics of the ambient noise. Regardless of the type of noise, the gender or the type of speech segment, the primary strategy was to speak louder in noise, with a greater adaptation in BB noise and an emphasis on vowels rather than any type of consonants.Active strategies were evidenced, but were subtle and of second order to the primary strategy of speaking louder: for each gender, fundamental frequency (f0) and first formant frequency (F1) were modified in cocktail-party noise in a way that optimized the release in energetic masking induced by this type of noise. Furthermore, speakers showed two additional modifications as compared to shouted speech, which therefore cannot be interpreted in terms of vocal effort only: they enhanced the modulation of their speech in f0 and vocal intensity and they boosted their speech spectrum specifically around 3 kHz, in the region of maximum ear sensitivity associated with the actor's or singer's formant.  相似文献   

13.
语音库裁剪或语音库去冗余,是大语料库语音合成技术的一个重要问题.提出了虚拟不定长替换的概念,以弥补不定长的损失.结合合成使用变体的频度,构建了语音库裁剪算法StaRp-VPA.该算法能够以任意比例裁剪语音库.实验表明:当裁剪率小于50%时,合成自然度几乎没有下降;当裁剪率大于50%时,合成自然度也不会严重降低.  相似文献   

14.
The Diplomat rapid-deployment speech-translation systemis intended to allow naï ve users to communicate across a languagebarrier, without strong domain restrictions, despite the error-pronenature of current speech and translation technologies. In addition,it should be deployable for new languages an order of magnitude morequickly than traditional technologies. Achieving this ambitious setof goals depends in large part on allowing the users to correct recognition and translation errors interactively. We present the Multi-Engine Machine Translation (MEMT) architecture, describing how it is well suited for such an application. We then discuss ourapproaches to rapid-deployment speech recognition and synthesis.Finally we describe our incorporation of interactive error correctionthroughout the system design. We have already developed workingbidirectional Croatian English and Spanish English systems, and have Haitian Creole English and Korean English versions under development.  相似文献   

15.
多语种下情感语音基频参数变化的统计分析   总被引:2,自引:0,他引:2  
田岚  姜晓庆  侯正信 《控制与决策》2005,20(11):1311-1313
为研究韵律特征中最能反映语音情感信息的基频参数变化,选取7种典型的情感状态、固定句式,对同一说话人的汉语、英语、日语等多语种语音样本进行基频平均值、基频动态范围、基频抖动等参数的统计分析.统计结果表明,情感语音的基频结构特征随情感状态改变有明显的变化,且不同语种下这种结构的变化有较好的一致性.  相似文献   

16.
    
In recent years, speech synthesis systems have allowed for the production of very high-quality voices. Therefore, research in this domain is now turning to the problem of integrating emotions into speech. However, the method of constructing a speech synthesizer for each emotion has some limitations. First, this method often requires an emotional-speech data set with many sentences. Such data sets are very time-intensive and labor-intensive to complete. Second, training each of these models requires computers with large computational capabilities and a lot of effort and time for model tuning. In addition, each model for each emotion failed to take advantage of data sets of other emotions. In this paper, we propose a new method to synthesize emotional speech in which the latent expressions of emotions are learned from a small data set of professional actors through a Flowtron model. In addition, we provide a new method to build a speech corpus that is scalable and whose quality is easy to control. Next, to produce a high-quality speech synthesis model, we used this data set to train the Tacotron 2 model. We used it as a pre-trained model to train the Flowtron model. We applied this method to synthesize Vietnamese speech with sadness and happiness. Mean opinion score (MOS) assessment results show that MOS is 3.61 for sadness and 3.95 for happiness. In conclusion, the proposed method proves to be more effective for a high degree of automation and fast emotional sentence generation, using a small emotional-speech data set.  相似文献   

17.
论文首先分析了小波的时频特性,基于该特性对语音信号进行小波域滤波,提出对听觉感知有效的频率分量,然后用参数滤波方法进行分段。参数滤波的基本思想是以一个变化的参数对信号进行滤波,得到信号在不同频带中的分量。可以证明若滤波参数以一定的规律变化,则这些滤波分量的一阶自相关表示了信号的相关结构。实验表明对上述经小波域滤波后的频率分量进行基于参数滤波的音素分段会得到较准确的分段效果。  相似文献   

18.
语音合成技术日趋成熟,为了提高合成情感语音的质量,提出了一种端到端情感语音合成与韵律修正相结合的方法。在Tacotron模型合成的情感语音基础上,进行韵律参数的修改,提高合成系统的情感表达力。首先使用大型中性语料库训练Tacotron模型,再使用小型情感语料库训练,合成出具有情感的语音。然后采用Praat声学分析工具对语料库中的情感语音韵律特征进行分析并总结不同情感状态下的参数规律,最后借助该规律,对Tacotron合成的相应情感语音的基频、时长和能量进行修正,使情感表达更为精确。客观情感识别实验和主观评价的结果表明,该方法能够合成较为自然且表现力更加丰富的情感语音。  相似文献   

19.
在分析回顾现有话音编码方案基础上提出话音编码系统的五层结构模型以及“在收端利用边信息获取激励码”的概念。  相似文献   

20.
针对声效相关的语音识别鲁棒性问题,在分析了声效变化情况下声强、时长、帧能量分布以及频谱倾斜能方面特性的基础上,建立了基于GMM的声效检测器。同时,还研究了声效变化对语音识别精度的影响,并提出了基于多模型框架的语音识别算法。汉语孤立词语音识别实验显示,除正常模式的语音识别精度略有下降外,其它四种声效模式的识别精度均有大幅度的提高。实验结果表明语音信号的声强、时长、帧能量分布以及频谱倾斜等信息能够用于识别声效模式,而多模型框架是解决声效相关的语音识别鲁棒性问题的有效方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号