共查询到20条相似文献,搜索用时 0 毫秒
1.
《Computer Speech and Language》2014,28(2):629-647
Lombard and Clear speech represent two acoustically and perceptually distinct speaking styles that humans employ to increase intelligibility. For Lombard speech, increased spectral energy in a band spanning the range of formants is consistent, effectively augmenting loudness, while vowel space expansion is exhibited in Clear speech, indicating greater articulation. On the other hand, analyses in the first part of this work illustrate that Clear speech does not exhibit significant spectral energy boosting, nor does the Lombard effect invoke an expansion of vowel space. Accordingly, though these two acoustic phenomena are largely attributed with the respective intelligibility gains of the styles, present analyses would suggest that they are mutually exclusive in human speech production. However, these phenomena can be used to inspire signal processing algorithms that seek to exploit and ultimately compound their respective intelligibility gains, as is explored in the second part of this work. While Lombard-inspired spectral shaping has been shown to successfully increase intelligibility, Clear speech-inspired modifications to expand vowel space are rarely explored. With this in mind, the latter part of this work focuses mainly on a novel frequency warping technique that is shown to achieve vowel space expansion. The frequency warping is then incorporated into an established Lombard-inspired Spectral Shaping method that pairs with dynamic range compression to maximize speech audibility (SSDRC). Finally, objective and subjective evaluations are presented in order to assess and compare the intelligibility gains of the different styles and their inspired modifications. 相似文献
2.
《Computer Speech and Language》2014,28(2):619-628
Post-filtering can be used in mobile communications to improve the quality and intelligibility of speech. Energy reallocation with a high-pass type filter has been shown to work effectively in improving the intelligibility of speech in difficult noise conditions. This paper introduces a post-filtering algorithm that adapts to the background noise level as well as to the fundamental frequency of the speaker and models the spectral effects observed in natural Lombard speech. The introduced method and another post-filtering technique were compared to unprocessed telephone speech in subjective listening tests in terms of intelligibility and quality. The results indicate that the proposed method outperforms the reference method in difficult noise conditions. 相似文献
3.
《Computer Speech and Language》2014,28(5):1209-1232
This paper presents a study on the importance of short-term speech parameterizations for expressive statistical parametric synthesis. Assuming a source-filter model of speech production, the analysis is conducted over spectral parameters, here defined as features which represent a minimum-phase synthesis filter, and some excitation parameters, which are features used to construct a signal that is fed to the minimum-phase synthesis filter to generate speech. In the first part, different spectral and excitation parameters that are applicable to statistical parametric synthesis are tested to determine which ones are the most emotion dependent. The analysis is performed through two methods proposed to measure the relative emotion dependency of each feature: one based on K-means clustering, and another based on Gaussian mixture modeling for emotion identification. Two commonly used forms of parameters for the short-term speech spectral envelope, the Mel cepstrum and the Mel line spectrum pairs are utilized. As excitation parameters, the anti-causal cepstrum, the time-smoothed group delay, and band-aperiodicity coefficients are considered. According to the analysis, the line spectral pairs are the most emotion dependent parameters. Among the excitation features, the band-aperiodicity coefficients present the highest correlation with the speaker's emotion. The most emotion dependent parameters according to this analysis were selected to train an expressive statistical parametric synthesizer using a speaker and language factorization framework. Subjective test results indicate that the considered spectral parameters have a bigger impact on the synthesized speech emotion when compared with the excitation ones. 相似文献
4.
A method to improve voicing decision using glottal activity features proposed for statistical parametric speech synthesis. In existing methods, voicing decision relies mostly on fundamental frequency , which may result in errors when the prediction is inaccurate. Even though is a glottal activity feature, other features that characterize this activity may help in improving the voicing decision. The glottal activity features used in this work are the strength of excitation (SoE), normalized autocorrelation peak strength (NAPS), and higher-order statistics (HOS). These features obtained from approximated source signals like zero-frequency filtered signal and integrated linear prediction residual. To improve voicing decision and to avoid heuristic threshold for classification, glottal activity features are trained using different statistical learning methods such as the k-nearest neighbor, support vector machine (SVM), and deep belief network. The voicing decision works best with SVM classifier, and its effectiveness is tested using the statistical parametric speech synthesis. The glottal activity features SoE, NAPS, and HOS modeled along with and Mel-cepstral coefficients in Hidden Markov model and deep neural network to get the voicing decision. The objective and subjective evaluations demonstrate that the proposed method improves the naturalness of synthetic speech. 相似文献
5.
Caroline Henton 《International Journal of Speech Technology》2002,5(2):117-131
Highest quality synthetic voices remain scarce in both parametric synthesis systems and in concatenative ones. Much synthetic speech lacks naturalness, pleasantness and flexibility. While great strides have been made over the past few years in the quality of synthetic speech, there is still much work that needs to be done. Now the major challenges facing developers are how to provide optimal size, performance, extensibility, and flexibility, together with developing improved signal processing techniques. This paper focuses on issues of performance and flexibility against a background containing a brief evolution of speech synthesis; some acoustic, phonetic and linguistic issues; and the merits and demerits of two commonly used synthesis techniques: parametric and concatenative. Shortcomings of both techniques are reviewed. Methodological developments in the variable size, selection and specification of the speech units used in concatenative systems are explored and shown to provide a more positive outlook for more natural, bearable synthetic speech. Differentiating considerations in making and improving concatenative systems are explored and evaluated. Acoustic and sociophonetic criteria are reviewed for the improvement of variable synthetic voices, and a ranking of their relative importance is suggested. Future rewards are weighed against current technical and developmental challenges. The conclusion indicates some of the current and future applications of TTS. 相似文献
6.
《Computer Speech and Language》2014,28(2):665-686
This paper describes speech intelligibility enhancement for Hidden Markov Model (HMM) generated synthetic speech in noise. We present a method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech. We update these coefficients such that the glimpse proportion – an objective measure of the intelligibility of speech in noise – increases, while keeping the speech energy fixed. An acoustic analysis reveals that the modified speech is boosted in the region 1–4 kHz, particularly for vowels, nasals and approximants. Results from listening tests employing speech-shaped noise show that the modified speech is as intelligible as a synthetic voice trained on plain speech whose duration, Mel cepstral coefficients and excitation signal parameters have been adapted to Lombard speech from the same speaker. Our proposed method does not require these additional recordings of Lombard speech. In the presence of a competing talker, both modification and adaptation of spectral coefficients give more modest gains. 相似文献
7.
《Computer Speech and Language》2014,28(4):858-872
A speech pre-processing algorithm is presented that improves the speech intelligibility in noise for the near-end listener. The algorithm improves intelligibility by optimally redistributing the speech energy over time and frequency according to a perceptual distortion measure, which is based on a spectro-temporal auditory model. Since this auditory model takes into account short-time information, transients will receive more amplification than stationary vowels, which has been shown to be beneficial for intelligibility of speech in noise. The proposed method is compared to unprocessed speech and two reference methods using an intelligibility listening test. Results show that the proposed method leads to significant intelligibility gains while still preserving quality. Although one of the methods used as a reference obtained higher intelligibility gains, this happened at the cost of decreased quality. Matlab code is provided. 相似文献
8.
该文介绍基于声学统计建模的语音合成技术,重点回顾中国科学技术大学讯飞语音实验室在语音合成领域这一前沿发展方向的创新性工作成果。具体包括 融合发音动作参数与声学参数,提高声学参数生成的灵活性;以最小生成误差准则取代最大似然准则,提高合成语音的音质;使用单元挑选与波形拼接方法取代参数合成器重构,改善参数语音合成器在合成语音音质上的不足。以上技术创新使得语音合成系统在自然度、表现力、灵活性及多语种应用等方面的性能都有进一步的提升,并推动语音合成技术在呼叫中心信息服务、移动嵌入式设备人机语音交互、智能语音教学等领域的广泛引用。 相似文献
9.
《Computer Speech and Language》2014,28(2):572-579
This study investigated whether the signal-to-noise ratio (SNR) of the interlocutor (speech partner) influences a speaker's vocal intensity in conversational speech. Twenty participants took part in artificial conversations with controlled levels of interlocutor speech and background noise. Three different levels of background noise were presented over headphones and the participant engaged in a “live interaction” with the experimenter. The experimenter's vocal intensity was manipulated in order to modify the SNR. The participants’ vocal intensity was measured. As observed previously, vocal intensity increased as background noise level increased. However, the SNR of the interlocutor did not have a significant effect on participants’ vocal intensity. These results suggest that increasing the signal level of the other party at the earpiece would not reduce the tendency of telephone users to talk loudly 相似文献
10.
11.
在增大训练数据的情况下,使用传统的隐马尔科夫模型难以提升参数化语音合成预测质量。长短期记忆神经网络学习序列内的长程特征,在大规模并行数值计算下获得更准确的语音时长和更连贯的频谱模型,但同时也包含了可简化的计算。首先分析双向长短期记忆神经网络功能结构,接着移除遗忘门和输出门,最后对文本音素信息到倒频谱的映射关系建模。在普通话语料库上的对比实验证明,简化双向长短期记忆神经网络计算量减少一半,梅尔倒频率失真度由隐马尔科夫模型的3.466 1降低到1.945 9。 相似文献
12.
《Computer Speech and Language》2014,28(2):580-597
What makes speech produced in the presence of noise (Lombard speech) more intelligible than conversational speech produced in quiet conditions? This study investigates the hypothesis that speakers modify their speech in the presence of noise in such a way that acoustic contrasts between their speech and the background noise are enhanced, which would improve speech audibility.Ten French speakers were recorded while playing an interactive game first in quiet condition, then in two types of noisy conditions with different spectral characteristics: a broadband noise (BB) and a cocktail-party noise (CKTL), both played over loudspeakers at 86 dB SPL.Similarly to (Lu and Cooke, 2009b), our results suggest no systematic “active” adaptation of the whole speech spectrum or vocal intensity to the spectral characteristics of the ambient noise. Regardless of the type of noise, the gender or the type of speech segment, the primary strategy was to speak louder in noise, with a greater adaptation in BB noise and an emphasis on vowels rather than any type of consonants.Active strategies were evidenced, but were subtle and of second order to the primary strategy of speaking louder: for each gender, fundamental frequency (f0) and first formant frequency (F1) were modified in cocktail-party noise in a way that optimized the release in energetic masking induced by this type of noise. Furthermore, speakers showed two additional modifications as compared to shouted speech, which therefore cannot be interpreted in terms of vocal effort only: they enhanced the modulation of their speech in f0 and vocal intensity and they boosted their speech spectrum specifically around 3 kHz, in the region of maximum ear sensitivity associated with the actor's or singer's formant. 相似文献
13.
14.
Robert Frederking Alexander Rudnicky Christopher Hogan Kevin Lenzo 《Machine Translation》2000,15(1-2):27-42
The Diplomat rapid-deployment speech-translation systemis intended to allow naï ve users to communicate across a languagebarrier, without strong domain restrictions, despite the error-pronenature of current speech and translation technologies. In addition,it should be deployable for new languages an order of magnitude morequickly than traditional technologies. Achieving this ambitious setof goals depends in large part on allowing the users to correct recognition and translation errors interactively. We present the Multi-Engine Machine Translation (MEMT) architecture, describing how it is well suited for such an application. We then discuss ourapproaches to rapid-deployment speech recognition and synthesis.Finally we describe our incorporation of interactive error correctionthroughout the system design. We have already developed workingbidirectional Croatian English and Spanish English systems, and have Haitian Creole English and Korean English versions under development. 相似文献
15.
16.
In recent years, speech synthesis systems have allowed for the production of very high-quality voices. Therefore, research in this domain is now turning to the problem of integrating emotions into speech. However, the method of constructing a speech synthesizer for each emotion has some limitations. First, this method often requires an emotional-speech data set with many sentences. Such data sets are very time-intensive and labor-intensive to complete. Second, training each of these models requires computers with large computational capabilities and a lot of effort and time for model tuning. In addition, each model for each emotion failed to take advantage of data sets of other emotions. In this paper, we propose a new method to synthesize emotional speech in which the latent expressions of emotions are learned from a small data set of professional actors through a Flowtron model. In addition, we provide a new method to build a speech corpus that is scalable and whose quality is easy to control. Next, to produce a high-quality speech synthesis model, we used this data set to train the Tacotron 2 model. We used it as a pre-trained model to train the Flowtron model. We applied this method to synthesize Vietnamese speech with sadness and happiness. Mean opinion score (MOS) assessment results show that MOS is 3.61 for sadness and 3.95 for happiness. In conclusion, the proposed method proves to be more effective for a high degree of automation and fast emotional sentence generation, using a small emotional-speech data set. 相似文献
17.
论文首先分析了小波的时频特性,基于该特性对语音信号进行小波域滤波,提出对听觉感知有效的频率分量,然后用参数滤波方法进行分段。参数滤波的基本思想是以一个变化的参数对信号进行滤波,得到信号在不同频带中的分量。可以证明若滤波参数以一定的规律变化,则这些滤波分量的一阶自相关表示了信号的相关结构。实验表明对上述经小波域滤波后的频率分量进行基于参数滤波的音素分段会得到较准确的分段效果。 相似文献
18.
语音合成技术日趋成熟,为了提高合成情感语音的质量,提出了一种端到端情感语音合成与韵律修正相结合的方法。在Tacotron模型合成的情感语音基础上,进行韵律参数的修改,提高合成系统的情感表达力。首先使用大型中性语料库训练Tacotron模型,再使用小型情感语料库训练,合成出具有情感的语音。然后采用Praat声学分析工具对语料库中的情感语音韵律特征进行分析并总结不同情感状态下的参数规律,最后借助该规律,对Tacotron合成的相应情感语音的基频、时长和能量进行修正,使情感表达更为精确。客观情感识别实验和主观评价的结果表明,该方法能够合成较为自然且表现力更加丰富的情感语音。 相似文献
19.