共查询到20条相似文献,搜索用时 15 毫秒
1.
《Computer Speech and Language》2014,28(2):629-647
Lombard and Clear speech represent two acoustically and perceptually distinct speaking styles that humans employ to increase intelligibility. For Lombard speech, increased spectral energy in a band spanning the range of formants is consistent, effectively augmenting loudness, while vowel space expansion is exhibited in Clear speech, indicating greater articulation. On the other hand, analyses in the first part of this work illustrate that Clear speech does not exhibit significant spectral energy boosting, nor does the Lombard effect invoke an expansion of vowel space. Accordingly, though these two acoustic phenomena are largely attributed with the respective intelligibility gains of the styles, present analyses would suggest that they are mutually exclusive in human speech production. However, these phenomena can be used to inspire signal processing algorithms that seek to exploit and ultimately compound their respective intelligibility gains, as is explored in the second part of this work. While Lombard-inspired spectral shaping has been shown to successfully increase intelligibility, Clear speech-inspired modifications to expand vowel space are rarely explored. With this in mind, the latter part of this work focuses mainly on a novel frequency warping technique that is shown to achieve vowel space expansion. The frequency warping is then incorporated into an established Lombard-inspired Spectral Shaping method that pairs with dynamic range compression to maximize speech audibility (SSDRC). Finally, objective and subjective evaluations are presented in order to assess and compare the intelligibility gains of the different styles and their inspired modifications. 相似文献
2.
《Computer Speech and Language》2014,28(5):1209-1232
This paper presents a study on the importance of short-term speech parameterizations for expressive statistical parametric synthesis. Assuming a source-filter model of speech production, the analysis is conducted over spectral parameters, here defined as features which represent a minimum-phase synthesis filter, and some excitation parameters, which are features used to construct a signal that is fed to the minimum-phase synthesis filter to generate speech. In the first part, different spectral and excitation parameters that are applicable to statistical parametric synthesis are tested to determine which ones are the most emotion dependent. The analysis is performed through two methods proposed to measure the relative emotion dependency of each feature: one based on K-means clustering, and another based on Gaussian mixture modeling for emotion identification. Two commonly used forms of parameters for the short-term speech spectral envelope, the Mel cepstrum and the Mel line spectrum pairs are utilized. As excitation parameters, the anti-causal cepstrum, the time-smoothed group delay, and band-aperiodicity coefficients are considered. According to the analysis, the line spectral pairs are the most emotion dependent parameters. Among the excitation features, the band-aperiodicity coefficients present the highest correlation with the speaker's emotion. The most emotion dependent parameters according to this analysis were selected to train an expressive statistical parametric synthesizer using a speaker and language factorization framework. Subjective test results indicate that the considered spectral parameters have a bigger impact on the synthesized speech emotion when compared with the excitation ones. 相似文献
3.
《Computer Speech and Language》2014,28(2):665-686
This paper describes speech intelligibility enhancement for Hidden Markov Model (HMM) generated synthetic speech in noise. We present a method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech. We update these coefficients such that the glimpse proportion – an objective measure of the intelligibility of speech in noise – increases, while keeping the speech energy fixed. An acoustic analysis reveals that the modified speech is boosted in the region 1–4 kHz, particularly for vowels, nasals and approximants. Results from listening tests employing speech-shaped noise show that the modified speech is as intelligible as a synthetic voice trained on plain speech whose duration, Mel cepstral coefficients and excitation signal parameters have been adapted to Lombard speech from the same speaker. Our proposed method does not require these additional recordings of Lombard speech. In the presence of a competing talker, both modification and adaptation of spectral coefficients give more modest gains. 相似文献
4.
A method to improve voicing decision using glottal activity features proposed for statistical parametric speech synthesis. In existing methods, voicing decision relies mostly on fundamental frequency , which may result in errors when the prediction is inaccurate. Even though is a glottal activity feature, other features that characterize this activity may help in improving the voicing decision. The glottal activity features used in this work are the strength of excitation (SoE), normalized autocorrelation peak strength (NAPS), and higher-order statistics (HOS). These features obtained from approximated source signals like zero-frequency filtered signal and integrated linear prediction residual. To improve voicing decision and to avoid heuristic threshold for classification, glottal activity features are trained using different statistical learning methods such as the k-nearest neighbor, support vector machine (SVM), and deep belief network. The voicing decision works best with SVM classifier, and its effectiveness is tested using the statistical parametric speech synthesis. The glottal activity features SoE, NAPS, and HOS modeled along with and Mel-cepstral coefficients in Hidden Markov model and deep neural network to get the voicing decision. The objective and subjective evaluations demonstrate that the proposed method improves the naturalness of synthetic speech. 相似文献
5.
Caroline Henton 《International Journal of Speech Technology》2002,5(2):117-131
Highest quality synthetic voices remain scarce in both parametric synthesis systems and in concatenative ones. Much synthetic speech lacks naturalness, pleasantness and flexibility. While great strides have been made over the past few years in the quality of synthetic speech, there is still much work that needs to be done. Now the major challenges facing developers are how to provide optimal size, performance, extensibility, and flexibility, together with developing improved signal processing techniques. This paper focuses on issues of performance and flexibility against a background containing a brief evolution of speech synthesis; some acoustic, phonetic and linguistic issues; and the merits and demerits of two commonly used synthesis techniques: parametric and concatenative. Shortcomings of both techniques are reviewed. Methodological developments in the variable size, selection and specification of the speech units used in concatenative systems are explored and shown to provide a more positive outlook for more natural, bearable synthetic speech. Differentiating considerations in making and improving concatenative systems are explored and evaluated. Acoustic and sociophonetic criteria are reviewed for the improvement of variable synthetic voices, and a ranking of their relative importance is suggested. Future rewards are weighed against current technical and developmental challenges. The conclusion indicates some of the current and future applications of TTS. 相似文献
6.
Jordi Robert-Ribes Jean-Luc Schwartz Pierre Escudier 《Artificial Intelligence Review》1995,9(4-5):323-346
Though a large amount of psychological and physiological evidence of audio-visual integration in speech has been collected in the last 20 years, there is no agreement about the nature of the fusion process. We present the main experimental data, and describe the various models proposed in the literature, together with a number of studies in the field of automatic audiovisual speech recognition. We discuss these models in relation to general proposals arising from psychology in the field of intersensory interaction, or from the field of vision and robotics in the field of sensor fusion. Then we examine the characteristics of four main models, in the light of psychological data and formal properties, and we present the results of a modelling study on audio-visual recognition of French vowels in noise. We conclude in favor of the relative superiority of a model in which the auditory and visual inputs are projected and fused in a common representation space related to motor properties of speech objects, the fused representation being further classified for lexical access. 相似文献
7.
为使更多人了解使用少数民族语音产品,有效解决我国少数民族地区与其他区域之间的语言障碍问题,促进民族间的相互交流。通过搜集资料,以国内基于语音识别技术的维吾尔语、蒙古语、藏语的语音产品为研究对象,梳理其开发和应用情况,发现目前开发的相关产品主要集中于语音输入法、语音翻译软件和转录产品三方面,在此基础上,对产品使用产生的影响进行分析,并对相关语音产品的发展前景进行展望。 相似文献
8.
近年来,卷积神经网络在语音增强任务中得到了广泛的应用.然而,目前广泛使用的跳跃连接机制在特征信息传输时会引入噪声成分,从而不可避免地降低了去噪性能;除此之外,普遍使用的固定形状的卷积核在处理各种声纹信息时效率低下,基于上述考虑,提出了一种跨维度协同注意力机制和形变卷积模块的端到端编-解码器网络CADNet. 具体来说,在跳跃连接中引入跨维度协同注意力模块,进一步提高信息控制能力. 并且在每个标准卷积层之后引入形变卷积层,从而更好地匹配声纹的自然特征. 在TIMIT公开数据集上进行的实验验证了所提出的方法在语音质量和可懂度的评价指标方面的有效性.
相似文献9.
Moataz El Ayadi Author Vitae Mohamed S. Kamel Author Vitae Author Vitae 《Pattern recognition》2011,44(3):572-587
Recently, increasing attention has been directed to the study of the emotional content of speech signals, and hence, many systems have been proposed to identify the emotional content of a spoken utterance. This paper is a survey of speech emotion classification addressing three important aspects of the design of a speech emotion recognition system. The first one is the choice of suitable features for speech representation. The second issue is the design of an appropriate classification scheme and the third issue is the proper preparation of an emotional speech database for evaluating system performance. Conclusions about the performance and limitations of current speech emotion recognition systems are discussed in the last section of this survey. This section also suggests possible ways of improving speech emotion recognition systems. 相似文献
10.
语音分割是语音识别和语音合成中必不可少的基础性工作,其质量对后续系统的影响巨大。使用手工分割和标注虽然精度高,但费时费力,同时需要熟练的领域专家来完成,自动语音分割因此成为语音处理的研究热点。首先针对自动语音分割目前的研究进展,介绍了语音分割的不同分类方法;然后分别介绍了基于对齐的方法和基于边界检测的方法,并详细介绍了可以应用在上述两种框架下的神经网络语音分割方法;接着介绍了基于生物激励信号以及博弈论等方法的新型语音分割技术,并给出了领域内广泛使用的性能评估度量,并对这些评估指标进行比较和分析;最后总结并提出语音分割研究未来发展的重要方向。 相似文献
11.
《Computer Speech and Language》2014,28(2):607-618
In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted. 相似文献
12.
本文介绍廉价的语音合成处理器TMS5220在数字系统中应用的技术。通过它在128路测温报温系统中的应用,讨论它在数字显示系统中如何获得优质语音合成语句的若干技术问题。 相似文献
13.
Animated virtual human characters are a common feature in interactive graphical applications, such as computer and video games, online virtual worlds and simulations. Due to dynamic nature of such applications, character animation must be responsive and controllable in addition to looking as realistic and natural as possible. Though procedural and physics-based animation provide a great amount of control over motion, they still look too unnatural to be of use in all but a few specific scenarios, which is why interactive applications nowadays still rely mainly on recorded and hand-crafted motion clips. The challenge faced by animation system designers is to dynamically synthesize new, controllable motion by concatenating short motion segments into sequences of different actions or by parametrically blending clips that correspond to different variants of the same logical action. In this article, we provide an overview of research in the field of example-based motion synthesis for interactive applications. We present methods for automated creation of supporting data structures for motion synthesis and describe how they can be employed at run-time to generate motion that accurately accomplishes tasks specified by the AI or human user. 相似文献
14.
英语语音合成中基于有限泛化法的字素切分规则的机器学习 总被引:1,自引:0,他引:1
在英语语音合成中,由于英语有着几乎无限多的词汇,因此不可能创建包含所有词汇的词库。对于未包含在词库中的英语单词,通过“字母转换成音素(L2P)”算法自动生成其音标是一个最好的解决办法。而L2P首要的任务就是字素切分。为此,文中提出了一种有限泛化法(FGA)的机器学习算法,用于进行字素切分规则学习。用于学习的词典库有27040个单词,其中90%的词用于规则学习,剩下的10%用于测试。经过10轮交叉验证,学习实例和测试实例的平均实例切分正确率为99.84%和97.88%,平均单词切分正确率为99.72%和96.35%:平均规则数为472个。 相似文献
15.
《Computer Speech and Language》2014,28(2):687-707
Hypo and hyperarticulation refer to the production of speech with respectively a reduction and an increase of the articulatory efforts compared to the neutral style. Produced consciously or not, these variations of articulatory efforts depend upon the surrounding environment, the communication context and the motivation of the speaker with regard to the listener. The goal of this work is to integrate hypo and hyperarticulation into speech synthesizers, such that they are more realistic by automatically adapting their way of speaking to the contextual situation, like humans do. Based on our preliminary work, this paper provides a thorough and detailed study on the analysis and synthesis of hypo and hyperarticulated speech. It is divided into three parts. In the first one, we focus on both acoustic and phonetic modifications due to articulatory effort changes. The second part aims at developing a HMM-based speech synthesizer allowing a continuous control of the degree of articulation. This requires to first tackle the issue of speaking style adaptation to derive hypo and hyperarticulated speech from the neutral synthesizer. Once this is done, an interpolation and extrapolation of the resulting models enables to finely tune the voice so that it is generated with the desired articulatory efforts. Finally the third and last part focuses on a perceptual study of speech with a variable articulation degree, where it is analyzed how intelligibility and various other voice dimensions are affected. 相似文献
16.
《Ergonomics》2012,55(11):1036-1051
The objective of this study was to define the quantitative relationship between external dynamic shoulder torques and calibrated perceived muscular effort levels for load delivery tasks, for application in job analyses. Subjects performed a series of loaded reaches and, following each exertion, rated their perceived shoulder muscular effort. Motion and task physical requirements data were processed with a biomechanical upper extremity model to calculate external dynamic shoulder torques. Calculated torque values were then statistically compared to reported calibrated perceived muscular effort scores. Individual subject torque profiles were significantly positively correlated with perceived effort scores (r2 = 0.45–0.77), with good population agreement (r2 = 0.50). The accuracy of the general regression model improved (r2 = 0.72) with inclusion of factors specific to task geometry and individual subjects. This suggests two major conclusions: 1) that the perception of muscular shoulder effort integrates several factors and this interplay should be considered when evaluating tasks for their impact on the shoulder region; 2) the torque/perception relationship may be usefully leveraged in job design and analysis. 相似文献
17.
《Ergonomics》2012,55(9):991-999
Rockfall accidents are an important determinant of safety performance in deep level mining. Previous research which is reviewed suggests that a lack of skill in perceiving warnings of imminent danger may be an important factor contributing to this kind of accident. The perceptual task of mine workers is described briefly and a survey of the visual characteristics of dangerous rock is reported upon. The visual search performance of mine workers in a stope simulator, designed on the basis of the survey results, was studied. The findings show not only that novice mine workers, when compared with experienced men, lack the ability to search adequately in a simulation of dangerous rock, but also that their search skills can be improved significantly by training. 相似文献
18.
19.
20.
在语音合成技术的研究中,情感语音合成是当前研究的热点.在众多研究因素中,建立恰当的韵律模型和选取好的韵律参数是研究的关键,它们描述的正确与否,直接影响到情感语音合成的输出效果.为了攻克提高情感语音自然度这一难点,对影响情感语音合成技术韵律参数进行了分析,建立了基于关联规则的情感语音韵律基频模型.本文通过研究关联规则、改进数据挖掘Apriori算法并由此来获得韵律参数中基频变化规则,并为情感语音合成的选音提供指导和帮助. 相似文献