提出一种与主观评价相关性较高的可懂度客观评价算法。传统的基于频域分段信噪比的可懂度评价算法与主观评价的相关性不高,原因在于没有分别计算谱衰减畸变和谱放大畸变这两种畸变。为了克服这一缺点,提出将增强语音分解为衰减畸变、放大倍数小于6.02 dB的放大畸变、放大倍数大于6.02 dB的放大畸变三部分,分别计算其频域信噪比,用多线性回归方法综合这三种畸变值,使其与主观可懂值的相关值达到最高。实验结果表明,用这种方法对句子的可懂度评价结果与主观评价的相关值达到0.91。  相似文献   

传统的语音评价算法,如SNR,存在语音的可懂度相关性不高的问题.有研究表明,语音的不同部分对可懂度的贡献不同,语音的浊音起始段对可懂度的影响较大.提出一种可懂度相关性相对较高的语音评价算法.在计算分段SNR之前,对语音段进行选择,选出起始段.所提出方法的可懂度计算结果与主观得分进行比较,实验结果表明,结合语音起始段(speech onset)检测算法,能够将可懂度与主观评价的相关值分别提高0.11(辅音)和0.06(句子),这也从一个侧面验证了语音的起始段对可懂度有较大影响这一研究结论.  相似文献   

Against a background of incorporating a talking head into a role-playing simulator, enhancements are proposed for users of the simulator and of text-to-speech systems in general. The first is the ability to generate vocal emotion in synthetic speech using a limited number of prosodic parameters with a concatenative speech synthesizer. The second enhancement allows for vocal emotions to be included during the authoring of text for output by the text-to-speech system. Vocal emotions can be represented visually, and can be manipulated directly by the user. Applications such as training simulators that use synthetic speech can be made more human by the addition of emotions. A graphical editor for specifying and directly manipulating the speech improves the authoring environment of these applications.  相似文献   

语音可懂度是语音信号的一种重要属性,在归一化协方差评价方法(NCM)的基础之上,以相对均方根(RMS)为阈值对语音信号进行分割,对高于均方值的语音段和低于均方值的语音段进行了分段可懂度评估,同时,提出了一种新的可懂度评价模型,结合了这两种语音段对语音可懂度的相对贡献,共同评价语音的可懂度。实验结果表明,高均方语音段相对于低均方语音段对可懂度具有更高的贡献,利用新的模型将这两种语音段的评价结果进行重新结合,评价效果得到了显著提升。  相似文献   

In the future, speech will increasingly become the preferred medium of communication between man and machine. For this to be successfully achieved attention must be given to the structuring of man-machine speech dialogues. The present paper outlines a preliminary classification of such dialogues in terms of 'dialogue acts' and 'subdialogues'. The general features necessary for an interactive information service, using an isolated word recogniser and synthetic speech output, are suggested and a hierarchy of appropriate modes of data entry is proposed.  相似文献   

The Internet has rarely been used in auditory perception studies due to concerns about standardisation and calibration across different systems and settings. However, not all auditory research is based on the investigation of fine-grained differences in auditory thresholds. Where meaningful ‘real-world’ listening, for instance the perception of speech, is concerned, the Internet may be a more appropriate and ecologically valid setting to collect data. This study compared affective ratings of low-pass-filtered infant-, foreigner- and British adult-directed speech obtained with traditional methods in the laboratory, with those obtained from an Internet sample. Dropout rates and demographic distribution of participants in the Internet condition were also assessed. The results show that affective ratings were similar for both the Internet and laboratory samples. These findings indicate the viability of Internet-based research into affective speech perception and suggest that precise acoustic environmental control may not always be necessary.  相似文献   

Acoustic modifications of loudspeaker announcements were investigated in a simulated aircraft cabin to improve passengers’ speech intelligibility and quality of communication in this specific setting. Four experiments with 278 participants in total were conducted in an acoustic laboratory using a standardised speech test and subjective rating scales. In experiments 1 and 2 the sound pressure level (SPL) of the announcements was varied (ranging from 70 to 85 dB(A)). Experiments 3 and 4 focused on frequency modification (octave bands) of the announcements. All studies used a background noise with the same SPL (74 dB(A)), but recorded at different seat positions in the aircraft cabin (front, rear). The results quantify speech intelligibility improvements with increasing signal-to-noise ratio and amplification of particular octave bands, especially the 2 kHz and the 4 kHz band. Thus, loudspeaker power in an aircraft cabin can be reduced by using appropriate filter settings in the loudspeaker system.  相似文献   

Noise abatement in office environments often focuses on the reduction of background speech intelligibility and noise level, as attainable with frequency-specific insulation. However, only limited empirical evidence exists regarding the effects of reducing speech intelligibility on cognitive performance and subjectively perceived disturbance. Three experiments tested the impact of low background speech (35 dB(A)) of both good and poor intelligibility, in comparison to silence and highly intelligible speech not lowered in level (55 dB(A)). The disturbance impact of the latter speech condition on verbal short-term memory (n = 20) and mental arithmetic (n = 24) was significantly reduced during soft and poorly intelligible speech, but not during soft and highly intelligible speech. No effect of background speech on verbal-logical reasoning performance (n = 28) was found. Subjective disturbance ratings, however, were consistent over all three experiments with, for example, soft and poorly intelligible speech rated as the least disturbing speech condition but still disturbing in comparison to silence. It is concluded, therefore, that a combination of objective performance tests and subjective ratings is desirable for the comprehensive evaluation of acoustic office environments and their alterations.  相似文献   

针对HMM语音合成算法,固定参数的后置滤波器无法适应不同失真程度的频谱导致合成语音自然度下降,提出了一种基于后置滤波器参数自适应的语音合成改进算法。该方法根据语音谱的平坦度自适应选择最优的短时滤波参数来对合成语音频谱的共振峰区域增强;使用长时后置滤波器优化合成语音的基频谐波结构来减轻合成语音基频的不连续性。仿真实验结果表明,该方法能够有效地减轻语音的频谱过平滑,主观测试结果表明,合成语音的自然度得以提高。  相似文献   

单通道语音增强算法通过从带噪语音中估计并抑制噪声成分来得到增强语音。然而,噪声估计算法在计算时存在过估现象,导致部分估计噪声能量值比实际值大。尽管可以通过补偿消去这些过估值,但引入的误差同样会降低增强语音的整体质量。针对此问题,提出一种基于计算听觉场景分析(CASA)的时频掩蔽估计与优化算法。首先,通过直接判决(DD)算法估计先验信噪比(SNR)并计算初始掩蔽;其次,利用噪声与带噪语音在Gammatone频带内的互相关(ICC)系数来计算噪声的存在概率,结合带噪语音能量谱得到新的噪声估计,减少原估计噪声中的过估成分;然后,利用优化算法对初始掩蔽进行迭代处理以减少其中因噪声过估而存在的误差并增加其中的目标语音成分,在满足条件后停止迭代并得到新的掩蔽;最后,利用新的掩蔽合成增强语音。实验结果表明在不同的背景噪声下,相比优化前,新的掩蔽使增强语音获得了较高的主观语音质量(PESQ)和语音可懂度(STOI)值,提升了语音听感与可懂度。  相似文献   

Though a large amount of psychological and physiological evidence of audio-visual integration in speech has been collected in the last 20 years, there is no agreement about the nature of the fusion process. We present the main experimental data, and describe the various models proposed in the literature, together with a number of studies in the field of automatic audiovisual speech recognition. We discuss these models in relation to general proposals arising from psychology in the field of intersensory interaction, or from the field of vision and robotics in the field of sensor fusion. Then we examine the characteristics of four main models, in the light of psychological data and formal properties, and we present the results of a modelling study on audio-visual recognition of French vowels in noise. We conclude in favor of the relative superiority of a model in which the auditory and visual inputs are projected and fused in a common representation space related to motor properties of speech objects, the fused representation being further classified for lexical access.  相似文献   

目前合成语音的自然度有待提高,论文根据目前的研究现状提出了一种合成语音自然度的客观评价方法,该方法主要从语音韵律特征的主要参数出发,计算同一发音人的自然语音和合成语音之间的基频、时长、音强等参数的差距,其中由于两种语音基频时间不匹配,所以采用DTW(Dynamic Time Warping)算法来对两种语音的基频进行了时间弯折对准。最后再将计算结果与主观评测(MOS)的结果进行比较。实验数据表明,论文提出的基频曲线失真测度与MOS之间具有很强的相关性,从韵律特征角度给出的评价结果能够衡量合成语音的自然度。  相似文献   


Machine induction has been extensively used in order to develop knowledge bases for decision support systems and predictive systems. The extent to which developers and domain experts can comprehend these knowledge structures and gain useful insights into the basis of decision making has become a challenging research issue. This article examines the knowledge structures generated by the C4.5 induction technique in a fault diagnostic task and proposes to use a model of human learning in order to guide the process of making comprehensive the results of machine induction. The model of learning is used to generate hierarchical representations of diagnostic knowledge by adjusting the level of abstraction and varying the goal structures between 'shallow' and 'deep' ones. Comprehensibility is assessed in a global way in an experimental comparison where subjects are required to acquire the knowledge structures and transfer to new tasks. This method of addressing the issue of comprehensibility appears promising especially for machine induction techniques that are rather inflexible with regard to the number and sorts of interventions allowed to system developers.  相似文献   

Hypo and hyperarticulation refer to the production of speech with respectively a reduction and an increase of the articulatory efforts compared to the neutral style. Produced consciously or not, these variations of articulatory efforts depend upon the surrounding environment, the communication context and the motivation of the speaker with regard to the listener. The goal of this work is to integrate hypo and hyperarticulation into speech synthesizers, such that they are more realistic by automatically adapting their way of speaking to the contextual situation, like humans do. Based on our preliminary work, this paper provides a thorough and detailed study on the analysis and synthesis of hypo and hyperarticulated speech. It is divided into three parts. In the first one, we focus on both acoustic and phonetic modifications due to articulatory effort changes. The second part aims at developing a HMM-based speech synthesizer allowing a continuous control of the degree of articulation. This requires to first tackle the issue of speaking style adaptation to derive hypo and hyperarticulated speech from the neutral synthesizer. Once this is done, an interpolation and extrapolation of the resulting models enables to finely tune the voice so that it is generated with the desired articulatory efforts. Finally the third and last part focuses on a perceptual study of speech with a variable articulation degree, where it is analyzed how intelligibility and various other voice dimensions are affected.  相似文献   

提出一种基于人类听觉特性的自适应小波滤波算法。该方法用听觉感知小波变换对含噪语音信号进行小波分解,这样可以保证对信号频率和幅值的听觉特性,将经听觉感知小波变换所分离出来的噪声成分作为自适应滤波器的输入。通过采用递推最小二乘算法从而实现信噪分离的最佳滤波,以保证去除信号中的相关噪声。结果表明,该方法能实现非平稳信号在同频段对噪声成分和有用信号的最佳估计,提高了语音的清晰度和可懂度。  相似文献   

为使更多人了解使用少数民族语音产品,有效解决我国少数民族地区与其他区域之间的语言障碍问题,促进民族间的相互交流。通过搜集资料,以国内基于语音识别技术的维吾尔语、蒙古语、藏语的语音产品为研究对象,梳理其开发和应用情况,发现目前开发的相关产品主要集中于语音输入法、语音翻译软件和转录产品三方面,在此基础上,对产品使用产生的影响进行分析,并对相关语音产品的发展前景进行展望。  相似文献   

The aim of this study was to find out what are the effects of three different sound environments on performance of cognitive tasks of varying complexity. These three sound environments were ‘speech’, ‘masked speech’ and ‘continuous noise’. They corresponded to poor, acceptable and perfect acoustical privacy in an open-plan office, respectively. The speech transmission indices were 0.00, 0.30 and 0.80, respectively. Sounds environments were presented at 48 dBA. The laboratory experiment on 36 subjects lasted for 4 h for each subject. Proofreading performance deteriorated in the ‘speech’ (p < 0.05) compared to the other two sound environments. Reading comprehension and computer-based tasks (simple and complex reaction time, subtraction, proposition, Stroop and vigilance) remained unaffected. Subjects assessed the ‘speech’ as the most disturbing, most disadvantageous and least pleasant environment (p < 0.01). ‘Continuous noise’ annoyed the least. Subjective arousal was highest in ‘masked speech’ and lowest in ‘continuous noise’ (p < 0.05). Performance in real open-plan offices could be improved by reducing speech intelligibility, e.g. by attenuating speech level and using an appropriate masking environment.  相似文献   

对于开放型办公室语音掩蔽系统性能的评价,语言可懂度是很重要的一个方面,目前通常采取的客观评价方法是STI。将语音信号按一定时间帧长反转后得到的信号我们称为时间反转语音,时间反转语音已被作为有效掩蔽信号之一。虽然对于由平稳噪声掩蔽的语音信号,STI与主观理解的语言可懂度相关性很好。但研究发现STI不适用于估计由时间反转语音掩蔽的语音信号的语言可懂度。文章分析了STI、PESQ及mNCM客观评价方法并进行了实验,实验结果表明,PESQ及mNCM对于由反转语音掩蔽的语音信号仍能较好估计语言可懂度。文章根据客观评价结果,进一步比较了反转语音掩蔽算法的不同参数(反转帧长与信噪比)对于语言可懂度的影响。发现反转帧长的增加和信噪比的降低会导致较低的语言可懂度。  相似文献   

针对传统谱减法存在的算法缺陷,提出一种基于联合最大后验概率的改进谱减法.传统谱减法通过获取带噪语音与噪声的幅度差值,并提取带噪语音的相位信息进行语音信号重建.该方法因为谱相减产生“音乐噪声”,并因为相位估计不准确,导致低信噪比下信号增强效果不理想.为此,引入多频带谱减法和相位估计,通过划分频谱,分别在子频带进行谱减法,有效降低“音乐噪声”的影响;同时构建基于最大后验概率的相位估计器,联合信号幅度函数和相位函数,通过多次交替迭代得到相位估值.实验结果表明,相对于传统谱减法,在低信噪比下该算法有效提高增强语音的质量感知和可懂度.  相似文献   

The intelligibility of a speech output device is an important predictor of user acceptability. The Diagnostic Rhyme Test (DRT) is an ANSI standard for measuring speech intelligibility (ANSI S3.2-1989). In the DRT, respondents hear a word and choose its equivalent from two visually presented words. The two words differ only in their initial (e.g., veal-feel), and the two consonants differ only in a single distinctive acousticphonetic feature (e.g., voicing). To define distinctive feature, the DRT uses a minimal distinctive feature system, loosely based on the work of Jakobson et al. (1963) and Miller and Nicely (1955). These studies carefully analyzed natural speech errors in various noise environments. Whether or not these studies can be freely applied to alternative forced-choice tests of coded or synthesized speech is an empirical issue. In the present study, the results of a Consonant Identification (CI) task were compared to a previously conducted DRT using the same coding algorithms. The CI data indicated that the low-bit-rate coded speech yielded significantly more multifeature confusions then the uncoded speech. Moreover, the multifeature confusions could not be easily predicted from the single-feature confusions. A fundamental assumption of the DRT is that speech errors are adequately diagnosed by testing single-feature confusions. The results of the present study contradict that assumption. In conclusion, we argue that the application of the DRT (and more generally, any closed-response choice procedure) to coded or synthesized speech is questionable.  相似文献   

