首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Is computer-synthesized speech as persuasive as the human voice when presenting an argument? After completing an attitude pretest, 193 participants were randomly assigned to listen to a persuasive appeal under three conditions: a high-quality synthesized speech system (DECtalk Express), a low-quality synthesized speech system (Monologue), and a tape recording of a human voice. Following the appeal, participants completed a posttest attitude survey and a series of questionnaires designed to assess perceptions of speech qualities, perceptions of the speaker, and perceptions of the message. The human voice was generally perceived more favorably than the computer-synthesized voice, and the speaker was perceived more favorably when the voice was a human voice than when it was computer synthesized. There was, however, no evidence that computerized speech, as compared with the human voice, affected persuasion or perceptions of the message. Actual or potential applications of this research include issues that should be considered when designing synthetic speech systems.  相似文献   

2.
受声学研究启发,结合人脑人耳听觉特性对语音的处理方式,建立了一个完整的模拟听觉中枢系统的语音分离模型.首先利用外周听觉模型对语音信号进行多频谱分析,然后建立重合神经元模型提取语音信号的特征,最后在脑下丘的神经细胞模型中完成对语音的分离.基于现有的语音识别方法,该模型能够很好地解决绝大多数的语音识别方法都只能在单声源和低噪声的环境下使用的问题.实验结果表明,该模型能够实现多声源环境下语音的分离并且具有较高的鲁棒性.随着研究的深入,基于人耳听觉特性的语音分离模型将有很广泛的应用前景.  相似文献   

3.
This paper presents a technique to transform high-effort voices into breathy voices using adaptive pre-emphasis linear prediction (APLP). The primary benefit of this technique is that it estimates a spectral emphasis filter that can be used to manipulate the perceived vocal effort. The other benefit of APLP is that it estimates a formant filter that is more consistent across varying voice qualities. This paper describes how constant pre-emphasis linear prediction (LP) estimates a voice source with a constant spectral envelope even though the spectral envelope of the true voice source varies over time. A listening experiment demonstrates how differences in vocal effort and breathiness are audible in the formant filter estimated by constant pre-emphasis LP. APLP is presented as a technique to estimate a spectral emphasis filter that captures the combined influence of the glottal source and the vocal tract upon the spectral envelope of the voice. A final listening experiment demonstrates how APLP can be used to effectively transform high-effort voices into breathy voices. The techniques presented here are relevant to researchers in voice conversion, voice quality, singing, and emotion.  相似文献   

4.
Attending to a single voice when multiple voices are present is a challenging but common occurrence. An experiment was conducted to determine (a) whether presenting a video display of the target speaker aided speech comprehension in an environment with competing voices, and (b) whether the "ventriloquism effect" could be used to enhance comprehension, as found by Driver (1996), using ecologically valid stimuli. Participants listened for target words from videos of an actress reading while simultaneously ignoring the voices of 2 to 4 different actresses. Target-word detection declined as participants had to ignore more distracting voices; however, this decline was reduced when a video display of the target speaker was provided. Neither a signal-detection analysis of performance data nor a gaze-contingent analysis revealed a ventriloquism effect. Providing a video display of a speaker when competing voices are present improves comprehension, but obtaining the ventriloquism effect appears elusive in naturalistic circumstances. Actual or potential applications of this research include those circumstances in which a listener must filter a relevant stream of speech from among multiple, competing voices, such as air traffic control and military environments.  相似文献   

5.
For individuals with severe speech impairment accurate spoken communication can be difficult and require considerable effort. Some may choose to use a voice output communication aid (or VOCA) to support their spoken communication needs. A VOCA typically takes input from the user through a keyboard or switch-based interface and produces spoken output using either synthesised or recorded speech. The type and number of synthetic voices that can be accessed with a VOCA is often limited and this has been implicated as a factor for rejection of the devices. Therefore, there is a need to be able to provide voices that are more appropriate and acceptable for users.This paper reports on a study that utilises recent advances in speech synthesis to produce personalised synthetic voices for 3 speakers with mild to severe dysarthria, one of the most common speech disorders. Using a statistical parametric approach to synthesis, an average voice trained on data from several unimpaired speakers was adapted using recordings of the impaired speech of 3 dysarthric speakers. By careful selection of the speech data and the model parameters, several exemplar voices were produced for each speaker. A qualitative evaluation was conducted with the speakers and listeners who were familiar with the speaker. The evaluation showed that for one of the 3 speakers a voice could be created which conveyed many of his personal characteristics, such as regional identity, sex and age.  相似文献   

6.
7.
The majority of previous studies on vocal expression have been conducted on posed expressions. In contrast, we utilized a large corpus of authentic affective speech recorded from real-life voice controlled telephone services. Listeners rated a selection of 200 utterances from this corpus with regard to level of perceived irritation, resignation, neutrality, and emotion intensity. The selected utterances came from 64 different speakers who each provided both neutral and affective stimuli. All utterances were further automatically analyzed regarding a comprehensive set of acoustic measures related to F0, intensity, formants, voice source, and temporal characteristics of speech. Results first showed that several significant acoustic differences were found between utterances classified as neutral and utterances classified as irritated or resigned using a within-persons design. Second, listeners’ ratings on each scale were associated with several acoustic measures. In general the acoustic correlates of irritation, resignation, and emotion intensity were similar to previous findings obtained with posed expressions, though the effect sizes were smaller for the authentic expressions. Third, automatic classification (using LDA classifiers both with and without speaker adaptation) of irritation, resignation, and neutral performed at a level comparable to human performance, though human listeners and machines did not necessarily classify individual utterances similarly. Fourth, clearly perceived exemplars of irritation and resignation were rare in our corpus. These findings were discussed in relation to future research.  相似文献   

8.
Wu  Xing  Ji  Sihui  Wang  Jianjia  Guo  Yike 《Applied Intelligence》2022,52(13):14839-14852

Human beings are capable of imagining a person’s voice according to his or her appearance because different people have different voice characteristics. Although researchers have made great progress in single-view speech synthesis, there are few studies on multi-view speech synthesis, especially the speech synthesis using face images. On the basis of implicit relationship between the speaker’s face image and his or her voice, we propose a multi-view speech synthesis method called SSFE (Speech Synthesis with Face Embeddings). The proposed SSFE consists of three parts: a voice encoder, a face encoder and an improved multi-speaker text-to-speech (TTS) engine. On the one hand, the proposed voice encoder generates the voice embeddings from the speaker’s speech and the proposed face encoder extracts the voice features from the speaker’s face as f-voice embeddings. On the other hand, the multi-speaker TTS engine would synthesize the speech with voice embeddings and f-voice embeddings. We have conducted extensive experiments to evaluate the proposed SSFE on the synthesized speech quality and face-voice matching degree, in which the Mean Opinion Score of the SSFE is more than 3.7 and the matching degree is about 1.7. The experimental results prove that the proposed SSFE method outperforms state-of-the-art methods on the synthesized speech in terms of speech quality and face-voice matching degree.

  相似文献   

9.
This paper reports progress in the synthesis of conversational speech, from the viewpoint of work carried out on the analysis of a very large corpus of expressive speech in normal everyday situations. With recent developments in concatenative techniques, speech synthesis has overcome the barrier of realistically portraying extra-linguistic information by using the actual voice of a recognizable person as a source for units, combined with minimal use of signal processing. However, the technology still faces the problem of expressing paralinguistic information, i.e., the variety in the types of speech and laughter that a person might use in everyday social interactions. Paralinguistic modification of an utterance portrays the speaker's affective states and shows his or her relationships with the speaker through variations in the manner of speaking, by means of prosody and voice quality. These inflections are carried on the propositional content of an utterance, and can perhaps be modeled by rule, but they are also expressed through nonverbal utterances, the complexity of which may be beyond the capabilities of many current synthesis methods. We suggest that this problem may be solved by the use of phrase-sized utterance units taken intact from a large corpus.  相似文献   

10.
Can synthetic speech be utilized in foreign language learning as natural speech? In this paper, we evaluated synthetic speech from the viewpoint of learners in order to find out an answer. The results pointed out that learners do not recognize remarkable differences between synthetic voices and natural voices for the words with short vowels and long vowels when they try to understand the meanings of the sounds. The data explicates that synthetic voice utterances of sentences are easier to understand and more acceptable by learners compared to synthetic voice utterances of words. In addition, the ratings on both synthetic voices and natural voices strongly depend upon the learners’ listening comprehension abilities. We conclude that some synthetic speech with specific pronunciations of vowels may be suitable for listening materials and suggest that evaluating TTS systems by comparing synthetic speech with natural speech and building a lexical database of synthetic speech that closely approximates natural speech will be helpful for teachers to readily use many existing CALL tools.  相似文献   

11.
《Advanced Robotics》2013,27(1-2):105-120
We developed a three-dimensional mechanical vocal cord model for Waseda Talker No. 7 (WT-7), an anthropomorphic talking robot, for generating speech sounds with various voice qualities. The vocal cord model is a cover model that has two thin folds made of thermoplastic material. The model self-oscillates by airflow exhausted from the lung model and generates the glottal sound source, which is fed into the vocal tract for generating the speech sound. Using the vocal cord model, breathy and creaky voices, as well as the modal (normal) voice, were produced in a manner similar to the human laryngeal control. The breathy voice is characterized by a noisy component mixed with the periodic glottal sound source and the creaky voice is characterized by an extremely low-pitch vibration. The breathy voice was produced by adjusting the glottal opening and generating the turbulence noise by the airflow just above the glottis. The creaky voice was produced by adjusting the vocal cord tension, the sub-glottal pressure and the vibration mass so as to generate a double-pitch vibration with a long pitch interval. The vocal cord model used to produce these voice qualities was evaluated in terms of the vibration pattern as measured by a high-speed camera, the glottal airflow and the acoustic characteristics of the glottal sound source, as compared to the data for a human.  相似文献   

12.
In this study, the effect of the user choice on social responses to computer-synthesized speech is investigated. Three previous findings about social responses to computer-synthesized speech (i.e., social identification, proximate source orientation, and similarity attraction) were tested using the choice paradigm. Social identification and proximate source orientation effects were found even when users had chosen a computer voice at their discretion. In addition, the primacy effect in the user choice prevailed: Participants were more likely to select whatever voice that they heard first between two options. The similarity attraction effect, however, was negated by the cognitive dissonance effect after user choices. The robustness of social responses, its implications for human–computer interaction, and the importance of the user choice in voice-interface designs are discussed.  相似文献   

13.
The authors present new results in solving problems of concatenative segment synthesis of voice information with prosody and vocal utterance, computer modeling of human voice signals based on joint models of human voice source and vocal tract, and speech signal preprocessing for automated documenting systems. The experiments show the efficiency of the proposed approaches.  相似文献   

14.
This paper presents an expressive voice conversion model (DeBi-HMM) as the post processing of a text-to-speech (TTS) system for expressive speech synthesis. DeBi-HMM is named for its duration-embedded characteristic of the two HMMs for modeling the source and target speech signals, respectively. Joint estimation of source and target HMMs is exploited for spectrum conversion from neutral to expressive speech. Gamma distribution is embedded as the duration model for each state in source and target HMMs. The expressive style-dependent decision trees achieve prosodic conversion. The STRAIGHT algorithm is adopted for the analysis and synthesis process. A set of small-sized speech databases for each expressive style is designed and collected to train the DeBi-HMM voice conversion models. Several experiments with statistical hypothesis testing are conducted to evaluate the quality of synthetic speech as perceived by human subjects. Compared with previous voice conversion methods, the proposed method exhibits encouraging potential in expressive speech synthesis.  相似文献   

15.
In the last few years, the number of systems and devices that use voice based interaction has grown significantly. For a continued use of these systems, the interface must be reliable and pleasant in order to provide an optimal user experience. However there are currently very few studies that try to evaluate how pleasant is a voice from a perceptual point of view when the final application is a speech based interface. In this paper we present an objective definition for voice pleasantness based on the composition of a representative feature subset and a new automatic voice pleasantness classification and intensity estimation system. Our study is based on a database composed by European Portuguese female voices but the methodology can be extended to male voices or to other languages. In the objective performance evaluation the system achieved a 9.1% error rate for voice pleasantness classification and a 15.7% error rate for voice pleasantness intensity estimation.  相似文献   

16.
Evidence from the study of human language understanding is presented suggesting that our ability to perceive visible speech can greatly influence our ability to understand and remember spoken language. A view of the speaker's face can greatly aid in the perception of ambiguous or noisy speech and can aid cognitive processing of speech leading to better understanding and recall. Some of these effects have been replicated using computer synthesized visual and auditory speech. Thus, it appears that when giving an interface a voice, it may be best to give it a face too.  相似文献   

17.
A video database of moving faces and people   总被引:3,自引:0,他引:3  
We describe a database of static images and video clips of human faces and people that is useful for testing algorithms for face and person recognition, head/eye tracking, and computer graphics modeling of natural human motions. For each person there are nine static "facial mug shots" and a series of video streams. The videos include a "moving facial mug shot," a facial speech clip, one or more dynamic facial expression clips, two gait videos, and a conversation video taken at a moderate distance from the camera. Complete data sets are available for 284 subjects and duplicate data sets, taken subsequent to the original set, are available for 229 subjects.  相似文献   

18.
Many commercial applications use synthetic speech for conveying information. In many cases the structure of the information is hierarchical (e.g. menus). In this article, we describe the results of two experiments that examine the possibility of conveying hierarchies (family of trees) using multiple synthetic voices. We postulate that if hierarchical structures can be conveyed using synthetic speech, then navigation in these hierarchies can be improved. In the first experiment, hierarchies containing 10 nodes, with a depth of 3 levels, were created. We used synthetic voices to represent nodes in these hierarchies. A within-subjects study (N = 12) was conducted to compare multiple synthetic voices against single synthetic voices for locating the positions of nodes in a hierarchy. Multiple synthetic voices were created by manipulating synthetic voice parameters according to a set of design principles. Results of the first experiment showed that the subjects performed the tasks significantly better with multiple synthetic voices than with single synthetic voices. To investigate the effect of multiple synthetic voices on complex hierarchies a second experiment was conducted. A hierarchy of 27 nodes was created and a between-subjects study (N = 16) was carried out. The results of this experiment showed that the participants recalled 84.38% of the nodes accurately. Results from these studies imply that multiple synthetic voices can be effectively used to represent and provide navigation cues in interfaces structured as hierarchies.  相似文献   

19.
The effects of message type (navigation, E-mail, news story), voice type (text-to-speech, natural human speech), and earcon cueing (present, absent) on message comprehension and driving performance were examined. Twenty-four licensed drivers (12 under 30, 12 over 65, both equally divided by gender) participated in the experiment. They drove the UMTRI driving simulator on a road consisting of straight sections and constant radius curves, thus yielding two levels of low driving-workload. In addition, as a control condition, data were collected while participants were parked. In all conditions, participants were presented with three types of messages. Each message was immediately followed by a series of questions to assess comprehension. Navigation messages were about 4 seconds long (about 9 words). E-mail messages were about 40 seconds long (about 100 words) and news messages were about 80 seconds long (about 225 words). For all message types, comprehension of text-to-speech messages, as determined by accuracy of response to questions, and by subjective ratings, was significantly worse than comprehension of natural speech (79 versus 83 percent correct answers; 7.7/10 versus 8.6/10 subjective rating). Driving workload did not affect comprehension. Interestingly, neither the speech used (synthesized or natural) nor the message type (navigation, E-mail, news) had a significant effect on basic driving performance measured by the standard deviations of lateral lane position and steering wheel angle.  相似文献   

20.
Three experiments are reported that use new experimental methods for the evaluation of text-to-speech (TTS) synthesis from the user's perspective. Experiment 1, using sentence stimuli, and Experiment 2, using discrete “call centre” word stimuli, investigated the effect of voice gender and signal quality on the intelligibility of three concatenative TTS synthesis systems. Accuracy and search time were recorded as on-line, implicit indices of intelligibility during phoneme detection tasks. It was found that both voice gender and noise affect intelligibility. Results also indicate interactions of voice gender, signal quality, and TTS synthesis system on accuracy and search time. In Experiment 3 the method of paired comparisons was used to yield ranks of naturalness and preference. As hypothesized, preference and naturalness ranks were influenced by TTS system, signal quality and voice, in isolation and in combination. The pattern of results across the four dependent variables – accuracy, search time, naturalness, preference – was consistent. Natural speech surpassed synthetic speech, and TTS system C elicited relatively high scores across all measures. Intelligibility, judged naturalness and preference are modulated by several factors and there is a need to tailor systems to particular commercial applications and environmental conditions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号