首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 562 毫秒
1.
2.
Three experiments are reported that use new experimental methods for the evaluation of text-to-speech (TTS) synthesis from the user's perspective. Experiment 1, using sentence stimuli, and Experiment 2, using discrete “call centre” word stimuli, investigated the effect of voice gender and signal quality on the intelligibility of three concatenative TTS synthesis systems. Accuracy and search time were recorded as on-line, implicit indices of intelligibility during phoneme detection tasks. It was found that both voice gender and noise affect intelligibility. Results also indicate interactions of voice gender, signal quality, and TTS synthesis system on accuracy and search time. In Experiment 3 the method of paired comparisons was used to yield ranks of naturalness and preference. As hypothesized, preference and naturalness ranks were influenced by TTS system, signal quality and voice, in isolation and in combination. The pattern of results across the four dependent variables – accuracy, search time, naturalness, preference – was consistent. Natural speech surpassed synthetic speech, and TTS system C elicited relatively high scores across all measures. Intelligibility, judged naturalness and preference are modulated by several factors and there is a need to tailor systems to particular commercial applications and environmental conditions.  相似文献   

3.
Highest quality synthetic voices remain scarce in both parametric synthesis systems and in concatenative ones. Much synthetic speech lacks naturalness, pleasantness and flexibility. While great strides have been made over the past few years in the quality of synthetic speech, there is still much work that needs to be done. Now the major challenges facing developers are how to provide optimal size, performance, extensibility, and flexibility, together with developing improved signal processing techniques. This paper focuses on issues of performance and flexibility against a background containing a brief evolution of speech synthesis; some acoustic, phonetic and linguistic issues; and the merits and demerits of two commonly used synthesis techniques: parametric and concatenative. Shortcomings of both techniques are reviewed. Methodological developments in the variable size, selection and specification of the speech units used in concatenative systems are explored and shown to provide a more positive outlook for more natural, bearable synthetic speech. Differentiating considerations in making and improving concatenative systems are explored and evaluated. Acoustic and sociophonetic criteria are reviewed for the improvement of variable synthetic voices, and a ranking of their relative importance is suggested. Future rewards are weighed against current technical and developmental challenges. The conclusion indicates some of the current and future applications of TTS.  相似文献   

4.
5.
The ECESS consortium (European Center of Excellence in Speech Synthesis) aims to speed up progress in speech synthesis technology, by providing an appropriate evaluation framework. The key element of the evaluation framework is based on the partition of a text-to-speech synthesis system into distributed TTS modules. A text processing, prosody generation, and an acoustic synthesis module have been specified currently. A split into various modules has the advantage that the developers of an institution active in ECESS, can concentrate its efforts on a single module, and test its performance in a complete system using missing modules from the developers of other institutions. In this way, complete TTS systems can be built using high performance modules from different institutions. In order to evaluate the modules and to connect modules efficiently, a remote evaluation platform—the Remote Evaluation System (RES) based on the existing internet infrastructure—has been developed within ECESS. The RES is based on client–server architecture. It consists of RES module servers, which encapsulate the modules of the developers, a RES client, which sends data to and receives data from the RES module servers, and a RES server, which connects the RES module servers, and organizes the flow of information. RES can be used by developers for selecting RES module from the internet, which contains a missing TTS module needed to test and improve the performances of their own modules. Finally, the RES allows for the evaluation of TTS modules running at different institutions worldwide. When using the RES client, the institution performing the evaluation is able to set-up and performs various evaluation tasks by sending test data via the RES client and receiving results from the RES module servers. Currently ELDA is setting-up an evaluation using the RES client, which will then be extended to an evaluation client specializing in the envisaged evaluation tasks.  相似文献   

6.
What makes speech produced in the presence of noise (Lombard speech) more intelligible than conversational speech produced in quiet conditions? This study investigates the hypothesis that speakers modify their speech in the presence of noise in such a way that acoustic contrasts between their speech and the background noise are enhanced, which would improve speech audibility.Ten French speakers were recorded while playing an interactive game first in quiet condition, then in two types of noisy conditions with different spectral characteristics: a broadband noise (BB) and a cocktail-party noise (CKTL), both played over loudspeakers at 86 dB SPL.Similarly to (Lu and Cooke, 2009b), our results suggest no systematic “active” adaptation of the whole speech spectrum or vocal intensity to the spectral characteristics of the ambient noise. Regardless of the type of noise, the gender or the type of speech segment, the primary strategy was to speak louder in noise, with a greater adaptation in BB noise and an emphasis on vowels rather than any type of consonants.Active strategies were evidenced, but were subtle and of second order to the primary strategy of speaking louder: for each gender, fundamental frequency (f0) and first formant frequency (F1) were modified in cocktail-party noise in a way that optimized the release in energetic masking induced by this type of noise. Furthermore, speakers showed two additional modifications as compared to shouted speech, which therefore cannot be interpreted in terms of vocal effort only: they enhanced the modulation of their speech in f0 and vocal intensity and they boosted their speech spectrum specifically around 3 kHz, in the region of maximum ear sensitivity associated with the actor's or singer's formant.  相似文献   

7.
Noise abatement in office environments often focuses on the reduction of background speech intelligibility and noise level, as attainable with frequency-specific insulation. However, only limited empirical evidence exists regarding the effects of reducing speech intelligibility on cognitive performance and subjectively perceived disturbance. Three experiments tested the impact of low background speech (35 dB(A)) of both good and poor intelligibility, in comparison to silence and highly intelligible speech not lowered in level (55 dB(A)). The disturbance impact of the latter speech condition on verbal short-term memory (n=20) and mental arithmetic (n=24) was significantly reduced during soft and poorly intelligible speech, but not during soft and highly intelligible speech. No effect of background speech on verbal-logical reasoning performance (n=28) was found. Subjective disturbance ratings, however, were consistent over all three experiments with, for example, soft and poorly intelligible speech rated as the least disturbing speech condition but still disturbing in comparison to silence. It is concluded, therefore, that a combination of objective performance tests and subjective ratings is desirable for the comprehensive evaluation of acoustic office environments and their alterations.  相似文献   

8.
《Ergonomics》2012,55(5):719-736
Noise abatement in office environments often focuses on the reduction of background speech intelligibility and noise level, as attainable with frequency-specific insulation. However, only limited empirical evidence exists regarding the effects of reducing speech intelligibility on cognitive performance and subjectively perceived disturbance. Three experiments tested the impact of low background speech (35 dB(A)) of both good and poor intelligibility, in comparison to silence and highly intelligible speech not lowered in level (55 dB(A)). The disturbance impact of the latter speech condition on verbal short-term memory (n = 20) and mental arithmetic (n = 24) was significantly reduced during soft and poorly intelligible speech, but not during soft and highly intelligible speech. No effect of background speech on verbal-logical reasoning performance (n = 28) was found. Subjective disturbance ratings, however, were consistent over all three experiments with, for example, soft and poorly intelligible speech rated as the least disturbing speech condition but still disturbing in comparison to silence. It is concluded, therefore, that a combination of objective performance tests and subjective ratings is desirable for the comprehensive evaluation of acoustic office environments and their alterations.  相似文献   

9.
Speech synthesis by rule has made considerable advances and it is being used today in numerous text-to-speech synthesis systems. Current systems are able to synthesise pleasant-sounding voices at high intelligibility levels. However, because their synthetic speech quality is still inferior to that of fluently produced human speech it has not found wide acceptance and instead it has been restricted mainly in useful applications for the handicapped or for restricted tasks in telecommunications. The problems with automatic speech synthesis are related to the methods of controlling speech synthesizer models in order to mimic the varying properties of the human speech production system during discourse. In this paper, artificial neural networks are developed for the control of a formant synthesizer. A set of common words comprising of larynx-produced phonemes were analysed and used to train a neural network cluster. The system was able to produce intelligible speech for certain phonemic combinations of new and unfamiliar words.  相似文献   

10.
A general method which combines formant synthesis by rule and time-domain concatenation is proposed. This method utilizes the advantages of both techniques by maintaining naturalness while minimizing difficulties such as prosodic modification and spectral discontinuities at the point of concatenation. An integrated sampled natural glottal source (Matsui et al., 1991) and sampled voiceless consonants were incorporated into a real-time text-to-speech formant synthesizer. In special cases, voicing amplitude envelopes and formant transitions dirived from natural speech were also utilized. Several listening tests were performed to evaluate these methods. We obtained a significant overall improvement in intelligibility over our previous formant synthesizer. Such improvements in intelligibility were previously obtained with a Japanese text-to-speech system using a related hybrid system (Kamai and Matsui, 1993), indicating the applicability of this method for multi-lingual synthesis. The results of subjective analyses showed that these methods can alo improve naturalness and listenability factors.  相似文献   

11.
English lexical stress is acoustically related to combination of duration, intensity, fundamental frequency (F0) and vowel quality. Errors in any or all of these correlates could interfere with production of the stress contrast, but it is unknown which correlates are most difficult for L1 Bengali speakers to acquire. This study compares the use of these correlates in the production of English lexical stress contrasts by 10 L1 English and 20 L1 Bengali speakers. The results showed that L1 Bengali speakers produced significantly less native like stress patterns, although they used all four acoustic correlates to distinguish stressed from unstressed syllables. L1 English speakers reduced vowel duration significantly more in the unstressed vowels compared to L1 Bengali speakers and degree of intensity and F0 increase in stressed vowels by L1 English speakers was higher than that by L1 Bengali speakers. There were also significant differences in formant patterns across speaker groups, such that L1 Bengali speakers produced English like vowel reduction in certain unstressed syllables, but in other cases, L1 Bengali speakers had tendency to either not reduce or incorrectly reduce vowels in unstressed syllables. The results suggest that L1 Bengali speakers’ production of English lexical stress contrast is influenced by L1 language experience and L1 phonology.  相似文献   

12.
This paper describes a new Korean Text-to-Speech (TTS) system based on a large speech corpus. Conventional concatenative TTS systems still produce machine-like synthetic speech. The poor naturalness is caused by excessive prosodic modification using a small speech database. To cope with this problem, we utilized a dynamic unit selection method based on a large speech database without prosodic modification. The proposed TTS system adopts triphones as synthesis units. We designed a new sentence set maximizing phonetic or prosodic coverage of Korean triphones. All the utterances were segmented automatically into phonemes using a speech recognizer. With the segmented phonemes, we achieved a synthesis unit cost of zero if two synthesis units were placed consecutively in an utterance. This reduces the number of concatenating points that may occur due to concatenating mismatches. In this paper, we present data concerning the realization of major prosodic variations through a consideration of prosodic phrase break strength. The phrase break was divided into four kinds of strength based on pause length. Using phrase break strength, triphones were further classified to reflect major prosodic variations. To predict phrase break strength on texts, we adopted an HMM-like Part-of-Speech (POS) sequence model. The performance of the model showed 73.5% accuracy for 4-level break strength prediction. For unit selection, a Viterbi beam search was performed to find the most appropriate triphone sequence, which has the minimum continuation cost of prosody and spectrum at concatenating boundaries. From the informal listening test, we found that the proposed Korean corpus-based TTS system showed better naturalness than the conventional demisyllable-based one.  相似文献   

13.
14.
One of the major problems in concatenative synthesis is the occurrence of audible discontinuities between two successive concatenative units. Several studies have attempted to discover objective distance measures that predict the audibility of these discontinuities. In this paper, we investigate mid-vowel joins for three vowels with a range of post-vocalic consonant contexts typical for diphone databases. A first perceptual experiment uses a pairwise comparison procedure to find two subsets of unit combinations: Those with versus without audible discontinuities. A second perceptual experiment uses these two subsets in a procedure where formant resynthesis is used to manipulate three sources of discontinuity separately: formant frequencies, formant bandwidths, and overall energy. Results show mismatch in formant frequencies provides the largest contribution to audible discontinuity, followed by mismatch in overall energy  相似文献   

15.
This paper describes the design and evaluation of prosodically-sensitive concatenative units for a Korean text-to-speech (TTS) synthesis system. The diphones used are prosodically conditioned in the sense that a single conventional diphone is stored as different versions taken directly from the different prosodic domains of the prosodically labeled, read sentences. The four levels of the Korean prosodic hierarchy were observed in the diphone selection process, thereby selecting four different versions of each diphone: three edge diphones from the prosodic domains of the intonational phrase (IP), accentual phrase (AP) and prosodic word (PW), and a non-edge diphone from the domain of the prosodic word. Due to the size of the corpus that we employed, our system covers only 36.4% of the 6503 possible diphones. A listening experiment designed to evaluate the quality of the diphone database showed that listeners preferred stimuli composed of prosodically appropriate diphones. We interpret this as supporting the view that segments carry prosodic domain information.  相似文献   

16.
This paper presents the design and development of unrestricted text to speech synthesis (TTS) system in Bengali language. Unrestricted TTS system is capable to synthesize good quality of speech in different domains. In this work, syllables are used as basic units for synthesis. Festival framework has been used for building the TTS system. Speech collected from a female artist is used as speech corpus. Initially five speakers’ speech is collected and a prototype TTS is built from each of the five speakers. Best speaker among the five is selected through subjective and objective evaluation of natural and synthesized waveforms. Then development of unrestricted TTS is carried out by addressing the issues involved at each stage to produce good quality synthesizer. Evaluation is carried out in four stages by conducting objective and subjective listening tests on synthesized speech. At the first stage, TTS system is built with basic festival framework. In the following stages, additional features are incorporated into the system and quality of synthesis is evaluated. The subjective and objective measures indicate that the proposed features and methods have improved the quality of the synthesized speech from stage-2 to stage-4.  相似文献   

17.
Smart speakers with voice assistants like Google Home or Amazon Alexa are increasingly popular and essential in our daily lives due to their convenience of issuing voice commands. Ensuring that these voice assistants are equitable to different population subgroups is crucial. In this paper, we present the first framework, AudioAcc, to help evaluate the performance of various accents. AudioAcc takes in videos from YouTube and generates composite commands. We further propose a new metric called Consistency of Results (COR) that developers can use to avoid the incorrect translation of the produced results by rewriting the skill to improve the Word Error Rate (WER) performance. We evaluate AudioAcc on complete sentences extracted from YouTube videos. The result reveals that our composite sentences generated by AudioAcc are close to the complete sentences. Our evaluation of diverse audiences shows that first, speech from native speakers, particularly Americans, exhibits the best WER performance by 9.5% in comparison to speech from other native and nonnative speakers. Second, speech from American professional speakers has significantly more fairness and the best WER performance by 8.3% in comparison to speech from German professional speakers and German and American amateur speakers. Moreover, we show that using the COR metric could help developers to rewrite the skill to improve the WER accuracy, which we used to improve the accuracy of the Russian accent.  相似文献   

18.
The paper presents theoretical support for and describes the use of a fuzzy paradigm in implementing a TTS system for the Romanian language, employing a rule-based formant synthesizer. In the framework of classic TTS systems, we propose a new approach in order to improve formant trace computation, aiming at increasing synthetic speech perceptual quality. A fuzzy system is proposed for solving the problem of the phonemes that are prone to multi-definitions in rule-based speech synthesis. In the introductory section, we briefly present the background of the problem and our previous results in speech synthesis. In the second section, we deal with the problem of the context-dependent phonemes at the letter-to-sound module level of our TTS system. Then, we discuss the case of the phoneme /l/ and the solution adopted to define it for different contexts. A fuzzy system is associated with each parameter (denoted F1 and F2) to implement the results of the complete analysis of the phoneme /l/ behavior. The knowledge used in implementing the fuzzy module is acquired by natural speech analysis. In the third section, we exemplify the computation of the synthesis parameters F1 and F2 of the phoneme /l/ in the context of the two syllable sequences. The parameter values are contrasted with those obtained from the spectrogram analysis of the natural speech sequences. The last section presents the main conclusions and further research objectives.  相似文献   

19.
研究英语语音合成系统超前端文本分析所需知识库的构建和扩充方法。语音合成系统在语音播报等领域已经得到了广泛应用。但是在英语多媒体教学领域,还需解决偶尔出现的发音错误问题。由于内置知识库覆盖面不足,目前必须通过人工处理输入的文本,消除发音错误。人工分析和处理的速度及效率制约了语音合成系统在英语教学领域的应用。在英语词汇知识库的支持下,利用计算机辅助文本分析技术,对输入文本中的词语进行筛选和分类,找出产生发音错误的单词或符号,经扩展、转换或标注处理,可使优秀的英语语音合成系统达到教学和训练的要求。  相似文献   

20.
The level of quality that can be attained in concatenative text-to-speech (TTS) synthesis is primarily governed by the inventory of units used in unit selection. This has led to the collection of ever larger corpora in the quest for ever more natural synthetic speech. As operational considerations limit the size of the unit inventory, however, pruning is critical to removing any instances that prove either spurious or superfluous. This paper proposes a novel pruning strategy based on a data-driven feature extraction framework separately optimized for each unit type in the inventory. A single distinctiveness/redundancy measure can then address, in a consistent manner, the two different problems of outliers and redundant units. Detailed analysis of an illustrative case study exemplifies the typical behavior of the resulting unit pruning procedure, and listening evidence suggests that both moderate and aggressive inventory pruning can be achieved with minimal degradation in perceived TTS quality. These experiments underscore the benefits of unit-centric feature mapping for database optimization in concatenative synthesis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号