首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The quality of text-to-speech systems can be effectively assessed only on the basis of reliable and valid listening tests to assess overall system performance. A mean opinion scale (MOS) has been the recommended measure of synthesized speech quality [ITU-T Recommendation P.85, 1994. Telephone transmission quality subjective opinion tests. A method for subjective performance assessment of the quality of speech voice output devices]. We assessed this MOS scale and developed and tested a modified measure of speech quality. This modified measure has new items specific to text-to-speech systems. Our research was motivated by the lack of clear evidence of the conceptual content of as well as the psychometric properties of the MOS scale. We present conceptual arguments and empirical evidence for the reliability and validity of a modified scale. Moreover, we employ state of the art psychometric techniques such as confirmatory factor analysis to provide strong tests of psychometric properties. This modified scale is better suited to appraise synthesis systems since it includes items that are specific to the artifacts found in synthesized speech. We believe that the speech synthesis research communities will find this modified scale a better fit for listening tests to assess synthesized speech.  相似文献   

2.
This paper presents a method for separating speech of individual speakers from the combined speech of two speakers. The main objective of this work is to demonstrate the significance of the combined excitation source based temporal processing and short-time spectrum based spectral processing method for the separation of speech produced by individual speakers. Speech in a two speaker environment is simultaneously collected over two spatially separated microphones. The speech signals are first subjected to excitation source information (linear prediction residual) based temporal processing. In temporal processing, speech of each speaker is enhanced with respect to the other by relatively emphasizing the speech around the instants of significant excitation of desired speaker by deriving speaker-specific weight function. To further improve the separation, the temporally processed speech is subjected to spectral processing. This involves enhancing the regions around the pitch and harmonic peaks of short time spectra computed from the temporally processed speech. To do so the pitch estimate is obtained from the temporally processed speech. The performance of the proposed method is evaluated using (i) objective quality measures: percentage of energy loss, percentage of noise residue, the signal-to-noise ratio (SNR) gain and perceptual evaluation of speech quality (PESQ), and (ii) subjective quality measure: mean opinion score (MOS). Experimental results are reported for both real and synthetic speech mixtures. The SNR gain and MOS values show that the proposed combined temporal and spectral processing method provides an average improvement in the performance of 5.83% and 8.06% respectively, compared to the best performing individual temporal or spectral processing methods.  相似文献   

3.
提出了一种语音主观质量的客观评估算法,该算法在巴克谱的基础上计算原始语音与重建语音之间的失真度,并考虑了弱音帧与噪声帧的存在对语音质量评估的影响。文中同时给出了结合巴克谱失真和弱音与噪声帧比率的语音质量评估公式,并将计算结果与平均意见分(MOS)进行了比较。数值实验表明,本文提出的增强型巴克谱失真测度(IBSD)与MOS之间具有很强的相关性.能客观地评价出语音信号的主观质量,适用于各种语音编码、语音通信系统。  相似文献   

4.
Successful operation of the Synchronous Overlap and Add (SOLA) algorithm for Time Scale Modification (TSM) of speech is closely tied to the proper choice of parameters. This paper investigates the quality of time scale modified speech under different values of primary parameters. Based on Mean Opinion Score (MOS) tests and Bark Spectral Distortion (BSD) measure, the proper choices of synthesis shift (Ss) and the duration of the shift search interval (K max?) are given experimentally. The conclusions can be helpful for operating the SOLA algorithm for time scale modification of speech.  相似文献   

5.
Since it is impractical to prerecord human speech for dynamic content such as email messages and news, many commercial speech applications use recorded human speech for fixed content (e.g. system prompts) and synthetic speech for dynamic content. However, mixing human speech and synthetic speech may not be optimal from a consistency perspective. A two-condition between-participants experiment (N = 24) was conducted to compare two versions of a telephony application for Personal Information Management (PIM). In the first condition, all the system output was delivered with synthetic speech. In the second condition, users heard a mix of human speech and synthetic speech. Users managed several email and calendar tasks. Users' task performance was rated by two independent judges. Their self-ratings of task performance and attitudinal responses were also measured by means of questionnaires. Users interacting with the interface that used only synthetic speech performed the task significantly better, while users interacting with the mixed-speech interface thought they did better and had more positive attitudinal responses. A consistency framework drawn from human psychological processing is offered to explain the difference in task performance. Cognitive processing and attitudinal response are differentiated. Design implications and directions for future research are suggested.  相似文献   

6.
A set of perception experiments, using reiterant and lexicalised speech, was designed to perform a diagnostic of the relative implication of prosody in the segmentation and hierarchisation of speech. Both natural and synthetic intonation were evaluated. Then, two distance measures—correlation and root-mean-square distance on the acoustic parameters (F0, syllabic duration and intensity)—were applied to match the perception results. This objective vs. subjective comparison underlines which acoustic cues are used by listeners to judge the adequacy of prosody in performing a given function such as demarcation. Results can be summarized by a scale of the perceptual distance between two demarcation functions. This study also points out the ability of listeners to retrieve pertinent information on the basis of pure prosodic stimuli.  相似文献   

7.
Many commercial applications use synthetic speech for conveying information. In many cases the structure of the information is hierarchical (e.g. menus). In this article, we describe the results of two experiments that examine the possibility of conveying hierarchies (family of trees) using multiple synthetic voices. We postulate that if hierarchical structures can be conveyed using synthetic speech, then navigation in these hierarchies can be improved. In the first experiment, hierarchies containing 10 nodes, with a depth of 3 levels, were created. We used synthetic voices to represent nodes in these hierarchies. A within-subjects study (N = 12) was conducted to compare multiple synthetic voices against single synthetic voices for locating the positions of nodes in a hierarchy. Multiple synthetic voices were created by manipulating synthetic voice parameters according to a set of design principles. Results of the first experiment showed that the subjects performed the tasks significantly better with multiple synthetic voices than with single synthetic voices. To investigate the effect of multiple synthetic voices on complex hierarchies a second experiment was conducted. A hierarchy of 27 nodes was created and a between-subjects study (N = 16) was carried out. The results of this experiment showed that the participants recalled 84.38% of the nodes accurately. Results from these studies imply that multiple synthetic voices can be effectively used to represent and provide navigation cues in interfaces structured as hierarchies.  相似文献   

8.
This paper examines issues underpinning the potential move in aviation away from real speech radiotelephony (R/T) communications towards datalink communications involving text and synthetic speech communications. Using a novel air traffic control (ATC) task, two experiments are reported. Experiment 1 compared the use of speech and text while Experiment 2 compared the use of real and synthetic speech communications. Results indicated that generally there were no significant differences between speech and text communications and that either type could be used without any main effects on performance. However, a number of specific differences were observed across the different phases of the scenarios indicating that workload levels may be more varied when speech communications are used. Experiment 2 illustrated that participants placed a greater level of trust in real speech than synthetic speech, and trusted true communications more than false communications (regardless of whether they were real or synthetic voices). The findings are considered in terms of datalink initiatives for future air traffic management, the importance placed on real speech R/T communications, and the need to develop more natural synthetic speech in this application area.  相似文献   

9.
合成语音自然度客观测度   总被引:2,自引:1,他引:1  
目前合成语音的自然度有待提高,论文根据目前的研究现状提出了一种合成语音自然度的客观评价方法,该方法主要从语音韵律特征的主要参数出发,计算同一发音人的自然语音和合成语音之间的基频、时长、音强等参数的差距,其中由于两种语音基频时间不匹配,所以采用DTW(Dynamic Time Warping)算法来对两种语音的基频进行了时间弯折对准。最后再将计算结果与主观评测(MOS)的结果进行比较。实验数据表明,论文提出的基频曲线失真测度与MOS之间具有很强的相关性,从韵律特征角度给出的评价结果能够衡量合成语音的自然度。  相似文献   

10.
We report work on the first component of a two-stage speech recognition architecture based onphonological features rather than phones. This paper reports experiments on three phonological feature systems: (1) the Sound Pattern of English (SPE) system which uses binary features, (2) amulti-valued (MV) feature system which uses traditional phonetic categories such as manner, place, etc., and (3)Government Phonology (GP) which uses a set of structured primes. All experiments used recurrent neural networks to perform feature detection. In these networks the input layer is a standard framewise cepstral representation, and the output layer represents the values of the features. The system effectively produces a representation of the most likely phonological features for each input frame. All experiments were carried out on the TIMIT speaker-independent database. The networks performed well in all cases, with the average accuracy for a single feature ranging from 86% and 93%. We describe these experiments in detail, and discuss the justification and potential advantages of using phonological features rather than phones for the basis of speech recognition.  相似文献   

11.
This paper investigates the problem of speech enhancement when only a single microphone is used and the statistics of the interfering noise and speech are not available a priori. Thus it seeks to address a pitfall of many current enhancement techniques and look towards a system which would have application in the real world. This paper focuses on Log Gabor Wavelet (LGW) based Long Term Squared Spectral Amplitude estimator using the Maximum a Posteriori (MAP) criterion. To begin with, long term cepstral mean subtraction technique with LGW is proposed to suppress telephone channel and handset effect from the speech signals. Then a novel speech enhancer by MAP based Bayesian Bivariate Model is developed to suppress the background noise. This work also introduces an inter-scale dependency between the coefficients and their parents by a Circularly Symmetric probability density function related to the family of Spherically Invariant Random Process (SIRPs). The corresponding joint estimator is derived by MAP estimation theory. The inter-scale noise variance of the coefficients is kept constant which gives closed form solution. Consideration of speech presence uncertainty (SPU) estimator is another contribution to the proposed estimator. Therefore, in this paper, the main contributions are; (i) combination of LGW, SIRPs and SPU for background noise reduction, (ii) LGW and Long Term Cepstral Mean Subtraction to reduce the effects of both telephone channel and handsets, (iii) circularly Symmetric probability density function to exploit the inter-scale dependency between the coefficients and their parents and corresponding joint estimators are derived by MAP estimation theory, (iv) the inter-scale noise variance of the coefficients is kept constant which gives closed form solution, (v) idea refines the estimate of the magnitudes by scaling them by the SPU probability. Extensive comparisons are done among the proposed and existing speech enhancement algorithms on NOIZEUS speech database which has different types of noise. We report the subjective and objective evaluations encompassing four classes of algorithms: spectral subtractive, subspace, statistical model based and Wiener type against the proposed methods. Experimental results show that the proposed estimator yields a higher improvement in Segmental SNR (SSNR), lower Log Area Ratio (LAR), Weighted Spectral Slope (WSS) distortion, higher Perceptual Evaluation of Speech Quality (PESQ) and Mean Opinion Score (MOS) compared to the existing speech enhancement algorithms. For SSNR measure, the proposed methods show 2 dB of improvement than existing methods for almost every Noise sources. For MOS measure, the proposed methods show improvement than existing methods for almost every Noise sources. Therefore the proposed methods are aiming to enhance the speech quality as well as intelligibility at a time.  相似文献   

12.
Previous comprehension studies using postperceptual memory tests have often reported negligible differences in performance between natural speech and several kinds of synthetic speech produced by rule, despite large differences in segmental intelligibility. The present experiments investigated the comprehension of natural and synthetic speech using two different on-line tasks: word monitoring and sentence-by-sentence listening. On-line task performance was slower and less accurate for passages of synthetic speech than for passages of natural speech. Recognition memory performance in both experiments was less accurate following passages of synthetic speech than of natural speech. Monitoring performance, sentence listening times, and recognition memory accuracy all showed moderate correlations with intelligibility scores obtained using the Modified Rhyme Test. The results suggest that poorer comprehension of passages of synthetic speech is attributable in part to the greater encoding demands of synthetic speech. In contrast to earlier studies, the present results demonstrate that on-line tasks can be used to measure differences in comprehension performance between natural and synthetic speech.  相似文献   

13.
Although many discrete Fourier transform (DFT) domain-based speech enhancement methods rely on stochastic models to derive clean speech estimators, like the Gaussian and Laplace distribution, certain speech sounds clearly show a more deterministic character. In this paper, we study the use of a deterministic model in combination with the well-known stochastic models for speech enhancement. We derive a minimum mean-square error (MMSE) estimator under a combined stochastic-deterministic speech model with speech presence uncertainty and show that for different distributions of the DFT coefficients the combined stochastic-deterministic speech model leads to improved performance of approximately 0.8 dB segmental signal-to-noise ratio (SNR) over the use of a stochastic model alone. Evaluation with perceptual evaluation of speech quality (PESQ) shows performance improvements of approximately 0.15 on an MOS scale  相似文献   

14.
We present a framework for the analysis and evaluation oftravel, or viewpoint motion control, techniques for use in immersive virtual environments (VEs). In previous work, we presented a taxonomy of travel techniques and a set of experiments mapping parts of the taxonomy to various performance metrics. Since these initial experiments, we have expanded the framework to allow evaluation of not only the effects of different travel techniques, but also the effects of many outside factors simultaneously. Combining this expanded framework with the measurement of multiple response variables epitomises the philosophy oftestbed evaluation. This experimental philosophy leads to a deeper understanding of the interaction and the technique(s) in question, as well as to broadly generalisable results. We also present an example experiment within this expanded framework, which evaluates the user's ability to gather information while travelling through a virtual environment. Results indicate that, of the variables tested, the complexity of the environment is by far the most important factor.  相似文献   

15.
This paper describes the work done in improving the performance of Tamil speech recognition system by using Time Scale Modification (TSM) and Vocal Tract Length Normalization (VTLN) techniques. The speech recognition system for Tamil language was developed using a new approach of text independent speech segmentation, with a phoneme based language model for recognition. There is degradation in the performance of speech recognition due to variations in the speaking rate and vocal tract shape among different speakers. In order to improve the performance of speech recognition system, both TSM and VTLN normalization techniques were used in this work. The TSM was implemented using the Phase vocoder approach and the VTLN was implemented using speaker specific bark/mel scale in bark/mel domain. The performance of Tamil speech recognition system was improved by performing both TSM and VTLN normalization techniques.  相似文献   

16.
Even the highest quality synthetic speech generated by rule sounds unlike human sppech. As the intelligibility of rule-based synthetic speech improves, and the number of applications for synthetic speech increases, the naturalness of synthetic speech will become an important factor in determining its use. In order to improve this aspect of the quality of synthetic speech it is necessary to have diagnostic tests that can measure naturalness. Currently, all of the available metrics for evaluating the acceptability of synthetic speech do not distinguish sufficiently between measuring overall acceptability (including naturalness) and simply measuring the ability of listeners to extract intelligible information from the signal. In this paper we propose a new methodology for measuring the naturalness of particular aspects of synthesized speech, independent of the intelligibility of the speech. Although naturalness is a multidimensional, subjective quality of speech, this methodology makes it possible to assess the separate contributions of prosodic, segmental, and source characteristics of the utterance. In two experiments, listeners reliably differentiated the naturalness of speech produced by two male talkers and two text-to-speech systems. Furthermore, they reliably differentiated between the two text-to-speech systems. The results of these experiments demonstrate that perception of naturalness is affected by information contained within the smallest part of speech, the glottal pulse, and by information contained within the prosodic structure of a syllable. These results shown that this new methodology does provide a solid basis for measuring and diagnosing the naturalness of synthetic speech.  相似文献   

17.
Disclosure of personal information is valuable to individuals, governments, and corporations. This experiment explores the role interface design plays in maximizing disclosure. Participants (N = 100) were asked to disclose personal information to a telephone-based speech user interface (SUI) in a 3 (recorded speech vs. synthesized speech vs. text-based interface) by 2 (gender of participant) by 2 (gender of voice) between-participants experiment (with no voice manipulation in the text conditions). Synthetic speech participants exhibited significantly less disclosure and less comfort with the system than text-based or recorded-speech participants. Females were more sensitive to differences between synthetic and recorded speech. There were significant interactions between modality and gender of speech, while there were no gender identification effects. Implications for the design of speech-based information-gathering systems are outlined.  相似文献   

18.
Chen  Lijiang  Ren  Jie  Chen  Pengfei  Mao  Xia  Zhao  Qi 《Applied Intelligence》2022,52(13):15193-15209

This paper proposes a framework of applying only the EGG signal for speech synthesis in the limited categories of contents scenario. EGG is a sort of physiological signal which can reflect the trends of the vocal cord movement. Note that EGG’s different acquisition method contrasted with speech signals, we exploit its application in speech synthesis under the following two scenarios. (1) To synthesize speeches under high noise circumstances, where clean speech signals are unavailable. (2) To enable dumb people who retain vocal cord vibration to speak again. Our study consists of two stages, EGG to text and text to speech. The first is a text content recognition model based on Bi-LSTM, which converts each EGG signal sample into the corresponding text with a limited class of contents. This model achieves 91.12% accuracy on the validation set in a 20-class content recognition experiment. Then the second step synthesizes speeches with the corresponding text and the EGG signal. Based on modified Tacotron-2, our model gains the Mel cepstral distortion (MCD) of 5.877 and the mean opinion score (MOS) of 3.87, which is comparable with the state-of-the-art performance and achieves an improvement by 0.42 and a relatively smaller model size than the origin Tacotron-2. Considering to introduce the characteristics of speakers contained in EGG to the final synthesized speech, we put forward a fine-grained fundamental frequency modification method, which adjusts the fundamental frequency according to EGG signals and achieves a lower MCD of 5.781 and a higher MOS of 3.94 than that without modification.

  相似文献   

19.
This study presents a different approach to the classification of Mesoscale Oceanic Structures (MOS) present in the Northwest African area, based on their location. The main improvement stems from the partition of this area in four large zones perfectly differentiated by their morphological characteristics, with attention to seafloor topography and coastal relief. This decomposition makes it easier to recognize structures under adverse conditions, basically the presence of clouds partly hiding them. This is observed particularly well in upwellings, which are usually very large structures with a different morphology and genesis in each zone. This approach not only improves the classification of the upwellings, but also makes it possible to analyse changes in the MOS over time, thereby improving the prediction of its morphological evolution. To identify and label the MOS classified in the Sea-viewing Wide Field-of-view Sensor (SeaWiFS) and Aqua-MODIS (Moderate Resolution Imaging Spectroradiometer) chlorophyll-a and temperature images, we used a tool specifically designed by our group for this purpose and which has again shown its validity in this new proposal.  相似文献   

20.
This paper presents the development and implementation of a variable rate time-scaling expansion system for speech signals, based on the pitch information, in which only the voiced segments are expanded, keeping the unvoiced and silence segments unchanged. The proposed system was first evaluated by computer simulation and then implemented on a digital signal processor (DSP). Time-domain, frequency-domain, mean opinion score (MOS) and diagnostic rhyme test (DRT) evaluations were done to test the actual performance of developed algorithm, which show that the proposed system allows improving the learning level of foreign language students as well as the understanding ability of elderly people. Objective tests also were carried out in order to probe similarity between the original and the expanded signals. Applying an iterative refinement of the C source code it was possible to obtain a real-time implementation. The current implemented algorithm requires 11 kwords program memory and about 9 million of floating point operations per second (MFLOPS).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号