期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Phoneme Intelligibility of Four Text-to-Speech Products to Nonnative Speakers of English in Noise

H.?S.?Venkatagiri Email author 《International Journal of Speech Technology》2005,8(4):313-321

The study investigated the segmental intelligibility of four text-to-speech (TTS) products under 0 dB and 5 dB signal-to-noise ratios in a group of native and nonnative speakers of English. Each product—AT&T Next-Gen™, Festival version 1.4.2, FlexVoice™ 2, and IBM ViaVoice™ Version 5.1—uses a different algorithm for generating speech from text. The results, which benefit developers of TTS technology as well as developers of products that utilize TTS, showed that (1) all TTS products were less intelligible to nonnative speakers of English than native speakers, (2) the “hybrid” TTS product that combined concatenative and formant synthesis methods was the least intelligible of the four products investigated, (3) the remaining three products, which used formant, concatenative diphone based LPC, and concatenative waveform synthesis methods respectively, were equally intelligible to nonnative speakers, (4) none of the four TTS products was better at resisting intelligibility loss due to noise than others, and (5) listening to currently available unrestricted TTS under high noise conditions would probably require a greater amount of cognitive resources on the part of both native and nonnative speakers of English and may be difficult when other demanding activities are concurrently performed. 相似文献

2.

母语与非母语语音识别声学建模

下载免费PDF全文

曾定刘加《计算机工程》2010,36(8):170-172

为了兼容母语与非母语说话人之间的发音变化,提出一种新的声学模型建模方法。分析中国人受母语影响产生的英语发音变化,利用中国人英语发音数据库自适应得到语音模型,采用声学模型融合技术构建融合2种发音规律的识别模型。实验结果证明,中国人英语发音的语音识别率提高了13.4%,但标准英语的语音识别率仅下降1.1%。相似文献

3.

Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling

G. Bouselmi D. Fohr I. Illina 《International Journal of Speech Technology》2012,15(2):203-213

This article presents an approach for the automatic recognition of non-native speech. Some non-native speakers tend to pronounce phonemes as they would in their native language. Model adaptation can improve the recognition rate for non-native speakers, but has difficulties dealing with pronunciation errors like phoneme insertions or substitutions. For these pronunciation mismatches, pronunciation modeling can make the recognition system more robust. Our approach is based on acoustic model transformation and pronunciation modeling for multiple non-native accents. For acoustic model transformation, two approaches are evaluated: MAP and model re-estimation. For pronunciation modeling, confusion rules (alternate pronunciations) are automatically extracted from a small non-native speech corpus. This paper presents a novel approach to introduce confusion rules in the recognition system which are automatically learned through pronunciation modelling. The modified HMM of a foreign spoken language phoneme includes its canonical pronunciation along with all the alternate non-native pronunciations, so that spoken language phonemes pronounced correctly by a non-native speaker could be recognized. We evaluate our approaches on the European project HIWIRE non-native corpus which contains English sentences pronounced by French, Italian, Greek and Spanish speakers. Two cases are studied: the native language of the test speaker is either known or unknown. Our approach gives better recognition results than the classical acoustic adaptation of HMM when the foreign origin of the speaker is known. We obtain 22% WER reduction compared to the reference system. 相似文献

4.

Automatic pronunciation scoring of words and sentences independent from the non-native’s first language

Tobias Cincarek Rainer Gruhn Christian Hacker Elmar Nth Satoshi Nakamura 《Computer Speech and Language》2009,23(1):65-88

This paper describes an approach for automatic scoring of pronunciation quality for non-native speech. It is applicable regardless of the foreign language student’s mother tongue. Sentences and words are considered as scoring units. Additionally, mispronunciation and phoneme confusion statistics for the target language phoneme set are derived from human annotations and word level scoring results using a Markov chain model of mispronunciation detection. The proposed methods can be employed for building a part of the scoring module of a system for computer assisted pronunciation training (CAPT). Methods from pattern and speech recognition are applied to develop appropriate feature sets for sentence and word level scoring. Besides features well-known from and approved in previous research, e.g. phoneme accuracy, posterior score, duration score and recognition accuracy, new features such as high-level phoneme confidence measures are identified. The proposed method is evaluated with native English speech, non-native English speech from German, French, Japanese, Indonesian and Chinese adults and non-native speech from German school children. The speech data are annotated with tags for mispronounced words and sentence level ratings by native English teachers. Experimental results show, that the reliability of automatic sentence level scoring by the system is almost as high as the average human evaluator. Furthermore, a good performance for detecting mispronounced words is achieved. In a validation experiment, it could also be verified, that the system gives the highest pronunciation quality scores to 90% of native speakers’ utterances. Automatic error diagnosis based on a automatically derived phoneme mispronunciation statistic showed reasonable results for five non-native speaker groups. The statistics can be exploited in order to provide the non-native feedback on mispronounced phonemes. 相似文献

5.

The effect of code-mixing on accent identification accuracy

Thomas Niesler Febe de Wet 《Computer Speech and Language》2009,23(4):435-443

We investigate whether accent identification is more effective for English utterances embedded in a different language as part of a mixed code than for English utterances that are part of a monolingual dialogue. Our focus is on Xhosa and Zulu, two South African languages for which code-mixing with English is very common. In order to carry out our investigation, we extract English utterances from mixed-code Xhosa and Zulu speech corpora, as well as comparable utterances from an English-only corpus by Xhosa and Zulu mother-tongue speakers. Experiments using automatic accent identification systems show that identification is substantially more accurate for the utterances originating from the mixed-code speech. These findings are supported by a corresponding set of perceptual experiments in which human subjects were asked to identify the accents of recorded utterances. We conclude that accent identification is more successful for these utterances because accents are more pronounced for English embedded in mother-tongue speech than for English spoken as part of a monolingual dialogue by non-native speakers. Furthermore we find that this is true for human listeners as well as for automatic identification systems. 相似文献

6.

Discourse prosody planning in native (L1) and nonnative (L2) (L1-Bengali) English: a comparative study

Shambhu Nath Saha Shyamal Kr Das Mandal 《International Journal of Speech Technology》2017,20(2):305-326

This paper conducts a comparative study between L1 and L2 (L1 Bengali) English discourse level speech planning to investigate differences between L1 and L2 English speaker groups in the organization of discourse-level speech planning. For this purpose, English speech of 10 L1 English and 40 L1 Bengali speakers of the same discourse are analyzed in terms of using prosodic and acoustic cues by applying hierarchical discourse prosody framework. From this analysis, between-group differences in discourse level speech planning are found through the speech rate, locations of discourse boundary breaks as well as size and scope of speech planning and chunking units. Result of analysis shows that the speech rate of L1 English speakers is higher than that of L2 English speakers, L2 English speakers contain more break boundary than that of the L1 English speakers at every discourse level in the organization, which exhibit the fact that L2 English speakers use more intermediate chunking units and larger scale planning units than that of L1 English speakers. Between-group differences are also found through the analysis of phrase component at prosodic phrase level and accent component at the prosodic word level. These findings can be attributed to L2 English speakers’ improper phrasing, improper word level prominence and the ambiguous difference between content words and function words. The study concludes that the deficiencies in English strategy for L1 Bengali speakers’ discourse-level speech planning compared to L1 English speakers are due to the influence of L1 (Bengali) prosody at the L2 discourse level. 相似文献

7.

Bitter Pills to Swallow. ASR and TTS have Drug Problems

Caroline?Henton Email author 《International Journal of Speech Technology》2005,8(3):247-257

相似文献

8.

Warning signal words: connoted strength and understandability by children,elders, and non-native English speakers

《Ergonomics》2012,55(11):2188-2206

Signal words, such as DANGER, WARNING and CAUTION, are commonly used in sign and product label warnings for the purpose of conveying different levels of hazard. Previous research has focused on whether people's perceptions of connoted hazard are consistent with the levels suggested by design standards and guidelines. Most investigations have used college students to evaluate the terms; other populations who may be at greater risk have not been adequately studied. One purpose of the present research was to determine whether young children, the elderly, and non-native English speakers perceive similar connoted hazard levels from the terms as undergraduates and published guidelines. A second purpose was to assess the terms' comprehensibility using various metrics such as missing values (i.e. ratings left blank) and understandability ratings. A third purpose was to develop a list of potential signal words that probably would be understandable to members of special populations. In the first experiment, 298 fourth- to eighth-grade students and 70 undergraduates rated 43 potential signal words on how careful they would be after seeing each term. The undergraduates also rated the terms on strength and understandability. In the second experiment, 98 elders and 135 non-native English speakers rated the same set of terms. The rank ordering of the words was found to be consistent across the participant groups. In general, the younger students gave higher carefulness ratings than the undergraduates. The words that the younger children and the non-native English speakers frequently left blank were given lower understandibility ratings. Finally, a short list of terms was derived that 95% or 99% of the youngest students (fourth- and fifth-graders) and 80% of the non-native English speakers understood. Implications of hazard communication are discussed. 相似文献

9.

Study on pharyngeal and uvular consonants in foreign accented Arabic for ASR

Yousef Ajami Alotaibi Ghulam Muhammad 《Computer Speech and Language》2010,24(2):219-231

This paper investigates the unique pharyngeal and uvular consonants of Arabic from the point of view of automatic speech recognition (ASR). Comparisons of the recognition error rates for these phonemes are analyzed in five experiments that involve different combinations of native and non-native Arabic speakers. The most three confusing consonants for every investigated consonant are discussed. All experiments use the Hidden Markov Model Toolkit (HTK) and the Language Data Consortium (LDC) WestPoint Modern Standard Arabic (MSA) database. Results confirm that these Arabic distinct consonants are a major source of difficulty for Arabic ASR. While the recognition rate for certain of these unique consonants such as // can drop below 35% when uttered by non-native speakers, there is advantage to include non-native speakers in ASR. Besides, regional differences in pronunciation of MSA by native Arabic speakers require the attention of Arabic ASR research. 相似文献

10.

Listening to Natural and Synthesized Speech while Driving: Effects on User Performance

Omer Tsimhoni Paul Green Jennifer Lai 《International Journal of Speech Technology》2001,4(2):155-169

The effects of message type (navigation, E-mail, news story), voice type (text-to-speech, natural human speech), and earcon cueing (present, absent) on message comprehension and driving performance were examined. Twenty-four licensed drivers (12 under 30, 12 over 65, both equally divided by gender) participated in the experiment. They drove the UMTRI driving simulator on a road consisting of straight sections and constant radius curves, thus yielding two levels of low driving-workload. In addition, as a control condition, data were collected while participants were parked. In all conditions, participants were presented with three types of messages. Each message was immediately followed by a series of questions to assess comprehension. Navigation messages were about 4 seconds long (about 9 words). E-mail messages were about 40 seconds long (about 100 words) and news messages were about 80 seconds long (about 225 words). For all message types, comprehension of text-to-speech messages, as determined by accuracy of response to questions, and by subjective ratings, was significantly worse than comprehension of natural speech (79 versus 83 percent correct answers; 7.7/10 versus 8.6/10 subjective rating). Driving workload did not affect comprehension. Interestingly, neither the speech used (synthesized or natural) nor the message type (navigation, E-mail, news) had a significant effect on basic driving performance measured by the standard deviations of lateral lane position and steering wheel angle. 相似文献

11.

英语语音合成系统超前端文本分析知识库的构建

马立东《电脑与信息技术》2013,(5):47-51

研究英语语音合成系统超前端文本分析所需知识库的构建和扩充方法。语音合成系统在语音播报等领域已经得到了广泛应用。但是在英语多媒体教学领域,还需解决偶尔出现的发音错误问题。由于内置知识库覆盖面不足,目前必须通过人工处理输入的文本,消除发音错误。人工分析和处理的速度及效率制约了语音合成系统在英语教学领域的应用。在英语词汇知识库的支持下,利用计算机辅助文本分析技术,对输入文本中的词语进行筛选和分类,找出产生发音错误的单词或符号,经扩展、转换或标注处理,可使优秀的英语语音合成系统达到教学和训练的要求。相似文献

12.

BBN TransTalk: Robust multilingual two-way speech-to-speech translation for mobile platforms

Rohit Prasad Prem Natarajan David Stallard Shirin Saleem Shankar Ananthakrishnan Stavros Tsakalidis Chia-lin Kao Fred Choi Ralf Meermeier Mark Rawls Jacob Devlin Kriste Krstovski Aaron Challenner 《Computer Speech and Language》2013,27(2):475-491

In this paper we present a speech-to-speech (S2S) translation system called the BBN TransTalk that enables two-way communication between speakers of English and speakers who do not understand or speak English. The BBN TransTalk has been configured for several languages including Iraqi Arabic, Pashto, Dari, Farsi, Malay, Indonesian, and Levantine Arabic. We describe the key components of our system: automatic speech recognition (ASR), machine translation (MT), text-to-speech (TTS), dialog manager, and the user interface (UI). In addition, we present novel techniques for overcoming specific challenges in developing high-performing S2S systems. For ASR, we present techniques for dealing with lack of pronunciation and linguistic resources and effective modeling of ambiguity in pronunciations of words in these languages. For MT, we describe techniques for dealing with data sparsity as well as modeling context. We also present and compare different user confirmation techniques for detecting errors that can cause the dialog to drift or stall. 相似文献

13.

New Applications & Recent Research

Michalopoulos D.A. 《Computer》1979,12(2):97-98

The Kurzweil Reading Machine converts print to speech, and is designed as a reading tool for the blind and visually handicapped. The system handles ordinary printed material–books, letters, reports, memoranda, etc.–in most common styles and sizes of type. The output produced is a synthetic voice using full-word English speech. The reader operates the device by placing printed material face down on the glass plate forming the top surface of the scanning unit; he then presses the "page" button on the contol panel, and listens to the synthetic speech produced as an electronic camera scans the page and transmits its image to a minicomputer housed within the device. The computer separates the image into discrete character forms, recognizes the letters, groups the letters into words, computes the pronunciation of each word, and then produces the speech sounds associated with each phoneme. The machine operates at normal speech rates, about 150 words per minute. 相似文献

14.

Voice comparison between smokers and non-smokers using HMM speech recognition system

Hassan Satori Ouissam Zealouk Khalid Satori Fatima ElHaoussi 《International Journal of Speech Technology》2017,20(4):771-777

Automatic speech recognition is a technology that allows a computer to transcribe in real time spoken words into readable text. In this work an HMM automatic speech recognition system was created to detect smoker speaker. This research project is carried out using Amazigh language for comparison of the voice of normal persons to smokers one. To achieve this goal, two experiments were performed, the first one to test the performance of the system for non-smokers for different parameters. The second experiment concern smokers speakers. The corpus used in this system is collected from two groups of speaker, non-smokers and smokers native Morocan tarifit speakers aged between 25 and 55 years. Our experimental results show that we can use our system to make diagnostic for smoking people and confirm that a speaker is smoker when the observed recognition rate is below 50%. 相似文献

15.

Babble Noise: Modeling, Analysis, and Applications

《IEEE transactions on audio, speech, and language processing》2009,17(7):1394-1407

Speech babble is one of the most challenging noise interference for all speech systems. Here, a systematic approach to model its underlying structure is proposed to further the existing knowledge of speech processing in noisy environments. This paper establishes a working foundation for the analysis and modeling of babble speech. We first address the underlying model for multiple speaker babble speech—considering the number of conversations versus the number of speakers contributing to babble. Next, based on this model, we develop an algorithm to detect the range of the number of speakers within an unknown babble speech sequence. Evaluation is performed using 110 h of data from the Switchboard corpus. The number of simultaneous conversations ranges from one to nine, or one to 18 subjects speaking. A speaker conversation stream detection rate in excess of 80% is achieved with a speaker window size of ${pm}1$ speakers. Finally, the problem of in-set/out-of-set speaker recognition is considered in the context of interfering babble speech noise. Results are shown for test durations from 2–8 s, with babble speaker groups ranging from two to nine subjects. It is shown that by choosing the correct number of speakers in the background babble an overall average performance gain of 6.44% equal error rate can be obtained. This study represents effectively the first effort in developing an overall model for speech babble, and with this, contributions are made for speech system robustness in noise. 相似文献

16.

Small-vocabulary speech recognition using surface electromyography

Bradley J. Betts Kim Binsted Charles Jorgensen 《Interacting with computers》2006,18(6):1242-1259

We present results of electromyographic (EMG) speech recognition on a small vocabulary of 15 English words. EMG speech recognition holds promise for mitigating the effects of high acoustic noise on speech intelligibility in communication systems, including those used by first responders (a focus of this work). We collected 150 examples per word of single-channel EMG data from a male subject, speaking normally while wearing a firefighter’s self-contained breathing apparatus. The signal processing consisted of an activity detector, a feature extractor, and a neural network classifier. Testing produced an overall average correct classification rate on the 15 words of 74% with a 95% confidence interval of (71%, 77%). Once trained, the subject used a classifier as part of a real-time system to communicate to a cellular phone and to control a robotic device. These tasks were performed under an ambient noise level of approximately 95 decibels. We also describe ongoing work on phoneme-level EMG speech recognition. 相似文献

17.

Two sepedi-english code-switched speech corpora

Modipa Thipe I. Davel Marelie H. 《Language Resources and Evaluation》2022,56(3):703-727

We report on the development of two reference corpora for the analysis of Sepedi-English code-switched speech in the context of automatic speech recognition. For the first corpus, possible English events were obtained from an existing corpus of transcribed Sepedi-English speech. The second corpus is based on the analysis of radio broadcasts: actual instances of code switching were transcribed and reproduced by a number of native Sepedi speakers. We describe the process to develop and verify both corpora and perform an initial analysis of the newly produced data sets. We find that, in naturally occurring speech, the frequency of code switching is unexpectedly high for this language pair, and that the continuum of code switching (from unmodified embedded words to loanwords absorbed into the matrix language) makes this a particularly challenging task for speech recognition systems.

相似文献

18.

基于praat软件的俄语读音分析

赵芳丽《计算机工程与应用》2012,48(11):133-136,177

用语音合成与分析软件praat分析了中国学生俄语读音的一些特点。通过对语音信号的波形图、语图谱、基音、共振峰、音高、音强等声学特性的分析,研究了中国学生在音素、音节、重音、音调、节奏、语调等方面存在的差异,为纠正其不良的发音、读句习惯提供技术帮助。相似文献

19.

复杂环境下基于自适应深度神经网络的鲁棒语音识别

张开生赵小芬《计算机工程与科学》2022,44(6):1105-1113

在连续语音识别系统中,针对复杂环境(包括说话人及环境噪声的多变性)造成训练数据与测试数据不匹配导致语音识别率低下的问题,提出一种基于自适应深度神经网络的语音识别算法。结合改进正则化自适应准则及特征空间的自适应深度神经网络提高数据匹配度;采用融合说话人身份向量i-vector及噪声感知训练克服说话人及环境噪声变化导致的问题,并改进传统深度神经网络输出层的分类函数,以保证类内紧凑、类间分离的特性。通过在TIMIT英文语音数据集和微软中文语音数据集上叠加多种背景噪声进行测试,实验结果表明,相较于目前流行的GMM-HMM和传统DNN语音声学模型,所提算法的识别词错误率分别下降了5.151%和3.113%,在一定程度上提升了模型的泛化性能和鲁棒性。相似文献

20.

Development of syllable-based text to speech synthesis system in Bengali

N. P. Narendra K. Sreenivasa Rao Krishnendu Ghosh Ramu Reddy Vempada Sudhamay Maity 《International Journal of Speech Technology》2011,14(3):167-181

This paper presents the design and development of unrestricted text to speech synthesis (TTS) system in Bengali language. Unrestricted TTS system is capable to synthesize good quality of speech in different domains. In this work, syllables are used as basic units for synthesis. Festival framework has been used for building the TTS system. Speech collected from a female artist is used as speech corpus. Initially five speakers’ speech is collected and a prototype TTS is built from each of the five speakers. Best speaker among the five is selected through subjective and objective evaluation of natural and synthesized waveforms. Then development of unrestricted TTS is carried out by addressing the issues involved at each stage to produce good quality synthesizer. Evaluation is carried out in four stages by conducting objective and subjective listening tests on synthesized speech. At the first stage, TTS system is built with basic festival framework. In the following stages, additional features are incorporated into the system and quality of synthesis is evaluated. The subjective and objective measures indicate that the proposed features and methods have improved the quality of the synthesized speech from stage-2 to stage-4. 相似文献