首页 | 本学科首页   官方微博 | 高级检索  
     


Effect of acoustic and linguistic contexts on human and machine speech recognition
Affiliation:1. Department of Media Science, Graduate School of Information Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi 464-8603, Japan;2. Department of Computer Science and Engineering, Toyohashi University of Technology, 1-1 Hibarigaoka, Tempaku-cho, Toyohashi, Aichi 441-8580, Japan;1. San Diego State University, United States;2. Kennedy Krieger Institute, United States;3. Johns Hopkins University School of Medicine, United States;4. Georgetown University, United States;1. International Rice Research Institute, Los Baños, Philippines;2. Rice Research and Development Institute, Batalagoda, Sri Lanka;1. Northeastern University, United States;2. Tel-Aviv University, Israel;3. Western Galilee College, Israel
Abstract:We compared the performance of an automatic speech recognition system using n-gram language models, HMM acoustic models, as well as combinations of the two, with the word recognition performance of human subjects who either had access to only acoustic information, had information only about local linguistic context, or had access to a combination of both. All speech recordings used were taken from Japanese narration and spontaneous speech corpora.Humans have difficulty recognizing isolated words taken out of context, especially when taken from spontaneous speech, partly due to word-boundary coarticulation. Our recognition performance improves dramatically when one or two preceding words are added. Short words in Japanese mainly consist of post-positional particles (i.e. wa, ga, wo, ni, etc.), which are function words located just after content words such as nouns and verbs. So the predictability of short words is very high within the context of the one or two preceding words, and thus recognition of short words is drastically improved. Providing even more context further improves human prediction performance under text-only conditions (without acoustic signals). It also improves speech recognition, but the improvement is relatively small.Recognition experiments using an automatic speech recognizer were conducted under conditions almost identical to the experiments with humans. The performance of the acoustic models without any language model, or with only a unigram language model, were greatly inferior to human recognition performance with no context. In contrast, prediction performance using a trigram language model was superior or comparable to human performance when given a preceding and a succeeding word. These results suggest that we must improve our acoustic models rather than our language models to make automatic speech recognizers comparable to humans in recognition performance under conditions where the recognizer has limited linguistic context.
Keywords:Continuous speech recognition  Human speech recognition ability  Acoustic model  Language model
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号