首页 | 本学科首页   官方微博 | 高级检索  
     


Random forests and the data sparseness problem in language modeling
Affiliation:1. Information Retrieval Lab, Department of Computer Science, University of A Coruña, Campus de Elviña, 15071 A Coruña, Spain;2. Information Retrieval Group, Department of Computer Science, Universidad Autónoma de Madrid, Ciudad universitaria de Cantoblanco, 28049 Madrid, Spain;1. National ICT for Australia (NICTA), School of Computer Science, RMIT University, Melbourne, VIC 3001, Australia;2. Department of Computational Mathematics, Faculty of IT, University of Moratuwa, Katubedda, Sri Lanka;1. Neurosurgery A Unit, University of Bordeaux-Segalen, Pellegrin University Hospital, Bordeaux, France;2. ENT Unit, University of Bordeaux-Segalen, Pellegrin University Hospital, Bordeaux, France;3. Neuroradiology Unit, University of Bordeaux-Segalen, Pellegrin University Hospital, Bordeaux, France
Abstract:Language modeling is the problem of predicting words based on histories containing words already hypothesized. Two key aspects of language modeling are effective history equivalence classification and robust probability estimation. The solution of these aspects is hindered by the data sparseness problem.Application of random forests (RFs) to language modeling deals with the two aspects simultaneously. We develop a new smoothing technique based on randomly grown decision trees (DTs) and apply the resulting RF language models to automatic speech recognition. This new method is complementary to many existing ones dealing with the data sparseness problem. We study our RF approach in the context of n-gram type language modeling in which n ? 1 words are present in a history. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories are longer than four words. We show that our RF language models are superior to the best known smoothing technique, the interpolated Kneser–Ney smoothing, in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary state-of-the-art speech recognition systems. In particular, we will show statistically significant improvements in a contemporary conversational telephony speech recognition system by applying the RF approach only to one of its many language models.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号