面向大规模语料的语言模型研究新进展 A Review of the State-of-the-Art of Research on Large-Scale Corpora Oriented Language Modeling期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向大规模语料的语言模型研究新进展

引用本文：	骆卫华,刘群,白硕.面向大规模语料的语言模型研究新进展[J].计算机研究与发展,2009,46(10).

作者姓名：	骆卫华刘群白硕

作者单位：	1. 中国科学院研究生院,北京,100049;中国科学院计算技术研究所智能信息处理重点实验室,北京,100190 2. 中国科学院计算技术研究所智能信息处理重点实验室,北京,100190 3. 上海证券交易所,上海,200120

基金项目：	国家“八六三”高技术研究发展计划基金项目(2007AA01Z438)~~

摘要：	N元语言模型是统计机器翻译、信息检索、语音识别等很多自然语言处理研究领域的重要工具.由于扩大训练语料规模和增加元数对于提高系统性能很有帮助,随着可用语料迅速增加,面向大规模训练语料的高元语言模型(如N≥5)的训练和使用成为新的研究热点.介绍了当前这个问题的最新研究进展,包括了集成数据分治、压缩和内存映射的一体化方法,基于随机存取模型的表示方法,以及基于分布式并行体系的语言模型训练与查询方法等几种代表性的方法,展示了它们在统计机器翻译中的性能,并比较了这些方法的优缺点.
关键词：	语言模型数据压缩随机存取模型布隆过滤器分布式并行体系
A Review of the State-of-the-Art of Research on Large-Scale Corpora Oriented Language Modeling

Luo Weihua,Liu Qun,Bai Shuo.A Review of the State-of-the-Art of Research on Large-Scale Corpora Oriented Language Modeling[J].Journal of Computer Research and Development,2009,46(10).

Authors:	Luo Weihua Liu Qun Bai Shuo

Affiliation:	Graduate University of Chinese Academy of Sciences;Beijing 100049;Key Laboratory of Intelligent Information Processing;Institute of Computing Technology;Chinese Academy of Sciences;Beijing 100190;Shanghai Securities Exchange;Shanghai 200120

Abstract:	N-gram language model (LM) is a key component in many research areas of natural language processing, such as statistical machine translation, information retrieval, speech recognition, etc. Using higher-order models and more training data can significantly improve the performance of applications. However, for limited resources of the systems (e.g., memory, usage of CPU, etc), the cost of training and accessing large-scale LM becomes prohibitive with more and more monolingual corpora available. Therefore, th...

Keywords:	language model data compression randomized access model Bloom filter distributed parallel architecture
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏