蒙古文原始语料统计建模研究 Study of Mongolian Raw Text Modeling期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

蒙古文原始语料统计建模研究

引用本文：	白双成.蒙古文原始语料统计建模研究[J].中文信息学报,2017,31(1):118-125.

作者姓名：	白双成

作者单位：	1. 内蒙古社会科学院蒙古语信息技术研发中心,内蒙古呼和浩特 010020; 2. 内蒙古蒙科立软件股份有限公司,内蒙古呼和浩特 010011

基金项目：	国家电子发展基金2010年度、2011年度蒙古文专项;国家自然科学基金(61163020);内蒙古自治区自然科基金(2011MS0918)

摘要：	蒙古文字符编码与字形之间的多对多复杂转换关系及录入不规范等众多原因导致原始语料存在严重的拼写多样化现象和字形拼写错误,成为大数据处理瓶颈。该文以蒙古文输入法为例,利用大词库和形码生成器,将原本基于读音正确的词晶格最佳路径搜索问题转换为基于形码词晶格路径搜索问题,很好地解决了原始文本统计建模问题。实验结果证明,该方法及字形归并的模型优化方法可显著提高输入效率,对所有蒙古文“音词转换”和“形词转换”研究都有广泛的参考价值。
关键词：	蒙古文原始文本统计建模读音错误字形错误智能输入
Study of Mongolian Raw Text Modeling

BAI Shuangcheng.Study of Mongolian Raw Text Modeling[J].Journal of Chinese Information Processing,2017,31(1):118-125.

Authors:	BAI Shuangcheng

Affiliation:	1. Inner Mongolia Academy of Social Science,Hohhot, Inner Mongolia 010020, China ; (2. Inner Mongolia Menksoft Co.,Ltd, Hohhot, Inner Mongolia 010011,China

Abstract:	The Mongolian language model for its text is challenged by the same character with different codes owing to the different pronunciations of the character in various contexts. To address this issue for spelling input, this paper adopts a large dictionary with correct pronunciations, training a statistical spelling model to maximize the the pronunciation sequence directly from the candidate code sequence. Experiments indicate a more efficient spelling input method is achieved, which is also enlightening for “pronunciation-to-word” coversion and “spelling-to-word” conversion.

Keywords:	Mongolian corpus statistical language model pronunciation error spelling error intelligent input method

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏