基于大规模网络语料的藏文音节拼写错误统计与分析 Statistics and Analysis on Spell Errors of Tibetan Syllables Based on a Large Scale Web Corpus期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于大规模网络语料的藏文音节拼写错误统计与分析

引用本文：	刘汇丹,洪锦玲,诺明花,吴健.基于大规模网络语料的藏文音节拼写错误统计与分析[J].中文信息学报,2017,31(2):61-70.

作者姓名：	刘汇丹洪锦玲诺明花吴健

作者单位：	中国科学院软件研究所,北京100190

基金项目：	国家自然科学基金(61202219,61303165);中国科学院信息化专项(XXH12504-1-10);新闻出版重大科技工程(0610-1041BJNF 2328/23)

摘要：	针对从互联网获取的一份包含19万藏文网页,总计427万句、9 328万音节字的藏文文本语料,该文按照预定的规则对其中的藏文音节拼写错误情况进行了统计与分析。数据显示,在语料中出现的共计20 743个藏文音节中,含有拼写错误的音节共有9 700个,占藏文音节总数的46.762 8%,错误音节在语料中共出现27 427次,仅占0.030 8%,说明这份语料的文本质量是相当高的。文中还详细统计了各种不同表现形式的错误音节所占比重,并分析了导致拼写错误的四个主要原因: 一是输入了多余的元音符号;二是音节点或句尾空格缺失;三是同一字丁/字符存在多种表达形式;四是错误地使用了相似字符。
关键词：	藏文拼写检查拼写检查语料统计藏文信息处理中文信息处理
Statistics and Analysis on Spell Errors of Tibetan Syllables Based on a Large Scale Web Corpus

LIU Huidan,HONG Jinling,NUO Minghua,WU Jian.Statistics and Analysis on Spell Errors of Tibetan Syllables Based on a Large Scale Web Corpus[J].Journal of Chinese Information Processing,2017,31(2):61-70.

Authors:	LIU Huidan HONG Jinling NUO Minghua WU Jian

Affiliation:	Institute of Software, Chinese Academy of Sciences, Beijing 100190, China

Abstract:	A large scale Tibetan text corpus is built, which includes 4.27 million sentences in 190 thousand documents, totaling 93 million syllables. Some predefined rules are applied to check whether there are spelling errors, detecting altogether 9 700 misspelt syllable types out of the 20 743 types of Tibetan syllables occurred in the corpus (covering 46.762 8%). But at the token level, the corpus has a very high quality, with only 27 427 misspelt syllables, roughly 0.030 8% of the total 93 million syllable tokens. Further analysis shows that there are mainly four causes leading to those spell errors: extra vowel sign(s); absence of syllable delimiter or sentence delimiter; characters which can be written in different forms; similar characters.

Keywords:	Tibetan spell check spell check corpus Tibetan information processing Chinese information processing

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏