基于规则与统计相结合的藏文文本自动查错方法研究 Automatic Tibetan Text Error Checking Based on Rules and Statistics期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于规则与统计相结合的藏文文本自动查错方法研究

引用本文：	完么扎西,尼玛扎西.基于规则与统计相结合的藏文文本自动查错方法研究[J].中文信息学报,2022,36(2):69-75.

作者姓名：	完么扎西尼玛扎西

作者单位：	1.青海师范大学民族师范学院,青海西宁 810008; 2.西藏大学信息科学技术学院,西藏拉萨 850000

基金项目：	国家社会科学基金(19XYY021)

摘要：	针对目前藏文文本自动查错方法的不足,该文提出了一种基于规则和统计相结合的自动查错方法.首先以藏文拼写文法为基础,结合形式语言与自动机理论,构造37种确定型有限自动机识别现代藏文字;然后利用查找字典的方法识别梵音藏文字;最后利用互信息和t-测试差等统计方法查找藏语词语搭配错误和语法错误等真字词错误,实现藏文文本的自动查错...
关键词：	藏文文本自动查错非字错误真字词错误
Automatic Tibetan Text Error Checking Based on Rules and Statistics

Pema Tashi,Nima Tashi.Automatic Tibetan Text Error Checking Based on Rules and Statistics[J].Journal of Chinese Information Processing,2022,36(2):69-75.

Authors:	Pema Tashi Nima Tashi

Affiliation:	1.Minority Normal College, Qinghai Normal University, Xining, Qinghai 810008, China; 2.School of Information Science and Technology, Tibet University, Lhasa, Tibet 850000, China

Abstract:	An automatic error checking method based on rules and statistics is proposed for automatic Tibetan text error checking . Firstly, based on the Tibetan spelling grammar, 37 types of deterministic finite automata are constructed to recognize modern Tibetan characters. Then a dictionary is employed to identify Sanskrit Tibetan. Finally, mutual information and t-test difference are used to identify true word errors including word collocation errors and grammatical errors in Tibetan texts. The test set consists of 100 news articles with 49 errors. Experiments show that the method proposed in this paper can effectively find non-character errors and true word errors, with 83.7% in recall, 70.7% in detection accuracy and 76.7% in F-measure.

Keywords:	Tibetan text automatic error checking non word error real word error

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏