首页 | 本学科首页   官方微博 | 高级检索  
     

基于规则与统计相结合的藏文文本自动查错方法研究
引用本文:完么扎西,尼玛扎西.基于规则与统计相结合的藏文文本自动查错方法研究[J].中文信息学报,2022,36(2):69-75.
作者姓名:完么扎西  尼玛扎西
作者单位:1.青海师范大学 民族师范学院,青海 西宁 810008;
2.西藏大学 信息科学技术学院,西藏 拉萨 850000
基金项目:国家社会科学基金(19XYY021)
摘    要:针对目前藏文文本自动查错方法的不足,该文提出了一种基于规则和统计相结合的自动查错方法.首先以藏文拼写文法为基础,结合形式语言与自动机理论,构造37种确定型有限自动机识别现代藏文字;然后利用查找字典的方法识别梵音藏文字;最后利用互信息和t-测试差等统计方法查找藏语词语搭配错误和语法错误等真字词错误,实现藏文文本的自动查错...

关 键 词:藏文文本自动查错  非字错误  真字词错误

Automatic Tibetan Text Error Checking Based on Rules and Statistics
Pema Tashi,Nima Tashi.Automatic Tibetan Text Error Checking Based on Rules and Statistics[J].Journal of Chinese Information Processing,2022,36(2):69-75.
Authors:Pema Tashi  Nima Tashi
Affiliation:1.Minority Normal College, Qinghai Normal University, Xining, Qinghai 810008, China;
2.School of Information Science and Technology, Tibet University, Lhasa, Tibet 850000, China
Abstract:An automatic error checking method based on rules and statistics is proposed for automatic Tibetan text error checking . Firstly, based on the Tibetan spelling grammar, 37 types of deterministic finite automata are constructed to recognize modern Tibetan characters. Then a dictionary is employed to identify Sanskrit Tibetan. Finally, mutual information and t-test difference are used to identify true word errors including word collocation errors and grammatical errors in Tibetan texts. The test set consists of 100 news articles with 49 errors. Experiments show that the method proposed in this paper can effectively find non-character errors and true word errors, with 83.7% in recall, 70.7% in detection accuracy and 76.7% in F-measure.
Keywords:Tibetan text automatic error checking  non word error  real word error  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号