首页 | 本学科首页   官方微博 | 高级检索  
     

基于规则与统计相结合的中文文本自动查错模型与算法
引用本文:张仰森,曹元大,俞士汶.基于规则与统计相结合的中文文本自动查错模型与算法[J].中文信息学报,2006,20(4):3-7,55.
作者姓名:张仰森  曹元大  俞士汶
作者单位:1.北京大学计算语言学研究所2.北京理工大学计算机科学工程系3.北京信息科技大学计算机及自动化系
基金项目:国家研究发展基金;国家科技攻关项目;中国博士后科学基金
摘    要:中文文本自动校对是自然语言处理领域具有挑战性的研究课题。本文提出了一种规则与统计相结合的中文文本自动查错模型与算法。根据正确文本分词后单字词的出现规律以及“非多字词错误”的概念,提出一组错误发现规则,并与针对分词后单字散串建立的字二元、三元统计模型和词性二元、三元统计模型相结合,建立了文本自动查错模型与实现算法。通过对30篇含有578个错误测试点的文本进行实验,所提算法的查错召回率为86.85%、准确率为69.43% ,误报率为30.57%。

关 键 词:计算机应用  中文信息处理  中文文本自动查错  规则与统计相结合  非多字词错误  真多字词错误  
文章编号:1003-0077(2006)04-0001-07
收稿时间:2005-07-07
修稿时间:2005-07-072006-06-02

A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text
ZHANG Yang-sen,CAO Yuan-da,YU Shi-wen.A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text[J].Journal of Chinese Information Processing,2006,20(4):3-7,55.
Authors:ZHANG Yang-sen  CAO Yuan-da  YU Shi-wen
Affiliation:1.Institute of Computational Linguistics , Peking University2.Department of Computer Science and Engineering , Beijing Institute of Technology3.Department of computer and automation , Beijing information science & technology University
Abstract:Chinese text automatic proofreading is an important research subject in NLP.A hybrid model based on the combination of rules and statistics are proposed in this article.According to the distribution of Chinese single-character after word segmentation in Chinese text and the conception of "non-multi-character word error",we proposed a group of rules to find errors in texts,to construct the automatic error-detection model and to implement its algorithm by combining the scattered single-character Bigram models,part-of-speech Bigram and Trigram models.Our experiment for the 30 texts that contain 578 error test points shows that the recall rate is 86.85% and accuracy rate is 69.43%,distorting rate is 30.57%.
Keywords:Computer application  Chinese information processing  Chinese text automatic error-detecting  Combing rule-based and statistics-based approaches  non-multi-character word error  real-multi-character word error
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号