基于规则与统计相结合的中文文本自动查错模型与算法 A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于规则与统计相结合的中文文本自动查错模型与算法

引用本文：	张仰森,曹元大,俞士汶.基于规则与统计相结合的中文文本自动查错模型与算法[J].中文信息学报,2006,20(4):3-7,55.

作者姓名：	张仰森曹元大俞士汶

作者单位：	1.北京大学计算语言学研究所2.北京理工大学计算机科学工程系3.北京信息科技大学计算机及自动化系

基金项目：	国家研究发展基金;国家科技攻关项目;中国博士后科学基金

摘要：	中文文本自动校对是自然语言处理领域具有挑战性的研究课题。本文提出了一种规则与统计相结合的中文文本自动查错模型与算法。根据正确文本分词后单字词的出现规律以及“非多字词错误”的概念,提出一组错误发现规则,并与针对分词后单字散串建立的字二元、三元统计模型和词性二元、三元统计模型相结合,建立了文本自动查错模型与实现算法。通过对30篇含有578个错误测试点的文本进行实验,所提算法的查错召回率为86.85%、准确率为69.43% ,误报率为30.57%。
关键词：	计算机应用中文信息处理中文文本自动查错规则与统计相结合非多字词错误真多字词错误
文章编号：	1003-0077（2006）04-0001-07
收稿时间：	2005-07-07
修稿时间：	2005-07-072006-06-02
A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text

ZHANG Yang-sen,CAO Yuan-da,YU Shi-wen.A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text[J].Journal of Chinese Information Processing,2006,20(4):3-7,55.

Authors:	ZHANG Yang-sen CAO Yuan-da YU Shi-wen

Affiliation:	1.Institute of Computational Linguistics , Peking University2.Department of Computer Science and Engineering , Beijing Institute of Technology3.Department of computer and automation , Beijing information science & technology University

Abstract:	Chinese text automatic proofreading is an important research subject in NLP.A hybrid model based on the combination of rules and statistics are proposed in this article.According to the distribution of Chinese single-character after word segmentation in Chinese text and the conception of "non-multi-character word error",we proposed a group of rules to find errors in texts,to construct the automatic error-detection model and to implement its algorithm by combining the scattered single-character Bigram models,part-of-speech Bigram and Trigram models.Our experiment for the 30 texts that contain 578 error test points shows that the recall rate is 86.85% and accuracy rate is 69.43%,distorting rate is 30.57%.

Keywords:	Computer application Chinese information processing Chinese text automatic error-detecting Combing rule-based and statistics-based approaches non-multi-character word error real-multi-character word error
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏