首页 | 本学科首页   官方微博 | 高级检索  
     

汉语语料词性标注自动校对方法的研究
引用本文:钱揖丽,郑家恒.汉语语料词性标注自动校对方法的研究[J].中文信息学报,2004,18(2):31-36.
作者姓名:钱揖丽  郑家恒
作者单位:山西大学计算机科学系
基金项目:国家高技术研究发展计划(863计划)
摘    要:兼类词的词类排歧是汉语语料词性标注中的难点问题,它严重影响语料的词性标注质量。针对这一难点问题,本文提出了一种兼类词词性标注的自动校对方法。它利用数据挖掘的方法从正确标注的训练语料中挖掘获取有效信息,自动生成兼类词词性校对规则,并应用获取的规则实现对机器初始标注语料的自动校对,从而提高语料中兼类词的词性标注质量。分别对50万汉语语料做封闭测试和开放测试,结果显示,校对后语料的兼类词词性标注正确率分别可提高11.32%和5.97%。

关 键 词:计算机应用  中文信息处理  兼类词  汉语词性标注  自动校对  粗糙集  
文章编号:1003-0077(2004)02-0030-06
修稿时间:2003年8月6日

Research on the Method of Automatic Correction of Chinese Part-of-Speech Tagging
QIAN Yi-li,ZHENG Jia-heng.Research on the Method of Automatic Correction of Chinese Part-of-Speech Tagging[J].Journal of Chinese Information Processing,2004,18(2):31-36.
Authors:QIAN Yi-li  ZHENG Jia-heng
Affiliation:The Department of Computer Science , Shanxi University
Abstract:The disambiguation of multi-category words is one of the difficulties in part-of-speech tagging of Chinese text, which affects the processing quality of corpora greatly. Aiming at this question, the paper describes an approach to correcting the part-of-speech tagging of multi-category words automatically. It acquires correction rules for the part-of-speech tagging of multi-category words from right-tagged corpora based on the rough sets and data mining, and then corrects the corpora based on these rules automatically. According to the results of close-test and open-test on the corpus of 500,000 Chinese characters, the accuracy of multi-category words'part-of-speech tagging can be increased by 11.32% and 5.97% respectively.
Keywords:computer application  Chinese information processing  multi-category word  Chinese part-of-speech tagging  automatic correction  rough sets
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号