首页 | 本学科首页   官方微博 | 高级检索  
     

北京大学现代汉语语料库基本加工规范
引用本文:俞士汶,段慧明,朱学锋,孙斌.北京大学现代汉语语料库基本加工规范[J].中文信息学报,2002,16(5):51-66.
作者姓名:俞士汶  段慧明  朱学锋  孙斌
作者单位:北京大学计算机系,北京大学计算语言学研究所
基金项目:国家自然基金(69483003),973项目(G1998030507-4),863项目(2001AA1140)
摘    要:北京大学计算语言学研究所已经完成了一个有2700万汉字的现代汉语语料库的基本加工。加工项目除词语切分和词性标注外, 还包括专有名词(人名、地名、团体机构名称等)标注、语素子类标注以及动词、形容词的特殊用法标注。这项大规模语言工程的顺利完成得益于事先制订并不断完善的规范。发表《北京大学现代汉语语料库墓本加工规范》是为了抛砖引玉, 更广泛地向专家、同行征询意见, 以便进一步修订。

关 键 词:现代汉语  语料库  词语切分  词性标注  规范  

The Basic Processing of Contemporary Chinese Corpus at Peking University SPECIFICATION
YU Shi-wen,DUAN Hui-ming,ZHU Xue-feng,SUN Bin.The Basic Processing of Contemporary Chinese Corpus at Peking University SPECIFICATION[J].Journal of Chinese Information Processing,2002,16(5):51-66.
Authors:YU Shi-wen  DUAN Hui-ming  ZHU Xue-feng  SUN Bin
Affiliation:Institute of Computational Linguistics Peking University Beijing 100871 China
Abstract:The Institute of Computational Linguistics,Peking University has completed the basic processing of a contemporary Chinese corpus that has 27 million Chinese Characters. In addition to word segmentation and part-of-speech tagging, the processing involves the tagging of proper nouns (person names,place names,organization names and so on) , morpheme subcategories and the special usages of verbs and adjectives. The success of this large-scale language engineering is attributed to the SPECIFICATION, which had been made beforehand and was being perfected while in use. We are hereby making an introduction to the SPECIFICATION through this publication, thus inviting the comments from all the experts and our colleagues for the improvement of it.
Keywords:contemporary Chinese  corpus  word segmentation  part-of-speech tagging  specification  
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号