首页 | 本学科首页   官方微博 | 高级检索  
     

基于语料库的字母词语自动提取研究
引用本文:郑泽之,张普,杨建国.基于语料库的字母词语自动提取研究[J].中文信息学报,2005,19(2):79-86.
作者姓名:郑泽之  张普  杨建国
作者单位:1. 太原师范学院计算机系,山西太原 030012 ; 2. 北京语言大学DCC 博士研究室,北京 100083
基金项目:国家语言资源监测与研究中心项目
摘    要:目前,很多最新的术语和专有名词,首先以字母词语的形式出现在汉语中,并日益广泛应用。而字母词语多数是汉语自动分词中的未登录词,其正确识别,将有助于提高中文分词、信息检索、搜索引擎、机器翻译等应用软件的质量。本文在对字母词语进行先期考察的基础上,分析了字母词语组成情况的复杂特征和自动识别的难点,结合字母词语的各种统计特征和其独有的特点———字母串“锚点”,提出了从中心往两边扩展的规则加统计辅助的字母词语自动提取的算法。并且对字母词语的双语同现问题进行了处理。算法简单,但有效。召回率为100 % ,准确率在80 %以上。

关 键 词:人工智能  自然语言处理  字母词语  自动提取  
文章编号:1003-0077(2005)02-0078-08
修稿时间:2004年6月16日

The Research on Lettered-word Extraction in Chinese Texts
ZHENG Ze-zhi,ZHANG Pu,YANG Jian-guo.The Research on Lettered-word Extraction in Chinese Texts[J].Journal of Chinese Information Processing,2005,19(2):79-86.
Authors:ZHENG Ze-zhi  ZHANG Pu  YANG Jian-guo
Affiliation:1.Taiyuan Teacherps College , Taiyuan ,Shanxi 030012 ,China ;2.DCC Lab , Beijing Language and Culture University , Beijing 100083 ,China
Abstract:Nowadays , more and more lettered2words are used in Chinese texts , most of which are new terms or proper nouns. And this may become a trend quite obvious to us. Usually , lettered2words are unknown phrases or words in automatic Chinese segmentation. Based on the observation of lettered2words in our Chinese corpus , the correct identification of them will improve the quality of Chinese segmentation , information retrieval , searching technology , machine translation , etc. This paper analyzes the complex features of Chinese lettered2words , and discusses the difficulties in extracting them. An algorithm for the automatic identification of Chinese lettered2words is presented here , which uses a letter string as the anchor and search its left and right contexts for the boundaries of the lettered2word. The algorithm is simple , but it is effective. Our experiment on the corpus of the Peopleps Daily (Year 2002) shows the precision and the recall rates being 80 % and 100 % respectively.
Keywords:artificial intelligence  natural language processing  lettered-word  automatic extracting
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号