首页 | 本学科首页   官方微博 | 高级检索  
     

维吾尔语多词领域术语的自动抽取
引用本文:田生伟,钟军,禹龙. 维吾尔语多词领域术语的自动抽取[J]. 中文信息学报, 2015, 29(2): 133-141
作者姓名:田生伟  钟军  禹龙
作者单位:1. 新疆大学 软件学院,新疆 乌鲁木齐 830008;
2. 新疆大学 信息科学与工程学院,新疆 乌鲁木齐 830046;
3. 新疆大学 网络中心,新疆 乌鲁木齐 830046)
基金项目:国家自然科学基金(60963017,60963018,61262064);国家自然科学基金(61331011);国家社科基金(10BTQ045,11XTQ007)
摘    要:多词领域术语抽取是自然语言处理技术中的一个重点和难点问题, 结合维吾尔语语言特征,该文提出了一种基于规则和统计相结合的维吾尔语多词领域术语的自动抽取方法。该方法分为四个阶段: ①语料预处理, 包括停用词过滤和词性标注; ② 对字串取N元子串, 利用改进的互信息算法和对数似然比率计算子串内部的联合强度, 结合词性构成规则, 构建候选维吾尔语多词领域术语集; ③ 利用相对词频差值, 得到尽可能多的维吾尔语多词领域术语; ④ 结合C_value值获取最终领域术语并作后处理。实验结果准确率为85.08%, 召回率为 73.19%, 验证了该文提出的方法在维吾尔语多词领域术语抽取上的有效性。

关 键 词:维吾尔语  多词领域术语  互信息  对数似然比率  相对词频差值  

Automatic Extraction of Multi-Word Domain Term in Uyghur Texts
TIAN Shengwei;ZHONG Jun;YU Long. Automatic Extraction of Multi-Word Domain Term in Uyghur Texts[J]. Journal of Chinese Information Processing, 2015, 29(2): 133-141
Authors:TIAN Shengwei  ZHONG Jun  YU Long
Affiliation:1. School of Software, Xinjiang University, Urumqi, Xinjiang 830008, China;
2. Information Science and Engineering Technology Institute, Xinjiang University, Urumqi, Xinjiang 830046, China;
3. Net Center, Xinjiang University, Urumqi, Xinjiang 830046, China
Abstract:Multi-word domain term extraction is an important issue in natural language processing. Combining the language features of Uyghur, a method of Uyghur multi-word domain terms extraction based on rules and statistics is proposed. The method is divided into four phases: ①corpora pre-processing, including the stop words filtering and part-of-speech(POS) tagging; ②obtaining N-gram substrings as the term candidates, by POS information and calculating internal associative strength via according to the modified mutual information and log likelihood ratio; ③enlarging the term candidates by utilizing the relative frequency difference; ④decide the final terms by C_value. The experimental results show the efficiency of the proposed method with a 85.08% precision and 73.19% recallin Uyghur multi-word domain terms extraction.
Keywords:Uyghur   multi-word domain term   mutual information(MI)   log_likelihood ratio(LLR)   relative frequency difference(RFD)  
本文献已被 CNKI 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号