首页 | 本学科首页   官方微博 | 高级检索  
     

互信息改进方法在术语抽取中的应用
引用本文:杜丽萍,李晓戈,周元哲,邵春昌. 互信息改进方法在术语抽取中的应用[J]. 计算机应用, 2015, 35(4): 996-1000. DOI: 10.11772/j.issn.1001-9081.2015.04.0996
作者姓名:杜丽萍  李晓戈  周元哲  邵春昌
作者单位:1. 西安邮电大学 计算机学院, 西安 710121;2. 中央民族大学 理学院, 北京 100081
基金项目:国家自然科学基金资助项目,西安邮电大学研究生创新基金资助项目
摘    要:为了确定改进互信息(PMIk)方法的参数k取何值时能够克服互信息(PMI)方法过高估计两个低频且总是一起出现的字串间结合强度的缺点,解决术语抽取系统采用经过分词的语料库时由于分词错误导致的某些术语无法抽取的问题,以及改善术语抽取系统的可移植性,提出了一种结合PMIk和两个基本过滤规则从未经过分词的语料库中进行术语抽取的算法。首先,利用PMIk方法计算两个字之间的结合强度,确定2元待扩展种子;其次,利用PMIk方法计算2元待扩展种子分别和其左边、右边的字的结合强度,确定2元是否能扩展为3元,如此迭代扩展出多元的候选术语;最后,利用两个基本过滤规则过滤候选术语中的垃圾串,得到最终结果。理论分析表明,当k≥3(k∈N+)时,PMIk方法能克服PMI方法的缺点。在1 GB的新浪财经博客语料库和300 MB百度贴吧语料库上的实验验证了理论分析的正确性,且PMIk方法获得了比PMI方法更高的精度,算法有良好的可移植性。

关 键 词:术语抽取  专业术语  知识获取  互信息  
收稿时间:2014-10-30
修稿时间:2015-01-13

Application of improved point-wise mutual information in term extraction
DU Liping,LI Xiaoge,ZHOU Yuanzhe,SHAO Chunchang. Application of improved point-wise mutual information in term extraction[J]. Journal of Computer Applications, 2015, 35(4): 996-1000. DOI: 10.11772/j.issn.1001-9081.2015.04.0996
Authors:DU Liping  LI Xiaoge  ZHOU Yuanzhe  SHAO Chunchang
Affiliation:1. College of Computer Science and Technology, Xi'an University of Posts and Telecommunications, Xi'an Shaanxi 710121, China;
2. College of Science, Minzu University of China, Beijing 100081, China
Abstract:The traditional Point-wise Mutual Information (PMI) method has shortcoming of overvaluing the co-occurrence of two low-frequency words. To get the proper value of k of improved PMI named PMIk to overcome the shortcoming of PMI, and solve the problem that the term extraction cannot be obtained from a segmented corpus with segmentation errors, as well as maintaining the portability of term extraction system, combining with the PMIk method and two fundamental rules, a new method was put forward to identity terms from an unsegmented corpus. Firstly, 2-gram extended seed was determined by computing the bonding strength of two adjoining words by PMIk method. Secondly, whether the 2-gram extended seed could be extended to 3-gram was determined by respectively computing the bonding strength between the seed and the word in front of it and the word located behind it, and then getting multi-gram term candidates iteratively. Finally, the garbage of term candidates were filtered using the two fundamental rules to obtain terms. The theoretical analysis shows that PMIkcan overcome the shortcoming of PMI when k≥3(k∈N+). The experiments on 1 GB SINA finance Blog corpus and 300 MB Baidu Tieba corpus verify the theoretical analysis, and PMIk outperforms PMI with good portability.
Keywords:term extraction  technical term  knowledge acquisition  Point-wise Mutual Information (PMI)
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号