首页 | 本学科首页   官方微博 | 高级检索  
     

基于互信息和邻接熵的新词发现算法
引用本文:刘伟童,刘培玉,刘文锋,李娜娜.基于互信息和邻接熵的新词发现算法[J].计算机应用研究,2019,36(5).
作者姓名:刘伟童  刘培玉  刘文锋  李娜娜
作者单位:山东师范大学信息科学与工程学院,济南250358;山东省分布式计算机软件新技术重点实验室,济南250358;山东师范大学信息科学与工程学院,济南250358;菏泽学院计算机学院,山东菏泽274015
基金项目:国家自然科学基金资助项目(61373148,61502151);山东省社科规划项目(17CHLJ18,17CHLJ33,17CHLJ30);山东省自然科学基金资助项目(ZR2014FL010);山东省教育厅基金资助项目(J15LN34)
摘    要:如何快速高效地识别新词是自然语言处理中一项非常重要的任务,针对当前新词发现存在的问题,提出了一种从左至右逐字在未切词的微博语料中发现新词的算法。通过计算候选词语与其右邻接字的互信息来逐字扩展,得到候选新词;并通过计算邻接熵、删除候选新词的首尾停用词和过滤旧词语等方法来过滤候选新词,最终得到新词集。解决了因切词错误导致部分新词无法识别以及通过n-gram方法导致大量重复词串和垃圾词串识别为新词的问题,最后通过实验验证了该算法的有效性。

关 键 词:新词发现  互信息  邻接熵  微博语料
收稿时间:2017/11/20 0:00:00
修稿时间:2019/3/31 0:00:00

New word discovery algorithm based on mutual information and branch entropy
liuweitong,liupeiyu,liuwenfeng and linana.New word discovery algorithm based on mutual information and branch entropy[J].Application Research of Computers,2019,36(5).
Authors:liuweitong  liupeiyu  liuwenfeng and linana
Affiliation:School of Information Science and Engineering, Shandong Normal University,,,
Abstract:How to identify new words quickly and efficiently is a very important task in natural language processing. Aiming at the problems existing in the discovery of new words, there is an algorithm for word-finding new words verbatim from left to right in the uncut word Weibo corpus. One way to get a candidate new word is by computing the candidate word and its right adjacent word mutual information to expand word by word; There are some ways to filter candidate new words to get new word sets. The included methods include calculating the branch entropy, deleting stop words contained in the first or last word of each candidate new word and deleting old words included in the candidate new word set. It solves the problem that some new words can not be recognized due to the mistakes in the word segmentation and It also solves the problem that the large number of repetitive word strings and rubbish words strings generated by the n-gram method are identified as new words. Finally, experiments verified the effectiveness of the algorithm.
Keywords:new word discovery  mutual information  branch entropy  microblog corpus
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号