首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于主动学习的中文新词识别算法
引用本文:王 博,代 翔,时 聪,刘 洋.一种基于主动学习的中文新词识别算法[J].电讯技术,2020(11).
作者姓名:王 博  代 翔  时 聪  刘 洋
作者单位:1.北京市信息技术研究所,北京 100094;2.中国西南电子技术研究所,成都 610036
基金项目:国家自然科学基金资助项目(U19A2078);四川省科技计划重点研发项目(2020YFG0009)
摘    要:分词是中文自然语言处理的重要基础,新词的不断涌现是分词的最大难题。针对新词识别定义不清、语料缺乏的实际问题,提出了一种以大规模神经网络预训练模型为基础,并结合主动学习和人工规则的新词识别算法。利用预训练模型高效识别候选新词,使用基于不确定性和代表性样本选择的主动学习策略辅助标注新词,利用热度规则、突发性规则和合成性规则识别和过滤新词发现结果。针对新词识别评价标准不一致的问题,给出了一般性准确率和受限制准确率两条规范测试指标。与现有最优算法进行实验对比,所提算法两项指标分别提高了16%和4%。

关 键 词:中文自然语言处理  中文新词识别  主动学习  深度神经网络  人工规则

Chinese New Words Recognition Based on Active Learning
WANG Bo,DAI Xiang,SHI Cong,LIU Yang.Chinese New Words Recognition Based on Active Learning[J].Telecommunication Engineering,2020(11).
Authors:WANG Bo  DAI Xiang  SHI Cong  LIU Yang
Affiliation:1.Beijing Institute of Information Technology,Beijing 100094,China;2.Southwest China Institute of Electronic Technology,Chengdu 610036,China
Abstract:Word segmentation is an essential basis of Chinese natural language processing,and the continuous emergence of new words is the biggest problem of word segmentation.To solve the problems of unclear definition and lack of corpus,a new word recognition algorithm based on a large-scale neural network pre-training model combined with active learning and standard rules is proposed.The pre-training model is used to identify new word candidates efficiently.The active learning strategy based on uncertainty and representative sample selection is used to assist new word labelling.The heat rule,the burst rule,and the compound rule are used to identify and filter new word discovery results.Two standard test metrics,general-accuracy rate and restricted-accuracy rate are given to solve the problem of inconsistent evaluation criteria.The experimental comparison of relevant research results shows that the two metrics are improved by 16% and 4%,respectively,compared with the existing algorithms.
Keywords:Chinese natural language processing  Chinese new word identification  active learning  deep neural network  artificial rules
点击此处可从《电讯技术》浏览原始摘要信息
点击此处可从《电讯技术》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号