首页 | 本学科首页   官方微博 | 高级检索  
     

基于置信度的藏文人名识别的主动学习模型研究
引用本文:王志娟,刘飞飞,赵小兵,宋伟.基于置信度的藏文人名识别的主动学习模型研究[J].中文信息学报,2019,33(8):53-59.
作者姓名:王志娟  刘飞飞  赵小兵  宋伟
作者单位:1.中央民族大学 信息工程学院,北京 100081;
2.国家语言资源监测与研究少数民族语言中心,北京 100081;
3.好未来教育科技集团,北京 100080
基金项目:国家自然科学基金(61331013,61501529)
摘    要:训练语料的标注成本是资源稀缺语言处理研究面临的一个重要问题,通过主动学习(active learning)方法可以选择信息量大、无冗余的语料供人工标注,进而大大降低语料标注成本。该文基于CRF模型给出的标注置信度提出了四种主动学习方法,并通过实验确定了这四种主动学习方法的相关参数。实验显示:选择置信度低于0.7的语料进行人工标注,直到新旧模型标注结果的差异度小于0.01%时,仅需6轮迭代;人工标注3.2MB的语料,藏文人名识别的F值可以达到88%,若要达到该识别效果,基于CRF的监督式学习模型需要标注约10MB的语料,该主动学习方法降低了约66%的语料标注规模。

关 键 词:藏文人名识别  主动学习  置信度

Confidence Based Active Learning Model for Tibetan Person Name Recognition
WANG Zhijuan,LIU Feifei,ZHAO Xiaobing,SONG Wei.Confidence Based Active Learning Model for Tibetan Person Name Recognition[J].Journal of Chinese Information Processing,2019,33(8):53-59.
Authors:WANG Zhijuan  LIU Feifei  ZHAO Xiaobing  SONG Wei
Affiliation:1.School of Electronics Engineering, Minzu University of China, Beijing 100081, China;
2.National Language Resource Monitoring & Research Center of Minority Languages, Beijing 100081, China;
3.Tomorrow Advancing Life Education Group, Beijing 100080, China
Abstract:To alleviate the issue of labeling cost of training data for low resource languages, the active learning is a promising method by selecting the informative data without redundancy. Four active learning methods based on the confidence are proposed, with the parameters decided empirically. The experimental results: selecting the data with confidence below 0.7 and 6 iteration of labeling with up to 3.2MB training data, we can achieve 0.88 F-measure for Tibetan name recognition. Compare with the 10MB training data for CRF model to achieve the same performance (with no more than 0.01% difference), the active learning approach reduces the annotation scale by 66%.
Keywords:Tibetan person name recognition  active learning  confidence  
本文献已被 维普 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号