首页 | 本学科首页   官方微博 | 高级检索  
     

基于主动学习的命名实体识别算法
引用本文:张岑芳.基于主动学习的命名实体识别算法[J].计算机与现代化,2021,0(7):18-22.
作者姓名:张岑芳
作者单位:南京理工大学计算机科学与工程学院,江苏 南京 210094
基金项目:江苏省研究生科研与实践创新计划项目(SJCX19_0054)
摘    要:命名实体识别的目的是识别文本中的实体指称的边界和类别。在进行命名实体识别模型训练的过程中,通常需要大量的标注样本。本文通过实现有效的选择算法,从大量样本中选择适合模型更新的样本,减少对样本的标注工作。通过5组对比实验,验证使用有效的选择算法能够获得更好的样本集,实现具有针对性的标注样本。通过设计在微博网络数据集上的实验,验证本文提出的基于流的主动学习算法可以针对大量互联网文本数据选择出更合适的样本集,能够有效减少人工标注的成本。本文通过2个模型分别实现实体的边界提取和类别区分。序列标注模型提取出实体在序列中的位置,实体分类模型实现对标注结果的分类,并利用主动学习的方法实现在无标注数据集上的训练。使用本文的训练方法在2个数据集上进行实验。在Weibo数据集上的实验展示算法能从无标签数据集中学习到文本特征。在MSRA数据集上的实验结果显示,在预训练数据集的比例达到40%以上时,模型在测试数据集上的F1值稳定在90%左右,与使用全部数据集的结果接近,说明模型在无标签数据集上具有一定的特征提取能力。

关 键 词:命名实体识别  主动学习  深度学习  Bi-LSTM  
收稿时间:2021-08-02

Named Entity Recognition Algorithm Based on Active Learning
ZHANG Cen-fang.Named Entity Recognition Algorithm Based on Active Learning[J].Computer and Modernization,2021,0(7):18-22.
Authors:ZHANG Cen-fang
Abstract:The purpose of named entity recognition is to identify the boundaries and categories of entities in the text. In the process of training named entity recognition models, a large number of labeled samples are usually required. By implementing effective selection algorithms, this paper reduces the labeling of samples from a large number of samples suitable for model updates. By using five sets of comparison experiments, it is verified that a better set of samples can be obtained by effective selection algorithm, and a targeted sample of annotations is realized. Through experiments designed on microblog network data sets, it is verified that the current-based active learning algorithm can select more appropriate sample sets for a large amount of Internet text data, which can effectively reduce the cost of manual labeling. This paper uses two models to realize the boundary extraction and classification of entities. The sequence labeling model extracts the position of the entity in the sequence, the entity classification model realizes the classification of the labeling results, and uses the active learning method to realize the training on the unlabeled data set. Experiment on two data sets is done by using the training method in this article. Experiments on the Weibo dataset show that the algorithm can learn text features from the unlabeled dataset. The experimental results on the MSRA data set show that when the proportion of the pre-training data set reaches more than 40%, the F1 score of the model on the test data set is stable at about 90%, which is close to the result of using all the data sets, indicating that the model  in unlabeled data sets has certain feature extraction capabilities.
Keywords:named entity recognition  activate learning  deep learning  Bi-LSTM  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机与现代化》浏览原始摘要信息
点击此处可从《计算机与现代化》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号