一种利用近邻和信息熵的主动文本标注方法 An Active Labeling Method for Text Data Based on Nearest Neighbor and Information Entropy期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种利用近邻和信息熵的主动文本标注方法

引用本文：	朱岩,景丽萍,于剑.一种利用近邻和信息熵的主动文本标注方法[J].计算机研究与发展,2012,49(6):1306-1312.

作者姓名：	朱岩景丽萍于剑

作者单位：	北京交通大学计算机科学系北京100044

基金项目：	中央高校基金科研业务费专项资金项目，北京交通大学优秀博士生科技创新基金项目，国家自然科学基金项目

摘要：	由于大规模标注文本数据费时费力,利用少量标注样本和大量未标注样本的半监督文本分类发展迅速.在半监督文本分类中,少量标注样本主要用来初始化分类模型,其合理性将影响最终分类模型的性能.为了使标注样本尽可能吻合原始数据的分布,提出一种避开选择已标注样本的K近邻来抽取下一组候选标注样本的方法,使得分布在不同区域的样本有更多的标注机会.在此基础上,为了获得更多的类别信息,在候选标注样本中选择信息熵最大的样本作为最终的标注样本.真实文本数据上的实验表明了提出方法的有效性.
关键词：	半监督文本分类主动学习近邻信息熵标注方法
An Active Labeling Method for Text Data Based on Nearest Neighbor and Information Entropy

Zhu Yan , Jing Liping , Yu Jian.An Active Labeling Method for Text Data Based on Nearest Neighbor and Information Entropy[J].Journal of Computer Research and Development,2012,49(6):1306-1312.

Authors:	Zhu Yan Jing Liping Yu Jian

Affiliation:	(Department of Computer Science,Beijing Jiaotong University,Beijing 100044)

Abstract:	As it is quite time-consuming to label text documents on a large scale,a kind of text classification with a few labeled data is needed.Thus,semi-supervised text classification emerges and develops rapidly.Different from traditional classification,semi-supervised text classification only requires a small set of labeled data and a large set of unlabeled data to train a classifier.The small set of labeled data is used to initialize the classification model in most cases.Its rationality will affect the performance of the final classifier.In order to make the distribution of the labeled data more consistent with the distribution of the original data,a sampling method is proposed to avoid selecting the K nearest neighbors of the labeled data to be new candidate labeled data.With the help of this method,the data located in various regions will have more opportunities to be labeled.Moreover,in order to obtain more category information from the very few labeled data,this method compares the information entropy of the candidate labeled data and the datum with the highest information entropy is chosen as the next datum to be labeled manually.Experiments on real text data sets suggest that this approach is very effective.

Keywords:	semi-supervised text classification active learning nearest neighbor information entropy labeling strategy
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏