首页 | 官方网站   微博 | 高级检索  
     

半监督自训练的方面提取
引用本文:曲昭伟,吴春叶,王晓茹.半监督自训练的方面提取[J].智能系统学报,2019,14(4):635-641.
作者姓名:曲昭伟  吴春叶  王晓茹
作者单位:1. 北京邮电大学 网络技术研究院, 北京 100876;2. 北京邮电大学 计算机学院, 北京 100876
摘    要:方面提取是观点挖掘和情感分析任务中的关键一步,随着社交网络的发展,用户越来越倾向于根据评论信息来帮助进行决策,并且用户也更加关注评论的细粒度的信息,因此,从海量的网络评论数据中快速挖掘方面信息对于用户快速决策具有重要意义。大部分基于主题模型和聚类的方法在方面提取的一致性上效果并不好,传统的监督学习的方法效果虽然表现很好,但是需要大量的标注文本作为训练数据,标注文本需要消耗大量的人力成本。基于以上问题,本文提出一种基于半监督自训练的方面提取方法,充分利用现存的大量未标签的数据价值,在未标签数据集上通过词向量模型寻找方面种子词的相似词,对每个方面建立与数据集最相关的方面表示词集合,本文方法避免了大量的文本标注,充分利用未标签数据的价值,并且本文方法在中文和英文数据集上都表现出了理想的效果。

关 键 词:方面提取  词向量  半监督  自训练  未标签数据  观点挖掘  种子词  相似词

Aspects extraction based on semi-supervised self-training
QU Zhaowei,WU Chunye,WANG Xiaoru.Aspects extraction based on semi-supervised self-training[J].CAAL Transactions on Intelligent Systems,2019,14(4):635-641.
Authors:QU Zhaowei  WU Chunye  WANG Xiaoru
Affiliation:1. Institute of Network Technology, Beijing University of Posts and Telecommunication, Beijing 100876, China;2. College of Computer Science, Beijing University of Posts and Telecommunication, Beijing 100876, China
Abstract:Aspect extraction is a key step in opinion mining and sentiment analysis. With the development of social networks, users are increasingly inclined to make decisions based on review information and pay more attention to the fine-grained information of comments. Therefore, it is important to help users to make these decisions by quickly mining information from massive comments. Most topic-based models and clustering methods do not work well in terms of consistency in aspect extraction. The traditional supervised learning method works well, but it requires a large amount of annotation text as training data, and labeling text requires a lot of labor costs. Based on the above issues, a method for aspects extraction based on semi-supervised self-training (AESS) is proposed in this paper. The method takes full advantage of the large amount of unlabeled data that exist in the web. Words similar to seed words on the unlabeled datasets using a word vector model are found, and multiple aspects word sets that are most related to the data set are constructed. Our approach avoids a large number of text annotations and makes full use of the value of unlabeled data, and our method has made good performance in both Chinese and English datasets.
Keywords:aspect extraction  word vector  semi-supervised  self-training  unlabeled data  opinion mining  seed words  similar words
点击此处可从《智能系统学报》浏览原始摘要信息
点击此处可从《智能系统学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号