首页 | 本学科首页   官方微博 | 高级检索  
     

基于局部密度的无监督作文跑题检测方法
引用本文:李霞,温启帆.基于局部密度的无监督作文跑题检测方法[J].中文信息学报,2017,31(6):205-213.
作者姓名:李霞  温启帆
作者单位:1.广东外语外贸大学 语言工程与计算实验室,广东 广州 510006;
2.广东外语外贸大学 信息科学与技术学院,广东 广州 510006
基金项目:国家自然科学基金(61402119);广东省普通高校科技创新项目(2013KJCX0071)
摘    要:针对现有的无监督作文跑题检测方法中,使用作文内容向量表示作文存在非主题词噪声所导致的相似度不准确问题,该文提出一种基于作文主题词抽取和局部密度阈值选择的无监督作文跑题检测方法。首先使用LDA主题生成模型挖掘待测作文的主题词,并使用分布式表示向量寻找与题目词项语义相似的词,作为对作文题目的主题词扩展,在此基础上使用提出的切题度计算方法计算待测作文的切题度,并使用所提出的基于作文集切题度局部密度的阈值抽取方法动态选取切题阈值,进而实现一种无需训练集和主题无关的无监督作文跑题检测方法。在以英语为母语的学习者和以汉语为母语的学习者所写的8个作文集共9 381篇作文上的实验结果表明,该文提出的作文跑题检测方法能有效识别跑题作文,加入拼写检查预处理后,平均F1值为79.64%,单个作文题目下F1值最好为96.1%。

关 键 词:作文跑题检测  主题词抽取  切题度  阈值选取  

Unsupervised Off-topic Essay Detection Based on Local Density
LI Xia,WEN Qifan.Unsupervised Off-topic Essay Detection Based on Local Density[J].Journal of Chinese Information Processing,2017,31(6):205-213.
Authors:LI Xia  WEN Qifan
Affiliation:1.Laboratory of Language Engineering and Computing,
Guangdong University of Foreign Studies, Guangzhou, Guangdong 510006, China;2.School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, Guangdong 510006, China
Abstract:Existing off-topic essay detection method mainly uses the content vector to represent the composition which sometimes results in low accuracy due to noise words. In this paper, we propose an unsupervised off-topic essay detection method based on the topic words and the local density thresholds. Firstly, Latent Dirichlet Allocation is used to predict essay’s topic distribution and the topic words are extracted according to different weights of the topics. Secondly, we use distributed word vector representation to find the similar words as the expansion of the title, and then compute on-topic score of all the test essays using our new similarity calculation method. Finally, we propose a local density threshold extraction method to extract the off-topic threshold automatically and determine off-topic essay. The experimental results on eight sets totaling 9381 essays show that our algorithm can significantly improve the F-measure compared to the baseline method. After adding the spelling correction preprocessing, the average F-measure value over all essay sets reaches 79.64%, and the best F-measure value of the eight sets is 96.1%.
Keywords:off-topic essay detection  topic word extraction  on-topic score  threshold extraction  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号