首页 | 本学科首页   官方微博 | 高级检索  
     

基于增量型聚类的自动话题检测研究
引用本文:张小明,李舟军,巢文涵.基于增量型聚类的自动话题检测研究[J].软件学报,2012,23(6):1578-1587.
作者姓名:张小明  李舟军  巢文涵
作者单位:北京航空航天大学计算机科学与工程系,北京,100191
基金项目:国家自然科学基金,国家教育部博士点基金,国家重点实验室基金
摘    要:随着网络信息飞速的发展,收集并组织相关信息变得越来越困难.话题检测与跟踪(topic detection and tracking,简称TDT)就是为解决该问题而提出来的研究方向.话题检测是TDT中重要的研究任务之一,其主要研究内容是把讨论相同话题的故事聚类到一起.虽然话题检测已经有了多年的研究,但面对日益变化的网络信息,它具有了更大的挑战性.提出了一种基于增量型聚类的和自动话题检测方法,该方法旨在提高话题检测的效率,并且能够自动检测出文本库中话题的数量.采用改进的权重算法计算特征的权重,通过自适应地提炼具有较强的主题辨别能力的文本特征来提高文档聚类的准确率,并且在聚类过程中利用BIC来判断话题类别的数目,同时利用话题的延续性特征来预聚类文档,并以此提高话题检测的速度.基于TDT-4语料库的实验结果表明,该方法能够大幅度提高话题检测的效率和准确率.

关 键 词:话题检测与跟踪  TDT  话题检测  增量型聚类  权重计算
收稿时间:8/7/2009 12:00:00 AM
修稿时间:9/1/2011 12:00:00 AM

Research of Automatic Topic Detection Based on Incremental Clustering
ZHANG Xiao-Ming,LI Zhou-Jun and CHAO Wen-Han.Research of Automatic Topic Detection Based on Incremental Clustering[J].Journal of Software,2012,23(6):1578-1587.
Authors:ZHANG Xiao-Ming  LI Zhou-Jun and CHAO Wen-Han
Affiliation:(Department of Computer Science and Engineering,BeiHang University,Beijing 100191,China)
Abstract:With the exponential growth of information on the Internet,it has become increasingly difficult to find and organize relevant material.Topic detection and tracking(TDT) is a research area addressing this problem.As one of the basic tasks of TDT,topic detection is the problem of grouping all stories,based on the topics they discuss.This paper proposes a new topic detection method(TPIC) based on an incremental clustering algorithm.The proposed topic detection strives to achieve a high accuracy and the capability of estimating the true number of topics in the document corpus.Term reweighing algorithm is used to accurately and efficiently cluster the given document corpus,and a self-refinement process of discriminative feature identification is proposed to improve the performance of clustering.Furthermore,topics’ "aging" nature is used to precluster stories,and Bayesian information criterion(BIC) is used to estimate the true number of topics.Experimental results on linguistic data consortium(LDC) datasets TDT-4 show that the proposed model can improve both efficiency and accuracy,compared to other models.
Keywords:topic detection and tracking  TDT  topic detection  incremental clustering  reweighting
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号