首页 | 本学科首页   官方微博 | 高级检索  
     

基于多向量和二次聚类的话题检测
引用本文:王振宇,吴泽衡,唐远华. 基于多向量和二次聚类的话题检测[J]. 计算机工程与设计, 2012, 33(8): 3214-3218
作者姓名:王振宇  吴泽衡  唐远华
作者单位:1. 华南理工大学软件学院,广东广州,510006
2. 华南理工大学计算机科学与工程学院,广东广州,510006
基金项目:广东省科技计划基金项目(2010B010600017)
摘    要:话题检测技术是互联网新闻热点挖掘的基础,为解决基于传统的话题检测较少利用报道中的类别信息以及命名实体信息来提高检测效果,提出一种基于多向量相似度计算和二次聚类的话题检测方法。将报道按照其所在的站点层次关系进行层次分类,利用新闻文本中的地点、人物等命名实体信息来区分新闻报道;利用报道的时间聚集特性,将同一天的报道先进行局部聚类,再与旧话题归并聚类。实验结果表明,该方法的归一化识别代价达到0.197,比传统的话题检测算法提升约8%的性能。

关 键 词:话题检测  新闻热点  命名实体  相似度计算  聚类

Topic detection based on multi-vector and secondary clustering
WANG Zhen-yu , WU Ze-heng , TANG Yuan-hua. Topic detection based on multi-vector and secondary clustering[J]. Computer Engineering and Design, 2012, 33(8): 3214-3218
Authors:WANG Zhen-yu    WU Ze-heng    TANG Yuan-hua
Affiliation:1(1.School of Software Engineering,South China University of Technology,Guangzhou 510006,China; 2.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,China)
Abstract:Topic detection technology is based on news hotspot mining on Internet.To solve the traditional topic detections do not make full use of categories information and named entity in reports.So,a new topic detection method based on multi-vector similarity calculation and secondary clustering is proposed,which classifies the reports according to its site hierarchy,and uses information of characters and locations to distinguish the topics.Furthermore,it utilizes the time aggregation behavior of reports to do partial clustering on the set of reports in the same day,and then merged the results with the old topics.The experimental results show that(CDet)Norm of the new method achieves 0.197,and its performance is about 8% better than traditional methods.
Keywords:topic detection  news hotspot  named entity  similarity calculation  cluster
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号