首页 | 本学科首页   官方微博 | 高级检索  
     

面向热点话题检测的增量文本聚类算法
引用本文:郭莹,薛涛,胡伟华.面向热点话题检测的增量文本聚类算法[J].计算机系统应用,2022,31(9):280-286.
作者姓名:郭莹  薛涛  胡伟华
作者单位:西安工程大学 计算机科学学院, 西安 710600;西安工程大学 人文社会科学学院, 西安 710600
基金项目:国家社会科学基金(18XYY010)
摘    要:针对传统的Single-Pass聚类算法对数据输入顺序过于敏感和准确率较低的问题, 提出一种以子话题为粒度, 考虑新闻文本动态性、时效性和上下文语义特征的增量文本聚类算法(SP-HTD). 首先通过解析LDA2Vec主题模型, 联合训练文档向量和词向量, 获得上下文向量, 充分挖掘文本的语义特征及重要性关系. 然后在Single-Pass算法基础上, 根据提取到的热点主题特征词, 划分子话题, 并设置时间阈值, 来确认类簇中心的时效性, 将挖掘的语义特征和任务相结合, 动态更新类簇中心. 最后以时间特性为辅, 更新话题质心向量, 提高文本相似度计算的准确性. 结果表明, 所提方法的F值最高可达89.3%, 且在保证聚类精度的前提下, 在漏检率和误检率上较传统算法有明显改善, 能够有效提高话题检测的准确性.

关 键 词:Single-Pass  文本表示  文本聚类  文本相似度  热点话题检测
收稿时间:2021/12/7 0:00:00
修稿时间:2022/1/4 0:00:00

Incremental Text Clustering Algorithm for Hot Topic Detection
GUO Ying,XUE Tao,HU Wei-Hua.Incremental Text Clustering Algorithm for Hot Topic Detection[J].Computer Systems& Applications,2022,31(9):280-286.
Authors:GUO Ying  XUE Tao  HU Wei-Hua
Abstract:As the traditional Single-Pass clustering algorithm is highly sensitive to the input sequence of data and has low accuracy, an incremental text clustering algorithm (SP-HTD) is proposed, which takes subtopics as granularity and considers the dynamics, timeliness, and contextual semantic features of news texts. Firstly, by parsing the LDA2Vec topic model, this study jointly trains the document vectors and the word vectors to obtain the context vectors and thus fully mines the semantic features and importance relationship of the text. Then, on the basis of the Single-Pass algorithm, sub-topics are classified according to the extracted hot topic feature words, and the time threshold is set to confirm the timeliness of the cluster center. The mined semantic features and tasks are combined to dynamically update the cluster center. Finally, with the assistance of the time characteristics, the centroid vectors of the topics are updated to improve the accuracy of text similarity calculation. The results reveal that the F value of the proposed method can reach up to 89.3%, and on the premise of ensuring the clustering accuracy, the proposed method has a significantly lower undetected rate and false detection rate compared with those of the traditional algorithm, and thus it can effectively improve the accuracy of topic detection.
Keywords:Single-Pass  text representation  text clustering  text similarity  hot topic detection
点击此处可从《计算机系统应用》浏览原始摘要信息
点击此处可从《计算机系统应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号