首页 | 本学科首页   官方微博 | 高级检索  
     

基于RoBERTa-WWM 和HDBSCAN的文本聚类算法
引用本文:刘锟,曾曦,邱梓珩,陈周国.基于RoBERTa-WWM 和HDBSCAN的文本聚类算法[J].计算机与现代化,2022,0(3):48-52.
作者姓名:刘锟  曾曦  邱梓珩  陈周国
作者单位:中国电子科技集团公司第三十研究所,四川 成都 610000,中国电子科技集团公司第三十研究所,四川 成都 610000;深圳市网联安瑞网络科技有限公司,广东 深圳 518000,深圳市网联安瑞网络科技有限公司,广东 深圳 518000
摘    要:在大数据环境下,从海量的互联网数据中获取热点话题是研究当前互联网中民意民情的基础,其中文本聚类是得到热点话题最常用的方法之一,可以分为文本向量化表示和聚类2个步骤。然而在文本向量化表示任务中,传统的文本表示模型无法准确表示新闻、帖文等文本的上下文语境信息。在聚类任务中,最常使用的是K-Means算法和DBSCAN算法,但是它们对数据的聚类方式与实际中话题数据的分布不符,这使得现有的文本聚类算法在实际的互联网环境中应用效果很差。本文根据互联网中话题的数据分布情况,提出一种基于RoBERTa-WWM和HDBSCAN的文本聚类算法。首先利用预训练语言模型RoBERTa-WWM得到每一篇文本的文本向量,其次利用t-SNE算法对高维文本向量进行降维,最后利用基于层次的密度聚类算法的HDBSCAN算法对低维的文本向量进行聚类。实验结果表明提出的算法相较于现有的文本聚类算法,在含有噪声数据且分布不均衡的数据集上,聚类效果有很大的提升。

关 键 词:文本聚类  预训练语言模型  可视化降维  密度聚类  
收稿时间:2022-04-29

Text Clustering Algorithm Based on RoBERTa-WWM and HDBSCAN
LIU kun,ZENG Xi,QIU Zi-heng,CHEN Zhou-guo.Text Clustering Algorithm Based on RoBERTa-WWM and HDBSCAN[J].Computer and Modernization,2022,0(3):48-52.
Authors:LIU kun  ZENG Xi  QIU Zi-heng  CHEN Zhou-guo
Abstract:In the big data environment, obtaining hot topics from massive Internet data is the basis for studying public opinion and sentiments in the current Internet. Among them, text clustering is one of the most common methods to get hot topics, which can be divided into two steps: text vectorization representation and clustering. However, in the task of vectorized text representation, the traditional text representation model cannot accurately represent the contextual information of texts such as news and posts. In the clustering task, the K-Means algorithm and DBSCAN algorithm are most commonly used, but their clustering method is not consistent with the actual distribution of topic data, which makes the existing text clustering algorithms very poorly applied in the actual Internet environment. Therefore, this paper proposes a text clustering algorithm based on RoBERTa-WWM and HDBSCAN according to the data distribution of topics in the Internet. Firstly, the pre-trained language model RoBERTa-WWM is used to obtain the text vector of each text. Secondly, the t-SNE algorithm is used to reduce the dimension of the high-dimensional text vector. Finally, the HDBSCAN algorithm based on hierarchical density clustering algorithm is used to cluster the low-dimensional text vector. The experimental results show that compared with the existing text clustering algorithms, the proposed algorithm has a great improvement in the clustering effect on data sets that contain noisy data and are unevenly distributed.
Keywords:text clustering  pre-training language model  visual dimensionality reduction  density clustering  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机与现代化》浏览原始摘要信息
点击此处可从《计算机与现代化》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号