首页 | 本学科首页   官方微博 | 高级检索  
     

基于奇异值分解的新闻标题聚类研究
引用本文:文晓艺,郝程程. 基于奇异值分解的新闻标题聚类研究[J]. 计算机技术与发展, 2020, 0(2): 42-46
作者姓名:文晓艺  郝程程
作者单位:上海对外经贸大学统计与信息学院
基金项目:上海市大学生创新训练项目(201810273116)
摘    要:和导航中应用广泛。文本聚类作为一种无监督学习算法,其依据是聚类假设:同类的文档相似程度大,不同类的文档相似程度小。文中主要研究汉语文本聚类算法在新闻标题类文本中的应用。首先对采集到的若干条新闻标题进行分词和特征提取,将分词后的文本转化为词条矩阵;然后使用TF-IDF技术处理词条矩阵,得到基于分词权重的新的词条矩阵,对新的词条矩阵进行奇异值分解,得到主成分得分矩阵,提取主成分分析文本特征并根据主成分得分矩阵进行K-均值和分层聚类分析;最后将聚类结果用词云图的形式展示出来并评价聚类效果的好坏。实证显示,对词条矩阵的奇异值分解能降低向量空间的维数,提高聚类的精度和运算速度。

关 键 词:汉语分词  词云图  奇异值分解  潜在语义分析  K-MEANS聚类

Study on News Header Clustering Based on Singular Value Decomposition
WEN Xiao-yi,HAO Cheng-cheng. Study on News Header Clustering Based on Singular Value Decomposition[J]. Computer Technology and Development, 2020, 0(2): 42-46
Authors:WEN Xiao-yi  HAO Cheng-cheng
Affiliation:(School of Statistics and Information,Shanghai University of International Business and Economics,Shanghai 201600,China)
Abstract:Chinese word segmentation and text clustering are important in natural language processing,which are widely used in text information organization,summarization and navigation.As an unsupervised learning algorithm,text clustering is based on the clustering hypothesis:documents of same category are more similar,while documents of different categories are less similar.We mainly study the application of Chinese text clustering algorithms in news headers.First of all,we divide the collected news headlines into word segmentation and feature extraction,and convert the text after word segmentation into term line matrix.Then the term line matrix is processed by TF-IDF technology and a new lexical matrix based on word segmentation weight is obtained.The new lexical matrix is decomposed by singular value and the principal component scoring matrix is obtained.The text features of principal component analysis are extracted and K-means and hierarchical cluster analysis are performed according to the scoring matrix of principal component analysis.Finally,the clustering results are displayed in the form of a word cloud map and the quality of the clustering effect is evaluated.The experiment shows that the singular value decomposition of the lexical matrix can effectively reduce the dimension of the vector space,thus improving the accuracy and speed of the clustering.
Keywords:Chinese word segmentation  word cloud diagram  singular value decomposition  latent semantic analysis  K-means clustering
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号