首页 | 本学科首页   官方微博 | 高级检索  
     


Concept Decompositions for Large Sparse Text Data Using Clustering
Authors:Dhillon  Inderjit S  Modha  Dharmendra S
Affiliation:(1) Department of Computer Science, University of Texas, Austin, TX 78712, USA;(2) IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA
Abstract:Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain ldquofractal-likerdquo and ldquoself-similarrdquo behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to the best possible, namely, to truncated singular value decompositions. As our third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, we observe the surprising fact that the linear subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them. In conclusion, the concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized ldquobasisrdquo for text data sets.
Keywords:concept vectors  fractals  high-dimensional data  information retrieval  k-means algorithm  least-squares  principal angles  principal component analysis  self-similarity  singular value decomposition  sparsity  vector space models  text mining
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号