An effective web document clustering algorithm based on bisection and merge |
| |
Authors: | Ingyu Lee Byung-Won On |
| |
Affiliation: | (1) Department of Media Technology, College of Information Science and Engineering, Ritsumeikan University, Kyoto, Japan;(2) Faculty of Culture and Information Science, Doshisha University, Kyoto, Japan;(3) Graduate School of Information Science, Nara Institute of Science and Technology, Nara, Japan;(4) Faculty of Informatics, Nara Sangyo University, Nara, Japan;; |
| |
Abstract: | To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such
as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents.
According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number
of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters
in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper,
we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm
performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately
56% compared to spectral bisection and 36% compared to K-means. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|