首页 | 本学科首页   官方微博 | 高级检索  
     


A three-phase approach to document clustering based on topic significance degree
Affiliation:1. Department of Electrical Power Systems, Kaunas University of Technology, Studentu 50, LT-51368 Kaunas, Lithuania;2. Department of Information Systems, Kaunas University of Technology, Studentu 50, LT-51368 Kaunas, Lithuania;3. Intelligent Systems Laboratory, Centre for Applied Intelligent Systems Research, Halmstad University, Kristian IV:s väg 3, PO Box 823, S-301 18 Halmstad, Sweden;4. Marine Science and Technology Centre, Klaipeda University, Herkaus Manto 84, LT-92294 Klaipeda, Lithuania;5. Department of Marine Research, Environmental Protection Agency, Taikos Av. 26, LT-91144 Klaipeda, Lithuania;1. Department of Laboratory Medicine, Severance Hospital, Yonsei University College of Medicine, Seoul, Republic of Korea;2. Department of Medicine, Yonsei University Graduate School of Medicine, Seoul, Republic of Korea;3. Division of Endocrinology and Metabolism, Department of Internal Medicine, Severance Hospital, Yonsei University College of Medicine, Seoul, Republic of Korea
Abstract:Topic model can project documents into a topic space which facilitates effective document clustering. Selecting a good topic model and improving clustering performance are two highly correlated problems for topic based document clustering. In this paper, we propose a three-phase approach to topic based document clustering. In the first phase, we determine the best topic model and present a formal concept about significance degree of topics and some topic selection criteria, through which we can find the best number of the most suitable topics from the original topic model discovered by LDA. Then, we choose the initial clustering centers by using the k-means++ algorithm. In the third phase, we take the obtained initial clustering centers and use the k-means algorithm for document clustering. Three clustering solutions based on the three phase approach are used for document clustering. The related experiments of the three solutions are made for comparing and illustrating the effectiveness and efficiency of our approach.
Keywords:Document clustering  Topic model  K-means  K-means++
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号