首页 | 本学科首页   官方微博 | 高级检索  
     


On principal component analysis, cosine and Euclidean measures in information retrieval
Authors:Tuomo Korenius
Affiliation:Department of Computer Sciences, 33014 University of Tampere, Kanslerinrinne 1, Tampere, Finland
Abstract:Clustering groups document objects represented as vectors. An extensive vector space may cause obstacles to applying these methods. Therefore, the vector space was reduced with principal component analysis (PCA). The conventional cosine measure is not the only choice with PCA, which involves the mean-correction of data. Since mean-correction changes the location of the origin, the angles between the document vectors also change. To avoid this, we used a connection between the cosine measure and the Euclidean distance in association with PCA, and grounded searching on the latter. We applied the single and complete linkage and Ward clustering to Finnish documents utilizing their relevance assessment as a new feature. After the normalization of the data PCA was run and relevant documents were clustered.
Keywords:Information retrieval  Cosine measure  Euclidean distance measure  Principal component analysis  Clustering  Documents
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号