首页 | 本学科首页   官方微博 | 高级检索  
     

融合语义资源和关键词的文本聚类
引用本文:吴舜尧,邵峰晶,王金龙,孙仁诚,王营.融合语义资源和关键词的文本聚类[J].计算机工程,2014(4):223-227.
作者姓名:吴舜尧  邵峰晶  王金龙  孙仁诚  王营
作者单位:[1]青岛大学自动化工程学院,山东青岛266071 [2]青岛大学信息工程学院,山东青岛266071 [3]青岛理工大学计算机工程学院,山东青岛266033
基金项目:基金项目:国家自然科学基金资助项N(91130035);国家公益性行业科研专项基金资助项目(200905030-2);山东省自然科学基金资助重点项目(ZR2012FZ003);山东省自然科学基金资助项目(ZR2012FQ017);青岛市科技计划基金资助项N(13-1-4-12-jch,12-1-4-4-(8)-jch).
摘    要:融合关键词形式的属性层知识可有效提高文本聚类的聚类质量,但融合关键词的簇中心初始化仍是一个开放性问题。为此,提出一种融合语义资源和关键词的文本聚类方法。通过Wikipedia语义识别文本集的主题,采用基于资源分配的网络推断策略,通过文献协同关系发现潜在语义相关性,以选择最能代表各主题的重要文档(初始簇中心),并利用软约束与测度学习相结合的策略融合关键词辅助文本聚类。在20Newsgourp文本集上的实验结果表明,与k-means和仅融合关键词的文本聚类方法相比,该方法可有效提升聚类质量,尤其在News_Different_3数据集上标准互信息最多可提升约20%。

关 键 词:关键词  文本聚类  Wikipedia语义  簇中心初始化  网络推断  重要文档

Document Clustering Fused with Semantic Resources and Key Words
WU Shun-yao,SHAO Feng-jing,WANG Jin-long,SUN Ren-cheng,WANG Ying.Document Clustering Fused with Semantic Resources and Key Words[J].Computer Engineering,2014(4):223-227.
Authors:WU Shun-yao  SHAO Feng-jing  WANG Jin-long  SUN Ren-cheng  WANG Ying
Affiliation:1 a. College of Automation Engineering; 1 b. College of Information Engineering, Qingdao University, Qingdao 266071, China; 2. School of Computer Engineering, Qingdao Technological University, Qingdao 266033, China)
Abstract:Fusing attribute-level knowledge in the form of key words can effectively improve the performance of document clustering. However, initialization of cluster center of key words is still an open issue. Therefore, this paper utilizes Wikipedia semantics to identify semantic themes, and adopts network-based inference strategy with dynamic resource-allocation to find hidden semantic relatedness according to article collaborative relationship, so as to select the most important documents(initial points) which can reflect semantic themes. It incorporates key words into document clustering by combing metric learning and soft-constraint strategies. Comparisons results with k-means and semi-supervised clustering method with key words on 20Newsgroup collection demonstrate that initialization for document clustering with key words can effectively improve clustering quality. Especially on News Different_3, the improvement is about 20% under Normalized Mutual Information(NMI) index.
Keywords:key words  document clustering  Wikipedia semantics  initialization of cluster center  network inference  importantdocument
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号