首页 | 本学科首页   官方微博 | 高级检索  
     

可并行中文同主题词聚类新算法
引用本文:沈筱彦,陈俊亮,孟祥武,张玉洁,张磊.可并行中文同主题词聚类新算法[J].北京邮电大学学报,2009,32(4):122-127.
作者姓名:沈筱彦  陈俊亮  孟祥武  张玉洁  张磊
作者单位:北京邮电大学,网络与交换技术国家重点实验室,北京,100876;北京邮电大学,网络与交换技术国家重点实验室,北京,100876;北京邮电大学,网络与交换技术国家重点实验室,北京,100876;北京邮电大学,网络与交换技术国家重点实验室,北京,100876;北京邮电大学,网络与交换技术国家重点实验室,北京,100876
基金项目:国家自然科学基金项目,国家重点基础研究发展计划项目 
摘    要:提出了一种高效的自动按照主题对中文词进行聚类的算法.该算法利用顿号(、)切分抽取语料库句子中的并列中文词,并以抽取出的中文词为节点构建一个共引用图; 然后对每个中文词节点产生若干个locality sensitive Hashing (LSH)签名组合; 最后将至少有1个相同LSH签名组合的任意2个中文词标记为同一个主题类.实验表明,该算法运算速度快,且易并行实现,在海量语料库的支持下,执行效率高,聚类效果较好.

关 键 词:中文词聚类  共引用图  locality  sensitive  Hashing签名  并行化
收稿时间:2008-11-20
修稿时间:2009-6-1

A Parallable Algorithm for Chinese Co-Topic Words Clustering
SHEN Xiao-yan,CHEN Jun-liang,MENG Xiang-wu,ZHANG Yu-jie,ZHANG Lei.A Parallable Algorithm for Chinese Co-Topic Words Clustering[J].Journal of Beijing University of Posts and Telecommunications,2009,32(4):122-127.
Authors:SHEN Xiao-yan  CHEN Jun-liang  MENG Xiang-wu  ZHANG Yu-jie  ZHANG Lei
Abstract:A simple but powerful algorithm for automatically clustering Chinese co topic words is presented. The method first uses punctuation ‘、’ to split and extract paratactic Chinese words within sentences from a corpus and constructs a co citation graph by treating Chinese words as nodes. Second, the method generates several locality sensitive Hashing (LSH) signature combinations for each node in the co citation graph. Those nodes shared at least one LSH signature combination, are grouped together and most of them may belong to the same topic. The main advantages of the algorithm are the fast speed of calculation and high convenience of implementation in parallel. Experimental results indicate the high efficiency and good clustering effect.
Keywords:Chinese word clustering  co-citation graph  connected component  LSH signature  parallable
本文献已被 万方数据 等数据库收录!
点击此处可从《北京邮电大学学报》浏览原始摘要信息
点击此处可从《北京邮电大学学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号