首页 | 本学科首页   官方微博 | 高级检索  
     

基于词义类簇的文本聚类
引用本文:唐国瑜,夏云庆,张民,郑方.基于词义类簇的文本聚类[J].中文信息学报,2013,27(3):113-120.
作者姓名:唐国瑜  夏云庆  张民  郑方
作者单位:1. 清华信息科学技术国家实验室技术创新和开发部语音和语言技术中心,
清华大学信息技术研究院语音和语言技术中心,
清华大学计算机科学与技术系,北京 100084;2. 资讯通信研究院,新加坡 138632
基金项目:国家自然科学基金资助项目
摘    要:文档表示是文本聚类的重要组成部分,该文旨在通过改进文档表示改进文本聚类。同义词和多义词现象是文档表示所面临的重要挑战。为此该文提出了词义类簇模型(Sense Cluster Model,SCM),在词义类簇空间上表示文档。SCM首先构造词义类簇空间,然后将文档表示在词义类簇空间上,获得每篇文档在每个词义类簇的概率。在词义类簇空间构造这一步骤中,首先利用词义归纳技术从文本中自动发现词义,接着采用词义聚类技术识别相同或者相似的词义从而获得词义类簇。词义类簇空间构造后,该文首先进行词义消歧,然后利用词义消歧的结果将文档表示在词义空间上。实验表明,SCM在标准测试集上的性能优于基线系统以及经典话题模型LDA。

关 键 词:文档聚类  文档表示  话题模型  

Document Clustering Based on Word Sense Cluster
TANG Guoyu , XIA Yunqing , ZHANG Min , ZHENG Fang.Document Clustering Based on Word Sense Cluster[J].Journal of Chinese Information Processing,2013,27(3):113-120.
Authors:TANG Guoyu  XIA Yunqing  ZHANG Min  ZHENG Fang
Affiliation:1. Center for Speech and Language Technologies, Division of Technical Innovation and Development,
Tsinghua National Laboratory for Information Science and Technology,
Center for Speech and Language Technologies, Research Institute of Information Technology,
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
2. Institute for Infocomm Research, A-STAR, 138632, Singapore
Abstract:Document representation is the key part in document clustering. In this paper, we aim at improving document representation in document clustering. Synonymy and polysemy are two challenging issues in document representation. Inspired by the observation that synonymy and polysemy are mainly related to word sense, we present a novel model, referred to as Sense Cluster Model (SCM), to address both issues by representing documents with word sense clusters. In SCM, word sense clusters are first constructed from the development dataset by 1) the word sense induction to automatically discover different senses of each word from raw text; and 2) the word sense clusteringto recognize identical or similar words. Then the probability distribution over word sense clusters is generated to represent every document after word sense disambiguation. The experiments conducted on benchmarking data show that the SCM model outperforms both baseline and the classic topic model, LDA, in the task of document clustering.
Key wordsword sense; document representation; topic model
Keywords:word sense  document representation  topic model
 
        
 
        
 
        
本文献已被 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号