首页 | 本学科首页   官方微博 | 高级检索  
     

基于词共现的文档表示模型
引用本文:常鹏,冯楠.基于词共现的文档表示模型[J].中文信息学报,2012,26(1):51-58.
作者姓名:常鹏  冯楠
作者单位:1. 天津大学 管理学院, 天津 300072; 2. 天津大学 网络与信息中心, 天津 300072
基金项目:国家自然科学基金资助项目
摘    要:文档表示模型是文本自动处理的基础,是将非结构化的文本数据转化为结构化数据的有效手段。然而,目前通用的空间向量模型(Vector Space Model,VSM)是以单个的词汇为基础的文档表示模型,因其忽略了词间的关联关系,导致文本挖掘的准确率难以得到很大的提升。该文以词共现分析为基础,讨论了文档主题与词的二阶关系之间的潜在联系,进而定义了词共现度及与文档主题相关度的量化计算方法,利用关联规则算法抽取出文档集上的词共现组合,提出了基于词共现组合的文档向量主题表示模型(Co-occurrence Term based Vector Space Model, CTVSM),定义了基于CTVSM的文档相似度。实验表明,CTVSM能够准确反映文档之间的相关关系,比经典的文档向量空间模型(Vector Space Model,VSM)具有更强的主题区分能力。

关 键 词:文档建模    词共现    文档相似度    文本挖掘  

A Co-occurrence based Vector Space Model for Document Indexing
CHANG Peng , FENG Nan.A Co-occurrence based Vector Space Model for Document Indexing[J].Journal of Chinese Information Processing,2012,26(1):51-58.
Authors:CHANG Peng  FENG Nan
Affiliation:1. School of Management, Tianjin University, Tianjin 300072, China;
2. Department of Information & Network Center, Tianjin University, Tianjin 300072, China
Abstract:This paper presents a novel co-occurrence terms based vector space model(CTVSM) for automatic document indexing which is inspired by the Vector Space Model(VSM).In contrast to the traditional VSM which presents the document with a bag of words regardless the position of these words in the texts,the proposed technique uses the co-occurrence terms instead of the single term.Firstly the pairs of obvious co-occurrence terms are extracted from the document set by association rules,and then the similarity between documents is also defined in this paper.The experiments indicate substantial and consistent improvements of the CTVSM over standard VSM.
Keywords:document model  co-occurrence  document similarity  text mining
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号