基于LSA和pLSA的多文档自动文摘 Multi-Documentation Summarization Based on LSA and pLSA期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于LSA和pLSA的多文档自动文摘

引用本文：	俞辉.基于LSA和pLSA的多文档自动文摘[J].计算机工程与科学,2009,31(9).

作者姓名：	俞辉

作者单位：	中国石油大学计算机与通信工程学院,山东,东营,257061

摘要：	本文提出一种基于LSA和pLSA的多文档自动文摘策略。首先,将多个文档切分成自然段,以自然段作为聚类单位。采用了新的特征提取方法构建词-自然段矩阵,利用LSA对词-自然段矩阵进行奇异值分解,使得向量空间模型中的高维表示变成在潜在语义空间中的低维表示。然后,采用pLSA将数据转换成概率统计模型来计算。在文摘生成的过程中采用基于质心的文摘句挑选办法得到文摘并输出。实验表明,本文提出的方法有效地提高了生成文摘的质量。
关键词：	多文档自动文摘潜在语义分析奇异值分解
Multi-Documentation Summarization Based on LSA and pLSA

YU Hui.Multi-Documentation Summarization Based on LSA and pLSA[J].Computer Engineering & Science,2009,31(9).

Authors:	YU Hui

Abstract:	This paper proposes a new strategy of multi-document summarization based on the latent semantic analysis and the probabilistic latent semantic analysis.Firstly,all documents are split to paragraphs,and they are used to clustering.New features are used to construct word-paragraph matrices.Latent semantic analysis which stems from linear algebra performs a singular value decomposition of word-paragraph matrices,so that unimportant information is filtered and the high dimensional representation in the vector space model is changed to low dimensional representation in the latent semantic space.Co-occurrence data is changed to the probabilistic model by the probabilistic latent semantic analysis.In the period of summarization,the method of centroid-based summarization is used to generate summarization.The experimental results show that the performance of summarization is improved.

Keywords:	multi-document summarization latent semantic analysis singular value decomposition
本文献已被万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏