首页 | 本学科首页   官方微博 | 高级检索  
     

基于LSA和pLSA的多文档自动文摘
引用本文:俞辉.基于LSA和pLSA的多文档自动文摘[J].计算机工程与科学,2009,31(9).
作者姓名:俞辉
作者单位:中国石油大学计算机与通信工程学院,山东,东营,257061
摘    要:本文提出一种基于LSA和pLSA的多文档自动文摘策略。首先,将多个文档切分成自然段,以自然段作为聚类单位。采用了新的特征提取方法构建词-自然段矩阵,利用LSA对词-自然段矩阵进行奇异值分解,使得向量空间模型中的高维表示变成在潜在语义空间中的低维表示。然后,采用pLSA将数据转换成概率统计模型来计算。在文摘生成的过程中采用基于质心的文摘句挑选办法得到文摘并输出。实验表明,本文提出的方法有效地提高了生成文摘的质量。

关 键 词:多文档自动文摘  潜在语义分析  奇异值分解

Multi-Documentation Summarization Based on LSA and pLSA
YU Hui.Multi-Documentation Summarization Based on LSA and pLSA[J].Computer Engineering & Science,2009,31(9).
Authors:YU Hui
Abstract:This paper proposes a new strategy of multi-document summarization based on the latent semantic analysis and the probabilistic latent semantic analysis.Firstly,all documents are split to paragraphs,and they are used to clustering.New features are used to construct word-paragraph matrices.Latent semantic analysis which stems from linear algebra performs a singular value decomposition of word-paragraph matrices,so that unimportant information is filtered and the high dimensional representation in the vector space model is changed to low dimensional representation in the latent semantic space.Co-occurrence data is changed to the probabilistic model by the probabilistic latent semantic analysis.In the period of summarization,the method of centroid-based summarization is used to generate summarization.The experimental results show that the performance of summarization is improved.
Keywords:multi-document summarization  latent semantic analysis  singular value decomposition
本文献已被 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号