首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
SBGA系统将多文档自动摘要过程视为一个从源文档集中抽取句子的组合优化过程,并用演化算法来求得近似最优解。与基于聚类的句子抽取方法相比,基于演化算法进行句子抽取的方法是面向摘要整体的,因此能获得更好的近似最优摘要。演化算法的评价函数中考虑了衡量摘要的4个标准:长度符合用户要求、信息覆盖率高、更多地保留原文传递的重要信息、无冗余。另外,为了提高词频计算的精度, SBGA采用了一种改进的词频计算方法TFS,将加权后词的同义词频率加到了原词频中。在DUC2004测试数据集上的实验结果表明,基于演化算法进行句子抽取的方法有很好的性能,其ROUGE-1分值比DUC2004最优参赛系统仅低0.55%。改进的词频计算方法TFS对提高文档质量也起到了良好的作用。  相似文献   

2.
We present an optimization-based unsupervised approach to automatic document summarization. In the proposed approach, text summarization is modeled as a Boolean programming problem. This model generally attempts to optimize three properties, namely, (1) relevance: summary should contain informative textual units that are relevant to the user; (2) redundancy: summaries should not contain multiple textual units that convey the same information; and (3) length: summary is bounded in length. The approach proposed in this paper is applicable to both tasks: single- and multi-document summarization. In both tasks, documents are split into sentences in preprocessing. We select some salient sentences from document(s) to generate a summary. Finally, the summary is generated by threading all the selected sentences in the order that they appear in the original document(s). We implemented our model on multi-document summarization task. When comparing our methods to several existing summarization methods on an open DUC2005 and DUC2007 data sets, we found that our method improves the summarization results significantly. This is because, first, when extracting summary sentences, this method not only focuses on the relevance scores of sentences to the whole sentence collection, but also the topic representative of sentences. Second, when generating a summary, this method also deals with the problem of repetition of information. The methods were evaluated using ROUGE-1, ROUGE-2 and ROUGE-SU4 metrics. In this paper, we also demonstrate that the summarization result depends on the similarity measure. Results of the experiment showed that combination of symmetric and asymmetric similarity measures yields better result than their use separately.  相似文献   

3.
In paper, we propose an unsupervised text summarization model which generates a summary by extracting salient sentences in given document(s). In particular, we model text summarization as an integer linear programming problem. One of the advantages of this model is that it can directly discover key sentences in the given document(s) and cover the main content of the original document(s). This model also guarantees that in the summary can not be multiple sentences that convey the same information. The proposed model is quite general and can also be used for single- and multi-document summarization. We implemented our model on multi-document summarization task. Experimental results on DUC2005 and DUC2007 datasets showed that our proposed approach outperforms the baseline systems.  相似文献   

4.
This paper proposes a constraint-driven document summarization approach emphasizing the following two requirements: (1) diversity in summarization, which seeks to reduce redundancy among sentences in the summary and (2) sufficient coverage, which focuses on avoiding the loss of the document’s main information when generating the summary. The constraint-driven document summarization models with tuning the constraint parameters can drive content coverage and diversity in a summary. The models are formulated as a quadratic integer programming (QIP) problem. To solve the QIP problem we used a discrete PSO algorithm. The models are implemented on multi-document summarization task. The comparative results showed that the proposed models outperform other methods on DUC2005 and DUC2007 datasets.  相似文献   

5.
This paper proposes an optimization-based model for generic document summarization. The model generates a summary by extracting salient sentences from documents. This approach uses the sentence-to-document collection, the summary-to-document collection and the sentence-to-sentence relations to select salient sentences from given document collection and reduce redundancy in the summary. To solve the optimization problem has been created an improved differential evolution algorithm. The algorithm can adjust crossover rate adaptively according to the fitness of individuals. We implemented the proposed model on multi-document summarization task. Experiments have been performed on DUC2002 and DUC2004 data sets. The experimental results provide strong evidence that the proposed optimization-based approach is a viable method for document summarization.  相似文献   

6.
Multi-document summarization is the process of extracting salient information from a set of source texts and present that information to the user in a condensed form. In this paper, we propose a multi-document summarization system which generates an extractive generic summary with maximum relevance and minimum redundancy by representing each sentence of the input document as a vector of words in Proper Noun, Noun, Verb and Adjective set. Five features, such as TF_ISF, Aggregate Cross Sentence Similarity, Title Similarity, Proper Noun and Sentence Length associated with the sentences, are extracted, and scores are assigned to sentences based on these features. Weights that can be assigned to different features may vary depending upon the nature of the document, and it is hard to discover the most appropriate weight for each feature, and this makes generation of a good summary a very tough task without human intelligence. Multi-document summarization problem is having large number of decision parameters and number of possible solutions from which most optimal summary is to be generated. Summary generated may not guarantee the essential quality and may be far from the ideal human generated summary. To address this issue, we propose a population-based multicriteria optimization method with multiple objective functions. Three objective functions are selected to determine an optimal summary, with maximum relevance, diversity, and novelty, from a global population of summaries by considering both the statistical and semantic aspects of the documents. Semantic aspects are considered by Latent Semantic Analysis (LSA) and Non Negative Matrix Factorization (NMF) techniques. Experiments have been performed on DUC 2002, DUC 2004 and DUC 2006 datasets using ROUGE tool kit. Experimental results show that our system outperforms the state of the art works in terms of Recall and Precision.  相似文献   

7.
Text summarization is a process of extracting salient information from a source text and presenting that information to the user in a condensed form while preserving its main content. In the text summarization, most of the difficult problems are providing wide topic coverage and diversity in a summary. Research based on clustering, optimization, and evolutionary algorithm for text summarization has recently shown good results, making this a promising area. In this paper, for a text summarization, a two‐stage sentences selection model based on clustering and optimization techniques, called COSUM, is proposed. At the first stage, to discover all topics in a text, the sentences set is clustered by using k‐means method. At the second stage, for selection of salient sentences from clusters, an optimization model is proposed. This model optimizes an objective function that expressed as a harmonic mean of the objective functions enforcing the coverage and diversity of the selected sentences in the summary. To provide readability of a summary, this model also controls length of sentences selected in the candidate summary. For solving the optimization problem, an adaptive differential evolution algorithm with novel mutation strategy is developed. The method COSUM was compared with the 14 state‐of‐the‐art methods: DPSO‐EDASum; LexRank; CollabSum; UnifiedRank; 0–1 non‐linear; query, cluster, summarize; support vector machine; fuzzy evolutionary optimization model; conditional random fields; MA‐SingleDocSum; NetSum; manifold ranking; ESDS‐GHS‐GLO; and differential evolution, using ROUGE tool kit on the DUC2001 and DUC2002 data sets. Experimental results demonstrated that COSUM outperforms the state‐of‐the‐art methods in terms of ROUGE‐1 and ROUGE‐2 measures.  相似文献   

8.
信息爆炸是信息化时代面临的普遍性问题, 为了从海量文本数据中快速提取出有价值的信息, 自动摘要技术成为自然语言处理(natural language processing, NLP)领域中的研究重点. 多文档摘要的目的是从一组具有相同主题的文档中精炼出重要内容, 帮助用户快速获取关键信息. 针对目前多文档摘要中存在的信息不全面、冗余度高的问题, 提出一种基于多粒度语义交互的抽取式摘要方法, 将多粒度语义交互网络与最大边界相关法(maximal marginal relevance, MMR)相结合, 通过不同粒度的语义交互训练句子的表示, 捕获不同粒度的关键信息, 从而保证摘要信息的全面性; 同时结合改进的MMR以保证摘要信息的低冗余度, 通过排序学习为输入的多篇文档中的各个句子打分并完成摘要句的抽取. 在Multi-News数据集上的实验结果表明基于多粒度语义交互的抽取式多文档摘要模型优于LexRank、TextRank等基准模型.  相似文献   

9.
We proposed a novel text summarization model based on 0–1 non-linear programming problem. This proposed model covers the main content of the given document(s) through sentence assignment. We implemented our model on multi-document summarization task. When comparing our method to several existing summarization methods on an open DUC2001 and DUC2002 datasets, we found that the proposed method could improve the summarization results significantly. The methods were evaluated using ROUGE-1, ROUGE-2 and ROUGE-W metrics.  相似文献   

10.
抽取式摘要的核心问题在于合理地建模句子,正确地判断句子重要性。该文提出一种计算句子话题重要性的方法,通过分析句子与话题的语义关系,判断句子是否描述话题的重要信息。针对自动摘要任务缺乏参考摘要作为训练数据的问题,该文提出一种基于排序学习的半监督训练框架,利用大规模未标注新闻语料训练模型。在DUC2004多文档摘要任务上的实验结果表明,该文提出的话题重要性特征能够作为传统启发式特征的有效补充,改进摘要质量。  相似文献   

11.
The massive quantity of data available today in the Internet has reached such a huge volume that it has become humanly unfeasible to efficiently sieve useful information from it. One solution to this problem is offered by using text summarization techniques. Text summarization, the process of automatically creating a shorter version of one or more text documents, is an important way of finding relevant information in large text libraries or in the Internet. This paper presents a multi-document summarization system that concisely extracts the main aspects of a set of documents, trying to avoid the typical problems of this type of summarization: information redundancy and diversity. Such a purpose is achieved through a new sentence clustering algorithm based on a graph model that makes use of statistic similarities and linguistic treatment. The DUC 2002 dataset was used to assess the performance of the proposed system, surpassing DUC competitors by a 50% margin of f-measure, in the best case.  相似文献   

12.
多文本摘要的目标是对给定的查询和多篇文本(文本集),创建一个简洁明了的摘要,要求该摘要能够表达这些文本的关键内容,同时和给定的查询相关。一个给定的文本集通常包含一些主题,而且每个主题由一类句子来表示,一个优秀的摘要应该要包含那些最重要的主题。如今大部分的方法是建立一个模型来计算句子得分,然后选择得分最高的部分句子来生成摘要。不同于这些方法,我们更加关注文本的主题而不是句子,把如何生成摘要的问题看成一个主题的发现,排序和表示的问题。我们首次引入dominant sets cluster(DSC)来发现主题,然后建立一个模型来对主题的重要性进行评估,最后兼顾代表性和无重复性来从各个主题中选择句子组成摘要。我们在DUC2005、2006、2007三年的标准数据集上进行了实验,最后的实验结果证明了该方法的有效性。  相似文献   

13.
Sentence-based multi-document summarization is the task of generating a succinct summary of a document collection, which consists of the most salient document sentences. In recent years, the increasing availability of semantics-based models (e.g., ontologies and taxonomies) has prompted researchers to investigate their usefulness for improving summarizer performance. However, semantics-based document analysis is often applied as a preprocessing step, rather than integrating the discovered knowledge into the summarization process.This paper proposes a novel summarizer, namely Yago-based Summarizer, that relies on an ontology-based evaluation and selection of the document sentences. To capture the actual meaning and context of the document sentences and generate sound document summaries, an established entity recognition and disambiguation step based on the Yago ontology is integrated into the summarization process.The experimental results, which were achieved on the DUC’04 benchmark collections, demonstrate the effectiveness of the proposed approach compared to a large number of competitors as well as the qualitative soundness of the generated summaries.  相似文献   

14.
Multi-document summarization via submodularity   总被引:1,自引:1,他引:0  
Multi-document summarization is becoming an important issue in the Information Retrieval community. It aims to distill the most important information from a set of documents to generate a compressed summary. Given a set of documents as input, most of existing multi-document summarization approaches utilize different sentence selection techniques to extract a set of sentences from the document set as the summary. The submodularity hidden in the term coverage and the textual-unit similarity motivates us to incorporate this property into our solution to multi-document summarization tasks. In this paper, we propose a new principled and versatile framework for different multi-document summarization tasks using submodular functions (Nemhauser et al. in Math. Prog. 14(1):265?C294, 1978) based on the term coverage and the textual-unit similarity which can be efficiently optimized through the improved greedy algorithm. We show that four known summarization tasks, including generic, query-focused, update, and comparative summarization, can be modeled as different variations derived from the proposed framework. Experiments on benchmark summarization data sets (e.g., DUC04-06, TAC08, TDT2 corpora) are conducted to demonstrate the efficacy and effectiveness of our proposed framework for the general multi-document summarization tasks.  相似文献   

15.
A New Approach for Multi-Document Update Summarization   总被引:1,自引:1,他引:0       下载免费PDF全文
1958(2)
  • Manifold-ranking based topic-focused multi-document summarization 2007
  • An Introduction to Kolmogorov Complexity and Its Applications 1997
  • The use of MMR,diversity-based reranking for reordering documents and producing summaries 1998
  • Centroid-based summarization of multiple documents 2004(6)
  • A trainable document summarizer 1995
  • Impact of linguistic analysis on the semantic graph coverage and learning of document extracts 2005
  • Document summarization using conditional random fields 2007
  • Adasum:An adaptive model for summarization 2008
  • Lexpagerank:Prestige in multidocument text summarization 2004
  • Mihalcea R.Taran P Textrank-Bring order into texts 2004
  • Mihalcea R.Tarau P A language independent algorithm for single and multiple document summarization 2005
  • Wan X.Yang J.Xiao J Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction 2007
  • Wan X An exploration of document impact on graph-based multi-document summarization 2008
  • Bennett C H.Gács P.Li M.Vitányi P M,Zurek W H Information distance 1998(4)
  • Li M.Badger J H.Chen X.Kwong S,Kearney P,Zhang H An information-based sequence distance and its application to whole mitochondrial genome phylogeny 2001(2)
  • Li M.Chen X.Li X.Ma B Vitányi P M The similarity metric 2004(12)
  • Long C.Zhu X.Li M.Ma B Information shared by many objects 2008
  • Benedetto D.Caglioti E.Loreto V Language trees and zipping 2002(4)
  • Bennett C H.Li M.Ma B Chain letters and evolutionary histories 2003(6)
  • Cilibrasi R L.Vitányi P M The Google similarity distance 2007(3)
  • Zhang X.Hao Y.Zhu X.Li M Information distance from a question to an answer 2007
  • Ziv J.Lempel A A universal algorithm for sequential data compression 1977(3)
  • Lin C Y.Hovy E Automatic evaluation of summaries using n-gram co-occurrence statistics 2003
  • Nenkova A.Passonneau R.Mckeown K The pyramid method:Incorporating human content selection variation in summarization evaluation 2007(2)
  • >>更多...  相似文献   


    16.
    自动文摘技术的目标是致力于将冗长的文档内容压缩成较为简短的几段话,将信息全面、简洁地呈现给用户,提高用户获取信息的效率和准确率。所提出的方法在LDA(Latent Dirichlet Allocation)的基础上,使用Gibbs抽样估计主题在单词上的概率分布和句子在主题上的概率分布,结合LDA参数和谱聚类算法提取多文档摘要。该方法使用线性公式来整合句子权重,提取出字数为400字的多文档摘要。使用ROUGE自动摘要评测工具包对DUC2002数据集评测摘要质量,结果表明,该方法能有效地提高摘要的质量。  相似文献   

    17.
    针对面向微博的中文新闻摘要的主要挑战,提出了一种将矩阵分解与子模最大化相结合的新闻自动摘要方法。该方法首先利用正交矩阵分解模型得到新闻文本潜语义向量,解决了短文本信息稀疏问题,并使投影方向近似正交以减少冗余;然后从相关性和多样性等方面评估新闻语句集合,该评估函数由多个单调子模函数和一个评估语句不相似度的非子模函数组成;最后设计贪心算法生成最终摘要。在NLPCC2015数据集面向上的实验结果表明本文提出的方法能有效提高面向微博的新闻自动摘要质量,ROUGE得分超过其他基线系统。  相似文献   

    18.
    文章描述了一种基于子主题划分和查询相结合的多文档自动摘要系统的设计:首先利用同义词词林计算句子语义相似度,通过对句子的聚类得到子主题,然后根据用户的查询对子主题进行重要度排序,在此基础上,采用一种动态的句子打分策略从各个主题中抽取句子生成摘要。实验结果表明生成的摘要冗余少,信息全面。  相似文献   

    19.
    提出了一种基于主题与子事件抽取的多文档自动文摘方法。该方法突破传统词频统计方法,除考虑词语频率、位置信息外,还将词语是否为描述文本集合的主题和子事件作为因素,提取出了8个基本特征,利用逻辑回归模型预测基本特征对词语权重的影响,计算词语权重。通过建立句子向量空间模型给句子打分,结合句子分数和冗余度产生文摘。对N-gram同现频率、主题词覆盖率和高频词覆盖率3种不同参数,分别在Coverage Baseline、Centroid-Based Summary和Word Mining based Summary(WMS)3种不同文摘系统下所产生的文摘质量,进行了对比实验,结果表明WMS系统在多方面具有优越的性能。  相似文献   

    20.
    多文档自动文摘能够帮助人们自动、快速地获取信息,是目前的一个研究热点。相比于单文档自动文摘,多文档自动文摘需要更多考虑文档之间的相关性,以及文档信息之间的冗余性。因此如何控制信息冗余是多文档自动文摘的一个关键所在。该文在考虑文摘特性的基础上提出了一个冗余度控制模型,该模型通过计算文本单元在主题概率分布之间的相似度来决定句子的选择,从而达到控制冗余的目的。实验结果表明,该方法能够有效降低冗余度,且总体性能优于现有的自动文摘系统。  相似文献   

    设为首页 | 免责声明 | 关于勤云 | 加入收藏

    Copyright©北京勤云科技发展有限公司  京ICP备09084417号