首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper proposes a constraint-driven document summarization approach emphasizing the following two requirements: (1) diversity in summarization, which seeks to reduce redundancy among sentences in the summary and (2) sufficient coverage, which focuses on avoiding the loss of the document’s main information when generating the summary. The constraint-driven document summarization models with tuning the constraint parameters can drive content coverage and diversity in a summary. The models are formulated as a quadratic integer programming (QIP) problem. To solve the QIP problem we used a discrete PSO algorithm. The models are implemented on multi-document summarization task. The comparative results showed that the proposed models outperform other methods on DUC2005 and DUC2007 datasets.  相似文献   

2.
Multi‐document summarization is a process of automatic creation of a compressed version of a given collection of documents that provides useful information to users. In this article we propose a generic multi‐document summarization method based on sentence clustering. We introduce five clustering methods, which optimize various aspects of intra‐cluster similarity, inter‐cluster dissimilarity and their combinations. To solve the clustering problem a modification of discrete particle swarm optimization algorithm has been proposed. The experimental results on open benchmark data sets from DUC2005 and DUC2007 show that our method significantly outperforms the baseline methods for multi‐document summarization.  相似文献   

3.
In paper, we propose an unsupervised text summarization model which generates a summary by extracting salient sentences in given document(s). In particular, we model text summarization as an integer linear programming problem. One of the advantages of this model is that it can directly discover key sentences in the given document(s) and cover the main content of the original document(s). This model also guarantees that in the summary can not be multiple sentences that convey the same information. The proposed model is quite general and can also be used for single- and multi-document summarization. We implemented our model on multi-document summarization task. Experimental results on DUC2005 and DUC2007 datasets showed that our proposed approach outperforms the baseline systems.  相似文献   

4.
This paper proposes an optimization-based model for generic document summarization. The model generates a summary by extracting salient sentences from documents. This approach uses the sentence-to-document collection, the summary-to-document collection and the sentence-to-sentence relations to select salient sentences from given document collection and reduce redundancy in the summary. To solve the optimization problem has been created an improved differential evolution algorithm. The algorithm can adjust crossover rate adaptively according to the fitness of individuals. We implemented the proposed model on multi-document summarization task. Experiments have been performed on DUC2002 and DUC2004 data sets. The experimental results provide strong evidence that the proposed optimization-based approach is a viable method for document summarization.  相似文献   

5.
With the rapid growth of information on the Internet and electronic government recently, automatic multi-document summarization has become an important task. Multi-document summarization is an optimization problem requiring simultaneous optimization of more than one objective function. In this study, when building summaries from multiple documents, we attempt to balance two objectives, content coverage and redundancy. Our goal is to investigate three fundamental aspects of the problem, i.e. designing an optimization model, solving the optimization problem and finding the solution to the best summary. We model multi-document summarization as a Quadratic Boolean Programing (QBP) problem where the objective function is a weighted combination of the content coverage and redundancy objectives. The objective function measures the possible summaries based on the identified salient sentences and overlap information between selected sentences. An innovative aspect of our model lies in its ability to remove redundancy while selecting representative sentences. The QBP problem has been solved by using a binary differential evolution algorithm. Evaluation of the model has been performed on the DUC2002, DUC2004 and DUC2006 data sets. We have evaluated our model automatically using ROUGE toolkit and reported the significance of our results through 95% confidence intervals. The experimental results show that the optimization-based approach for document summarization is truly a promising research direction.  相似文献   

6.
We proposed a novel text summarization model based on 0–1 non-linear programming problem. This proposed model covers the main content of the given document(s) through sentence assignment. We implemented our model on multi-document summarization task. When comparing our method to several existing summarization methods on an open DUC2001 and DUC2002 datasets, we found that the proposed method could improve the summarization results significantly. The methods were evaluated using ROUGE-1, ROUGE-2 and ROUGE-W metrics.  相似文献   

7.
Modern information retrieval (IR) systems consist of many challenging components, e.g. clustering, summarization, etc. Nowadays, without browsing the whole volume of datasets, IR systems present users with clusters of documents they are interested in, and summarize each document briefly which facilitates the task of finding the desired documents. This paper proposes a fuzzy evolutionary optimization modeling (FEOM) and its applications to unsupervised categorization and extractive summarization. In view of the nature of biological evolution, we take advantage of several fuzzy control parameters to adaptively regulate the behaviors of the evolutionary optimization, which can effectively prevent premature convergence to a local optimal solution. As a portable, modular and extensively executable model, FEOM is firstly implemented for clustering text documents. The searching capability of FEOM is exploited to explore appropriate partitions of documents such that the similarity metric of the resulting clusters is optimized. In order to further investigate its effectiveness as a generic data clustering model, FEOM is then applied to sentence clustering based extractive document summarization. It selects the most important sentence from each cluster to represent the overall meaning of document. We demonstrate the improved performance by a series of experiments using standard test sets, e.g. Reuter document collection, 20-newsgroup corpus, DUC01 and DUC02, as evaluated by some commonly used metrics, i.e. F-measure and ROUGE. The experimental results show that FEOM achieves performance as good as or better than state of arts of clustering and summarizing systems.  相似文献   

8.
The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of automatically creating a compressed version of a given document that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic. In our study we focus on sentence based extractive document summarization. We propose the generic document summarization method which is based on sentence clustering. The proposed approach is a continue sentence-clustering based extractive summarization methods, proposed in Alguliev [Alguliev, R. M., Aliguliyev, R. M., Bagirov, A. M. (2005). Global optimization in the summarization of text documents. Automatic Control and Computer Sciences 39, 42–47], Aliguliyev [Aliguliyev, R. M. (2006). A novel partitioning-based clustering method and generic document summarization. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI–IAT 2006 Workshops) (WI–IATW’06), 18–22 December (pp. 626–629) Hong Kong, China], Alguliev and Alyguliev [Alguliev, R. M., Alyguliev, R. M. (2007). Summarization of text-based documents with a determination of latent topical sections and information-rich sentences. Automatic Control and Computer Sciences 41, 132–140] Aliguliyev, [Aliguliyev, R. M. (2007). Automatic document summarization by sentence extraction. Journal of Computational Technologies 12, 5–15.]. The purpose of present paper to show, that summarization result not only depends on optimized function, and also depends on a similarity measure. The experimental results on an open benchmark datasets from DUC01 and DUC02 show that our proposed approach can improve the performance compared to sate-of-the-art summarization approaches.  相似文献   

9.
Multi-document summarization is the process of extracting salient information from a set of source texts and present that information to the user in a condensed form. In this paper, we propose a multi-document summarization system which generates an extractive generic summary with maximum relevance and minimum redundancy by representing each sentence of the input document as a vector of words in Proper Noun, Noun, Verb and Adjective set. Five features, such as TF_ISF, Aggregate Cross Sentence Similarity, Title Similarity, Proper Noun and Sentence Length associated with the sentences, are extracted, and scores are assigned to sentences based on these features. Weights that can be assigned to different features may vary depending upon the nature of the document, and it is hard to discover the most appropriate weight for each feature, and this makes generation of a good summary a very tough task without human intelligence. Multi-document summarization problem is having large number of decision parameters and number of possible solutions from which most optimal summary is to be generated. Summary generated may not guarantee the essential quality and may be far from the ideal human generated summary. To address this issue, we propose a population-based multicriteria optimization method with multiple objective functions. Three objective functions are selected to determine an optimal summary, with maximum relevance, diversity, and novelty, from a global population of summaries by considering both the statistical and semantic aspects of the documents. Semantic aspects are considered by Latent Semantic Analysis (LSA) and Non Negative Matrix Factorization (NMF) techniques. Experiments have been performed on DUC 2002, DUC 2004 and DUC 2006 datasets using ROUGE tool kit. Experimental results show that our system outperforms the state of the art works in terms of Recall and Precision.  相似文献   

10.
We present an optimization-based unsupervised approach to automatic document summarization. In the proposed approach, text summarization is modeled as a Boolean programming problem. This model generally attempts to optimize three properties, namely, (1) relevance: summary should contain informative textual units that are relevant to the user; (2) redundancy: summaries should not contain multiple textual units that convey the same information; and (3) length: summary is bounded in length. The approach proposed in this paper is applicable to both tasks: single- and multi-document summarization. In both tasks, documents are split into sentences in preprocessing. We select some salient sentences from document(s) to generate a summary. Finally, the summary is generated by threading all the selected sentences in the order that they appear in the original document(s). We implemented our model on multi-document summarization task. When comparing our methods to several existing summarization methods on an open DUC2005 and DUC2007 data sets, we found that our method improves the summarization results significantly. This is because, first, when extracting summary sentences, this method not only focuses on the relevance scores of sentences to the whole sentence collection, but also the topic representative of sentences. Second, when generating a summary, this method also deals with the problem of repetition of information. The methods were evaluated using ROUGE-1, ROUGE-2 and ROUGE-SU4 metrics. In this paper, we also demonstrate that the summarization result depends on the similarity measure. Results of the experiment showed that combination of symmetric and asymmetric similarity measures yields better result than their use separately.  相似文献   

11.
自动文摘技术的目标是致力于将冗长的文档内容压缩成较为简短的几段话,将信息全面、简洁地呈现给用户,提高用户获取信息的效率和准确率。所提出的方法在LDA(Latent Dirichlet Allocation)的基础上,使用Gibbs抽样估计主题在单词上的概率分布和句子在主题上的概率分布,结合LDA参数和谱聚类算法提取多文档摘要。该方法使用线性公式来整合句子权重,提取出字数为400字的多文档摘要。使用ROUGE自动摘要评测工具包对DUC2002数据集评测摘要质量,结果表明,该方法能有效地提高摘要的质量。  相似文献   

12.
A document-sensitive graph model for multi-document summarization   总被引:1,自引:1,他引:0  
In recent years, graph-based models and ranking algorithms have drawn considerable attention from the extractive document summarization community. Most existing approaches take into account sentence-level relations (e.g. sentence similarity) but neglect the difference among documents and the influence of documents on sentences. In this paper, we present a novel document-sensitive graph model that emphasizes the influence of global document set information on local sentence evaluation. By exploiting document–document and document–sentence relations, we distinguish intra-document sentence relations from inter-document sentence relations. In such a way, we move towards the goal of truly summarizing multiple documents rather than a single combined document. Based on this model, we develop an iterative sentence ranking algorithm, namely DsR (Document-Sensitive Ranking). Automatic ROUGE evaluations on the DUC data sets show that DsR outperforms previous graph-based models in both generic and query-oriented summarization tasks.  相似文献   

13.
Automatic document summarization aims to create a compressed summary that preserves the main content of the original documents. It is a well-recognized fact that a document set often covers a number of topic themes with each theme represented by a cluster of highly related sentences. More important, topic themes are not equally important. The sentences in an important theme cluster are generally deemed more salient than the sentences in a trivial theme cluster. Existing clustering-based summarization approaches integrate clustering and ranking in sequence, which unavoidably ignore the interaction between them. In this paper, we propose a novel approach developed based on the spectral analysis to simultaneously clustering and ranking of sentences. Experimental results on the DUC generic summarization datasets demonstrate the improvement of the proposed approach over the other existing clustering-based approaches.  相似文献   

14.
Sentence-based multi-document summarization is the task of generating a succinct summary of a document collection, which consists of the most salient document sentences. In recent years, the increasing availability of semantics-based models (e.g., ontologies and taxonomies) has prompted researchers to investigate their usefulness for improving summarizer performance. However, semantics-based document analysis is often applied as a preprocessing step, rather than integrating the discovered knowledge into the summarization process.This paper proposes a novel summarizer, namely Yago-based Summarizer, that relies on an ontology-based evaluation and selection of the document sentences. To capture the actual meaning and context of the document sentences and generate sound document summaries, an established entity recognition and disambiguation step based on the Yago ontology is integrated into the summarization process.The experimental results, which were achieved on the DUC’04 benchmark collections, demonstrate the effectiveness of the proposed approach compared to a large number of competitors as well as the qualitative soundness of the generated summaries.  相似文献   

15.
Recently, automation is considered vital in most fields since computing methods have a significant role in facilitating work such as automatic text summarization. However, most of the computing methods that are used in real systems are based on graph models, which are characterized by their simplicity and stability. Thus, this paper proposes an improved extractive text summarization algorithm based on both topic and graph models. The methodology of this work consists of two stages. First, the well-known TextRank algorithm is analyzed and its shortcomings are investigated. Then, an improved method is proposed with a new computational model of sentence weights. The experimental results were carried out on standard DUC2004 and DUC2006 datasets and compared to four text summarization methods. Finally, through experiments on the DUC2004 and DUC2006 datasets, our proposed improved graph model algorithm TG-SMR (Topic Graph-Summarizer) is compared to other text summarization systems. The experimental results prove that the proposed TG-SMR algorithm achieves higher ROUGE scores. It is foreseen that the TG-SMR algorithm will open a new horizon that concerns the performance of ROUGE evaluation indicators.  相似文献   

16.
应用图模型来研究多文档自动摘要是当前研究的一个热点,它以句子为顶点,以句子之间相似度为边的权重构造无向图结构。由于此模型没有充分考虑句子中的词项权重信息以及句子所属的文档信息,针对这个问题,该文提出了一种基于词项—句子—文档的三层图模型,该模型可充分利用句子中的词项权重信息以及句子所属的文档信息来计算句子相似度。在DUC2003和DUC2004数据集上的实验结果表明,基于词项—句子—文档三层图模型的方法优于LexRank模型和文档敏感图模型。  相似文献   

17.
Nowadays, it is necessary that users have access to information in a concise form without losing any critical information. Document summarization is an automatic process of generating a short form from a document. In itemset-based document summarization, the weights of all terms are considered the same. In this paper, a new approach is proposed for multidocument summarization based on weighted patterns and term association measures. In the present study, the weights of the terms are not equal in the context and are computed based on weighted frequent itemset mining. Indeed, the proposed method enriches frequent itemset mining by weighting the terms in the corpus. In addition, the relationships among the terms in the corpus have been considered using term association measures. Also, the statistical features such as sentence length and sentence position have been modified and matched to generate a summary based on the greedy method. Based on the results of the DUC 2002 and DUC 2004 datasets obtained by the ROUGE toolkit, the proposed approach can outperform the state-of-the-art approaches significantly.  相似文献   

18.
自动文摘是自然语言处理领域的一个重要研究话题,基于机器学习的自动文摘方法则是该项研究中的一个热点。然而,自动文摘问题中的数据分布有一个重要现象,即文摘句子与非文摘句子的数量相差非常悬殊,该现象将给传统机器学习算法的应用效果带来负面影响。为此,本文针对自动文摘中句子类别分布严重不平衡这一现象,以支持向量机算法为基础,设计了两种有效的处理非平衡自动文摘数据的分类方法。在第一种方法中,将传统支持向量机中正负类平衡的分类间隔转换为不平衡的分类间隔;在第二种方法中,通过将数据集进行切分,设计了一种支持向量机集成学习算法。通过在DUC2001数据集上的实验证明,本文设计的两种基于非平衡数据分类的单文档自动文摘方法显著优于基于传统分类算法的自动文摘方法。  相似文献   

19.
A New Approach for Multi-Document Update Summarization   总被引:1,自引:1,他引:0       下载免费PDF全文
1958(2)
  • Manifold-ranking based topic-focused multi-document summarization 2007
  • An Introduction to Kolmogorov Complexity and Its Applications 1997
  • The use of MMR,diversity-based reranking for reordering documents and producing summaries 1998
  • Centroid-based summarization of multiple documents 2004(6)
  • A trainable document summarizer 1995
  • Impact of linguistic analysis on the semantic graph coverage and learning of document extracts 2005
  • Document summarization using conditional random fields 2007
  • Adasum:An adaptive model for summarization 2008
  • Lexpagerank:Prestige in multidocument text summarization 2004
  • Mihalcea R.Taran P Textrank-Bring order into texts 2004
  • Mihalcea R.Tarau P A language independent algorithm for single and multiple document summarization 2005
  • Wan X.Yang J.Xiao J Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction 2007
  • Wan X An exploration of document impact on graph-based multi-document summarization 2008
  • Bennett C H.Gács P.Li M.Vitányi P M,Zurek W H Information distance 1998(4)
  • Li M.Badger J H.Chen X.Kwong S,Kearney P,Zhang H An information-based sequence distance and its application to whole mitochondrial genome phylogeny 2001(2)
  • Li M.Chen X.Li X.Ma B Vitányi P M The similarity metric 2004(12)
  • Long C.Zhu X.Li M.Ma B Information shared by many objects 2008
  • Benedetto D.Caglioti E.Loreto V Language trees and zipping 2002(4)
  • Bennett C H.Li M.Ma B Chain letters and evolutionary histories 2003(6)
  • Cilibrasi R L.Vitányi P M The Google similarity distance 2007(3)
  • Zhang X.Hao Y.Zhu X.Li M Information distance from a question to an answer 2007
  • Ziv J.Lempel A A universal algorithm for sequential data compression 1977(3)
  • Lin C Y.Hovy E Automatic evaluation of summaries using n-gram co-occurrence statistics 2003
  • Nenkova A.Passonneau R.Mckeown K The pyramid method:Incorporating human content selection variation in summarization evaluation 2007(2)
  • >>更多...  相似文献   


    20.
    多文本摘要的目标是对给定的查询和多篇文本(文本集),创建一个简洁明了的摘要,要求该摘要能够表达这些文本的关键内容,同时和给定的查询相关。一个给定的文本集通常包含一些主题,而且每个主题由一类句子来表示,一个优秀的摘要应该要包含那些最重要的主题。如今大部分的方法是建立一个模型来计算句子得分,然后选择得分最高的部分句子来生成摘要。不同于这些方法,我们更加关注文本的主题而不是句子,把如何生成摘要的问题看成一个主题的发现,排序和表示的问题。我们首次引入dominant sets cluster(DSC)来发现主题,然后建立一个模型来对主题的重要性进行评估,最后兼顾代表性和无重复性来从各个主题中选择句子组成摘要。我们在DUC2005、2006、2007三年的标准数据集上进行了实验,最后的实验结果证明了该方法的有效性。  相似文献   

    设为首页 | 免责声明 | 关于勤云 | 加入收藏

    Copyright©北京勤云科技发展有限公司  京ICP备09084417号