首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Sentence extraction is a widely adopted text summarization technique where the most important sentences are extracted from document(s) and presented as a summary. The first step towards sentence extraction is to rank sentences in order of importance as in the summary. This paper proposes a novel graph-based ranking method, iSpreadRank, to perform this task. iSpreadRank models a set of topic-related documents into a sentence similarity network. Based on such a network model, iSpreadRank exploits the spreading activation theory to formulate a general concept from social network analysis: the importance of a node in a network (i.e., a sentence in this paper) is determined not only by the number of nodes to which it connects, but also by the importance of its connected nodes. The algorithm recursively re-weights the importance of sentences by spreading their sentence-specific feature scores throughout the network to adjust the importance of other sentences. Consequently, a ranking of sentences indicating the relative importance of sentences is reasoned. This paper also develops an approach to produce a generic extractive summary according to the inferred sentence ranking. The proposed summarization method is evaluated using the DUC 2004 data set, and found to perform well. Experimental results show that the proposed method obtains a ROUGE-1 score of 0.38068, which represents a slight difference of 0.00156, when compared with the best participant in the DUC 2004 evaluation.  相似文献   

2.
The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of automatically creating a compressed version of a given document that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic. In our study we focus on sentence based extractive document summarization. We propose the generic document summarization method which is based on sentence clustering. The proposed approach is a continue sentence-clustering based extractive summarization methods, proposed in Alguliev [Alguliev, R. M., Aliguliyev, R. M., Bagirov, A. M. (2005). Global optimization in the summarization of text documents. Automatic Control and Computer Sciences 39, 42–47], Aliguliyev [Aliguliyev, R. M. (2006). A novel partitioning-based clustering method and generic document summarization. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI–IAT 2006 Workshops) (WI–IATW’06), 18–22 December (pp. 626–629) Hong Kong, China], Alguliev and Alyguliev [Alguliev, R. M., Alyguliev, R. M. (2007). Summarization of text-based documents with a determination of latent topical sections and information-rich sentences. Automatic Control and Computer Sciences 41, 132–140] Aliguliyev, [Aliguliyev, R. M. (2007). Automatic document summarization by sentence extraction. Journal of Computational Technologies 12, 5–15.]. The purpose of present paper to show, that summarization result not only depends on optimized function, and also depends on a similarity measure. The experimental results on an open benchmark datasets from DUC01 and DUC02 show that our proposed approach can improve the performance compared to sate-of-the-art summarization approaches.  相似文献   

3.
We present an optimization-based unsupervised approach to automatic document summarization. In the proposed approach, text summarization is modeled as a Boolean programming problem. This model generally attempts to optimize three properties, namely, (1) relevance: summary should contain informative textual units that are relevant to the user; (2) redundancy: summaries should not contain multiple textual units that convey the same information; and (3) length: summary is bounded in length. The approach proposed in this paper is applicable to both tasks: single- and multi-document summarization. In both tasks, documents are split into sentences in preprocessing. We select some salient sentences from document(s) to generate a summary. Finally, the summary is generated by threading all the selected sentences in the order that they appear in the original document(s). We implemented our model on multi-document summarization task. When comparing our methods to several existing summarization methods on an open DUC2005 and DUC2007 data sets, we found that our method improves the summarization results significantly. This is because, first, when extracting summary sentences, this method not only focuses on the relevance scores of sentences to the whole sentence collection, but also the topic representative of sentences. Second, when generating a summary, this method also deals with the problem of repetition of information. The methods were evaluated using ROUGE-1, ROUGE-2 and ROUGE-SU4 metrics. In this paper, we also demonstrate that the summarization result depends on the similarity measure. Results of the experiment showed that combination of symmetric and asymmetric similarity measures yields better result than their use separately.  相似文献   

4.
周凯  李芳 《计算机应用与软件》2009,26(6):231-232,255
针对事件摘要方法进行了深入研究,提出了一种基于句子特征与模糊推断的中文突发事件摘要实现机制。该机制综合考虑句子的特征重要性和与用户需求的内在相关性为单篇新闻生成摘要,在事件所有新闻摘要的句子上进行聚类、排序、抽取并最终生成事件的多主题摘要。在中文突发事件语料库上进行了实验,结果证明该机制能够有效地为中文突发事件生成摘要。  相似文献   

5.
In the paper, the most state-of-the-art methods of automatic text summarization, which build summaries in the form of generic extracts, are considered. The original text is represented in the form of a numerical matrix. Matrix columns correspond to text sentences, and each sentence is represented in the form of a vector in the term space. Further, latent semantic analysis is applied to the matrix obtained to construct sentences representation in the topic space. The dimensionality of the topic space is much less than the dimensionality of the initial term space. The choice of the most important sentences is carried out on the basis of sentences representation in the topic space. The number of important sentences is defined by the length of the demanded summary. This paper also presents a new generic text summarization method that uses nonnegative matrix factorization to estimate sentence relevance. Proposed sentence relevance estimation is based on normalization of topic space and further weighting of each topic using sentences representation in topic space. The proposed method shows better summarization quality and performance than state-of-the-art methods on the DUC 2001 and DUC 2002 standard data sets.  相似文献   

6.
Text summarization is either extractive or abstractive. Extractive summarization is to select the most salient pieces of information (words, phrases, and/or sentences) from a source document without adding any external information. Abstractive summarization allows an internal representation of the source document so as to produce a faithful summary of the source. In this case, external text can be inserted into the generated summary. Because of the complexity of the abstractive approach, the vast majority of work in text summarization has adopted an extractive approach.In this work, we focus on concepts fusion and generalization, i.e. where different concepts appearing in a sentence can be replaced by one concept which covers the meanings of all of them. This is one operation that can be used as part of an abstractive text summarization system. The main goal of this contribution is to enrich the research efforts on abstractive text summarization with a novel approach that allows the generalization of sentences using semantic resources. This work should be useful in intelligent systems more generally since it introduces a means to shorten sentences by producing more general (hence abstractions of the) sentences. It could be used, for instance, to display shorter texts in applications for mobile devices. It should also improve the quality of the generated text summaries by mentioning key (general) concepts. One can think of using the approach in reasoning systems where different concepts appearing in the same context are related to one another with the aim of finding a more general representation of the concepts. This could be in the context of Goal Formulation, expert systems, scenario recognition, and cognitive reasoning more generally.We present our methodology for the generalization and fusion of concepts that appear in sentences. This is achieved through (1) the detection and extraction of what we define as generalizable sentences and (2) the generation and reduction of the space of generalization versions. We introduce two approaches we have designed to select the best sentences from the space of generalization versions. Using four NLTK1 corpora, the first approach estimates the “acceptability” of a given generalization version. The second approach is Machine Learning-based and uses contextual and specific features. The recall, precision and F1-score measures resulting from the evaluation of the concept generalization and fusion approach are presented.  相似文献   

7.
针对现有多文档抽取方法不能很好地利用句子主题信息和语义信息的问题,提出一种融合多信息句子图模型的多文档摘要抽取方法。首先,以句子为节点,构建句子图模型;然后,将基于句子的贝叶斯主题模型和词向量模型得到的句子主题概率分布和句子语义相似度相融合,得到句子最终的相关性,结合主题信息和语义信息作为句子图模型的边权重;最后,借助句子图最小支配集的摘要方法来描述多文档摘要。该方法通过融合多信息的句子图模型,将句子间的主题信息、语义信息和关系信息相结合。实验结果表明,该方法能够有效地改进抽取摘要的综合性能。  相似文献   

8.
多文本摘要的目标是对给定的查询和多篇文本(文本集),创建一个简洁明了的摘要,要求该摘要能够表达这些文本的关键内容,同时和给定的查询相关。一个给定的文本集通常包含一些主题,而且每个主题由一类句子来表示,一个优秀的摘要应该要包含那些最重要的主题。如今大部分的方法是建立一个模型来计算句子得分,然后选择得分最高的部分句子来生成摘要。不同于这些方法,我们更加关注文本的主题而不是句子,把如何生成摘要的问题看成一个主题的发现,排序和表示的问题。我们首次引入dominant sets cluster(DSC)来发现主题,然后建立一个模型来对主题的重要性进行评估,最后兼顾代表性和无重复性来从各个主题中选择句子组成摘要。我们在DUC2005、2006、2007三年的标准数据集上进行了实验,最后的实验结果证明了该方法的有效性。  相似文献   

9.
Multi-document summarization via submodularity   总被引:1,自引:1,他引:0  
Multi-document summarization is becoming an important issue in the Information Retrieval community. It aims to distill the most important information from a set of documents to generate a compressed summary. Given a set of documents as input, most of existing multi-document summarization approaches utilize different sentence selection techniques to extract a set of sentences from the document set as the summary. The submodularity hidden in the term coverage and the textual-unit similarity motivates us to incorporate this property into our solution to multi-document summarization tasks. In this paper, we propose a new principled and versatile framework for different multi-document summarization tasks using submodular functions (Nemhauser et al. in Math. Prog. 14(1):265?C294, 1978) based on the term coverage and the textual-unit similarity which can be efficiently optimized through the improved greedy algorithm. We show that four known summarization tasks, including generic, query-focused, update, and comparative summarization, can be modeled as different variations derived from the proposed framework. Experiments on benchmark summarization data sets (e.g., DUC04-06, TAC08, TDT2 corpora) are conducted to demonstrate the efficacy and effectiveness of our proposed framework for the general multi-document summarization tasks.  相似文献   

10.
时序摘要是按照时间顺序生成摘要, 对话题的演化发展进行概括. 已有的相关研究忽视或者不能准确发现句子中隐含的子话题信息. 针对该问题, 本文建立了一种新的主题模型, 即词语对狄利克雷过程, 并提出了一种基于该模型的时序摘要生成方法. 首先通过模型推理得到句子的子话题分布; 然后利用该分布计算句子的相关度和新颖度; 最后按时间顺序抽取与话题相关且新颖度高的句子组成时序摘要. 实验结果表明, 本文方法较目前的代表性研究方法生成了更高质量的时序摘要.  相似文献   

11.
12.

Text summarization presents several challenges such as considering semantic relationships among words, dealing with redundancy and information diversity issues. Seeking to overcome these problems, we propose in this paper a new graph-based Arabic summarization system that combines statistical and semantic analysis. The proposed approach utilizes ontology hierarchical structure and relations to provide a more accurate similarity measurement between terms in order to improve the quality of the summary. The proposed method is based on a two-dimensional graph model that makes uses statistical and semantic similarities. The statistical similarity is based on the content overlap between two sentences, while the semantic similarity is computed using the semantic information extracted from a lexical database whose use enables our system to apply reasoning by measuring semantic distance between real human concepts. The weighted ranking algorithm PageRank is performed on the graph to produce significant score for all document sentences. The score of each sentence is performed by adding other statistical features. In addition, we address redundancy and information diversity issues by using an adapted version of Maximal Marginal Relevance method. Experimental results on EASC and our own datasets showed the effectiveness of our proposed approach over existing summarization systems.

  相似文献   

13.
The goal of abstractive summarization of multi-documents is to automatically produce a condensed version of the document text and maintain the significant information. Most of the graph-based extractive methods represent sentence as bag of words and utilize content similarity measure, which might fail to detect semantically equivalent redundant sentences. On other hand, graph based abstractive method depends on domain expert to build a semantic graph from manually created ontology, which requires time and effort. This work presents a semantic graph approach with improved ranking algorithm for abstractive summarization of multi-documents. The semantic graph is built from the source documents in a manner that the graph nodes denote the predicate argument structures (PASs)—the semantic structure of sentence, which is automatically identified by using semantic role labeling; while graph edges represent similarity weight, which is computed from PASs semantic similarity. In order to reflect the impact of both document and document set on PASs, the edge of semantic graph is further augmented with PAS-to-document and PAS-to-document set relationships. The important graph nodes (PASs) are ranked using the improved graph ranking algorithm. The redundant PASs are reduced by using maximal marginal relevance for re-ranking the PASs and finally summary sentences are generated from the top ranked PASs using language generation. Experiment of this research is accomplished using DUC-2002, a standard dataset for document summarization. Experimental findings signify that the proposed approach shows superior performance than other summarization approaches.  相似文献   

14.
文本摘要是指对文本信息内容进行概括、提取主要内容进而形成摘要的过程。现有的文本摘要模型通常将内容选择和摘要生成独立分析,虽然能够有效提高句子压缩和融合的性能,但是在抽取过程中会丢失部分文本信息,导致准确率降低。基于预训练模型和Transformer结构的文档级句子编码器,提出一种结合内容抽取与摘要生成的分段式摘要模型。采用BERT模型对大量语料进行自监督学习,获得包含丰富语义信息的词表示。基于Transformer结构,通过全连接网络分类器将每个句子分成3类标签,抽取每句摘要对应的原文句子集合。利用指针生成器网络对原文句子集合进行压缩,将多个句子集合生成单句摘要,缩短输出序列和输入序列的长度。实验结果表明,相比直接生成摘要全文,该模型在生成句子上ROUGE-1、ROUGE-2和ROUGE-L的F1平均值提高了1.69个百分点,能够有效提高生成句子的准确率。  相似文献   

15.
Traditional summarization methods only use the internal information of a Web document while ignoring its social information such as tweets from Twitter, which can provide a perspective viewpoint for readers towards an event. This paper proposes a framework named SoRTESum to take the advantages of social information such as document content reflection to extract summary sentences and social messages. In order to do that, the summarization was formulated in two steps: scoring and ranking. In the scoring step, the score of a sentence or social message is computed by using intra-relation and inter-relation which integrate the support of local and social information in a mutual reinforcement form. To calculate these relations, 16 features are proposed. After scoring, the summarization is generated by selecting top m ranked sentences and social messages. SoRTESum was extensively evaluated on two datasets. Promising results show that: (i) SoRTESum obtains significant improvements of ROUGE-scores over state-of-the-art baselines and competitive results with the learning to rank approach trained by RankBoost and (ii) combining intra-relation and inter-relation benefits single-document summarization.  相似文献   

16.
案件舆情摘要是从涉及特定案件的新闻文本簇中,抽取能够概括其主题信息的几个句子作为摘要.案件舆情摘要可以看作特定领域的多文档摘要,与一般的摘要任务相比,可以通过一些贯穿于整个文本簇的案件要素来表征其主题信息.在文本簇中,由于句子与句子之间存在关联关系,案件要素与句子亦存在着不同程度的关联关系,这些关联关系对摘要句的抽取有着重要的作用.提出了基于案件要素句子关联图卷积的案件文本摘要方法,采用图的结构来对多文本簇进行建模,句子作为主节点,词和案件要素作为辅助节点来增强句子之间的关联关系,利用多种特征计算不同节点间的关联关系.然后,使用图卷积神经网络学习句子关联图,并对句子进行分类得到候选摘要句.最后,通过去重和排序得到案件舆情摘要.在收集到的案件舆情摘要数据集上进行实验,结果表明:提出的方法相比基准模型取得了更好的效果,引入要素及句子关联图对案件多文档摘要有很好的效果.  相似文献   

17.
多文档文摘中句子优化选择方法研究   总被引:2,自引:0,他引:2  
在多文档文摘子主题划分的基础上,提出了一种在子主题之间对文摘句优化选择的方法.首先在句子相似度计算的基础上,形成多文档集合的子主题,通过对各子主题打分,确定子主题的抽取顺序.以文摘中有效词的覆盖率作为优化指标,在各个子主题中选择文摘句.从减少子主题之间及子主题内部的信息的冗余性两个角度选择文摘句,使文摘的信息覆盖率得到很大提高.实验表明,生成的文摘是令人满意的.  相似文献   

18.
Multi-document summarization is the process of extracting salient information from a set of source texts and present that information to the user in a condensed form. In this paper, we propose a multi-document summarization system which generates an extractive generic summary with maximum relevance and minimum redundancy by representing each sentence of the input document as a vector of words in Proper Noun, Noun, Verb and Adjective set. Five features, such as TF_ISF, Aggregate Cross Sentence Similarity, Title Similarity, Proper Noun and Sentence Length associated with the sentences, are extracted, and scores are assigned to sentences based on these features. Weights that can be assigned to different features may vary depending upon the nature of the document, and it is hard to discover the most appropriate weight for each feature, and this makes generation of a good summary a very tough task without human intelligence. Multi-document summarization problem is having large number of decision parameters and number of possible solutions from which most optimal summary is to be generated. Summary generated may not guarantee the essential quality and may be far from the ideal human generated summary. To address this issue, we propose a population-based multicriteria optimization method with multiple objective functions. Three objective functions are selected to determine an optimal summary, with maximum relevance, diversity, and novelty, from a global population of summaries by considering both the statistical and semantic aspects of the documents. Semantic aspects are considered by Latent Semantic Analysis (LSA) and Non Negative Matrix Factorization (NMF) techniques. Experiments have been performed on DUC 2002, DUC 2004 and DUC 2006 datasets using ROUGE tool kit. Experimental results show that our system outperforms the state of the art works in terms of Recall and Precision.  相似文献   

19.
现有中文自动文本摘要方法主要是利用文本自身信息,其缺陷是不能充分利用词语之间的语义相关等信息。鉴于此,提出了一种改进的中文文本摘要方法。此方法将外部语料库信息用词向量的形式融入到TextRank算法中,通过TextRank与word2vec的结合,把句子中每个词语映射到高维词库形成句向量。充分考虑了句子之间的相似度、关键词的覆盖率和句子与标题的相似度等因素,以此计算句子之间的影响权重,并选取排序最靠前的句子重新排序作为文本的摘要。实验结果表明,此方法在本文数据集中取得了较好的效果,自动提取中文摘要的效果比原方法好。  相似文献   

20.
The task of automatic document summarization aims at generating short summaries for originally long documents. A good summary should cover the most important information of the original document or a cluster of documents, while being coherent, non-redundant and grammatically readable. Numerous approaches for automatic summarization have been developed to date. In this paper we give a self-contained, broad overview of recent progress made for document summarization within the last 5 years. Specifically, we emphasize on significant contributions made in recent years that represent the state-of-the-art of document summarization, including progress on modern sentence extraction approaches that improve concept coverage, information diversity and content coherence, as well as attempts from summarization frameworks that integrate sentence compression, and more abstractive systems that are able to produce completely new sentences. In addition, we review progress made for document summarization in domains, genres and applications that are different from traditional settings. We also point out some of the latest trends and highlight a few possible future directions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号