首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In recent years, algebraic methods, more precisely matrix decomposition approaches, have become a key tool for tackling document summarization problem. Typical algebraic methods used in multi-document summarization (MDS) vary from soft and hard clustering approaches to low-rank approximations. In this paper, we present a novel summarization method AASum which employs the archetypal analysis for generic MDS. Archetypal analysis (AA) is a promising unsupervised learning tool able to completely assemble the advantages of clustering and the flexibility of matrix factorization. In document summarization, given a content-graph data matrix representation of a set of documents, positively and/or negatively salient sentences are values on the data set boundary. These extreme values, archetypes, can be computed using AA. While each sentence in a data set is estimated as a mixture of archetypal sentences, the archetypes themselves are restricted to being sparse mixtures, i.e., convex combinations of the original sentences. Since AA in this way readily offers soft clustering, we suggest to consider it as a method for simultaneous sentence clustering and ranking. Another important argument in favor of using AA in MDS is that in contrast to other factorization methods, which extract prototypical, characteristic, even basic sentences, AA selects distinct (archetypal) sentences and thus induces variability and diversity in produced summaries. Experimental results on the DUC generic summarization data sets evidence the improvement of the proposed approach over the other closely related methods.  相似文献   

2.
面向查询的多文档摘要技术有两个难点 第一,为了保证摘要与查询密切相关,容易造成摘要内容重复,不够全面;第二,原始查询难以完整描述查询意图,需进行查询扩展,而现有查询扩展方法多依赖于外部语义资源。针对以上问题,该文提出一种面向查询的多文档摘要方法,利用主题分析技术识别出当前主题下的子主题,综合考虑句子所在的子主题与查询的相关度以及子主题的重要度两方面因素来选择摘要句,并根据词语在子主题之间的共现信息,在不使用任何外部知识的情况下,进行查询扩展。在DUC2006评测语料上的实验结果表明,与Baseline系统相比,该系统取得了更高的ROUGE评价值,基于子主题的查询扩展方法则进一步提高了摘要的质量。  相似文献   

3.
Graph model has been widely applied in document summarization by using sentence as the graph node, and the similarity between sentences as the edge. In this paper, a novel graph model for document summarization is presented, that not only sentences relevance but also phrases relevance information included in sentences are utilized. In a word, we construct a phrase-sentence two-layer graph structure model (PSG) to summarize document(s). We use this model for generic document summarization and query-focused summarization. The experimental results show that our model greatly outperforms existing work.  相似文献   

4.
多文本摘要的目标是对给定的查询和多篇文本(文本集),创建一个简洁明了的摘要,要求该摘要能够表达这些文本的关键内容,同时和给定的查询相关。一个给定的文本集通常包含一些主题,而且每个主题由一类句子来表示,一个优秀的摘要应该要包含那些最重要的主题。如今大部分的方法是建立一个模型来计算句子得分,然后选择得分最高的部分句子来生成摘要。不同于这些方法,我们更加关注文本的主题而不是句子,把如何生成摘要的问题看成一个主题的发现,排序和表示的问题。我们首次引入dominant sets cluster(DSC)来发现主题,然后建立一个模型来对主题的重要性进行评估,最后兼顾代表性和无重复性来从各个主题中选择句子组成摘要。我们在DUC2005、2006、2007三年的标准数据集上进行了实验,最后的实验结果证明了该方法的有效性。  相似文献   

5.
.基于用户查询扩展的自动摘要技术*   总被引:1,自引:0,他引:1  
提出了一种新的文档自动摘要方法,利用非负矩阵分解算法将原始文档表示为若干语义特征向量的线性组合,通过相似性计算来确定与用户查询高度相关的语义特征向量,抽取在该向量上具有较大投影系数的句子作为摘要,在此过程中,多次采用相关反馈技术对用户查询进行扩展优化。实验表明,该方法所得摘要在突出文档主题的同时,体现了用户的需求和兴趣,有效改善了信息检索的效率。  相似文献   

6.
In the paper, the most state-of-the-art methods of automatic text summarization, which build summaries in the form of generic extracts, are considered. The original text is represented in the form of a numerical matrix. Matrix columns correspond to text sentences, and each sentence is represented in the form of a vector in the term space. Further, latent semantic analysis is applied to the matrix obtained to construct sentences representation in the topic space. The dimensionality of the topic space is much less than the dimensionality of the initial term space. The choice of the most important sentences is carried out on the basis of sentences representation in the topic space. The number of important sentences is defined by the length of the demanded summary. This paper also presents a new generic text summarization method that uses nonnegative matrix factorization to estimate sentence relevance. Proposed sentence relevance estimation is based on normalization of topic space and further weighting of each topic using sentences representation in topic space. The proposed method shows better summarization quality and performance than state-of-the-art methods on the DUC 2001 and DUC 2002 standard data sets.  相似文献   

7.
Multi-document summarization via submodularity   总被引:1,自引:1,他引:0  
Multi-document summarization is becoming an important issue in the Information Retrieval community. It aims to distill the most important information from a set of documents to generate a compressed summary. Given a set of documents as input, most of existing multi-document summarization approaches utilize different sentence selection techniques to extract a set of sentences from the document set as the summary. The submodularity hidden in the term coverage and the textual-unit similarity motivates us to incorporate this property into our solution to multi-document summarization tasks. In this paper, we propose a new principled and versatile framework for different multi-document summarization tasks using submodular functions (Nemhauser et al. in Math. Prog. 14(1):265?C294, 1978) based on the term coverage and the textual-unit similarity which can be efficiently optimized through the improved greedy algorithm. We show that four known summarization tasks, including generic, query-focused, update, and comparative summarization, can be modeled as different variations derived from the proposed framework. Experiments on benchmark summarization data sets (e.g., DUC04-06, TAC08, TDT2 corpora) are conducted to demonstrate the efficacy and effectiveness of our proposed framework for the general multi-document summarization tasks.  相似文献   

8.
为了进一步提高网页相关性判断的速度和准确率,提出了一种新的用于聚焦文摘的句子权重计算方法。在查询返回的结果集的基础上,通过计算关键词间的互信息,对输入的查询语句进行短语识别;利用网页文本中的标签信息,对网页结构进行分析,并将关键词短语和网页结构等信息融入句子权重计算。实验结果表明,基于该算法生成的查询摘要在相关性判断的速度和准确率等方面均优于现有方法。  相似文献   

9.
The goal of abstractive summarization of multi-documents is to automatically produce a condensed version of the document text and maintain the significant information. Most of the graph-based extractive methods represent sentence as bag of words and utilize content similarity measure, which might fail to detect semantically equivalent redundant sentences. On other hand, graph based abstractive method depends on domain expert to build a semantic graph from manually created ontology, which requires time and effort. This work presents a semantic graph approach with improved ranking algorithm for abstractive summarization of multi-documents. The semantic graph is built from the source documents in a manner that the graph nodes denote the predicate argument structures (PASs)—the semantic structure of sentence, which is automatically identified by using semantic role labeling; while graph edges represent similarity weight, which is computed from PASs semantic similarity. In order to reflect the impact of both document and document set on PASs, the edge of semantic graph is further augmented with PAS-to-document and PAS-to-document set relationships. The important graph nodes (PASs) are ranked using the improved graph ranking algorithm. The redundant PASs are reduced by using maximal marginal relevance for re-ranking the PASs and finally summary sentences are generated from the top ranked PASs using language generation. Experiment of this research is accomplished using DUC-2002, a standard dataset for document summarization. Experimental findings signify that the proposed approach shows superior performance than other summarization approaches.  相似文献   

10.
以关键词抽取为核心的文摘句选择策略   总被引:3,自引:0,他引:3  
针对面向查询的多文档自动文摘,该文提出了一种以关键词抽取为核心的文摘句选择策略。通过查询扩展的相关技术得到相关多文档集中词语的查询相关性特征,利用最大似然估计法得到语料中词语的话题相关性特征,并将这两个特征值进行特征融合得到词语的重要度以确定关键词。然后通过关键词的重要度来给候选句打分,进一步利用改进的MMR(Maximal Marginal Relevance)技术来调整候选句的得分,最后生成文摘。该文将特征融合引入到词语层面,在DUC2005的语料中测试取得了较好的效果。  相似文献   

11.
聚焦查询的文摘把重点放在文档中用户关心的内容。聚焦查询的自动文摘方法以搜索引擎为问题查询工具,利用汉宁窗函数计算句子重要度,以体现问题与预期答案中的词密度特性。通过窗口从头到尾滑动来计算句子的权值选择出权值高的作为文摘。实验结果表明该方法形成的文摘优于Google文摘。  相似文献   

12.
Topic‐focused multidocument summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit, and then directly incorporate the relevance between each sentence and the single topic into the sentence evaluation process. However, the given topic is usually not well defined and it consists of a few explicit or implicit subtopics. In this study, the related subtopics are discovered from the topic's narrative text or document set through topic analysis techniques. Then, the sentence relationships against each subtopic are considered as an individual modality and the multimodality manifold‐ranking method is proposed to evaluate and rank sentences by fusing the multiple modalities. Experimental results on the DUC benchmark data sets show the promising results of our proposed methods.  相似文献   

13.
Low-rank matrix factorization is one of the most useful tools in scientific computing, data mining and computer vision. Among of its techniques, non-negative matrix factorization (NMF) has received considerable attention due to producing a parts-based representation of the data. Recent research has shown that not only the observed data are found to lie on a nonlinear low dimensional manifold, namely data manifold, but also the features lie on a manifold, namely feature manifold. In this paper, we propose a novel algorithm, called graph dual regularization non-negative matrix factorization (DNMF), which simultaneously considers the geometric structures of both the data manifold and the feature manifold. We also present a graph dual regularization non-negative matrix tri-factorization algorithm (DNMTF) as an extension of DNMF. Moreover, we develop two iterative updating optimization schemes for DNMF and DNMTF, respectively, and provide the convergence proofs of our two optimization schemes. Experimental results on UCI benchmark data sets, several image data sets and a radar HRRP data set demonstrate the effectiveness of both DNMF and DNMTF.  相似文献   

14.
针对现有多文档抽取方法不能很好地利用句子主题信息和语义信息的问题,提出一种融合多信息句子图模型的多文档摘要抽取方法。首先,以句子为节点,构建句子图模型;然后,将基于句子的贝叶斯主题模型和词向量模型得到的句子主题概率分布和句子语义相似度相融合,得到句子最终的相关性,结合主题信息和语义信息作为句子图模型的边权重;最后,借助句子图最小支配集的摘要方法来描述多文档摘要。该方法通过融合多信息的句子图模型,将句子间的主题信息、语义信息和关系信息相结合。实验结果表明,该方法能够有效地改进抽取摘要的综合性能。  相似文献   

15.
This paper proposes a Graph regularized $L_p$ smooth non-negative matrix factorization (GSNMF) method by incorporating graph regularization and $L_p$ smoothing constraint, which considers the intrinsic geometric information of a data set and produces smooth and stable solutions. The main contributions are as follows: first, graph regularization is added into NMF to discover the hidden semantics and simultaneously respect the intrinsic geometric structure information of a data set. Second, the $L_p$ smoothing constraint is incorporated into NMF to combine the merits of isotropic ($L_{2}$-norm) and anisotropic ($L_{1}$-norm) diffusion smoothing, and produces a smooth and more accurate solution to the optimization problem. Finally, the update rules and proof of convergence of GSNMF are given. Experiments on several data sets show that the proposed method outperforms related state-of-the-art methods.   相似文献   

16.
案件舆情摘要是从涉及特定案件的新闻文本簇中,抽取能够概括其主题信息的几个句子作为摘要.案件舆情摘要可以看作特定领域的多文档摘要,与一般的摘要任务相比,可以通过一些贯穿于整个文本簇的案件要素来表征其主题信息.在文本簇中,由于句子与句子之间存在关联关系,案件要素与句子亦存在着不同程度的关联关系,这些关联关系对摘要句的抽取有着重要的作用.提出了基于案件要素句子关联图卷积的案件文本摘要方法,采用图的结构来对多文本簇进行建模,句子作为主节点,词和案件要素作为辅助节点来增强句子之间的关联关系,利用多种特征计算不同节点间的关联关系.然后,使用图卷积神经网络学习句子关联图,并对句子进行分类得到候选摘要句.最后,通过去重和排序得到案件舆情摘要.在收集到的案件舆情摘要数据集上进行实验,结果表明:提出的方法相比基准模型取得了更好的效果,引入要素及句子关联图对案件多文档摘要有很好的效果.  相似文献   

17.
文章描述了一种基于子主题划分和查询相结合的多文档自动摘要系统的设计:首先利用同义词词林计算句子语义相似度,通过对句子的聚类得到子主题,然后根据用户的查询对子主题进行重要度排序,在此基础上,采用一种动态的句子打分策略从各个主题中抽取句子生成摘要。实验结果表明生成的摘要冗余少,信息全面。  相似文献   

18.
Text summarization is a process of extracting salient information from a source text and presenting that information to the user in a condensed form while preserving its main content. In the text summarization, most of the difficult problems are providing wide topic coverage and diversity in a summary. Research based on clustering, optimization, and evolutionary algorithm for text summarization has recently shown good results, making this a promising area. In this paper, for a text summarization, a two‐stage sentences selection model based on clustering and optimization techniques, called COSUM, is proposed. At the first stage, to discover all topics in a text, the sentences set is clustered by using k‐means method. At the second stage, for selection of salient sentences from clusters, an optimization model is proposed. This model optimizes an objective function that expressed as a harmonic mean of the objective functions enforcing the coverage and diversity of the selected sentences in the summary. To provide readability of a summary, this model also controls length of sentences selected in the candidate summary. For solving the optimization problem, an adaptive differential evolution algorithm with novel mutation strategy is developed. The method COSUM was compared with the 14 state‐of‐the‐art methods: DPSO‐EDASum; LexRank; CollabSum; UnifiedRank; 0–1 non‐linear; query, cluster, summarize; support vector machine; fuzzy evolutionary optimization model; conditional random fields; MA‐SingleDocSum; NetSum; manifold ranking; ESDS‐GHS‐GLO; and differential evolution, using ROUGE tool kit on the DUC2001 and DUC2002 data sets. Experimental results demonstrated that COSUM outperforms the state‐of‐the‐art methods in terms of ROUGE‐1 and ROUGE‐2 measures.  相似文献   

19.
The paper describes the rank 1 weighted factorization solution to the structure from motion problem. This method recovers the 3D structure from the factorization of a data matrix that is rank 1 rather than rank 3. This matrix collects the estimates of the 2D motions of a set of feature points of the rigid object. These estimates are weighted by the inverse of the estimates error standard deviation so that the 2D motion estimates for "sharper" features, which are usually well-estimated, are given more weight, while the noisier motion estimates for "smoother" features are weighted less. We analyze the performance of the rank 1 weighted factorization algorithm to determine what are the most suitable 3D shapes or the best 3D motions to recover the 3D structure of a rigid object from the 2D motions of the features. Our approach is developed for the orthographic camera model. It avoids expensive singular value decompositions by using the power method and is suitable to handle dense sets of feature points and long video sequences. Experimental studies with synthetic and real data illustrate the good performance of our approach.  相似文献   

20.
针对面向微博的中文新闻摘要的主要挑战,提出了一种将矩阵分解与子模最大化相结合的新闻自动摘要方法。该方法首先利用正交矩阵分解模型得到新闻文本潜语义向量,解决了短文本信息稀疏问题,并使投影方向近似正交以减少冗余;然后从相关性和多样性等方面评估新闻语句集合,该评估函数由多个单调子模函数和一个评估语句不相似度的非子模函数组成;最后设计贪心算法生成最终摘要。在NLPCC2015数据集面向上的实验结果表明本文提出的方法能有效提高面向微博的新闻自动摘要质量,ROUGE得分超过其他基线系统。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号