首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 159 毫秒
1.
文档自动文摘是自然语言处理一个研究热点。本文提出了一种基于局部主题关键句抽取的多文档自动文摘方法。首先,将文档集合中的每篇文档划分为若干个局部主题,然后对不同文档中的局部主题进行聚类分析,最后从局部主题聚簇中间抽取所需要的文摘句。实验证明了该方法的有效性。  相似文献   

2.
针对面向查询的多文档自动文摘,本文将查询句混入多文档集合中的各句子中间,采用高效的软聚类算法SSC对所有的句子进行聚类。采用轮转法抽取文摘句,最后生成文摘。该方法在DUC2005的语料中测试效果很好。  相似文献   

3.
基于局部主题关键句抽取的自动文摘方法   总被引:2,自引:1,他引:1       下载免费PDF全文
徐超  王萌  何婷婷  张勇 《计算机工程》2008,34(22):49-51
自动文摘是语言信息处理中的重要环节。该文提出一种基于局部主题关键句抽取的中文自动文摘方法。通过层次分割的方法对文档进行主题分割,从各个局部主题单元中抽取一定数量的句子作为文章的文摘句。通过事先对文档进行语义分析,有效地避免了数据冗余和容易忽略分布较小的主题等问题。实验结果表明了该方法的有效性。  相似文献   

4.
多文档文摘中句子优化选择方法研究   总被引:2,自引:0,他引:2  
在多文档文摘子主题划分的基础上,提出了一种在子主题之间对文摘句优化选择的方法.首先在句子相似度计算的基础上,形成多文档集合的子主题,通过对各子主题打分,确定子主题的抽取顺序.以文摘中有效词的覆盖率作为优化指标,在各个子主题中选择文摘句.从减少子主题之间及子主题内部的信息的冗余性两个角度选择文摘句,使文摘的信息覆盖率得到很大提高.实验表明,生成的文摘是令人满意的.  相似文献   

5.
多文档自动文摘能够帮助人们自动、快速地获取信息,使用主题模型构建多文档自动文摘系统是一种新的尝试,其中主题模型采用浅层狄利赫雷分配(LDA)。该模型是一个多层的产生式概率模型,能够检测文档中的主题分布。使用LDA为多文档集合建模,通过计算句子在不同主题上的概率分布之间的相似度作为句子的重要度,并根据句子重要度进行文摘句的抽取。实验结果表明,该方法所得到的文摘性能优于传统的文摘方法。  相似文献   

6.
邓箴  包宏 《计算机与应用化学》2012,29(11):1384-1386
提出了一种基于词汇链抽取,文法分析的抽取文本代表词条的多文档摘要生成的方法。通过计算词义相似度构建词汇链,结合词频与位置特征进行文本代表词条成员的选择,将含有词条权值高的句子经过聚类形成多文档文摘句集合,然后进行质心句的抽取和排序,生成多文档文摘。该方法不仅考虑了词汇之间的语义信息,还考虑了词条对文本的代表成度,能够改善文摘句抽取的性能。实验结果表明,与单纯的由关键词确定文摘的方法相比,召回率和准确率都有不少的提高。  相似文献   

7.
主题模型LDA的多文档自动文摘   总被引:3,自引:0,他引:3  
近年来使用概率主题模型表示多文档文摘问题受到研究者的关注.LDA (latent dirichlet allocation)是主题模型中具有代表性的概率生成性模型之一.提出了一种基于LDA的文摘方法,该方法以混乱度确定LDA模型的主题数目,以Gibbs抽样获得模型中句子的主题概率分布和主题的词汇概率分布,以句子中主题权重的加和确定各个主题的重要程度,并根据LDA模型中主题的概率分布和句子的概率分布提出了2种不同的句子权重计算模型.实验中使用ROUGE评测标准,与代表最新水平的SumBasic方法和其他2种基于LDA的多文档自动文摘方法在通用型多文档摘要测试集DUC2002上的评测数据进行比较,结果表明提出的基于LDA的多文档自动文摘方法在ROUGE的各个评测标准上均优于SumBasic方法,与其他基于LDA模型的文摘相比也具有优势.  相似文献   

8.
王萌  徐超  李春贵  何婷婷 《计算机工程》2011,37(12):158-160
为解决词频矩阵的词频维数过大和矩阵过于稀疏的问题,提出一种子主题区域划分的多文档自动文摘方法。使用知网进行概念获取,建立概念向量空间模型,代替传统的词频向量空间模型。在概念向量空间模型的基础上,利用一种改进的层次分割法对文档集合进行子主题划分,从各个子主题中抽取出满足一定数量的句子作为文摘。实验结果验证了该方法的有效性。  相似文献   

9.
近年来概率主题模型受到了研究者的广泛关注,LDA(Latent Dirichlet Allocation)模型是主题模型中具有代表性的概率生成模型之一,它能够检测文本的隐含主题。提出一个基于LDA模型的主题特征,该特征计算文档的主题分布与句子主题分布的距离。结合传统多文档自动文摘中的常用特征,计算句子权重,最终根据句子的分值抽取句子形成摘要。实验结果证明,加入LDA模型的主题特征后,自动文摘的性能得到了显著的提高。  相似文献   

10.
文中总结了自动文摘的主要研究方法和策略并把方法分成了三大类:自动摘录、基于信息抽取的自动文摘和基于理解的自动文摘.自动摘录方法是从文章中抽取重要句子来形成文摘;基于信息抽取的文摘方法是用从文章中抽取的信息填充已经编好的框架,然后用模板将内容输出;基于理解的文摘方法是利用自然语言处理技术生成文摘.文中重点总结了单主题文章和多主题文章的自动摘录方法,在多种算法进行优缺点比较后提出了一种新的多主题划分方法.  相似文献   

11.
肖升  何炎祥 《计算机应用研究》2012,29(12):4507-4511
中文摘录是一种实现中文自动文摘的便捷方法,它根据摘录规则选取若干个原文句子直接组成摘要。通过优化输入矩阵和关键句子选取算法,提出了一种改进的潜在语义分析中文摘录方法。该方法首先基于向量空间模型构建多值输入矩阵;然后对输入矩阵进行潜在语义分析,并由此得出句子与潜在概念(主题信息的抽象表达)的语义相关度;最后借助改进的优选算法完成关键句子选取。实验结果显示,该方法准确率、召回率和F度量值的平均值分别为75.9%、71.8%和73.8%,与已有同类方法相比,改进后的方法实现了全程无监督且在整体效率上有较大提升,更具应用潜质。  相似文献   

12.
Patents are a type of intellectual property with ownership and monopolistic rights that are publicly accessible published documents, often with illustrations, registered by governments and international organizations. The registration allows people familiar with the domain to understand how to re-create the new and useful invention but restricts the manufacturing unless the owner licenses or enters into a legal agreement to sell ownership of the patent. Patents reward the costly research and development efforts of inventors while spreading new knowledge and accelerating innovation. This research uses artificial intelligence natural language processing, deep learning techniques and machine learning algorithms to extract the essential knowledge of patent documents within a given domain as a means to evaluate their worth and technical advantage. Manual patent abstraction is a time consuming, labor intensive, and subjective process which becomes cost and outcome ineffective as the size of the patent knowledge domain increases. This research develops an intelligent patent summarization methodology using artificial intelligence machine learning approaches to allow patent domains of extremely large sizes to be effectively and objectively summarized, especially for cases where the cost and time requirements of manual summarization is infeasible. The system learns to automatically summarize patent documents with natural language texts for any given technical domain. The machine learning solution identifies technical key terminologies (words, phrases, and sentences) in the context of the semantic relationships among training patents and corresponding summaries as the core of the summarization system. To ensure the high performance of the proposed methodology, ROUGE metrics are used to evaluate precision, recall, accuracy, and consistency of knowledge generated by the summarization system. The Smart machinery technologies domain, under the sub-domains of control intelligence, sensor intelligence and intelligent decision-making provide the case studies for the patent summarization system training. The cases use 1708 training pairs of patents and summaries while testing uses 30 randomly selected patents. The case implementation and verification have shown the summary reports achieve 90% and 84% average precision and recall ratios respectively.  相似文献   

13.
文本摘要是指对文本信息内容进行概括、提取主要内容进而形成摘要的过程。现有的文本摘要模型通常将内容选择和摘要生成独立分析,虽然能够有效提高句子压缩和融合的性能,但是在抽取过程中会丢失部分文本信息,导致准确率降低。基于预训练模型和Transformer结构的文档级句子编码器,提出一种结合内容抽取与摘要生成的分段式摘要模型。采用BERT模型对大量语料进行自监督学习,获得包含丰富语义信息的词表示。基于Transformer结构,通过全连接网络分类器将每个句子分成3类标签,抽取每句摘要对应的原文句子集合。利用指针生成器网络对原文句子集合进行压缩,将多个句子集合生成单句摘要,缩短输出序列和输入序列的长度。实验结果表明,相比直接生成摘要全文,该模型在生成句子上ROUGE-1、ROUGE-2和ROUGE-L的F1平均值提高了1.69个百分点,能够有效提高生成句子的准确率。  相似文献   

14.
With the continuous growth of online news articles, there arises the necessity for an efficient abstractive summarization technique for the problem of information overloading. Abstractive summarization is highly complex and requires a deeper understanding and proper reasoning to come up with its own summary outline. Abstractive summarization task is framed as seq2seq modeling. Existing seq2seq methods perform better on short sequences; however, for long sequences, the performance degrades due to high computation and hence a two-phase self-normalized deep neural document summarization model consisting of improvised extractive cosine normalization and seq2seq abstractive phases has been proposed in this paper. The novelty is to parallelize the sequence computation training by incorporating feed-forward, the self-normalized neural network in the Extractive phase using Intra Cosine Attention Similarity (Ext-ICAS) with sentence dependency position. Also, it does not require any normalization technique explicitly. Our proposed abstractive Bidirectional Long Short Term Memory (Bi-LSTM) encoder sequence model performs better than the Bidirectional Gated Recurrent Unit (Bi-GRU) encoder with minimum training loss and with fast convergence. The proposed model was evaluated on the Cable News Network (CNN)/Daily Mail dataset and an average rouge score of 0.435 was achieved also computational training in the extractive phase was reduced by 59% with an average number of similarity computations.  相似文献   

15.
We present methods of extractive query-oriented single-document summarization using a deep auto-encoder (AE) to compute a feature space from the term-frequency (tf) input. Our experiments explore both local and global vocabularies. We investigate the effect of adding small random noise to local tf as the input representation of AE, and propose an ensemble of such noisy AEs which we call the Ensemble Noisy Auto-Encoder (ENAE). ENAE is a stochastic version of an AE that adds noise to the input text and selects the top sentences from an ensemble of noisy runs. In each individual experiment of the ensemble, a different randomly generated noise is added to the input representation. This architecture changes the application of the AE from a deterministic feed-forward network to a stochastic runtime model. Experiments show that the AE using local vocabularies clearly provide a more discriminative feature space and improves the recall on average 11.2%. The ENAE can make further improvements, particularly in selecting informative sentences. To cover a wide range of topics and structures, we perform experiments on two different publicly available email corpora that are specifically designed for text summarization. We used ROUGE as a fully automatic metric in text summarization and we presented the average ROUGE-2 recall for all experiments.  相似文献   

16.
针对传统图模型方法进行文本摘要时只考虑统计特征或浅层次语义特征,缺乏对深层次主题语义特征的挖掘与利用,提出了融合主题特征后多维度度量的文本自动摘要方法MDSR(multi-dimension summarization rank)。首先利用LDA主题模型对文本主题语义信息进行挖掘,定义了主题重要度以衡量主题特征对句子重要程度的影响;然后结合主题特征、统计特征和句间相似度,改进了图模型节点的概率转移矩阵的构建方式;最后根据句子节点权重进行摘要的抽取与度量。实验结果显示,当主题特征、统计特征及句间相似度权重比例达到3:4:3时,MDSR方法的ROUGE评测值达到最佳,ROUGE-1、ROUGE-2、ROUGE-SU4值分别达到53.35%、35.18%和33.86%,优于对比方法,表明了融入主题特征后的文本摘要方法有效提高了摘要抽取的准确性。  相似文献   

17.
Xiong  Yu  Zhou  Xiangmin  Zhang  Yifei  Feng  Shi  Wang  Daling 《Multimedia Tools and Applications》2019,78(6):6409-6440

Effectively and efficiently summarizing social media is crucial and non-trivial to analyze social media. On social streams, events which are the main concept of semantic similar social messages, often bring us a firsthand story of daily news. However, to identify the valuable news, it is almost impossible to plough through millions of multi-modal messages one by one with traditional methods. Thus, it is urgent to summarize events with a few representative data samples on the streams. In this paper, we provide a vivid textual-visual media summarization approach for microblog streams, which exploits the incremental latent semantic analysis (LSA) of detected events. Firstly, with a novel weighting scheme for keyword relationship, we can detect and track daily sub-events on a keyword relation graph (WordGraph) of microblog streams effectively. Then, to summarize the stream with representative texts and images, we use cross-modal fusion to analyze the semantics of microblog texts and images incrementally and separately, with a novel incremental cross-modal LSA algorithm. The experimental results on a real microblog dataset show that our method is at least 1.31% better and 23.67% faster than existing state-of-the-art methods, and cross-modal fusion can improve the summarization performance by 4.16% on average.

  相似文献   

18.
This paper proposes a new automatic speech summarization method. In this method, a set of words maximizing a summarization score is extracted from automatically transcribed speech. This extraction is performed according to a target compression ratio using a dynamic programming (DP) technique. The extracted set of words is then connected to build a summarization sentence. The summarization score consists of a word significance measure, a confidence measure, linguistic likelihood, and a word concatenation probability. The word concatenation score is determined by a dependency structure in the original speech given by stochastic dependency context free grammar (SDCFG). Japanese broadcast news speech transcribed using a large-vocabulary continuous-speech recognition (LVCSR) system is summarized using our proposed method and compared with manual summarization by human subjects. The manual summarization results are combined to build a word network. This word network is used to calculate the word accuracy of each automatic summarization result using the most similar word string in the network. Experimental results show that the proposed method effectively extracts relatively important information by removing redundant and irrelevant information.  相似文献   

19.
本文提出一种基于LSA和pLSA的多文档自动文摘策略。首先,将多个文档切分成自然段,以自然段作为聚类单位。采用了新的特征提取方法构建词-自然段矩阵,利用LSA对词-自然段矩阵进行奇异值分解,使得向量空间模型中的高维表示变成在潜在语义空间中的低维表示。然后,采用pLSA将数据转换成概率统计模型来计算。在文摘生成的过程中采用基于质心的文摘句挑选办法得到文摘并输出。实验表明,本文提出的方法有效地提高了生成文摘的质量。  相似文献   

20.
As the Internet grows, it becomes essential to find efficient tools to deal with all the available information. Question answering (QA) and text summarization (TS) research fields focus on presenting the information requested by users in a more concise way. In this paper, the appropriateness and benefits of using summaries in semantic QA are analyzed. For this purpose, a combined approach where a TS component is integrated into a Web‐based semantic QA system is developed. The main goal of this paper is to determine to what extent TS can help semantic QA approaches, when using summaries instead of search engine snippets as the corpus for answering questions. In particular, three issues are analyzed: (i) the appropriateness of query‐focused (QF) summarization rather than generic summarization for the QA task, (ii) the suitable length comparing short and long summaries, and (iii) the benefits of using TS instead of snippets for finding the answers, tested within two semantic QA approaches (named entities and semantic roles). The results obtained show that QF summarization is better than generic (58% improvement), short summaries are better than long (6.3% improvement), and the use of TS within semantic QA improves the performance for both named‐entity‐based (10%) and, especially, semantic‐role‐based QA (47.5%). © 2011 Wiley Periodicals, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号