共查询到20条相似文献,搜索用时 0 毫秒
1.
随着在线网页的指数型增长,自动摘要技术越来越受到人们的关注。针对抽取型摘要很少对文本进行语义分析、抽取出的句子可能偏离主题等缺陷,结合单文本摘要的特点,提出了一种英文自动摘要方法TLETS(TF-ISF and LexRank based English Text Summarization)。该方法采用WordNet对向量空间模型的特征词进行概念统计,计算每个概念词的TF-ISF值作为其权值,最后计算每个句子的LexRank权值并提取出权值最高的几个句子作为摘要。实验结果表明,TLETS方法能很好地得到摘要结果。 相似文献
2.
在多领域数据的文本生成场景中,不同领域中的数据通常存在差异性,而新领域的引入会同时带来数据缺失的问题. 传统的有监督方法,需要目标领域中大量包含标记的数据来训练深度神经网络文本生成模型,而且训练好的模型无法在新领域中取得良好的泛化效果. 针对多领域场景中数据差异和数据缺失的问题,受到迁移学习方法的启发,设计了一种综合性的迁移式文本生成方法,减少了不同领域之间文本数据的差异性,同时借助已有领域和新领域之间文本数据上的语义关联性,帮助深度神经网络文本生成模型在新领域上进行泛化. 通过在公开数据集上的实验,验证了所提方法在多领域场景下领域迁移的有效性,模型在新领域上进行文本生成时具有较好的表现,对比现有的其他迁移式文本生成方法,在各项文本生成评价指标上均有提升.
相似文献3.
4.
Extractive summarization aims to automatically produce a short summary of a document by concatenating several sentences taken exactly from the original material. Due to its simplicity and easy-to-use, the extractive summarization methods have become the dominant paradigm in the realm of text summarization. In this paper, we address the sentence scoring technique, a key step of the extractive summarization. Specifically, we propose a novel word-sentence co-ranking model named CoRank, which combines the word-sentence relationship with the graph-based unsupervised ranking model. CoRank is quite concise in the view of matrix operations, and its convergence can be theoretically guaranteed. Moreover, a redundancy elimination technique is presented as a supplement to CoRank, so that the quality of automatic summarization can be further enhanced. As a result, CoRank can serve as an important building-block of the intelligent summarization systems. Experimental results on two real-life datasets including nearly 600 documents demonstrate the effectiveness of the proposed methods. 相似文献
5.
现有中文自动文本摘要方法主要是利用文本自身信息,其缺陷是不能充分利用词语之间的语义相关等信息。鉴于此,提出了一种改进的中文文本摘要方法。此方法将外部语料库信息用词向量的形式融入到TextRank算法中,通过TextRank与word2vec的结合,把句子中每个词语映射到高维词库形成句向量。充分考虑了句子之间的相似度、关键词的覆盖率和句子与标题的相似度等因素,以此计算句子之间的影响权重,并选取排序最靠前的句子重新排序作为文本的摘要。实验结果表明,此方法在本文数据集中取得了较好的效果,自动提取中文摘要的效果比原方法好。 相似文献
6.
针对传统图模型方法进行文本摘要时只考虑统计特征或浅层次语义特征,缺乏对深层次主题语义特征的挖掘与利用,提出了融合主题特征后多维度度量的文本自动摘要方法MDSR(multi-dimension summarization rank)。首先利用LDA主题模型对文本主题语义信息进行挖掘,定义了主题重要度以衡量主题特征对句子重要程度的影响;然后结合主题特征、统计特征和句间相似度,改进了图模型节点的概率转移矩阵的构建方式;最后根据句子节点权重进行摘要的抽取与度量。实验结果显示,当主题特征、统计特征及句间相似度权重比例达到3:4:3时,MDSR方法的ROUGE评测值达到最佳,ROUGE-1、ROUGE-2、ROUGE-SU4值分别达到53.35%、35.18%和33.86%,优于对比方法,表明了融入主题特征后的文本摘要方法有效提高了摘要抽取的准确性。 相似文献
7.
针对当前生成式文本摘要方法存在的语义信息利用不充分、摘要精度不够等问题,提出一种基于双编码器的文本摘要方法。首先,通过双编码器为序列映射(Seq2Seq)架构提供更丰富的语义信息,并对融入双通道语义的注意力机制和伴随经验分布的解码器进行了优化研究;然后,在词嵌入生成技术中融合位置嵌入和词嵌入,并新增词频-逆文档频率(TF-IDF)、词性(POS)、关键性得分(Soc),优化词嵌入维度。所提方法对传统序列映射Seq2Seq和词特征表示进行优化,在增强模型对语义的理解的同时,提高了摘要的质量。实验结果表明,该方法在Rouge评价体系中的表现相比传统伴随自注意力机制的递归神经网络方法(RNN+atten)和多层双向伴随自注意力机制的递归神经网络方法(Bi-MulRNN+atten)提高10~13个百分点,其文本摘要语义理解更加准确、生成效果更好,拥有更好的应用前景。 相似文献
8.
V. A. Yatsko T. N. Vishnyakov 《Automatic Documentation and Mathematical Linguistics》2007,41(5):185-193
The paper classifies systems of automatic text summarization and considers in detail the architecture and algorithms in the functioning of systems of a superficial level. It formulates a universal criterion for selection of linguistic units in the process of information search, which is an interpretation of the Zipf law. The problems and prospects in the development of this object domain are considered. 相似文献
9.
V. A. Yatsko T. N. Vishnyakov 《Automatic Documentation and Mathematical Linguistics》2007,41(3):93-103
Four modern systems of automatic text summarization are tested on the basis of a model vocabulary composed by subjects. Distribution of terms of the vocabulary in the source text is compared with their distribution in summaries of different length generated by the systems. Principles for evaluation of the efficiency of the current systems of automatic text summarization are described. 相似文献
10.
Language Resources and Evaluation - Due to the exponential growth in the number of documents on the Web, accessing the salient information relevant to a user need is gaining importance, which... 相似文献
11.
DU Jia-li YU Ping-fang ZHAO Hong-yan XU Jing 《通讯和计算机》2008,5(9):54-60
Based on English literary corpus this paper devises, to calculate SAS, a verifiable formula which comprises the average word-length in a sentence (L), multi-syllable word-number in every 100 words (H), sentence-number extracted from the text (S 1), the sum total of text sentences (S), word-number extracted from the text (W1) and the sum total of text words (W). This formula bears much relationship to the intersection between them and draws a conclusion that different sampling ratios will not result in significant deviation of SAS and correspondingly provides strong evidence of controllability of SAS. Despite the existence of domain limitation, domain simplicity and relativity of evaluation on line, it is helpful for the literary critics who have access to English literary corpus to correctly and effectively analyze texts by extracting some pages or passages from corpus even though no whole-text-extraction is involved. 相似文献
12.
13.
Ramiz M. Aliguliyev 《Expert systems with applications》2009,36(4):7764-7772
The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of automatically creating a compressed version of a given document that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic. In our study we focus on sentence based extractive document summarization. We propose the generic document summarization method which is based on sentence clustering. The proposed approach is a continue sentence-clustering based extractive summarization methods, proposed in Alguliev [Alguliev, R. M., Aliguliyev, R. M., Bagirov, A. M. (2005). Global optimization in the summarization of text documents. Automatic Control and Computer Sciences 39, 42–47], Aliguliyev [Aliguliyev, R. M. (2006). A novel partitioning-based clustering method and generic document summarization. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI–IAT 2006 Workshops) (WI–IATW’06), 18–22 December (pp. 626–629) Hong Kong, China], Alguliev and Alyguliev [Alguliev, R. M., Alyguliev, R. M. (2007). Summarization of text-based documents with a determination of latent topical sections and information-rich sentences. Automatic Control and Computer Sciences 41, 132–140] Aliguliyev, [Aliguliyev, R. M. (2007). Automatic document summarization by sentence extraction. Journal of Computational Technologies 12, 5–15.]. The purpose of present paper to show, that summarization result not only depends on optimized function, and also depends on a similarity measure. The experimental results on an open benchmark datasets from DUC01 and DUC02 show that our proposed approach can improve the performance compared to sate-of-the-art summarization approaches. 相似文献
14.
This work proposes an approach to address the problem of improving content selection in automatic text summarization by using some statistical tools. This approach is a trainable summarizer, which takes into account several features, including sentence position, positive keyword, negative keyword, sentence centrality, sentence resemblance to the title, sentence inclusion of name entity, sentence inclusion of numerical data, sentence relative length, Bushy path of the sentence and aggregated similarity for each sentence to generate summaries. First, we investigate the effect of each sentence feature on the summarization task. Then we use all features in combination to train genetic algorithm (GA) and mathematical regression (MR) models to obtain a suitable combination of feature weights. Moreover, we use all feature parameters to train feed forward neural network (FFNN), probabilistic neural network (PNN) and Gaussian mixture model (GMM) in order to construct a text summarizer for each model. Furthermore, we use trained models by one language to test summarization performance in the other language. The proposed approach performance is measured at several compression rates on a data corpus composed of 100 Arabic political articles and 100 English religious articles. The results of the proposed approach are promising, especially the GMM approach. 相似文献
15.
提出并实现了一种改进的基于词汇链分析的自动摘要系统,该系统结合语法分析的特点计算候选词的上下文窗口,提高了采用贪心策略构建词汇链的质量;并针对传统的词汇链分析方法生成的摘要长度的相对固定性给予了改善.实验表明,该系统与采用原始贪心策略构建词汇链的自动摘要系统相比较,生成的文摘质量有明显提高. 相似文献
16.
The challenges of automatic summarization 总被引:1,自引:0,他引:1
Summarization, the art of abstracting key content from one or more information sources, has become an integral part of everyday life. Researchers are investigating summarization tools and methods that automatically extract or abstract content from a range of information sources, including multimedia. Researchers are looking at approaches which roughly fall into two categories. Knowledge-poor approaches rely on not having to add new rules for each new application domain or language. Knowledge-rich approaches assume that if you grasp the meaning of the text, you can reduce it more effectively, thus yielding a better summary. Some approaches use a hybrid. In both methods, the main constraint is the compression requirement. High reduction rates pose a challenge because they are hard to attain without a reasonable amount of background knowledge. Another challenge is how to evaluate summarizers. If you are to trust that the summary is indeed a reliable substitute for the source, you must be confident that it does in fact reflect what is relevant in that source. Hence, methods for creating and evaluating summaries must complement each other 相似文献
17.
In recent years, Twitter has become one of the most important microblogging services of the Web 2.0. Among the possible uses it allows, it can be employed for communicating and broadcasting information in real time. The goal of this research is to analyze the task of automatic tweet generation from a text summarization perspective in the context of the journalism genre. To achieve this, different state-of-the-art summarizers are selected and employed for producing multi-lingual tweets in two languages (English and Spanish). A wide experimental framework is proposed, comprising the creation of a new corpus, the generation of the automatic tweets, and their assessment through a quantitative and a qualitative evaluation, where informativeness, indicativeness and interest are key criteria that should be ensured in the proposed context.From the results obtained, it was observed that although the original tweets were considered as model tweets with respect to their informativeness, they were not among the most interesting ones from a human viewpoint. Therefore, relying only on these tweets may not be the ideal way to communicate news through Twitter, especially if a more personalized and catchy way of reporting news wants to be performed. In contrast, we showed that recent text summarization techniques may be more appropriate, reflecting a balance between indicativeness and interest, even if their content was different from the tweets delivered by the news providers. 相似文献
18.
Atext is a word together with a (additional) linear ordering. Each text has a generic tree representation, called itsshape. Texts are considered in a logical and in an algebraic framework. It is proved that, for texts of bounded primitivity, the
class of monadic second-order definable text languages coincides with both the class of recognizable text languages and the
class of text languages generated by right-linear text grammars. In particular it is demonstrated that the construction of
the shape of a text can be formalized in terms of our monadic second-order logic. This approach can be extended to the case
of graphs.
This research was supported by the EBRA Working Group ASMICS 2. 相似文献
19.
V. A. Yatsko M. S. Starikov A. V. Butakov 《Automatic Documentation and Mathematical Linguistics》2010,44(3):111-120
This paper describes an experimental method for automatic text genre recognition based on 45 statistical, lexical, syntactic, positional, and discursive parameters. The suggested method includes: (1) the development of software permitting heterogeneous parameters to be normalized and clustered using the k-means algorithm; (2) the verification of parameters; (3) the selection of the parameters that are the most significant for scientific, newspaper, and artistic texts using two-factor analysis algorithms. Adaptive summarization algorithms have been developed based on these parameters. 相似文献