共查询到20条相似文献,搜索用时 171 毫秒
1.
The popularity of social media sites provides new ways for people to share their experiences and convey their opinions, leading to an explosive growth of user-generated content. Text data, owing to the amazing expressiveness of natural language, is of great value for people to explore various kinds of knowledge. However, considerable user-generated text contents are longer than what a reader expects, making automatic document summarization a necessity to facilitate knowledge digestion. In this paper, we focus on the reviews-like sentiment-oriented textual data. We propose the concept of Sentiment-preserving Document Summarization (SDS), aiming at summarizing a long textual document to a shorter version while preserving its main sentiments and not sacrificing readability. To tackle this problem, using deep neural network-based models, we devise an end-to-end weakly-supervised extractive framework, consisting of a hierarchical document encoder, a sentence extractor, a sentiment classifier, and a discriminator to distinguish the extracted summaries from the natural short reviews. The framework is weakly-supervised in that no ground-truth summaries are used for training, while the sentiment labels are available to supervise the generated summary to preserve the sentiments of the original document. In particular, the sentence extractor is trained to generate summaries i) making the sentiment classifier predict the same sentiment category as the original longer documents, and ii) fooling the discriminator into recognizing them as human-written short reviews. Experimental results on two public datasets validate the effectiveness of our framework. 相似文献
2.
The framework of multi-objective clustering can serve as a competent technique in nowadays human issues ranging from decision making process to machine learning and pattern recognition problems. Multi-objective clustering basically aims at placing similar objects into the same groups based on some conflicting objectives, which substantially supports the use of game theory to come to a resolution. Based on these understandings, this paper suggests Enriched Game Theory K-means, called EGTKMeans, as a novel multi-objective clustering technique based on the notion of game theory. EGTKMeans is specially designed to optimize two intrinsically conflicting objectives, named, compaction and equi-partitioning. The key contributions of the proposed approach are three folds. First, it formulates an elegant and novel payoff definition which considers both objectives with equal priority. The presented payoff function incorporates a desirable fairness into the final clustering results. Second, EGTKMeans performs better off by utilizing the advantages of mixed strategies as well as those of pure ones, considering the existence of mixed Nash Equilibrium in every game. The last but not the least is that EGTKMeans approaches the optimal solution in a very promising manner by optimizing both objectives simultaneously. The experimental results suggest that the proposed approach significantly outperforms other rival methods across real world and synthetic data sets with reasonable time complexity. 相似文献
3.
The task of automatic document summarization aims at generating short summaries for originally long documents. A good summary should cover the most important information of the original document or a cluster of documents, while being coherent, non-redundant and grammatically readable. Numerous approaches for automatic summarization have been developed to date. In this paper we give a self-contained, broad overview of recent progress made for document summarization within the last 5 years. Specifically, we emphasize on significant contributions made in recent years that represent the state-of-the-art of document summarization, including progress on modern sentence extraction approaches that improve concept coverage, information diversity and content coherence, as well as attempts from summarization frameworks that integrate sentence compression, and more abstractive systems that are able to produce completely new sentences. In addition, we review progress made for document summarization in domains, genres and applications that are different from traditional settings. We also point out some of the latest trends and highlight a few possible future directions. 相似文献
4.
Neural Computing and Applications - Video summarization is the process of refining the original video into a more concise form without losing valuable information. Both efficient storage and... 相似文献
5.
Applied Intelligence - Most previous abstractive summarization models generate the summary in a left-to-right manner without making the most use of target-side global information. Recently, many... 相似文献
6.
This paper proposes a constraint-driven document summarization approach emphasizing the following two requirements: (1) diversity in summarization, which seeks to reduce redundancy among sentences in the summary and (2) sufficient coverage, which focuses on avoiding the loss of the document’s main information when generating the summary. The constraint-driven document summarization models with tuning the constraint parameters can drive content coverage and diversity in a summary. The models are formulated as a quadratic integer programming (QIP) problem. To solve the QIP problem we used a discrete PSO algorithm. The models are implemented on multi-document summarization task. The comparative results showed that the proposed models outperform other methods on DUC2005 and DUC2007 datasets. 相似文献
7.
以互联网为代表的信息技术的发展使人们索取信息变得前所未有的便捷,同时也对如何有效利用信息提出了挑战。自动文摘技术通过自动选择文档中的代表句子,可以极大提高信息使用的效率。近年来,基于英文和中文的自动文摘技术获得广泛关注并取得长足进展,而对少数民族语言的自动文摘研究还不够充分,例如维吾尔语。构造了一个面向维吾尔语的自动文摘系统。首先利用维吾尔语的语言学知识对文档进行预处理,之后对文档进行了关键词提取,利用这些关键词进行了抽取式自动文摘。比较了基于TF-IDF和基于TextRank的两种关键词提取算法,证明TextRank方法提取出的关键词更适合自动文摘应用。通过研究证明了在充分考虑到维吾尔语语言信息的前提下,基于关键词的自动文摘方法可以取得让人满意的效果。 相似文献
8.
In this paper, we propose a new semi-supervised co-clustering algorithm Orthogonal Semi-Supervised Nonnegative Matrix Factorization (OSS-NMF) for document clustering. In this new approach, the clustering process is carried out by incorporating both prior domain knowledge of data points (documents) in the form of pair-wise constraints and category knowledge of features (words) into the NMF co-clustering framework. Under this framework, the clustering problem is formulated as the problem of finding the local minimizer of objective function, taking into account the dual prior knowledge. The update rules are derived, and an iterative algorithm is designed for the co-clustering process. Theoretically, we prove the correctness and convergence of our algorithm and demonstrate its mathematical rigorous. Our experimental evaluations show that the proposed document clustering model presents remarkable performance improvements with those constraints. 相似文献
9.
Recent advances in technology have made tremendous amounts of multimedia information available to the general population.
An efficient way of dealing with this new development is to develop browsing tools that distill multimedia data as information
oriented summaries. Such an approach will not only suit resource poor environments such as wireless and mobile, but also enhance
browsing on the wired side for applications like digital libraries and repositories. Automatic summarization and indexing
techniques will give users an opportunity to browse and select multimedia document of their choice for complete viewing later.
In this paper, we present a technique by which we can automatically gather the frames of interest in a video for purposes
of summarization. Our proposed technique is based on using Delaunay Triangulation for clustering the frames in videos. We
represent the frame contents as multi-dimensional point data and use Delaunay Triangulation for clustering them. We propose
a novel video summarization technique by using Delaunay clusters that generates good quality summaries with fewer frames and
less redundancy when compared to other schemes. In contrast to many of the other clustering techniques, the Delaunay clustering
algorithm is fully automatic with no user specified parameters and is well suited for batch processing. We demonstrate these
and other desirable properties of the proposed algorithm by testing it on a collection of videos from Open Video Project.
We provide a meaningful comparison between results of the proposed summarization technique with Open Video storyboard and
K-means clustering. We evaluate the results in terms of metrics that measure the content representational value of the proposed
technique. 相似文献
10.
谱聚类方法的应用已经开始从图像分割领域扩展到文本挖掘领域中,并取得了一定的成果。在自动确定聚类数目的基础上,结合模糊理论与谱聚类算法,提出了一种应用在多文本聚类中的模糊聚类算法,该算法主要描述了如何实现单个文本同时属于多个文本类的模糊谱聚类方法。实验仿真结果表明该算法具有很好的聚类效果。 相似文献
11.
Social media platforms become paramount for gathering relevant information during the occurrence of any natural disaster. Twitter has emerged as a platform which is heavily used for the purpose of communication during disaster events. Therefore, it becomes necessary to design a technique which can summarize the relevant tweets and thus, can help in the decision-making process of disaster management authority. In this paper, the problem of summarizing the relevant tweets is posed as an optimization problem where a subset of tweets is selected using the search capability of multi-objective binary differential evolution (MOBDE) by optimizing different perspectives of the summary. MOBDE deals with a set of solutions in its population, and each solution encodes a subset of tweets. Three versions of the proposed approach, namely, MOOTS1, MOOTS2, and MOOTS3, are developed in this paper. They differ in the way of working and the adaptive selection of parameters. Recently developed self-organizing map based genetic operator is explored in the optimization process. Two measures capturing the similarity/dissimilarity between tweets, word mover distance and BM25 are explored in the optimization process. The proposed approaches are evaluated on four datasets related to disaster events containing only relevant tweets. It has been observed that all versions of the developed MOBDE framework outperform the state-of-the-art (SOA) techniques. In terms of improvements, our best-proposed approach (MOOST3) improves by 8.5% and 3.1% in terms of ROUGE??2 and ROUGE?L, respectively, over the existing techniques and these improvements are further validated using statistical significance t-test. 相似文献
12.
Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This article presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the document index graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods. 相似文献
13.
对于节录式自动摘要需要从文档中提取一定数量的重要句子,以生成涵盖原文主旨的短文的问题,提出一种基于词句协同排序的单文档自动摘要算法,将词句关系融入以图排序为基础的句子权重计算过程中。首先给出了算法中词句协同计算的框架;然后转化为简洁的矩阵表示形式,并从理论上证明了收敛性;最后进一步通过去冗余方法提高自动摘要的质量。真实数据集上的实验表明,基于词句协同排序的自动摘要算法较经典的TextRank算法在Rouge指标上提升13%~30%,能够有效提高摘要的生成质量。 相似文献
14.
提出了一种以网页结构为指导的自动摘要方法。对页面源文件进行解析时,利用文档的结构信息生成DOM树,并在此基础上划分文档主题。同时充分挖掘网页标记对主题词提取和句子重要性计算的价值。最后以主题块为单位,根据句子间的相似度调整句子权重,动态生成摘要。实验结果表明该方法能有效解决文档摘要分布不平衡问题,减少了文摘内容的冗余。 相似文献
15.
Graph model has been widely applied in document summarization by using sentence as the graph node, and the similarity between sentences as the edge. In this paper, a novel graph model for document summarization is presented, that not only sentences relevance but also phrases relevance information included in sentences are utilized. In a word, we construct a phrase-sentence two-layer graph structure model (PSG) to summarize document(s). We use this model for generic document summarization and query-focused summarization. The experimental results show that our model greatly outperforms existing work. 相似文献
17.
Most of existing text automatic summarization algorithms are targeted for multi-documents of relatively short length, thus difficult to be applied immediately to novel documents of structure freedom and long length. In this paper, aiming at novel documents, we propose a topic modeling based approach to extractive automatic summarization, so as to achieve a good balance among compression ratio, summarization quality and machine readability. First, based on topic modeling, we extract the candidate sentences associated with topic words from a preprocessed novel document. Second, with the goals of compression ratio and topic diversity, we design an importance evaluation function to select the most important sentences from the candidate sentences and thus generate an initial novel summary. Finally, we smooth the initial summary to overcome the semantic confusion caused by ambiguous or synonymous words, so as to improve the summary readability. We evaluate experimentally our proposed approach on a real novel dataset. The experiment results show that compared to those from other candidate algorithms, each automatic summary generated by our approach has not only a higher compression ratio, but also better summarization quality. 相似文献
18.
Fast and high quality document clustering is a crucial task in organizing information, search engine results, enhancing web
crawling, and information retrieval or filtering. Recent studies have shown that the most commonly used partition-based clustering
algorithm, the K-means algorithm, is more suitable for large datasets. However, the K-means algorithm can generate a local optimal solution. In this paper we propose a novel Harmony K-means Algorithm (HKA) that deals with document clustering based on Harmony Search (HS) optimization method. It is proved
by means of finite Markov chain theory that the HKA converges to the global optimum. To demonstrate the effectiveness and
speed of HKA, we have applied HKA algorithms on some standard datasets. We also compare the HKA with other meta-heuristic
and model-based document clustering approaches. Experimental results reveal that the HKA algorithm converges to the best known
optimum faster than other methods and the quality of clusters are comparable. 相似文献
19.
This paper introduces a novel pairwise-adaptive dissimilarity measure for large high dimensional document datasets that improves the unsupervised clustering quality and speed compared to the original cosine dissimilarity measure. This measure dynamically selects a number of important features of the compared pair of document vectors. Two approaches for selecting the number of features in the application of the measure are discussed. The proposed feature selection process makes this dissimilarity measure especially applicable in large, high dimensional document collections. Its performance is validated on several test sets originating from standardized datasets. The dissimilarity measure is compared to the well-known cosine dissimilarity measure using the average F-measures of the hierarchical agglomerative clustering result. This new dissimilarity measure results in an improved clustering result obtained with a lower required computational time. 相似文献
20.
k-means是目前常用的文本聚类算法,该算法的主要缺点需要人工指定聚类的最终个数k及相应的初始中心点.针对这些缺点,提出一种基于参考区域的初始化方法,自动生成k-means的初始化分区,并且在参考区域的生成过程中,设计一种求最大斜率(绝对值)的方法确定自动阈值.理论分析和实验结果表明,该改进算法能有效的提高文本聚类的精度,且具有可行的效率. 相似文献
|