首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In the paper, the most state-of-the-art methods of automatic text summarization, which build summaries in the form of generic extracts, are considered. The original text is represented in the form of a numerical matrix. Matrix columns correspond to text sentences, and each sentence is represented in the form of a vector in the term space. Further, latent semantic analysis is applied to the matrix obtained to construct sentences representation in the topic space. The dimensionality of the topic space is much less than the dimensionality of the initial term space. The choice of the most important sentences is carried out on the basis of sentences representation in the topic space. The number of important sentences is defined by the length of the demanded summary. This paper also presents a new generic text summarization method that uses nonnegative matrix factorization to estimate sentence relevance. Proposed sentence relevance estimation is based on normalization of topic space and further weighting of each topic using sentences representation in topic space. The proposed method shows better summarization quality and performance than state-of-the-art methods on the DUC 2001 and DUC 2002 standard data sets.  相似文献   

2.
In this paper, we develop a genetic algorithm method based on a latent semantic model (GAL) for text clustering. The main difficulty in the application of genetic algorithms (GAs) for document clustering is thousands or even tens of thousands of dimensions in feature space which is typical for textual data. Because the most straightforward and popular approach represents texts with the vector space model (VSM), that is, each unique term in the vocabulary represents one dimension. Latent semantic indexing (LSI) is a successful technology in information retrieval which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. Meanwhile, LSI takes into account the effects of synonymy and polysemy, which constructs a semantic structure in textual data. GA belongs to search techniques that can efficiently evolve the optimal solution in the reduced space. We propose a variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering. GA can be used in conjunction with the reduced latent semantic structure and improve clustering efficiency and accuracy. The superiority of GAL approach over conventional GA applied in VSM model is demonstrated by providing good Reuter document clustering results.  相似文献   

3.
Text categorization has gained increasing popularity in the last years due the explosive growth of multimedia documents. As a document can be associated with multiple non-exclusive categories simultaneously (e.g., Virus, Health, Sports, and Olympic Games), text categorization provides many opportunities for developing novel multi-label learning approaches devoted specifically to textual data. In this paper, we propose an ensemble multi-label classification method for text categorization based on four key ideas: (1) performing Latent Semantic Indexing based on distinct orthogonal projections on lower-dimensional spaces of concepts; (2) random splitting of the vocabulary; (3) document bootstrapping; and (4) the use of BoosTexter as a powerful multi-label base learner for text categorization to simultaneously encourage diversity and individual accuracy in the committee. Diversity of the ensemble is promoted through random splits of the vocabulary that leads to different orthogonal projections on lower-dimensional latent concept spaces. Accuracy of the committee members is promoted through the underlying latent semantic structure uncovered in the text. The combination of both rotation-based ensemble construction and Latent Semantic Indexing projection is shown to bring about significant improvements in terms of Average Precision, Coverage, Ranking loss and One error compared to five state-of-the-art approaches across 14 real-word textual data sets covering a wide variety of topics including health, education, business, science and arts.  相似文献   

4.
自动文摘技术应尽可能获取准确的相似度以确定句子或段落的权重,但目前常用的基于向量空间模型的计算方法却忽视句子、段落、文本中词的顺序.提出了一种新的基于相邻词序组的相似度度量方法并应用于文本的自动摘要,采用基于聚类的方法实现了词序组的向量表示并以此刻画句子、段落、文本,通过线性插值将基于不同长度词序组的相似度结果予以综合.同时,提出了新的基于含词序组重要性累计度的句子或段落的权重指标.实验证明利用词序信息可有效提高自动文摘质量.  相似文献   

5.
基于隐含语义的个性化信息检索   总被引:3,自引:0,他引:3  
随着互连网上信息资源的极度膨胀,出现了各种各样的信息搜集工具给用户提供信息服务,但是目前的信息搜集系统在给用户提供信息服务时,很难根据用户的个人信息实现个性化的信息服务,不同的用户相同的查询请求,返回的查询结果是相同的,这给用户的使用带来了很大的不便,而隐含语义检索(LSI)可以利用关键词之间的语义信息完成信息的搜索。提出了利用LSI来进行个性化信息检索算法的几种实现方法,并通过实验验证了算法的有效性。  相似文献   

6.
传统的文本分类都是根据文本的外在特征进行的,最常见的就是基于向量空间模型的方法,使用空间向量表示文本,通过相似度比较来确定分类。为了克服向量空间模型中的词条独立性假设,文章提出了一种基于潜在语义索引的文本分类模型,通过对大量的文本集进行统计分析,揭示了词语的上下文使用含义,通过奇异值分解有效地降低了向量空间的维数,消除了同义词、多义词的影响,从而提高了文本分类的精度。  相似文献   

7.
基于潜在语义索引的文本特征词权重计算方法   总被引:1,自引:0,他引:1  
李媛媛  马永强 《计算机应用》2008,28(6):1460-1462
潜在语义索引具有可计算性强,需要人参与少等优点。对其中重要的优化过程--权重计算,进行了深入分析。针对目前应用最广泛的TF-IDF方法中,采用线性处理的不合理性以及难以突出对文本内容起关键性作用的特征的缺点,提出了一种基于"Sigmiod函数"和"位置因子"的新权重方案。突出了文本中不同特征词的重要程度,更有利于潜在语义空间的构造。通过实验平台"中文潜在语义索引分析系统"的测试结果表明,该权重方法更利于基于潜在语义的检索性能的提高。  相似文献   

8.
针对自然语言处理(NLP)生成式自动摘要领域的语义理解不充分、摘要语句不通顺和摘要准确度不够高的问题,提出了一种新的生成式自动摘要解决方案,包括一种改进的词向量生成技术和一个生成式自动摘要模型。改进的词向量生成技术以Skip-Gram方法生成的词向量为基础,结合摘要的特点,引入词性、词频和逆文本频率三个词特征,有效地提高了词语的理解;而提出的Bi-MulRnn+生成式自动摘要模型以序列映射(seq2seq)与自编码器结构为基础,引入注意力机制、门控循环单元(GRU)结构、双向循环神经网络(BiRnn)、多层循环神经网络(MultiRnn)和集束搜索,提高了生成式摘要准确性与语句流畅度。基于大规模中文短文本摘要(LCSTS)数据集的实验结果表明,该方案能够有效地解决短文本生成式摘要问题,并在Rouge标准评价体系中表现良好,提高了摘要准确性与语句流畅度。  相似文献   

9.
卢玲  杨武  曹琼 《计算机应用》2016,36(2):432-436
传统自动文摘一般对字数没有明确限制,运用传统技术进行短文摘提取时,受字数限制,难以获取均衡的性能。针对该问题,提出一种多重映射的自动短文摘方法。通过计算关联度映射值、长度映射值、标题映射值和位置映射值,分别形成多个候选文摘句子集;再运用多重映射策略,将多个候选子集映射到文摘句子集中,同时使用提取文本中心句的方法提高召回率。实验表明,多重映射可在短文摘提取上获得稳定的性能。在NLP&CC2015评测中,该方法的ROUGE-1测试F值达到0.49,ROUGE-2测试F值达到0.35,均优于评测的平均水平,表明了该方法的有效性。  相似文献   

10.
讨论了现有的自动文摘评价方法,并具体分析了内部评价方法的缺陷,由此提出了基于文本相似度的自动文摘评价方法.同时,通过基于VSM(支持向量机)相似度和基于语义相似度两种相似度方法来比较评价方法的性能.实验表明,基于相似度的方法实现简单、效果良好,是一种更接近自然模型的评价方法.  相似文献   

11.

Text summarization presents several challenges such as considering semantic relationships among words, dealing with redundancy and information diversity issues. Seeking to overcome these problems, we propose in this paper a new graph-based Arabic summarization system that combines statistical and semantic analysis. The proposed approach utilizes ontology hierarchical structure and relations to provide a more accurate similarity measurement between terms in order to improve the quality of the summary. The proposed method is based on a two-dimensional graph model that makes uses statistical and semantic similarities. The statistical similarity is based on the content overlap between two sentences, while the semantic similarity is computed using the semantic information extracted from a lexical database whose use enables our system to apply reasoning by measuring semantic distance between real human concepts. The weighted ranking algorithm PageRank is performed on the graph to produce significant score for all document sentences. The score of each sentence is performed by adding other statistical features. In addition, we address redundancy and information diversity issues by using an adapted version of Maximal Marginal Relevance method. Experimental results on EASC and our own datasets showed the effectiveness of our proposed approach over existing summarization systems.

  相似文献   

12.
隐含语义检索技术是一种基于概念的检索方法,本文介绍了隐含语义检索的原理,并考虑到不同词条对文档内容描述重要程度不同,通过提高特征词、关键词的权重改进了隐含语义检索系统。工作中对检索系统中不同重要程度的词条采用了不同的权重算法计算权重,并以化学学科信息门户中的西文期刊简介页作为测试文档进行了检索测试,分析了权重算法改进前后检索测试的数据,结果表明,改进后的隐含语义检索系统的检索效果有了较大的提高。  相似文献   

13.
Many national and international governments establish organizations for applied science research funding. For this, several organizations have defined procedures for identifying relevant projects that based on prioritized technologies. Even for applied science research projects, which combine several technologies it is difficult to identify all corresponding technologies of all research-funding organizations. In this paper, we present an approach to support researchers and to support research-funding planners by classifying applied science research projects according to corresponding technologies of research-funding organizations. In contrast to related work, this problem is solved by considering results from literature concerning the application based technological relationships and by creating a new approach that is based on latent semantic indexing (LSI) as semantic text classification algorithm. Technologies that occur together in the process of creating an application are grouped in classes, semantic textual patterns are identified as representative for each class, and projects are assigned to one of these classes. This enables the assignment of each project to all technologies semantically grouped by use of LSI. This approach is evaluated using the example of defense and security based technological research. This is because the growing importance of this application field leads to an increasing number of research projects and to the appearance of many new technologies.  相似文献   

14.
This publication shows how the gap between the HTML based internet and the RDF based vision of the semantic web might be bridged, by linking words in texts to concepts of ontologies. Most current search engines use indexes that are built at the syntactical level and return hits based on simple string comparisons. However, the indexes do not contain synonyms, cannot differentiate between homonyms (‘mouse’ as a pointing vs. ‘mouse’ as an animal) and users receive different search results when they use different conjugation forms of the same word. In this publication, we present a system that uses ontologies and Natural Language Processing techniques to index texts, and thus supports word sense disambiguation and the retrieval of texts that contain equivalent words, by indexing them to concepts of ontologies.

For this purpose, we developed fully automated methods for mapping equivalent concepts of imported RDF ontologies (for this prototype WordNet, SUMO and OpenCyc). These methods will thus allow the seamless integration of domain specific ontologies for concept based information retrieval in different domains.

To demonstrate the practical workability of this approach, a set of web pages that contain synonyms and homonyms were indexed and can be queried via a search engine like query frontend. However, the ontology based indexing approach can also be used for other data mining applications such text clustering, relation mining and for searching free text fields in biological databases. The ontology alignment methods and some of the text mining principles described in this publication are now incorporated into the ONDEX system http://ondex.sourceforge.net/.  相似文献   


15.
In this paper, the problem of automatic document classification by a set of given topics is considered. The method proposed is based on the use of the latent semantic analysis to retrieve semantic dependencies between words. The classification of document is based on these dependencies. The results of experiments performed on the basis of the standard test data set TREC (Text REtrieval Conference) confirm the attractiveness of this approach. The relatively low computational complexity of this method at the classification stage makes it possible to be applied to the classification of document streams.  相似文献   

16.
Large-scale information retrieval with latent semantic indexing   总被引:9,自引:0,他引:9  
As the amount of electronic information increases, traditional lexical (or Boolean) information retrieval techniques will become less useful. Large, heterogeneous collections will be difficult to search since the sheer volume of unranked documents returned in response to a query will overwhelm the user. Vector-space approaches to information retrieval, on the other hand, allow the user to search for concepts rather than specific words, and rank the results of the search according to their relative similarity to the query. One vector-space approach, Latent Semantic Indexing (LSI), has achieved up to 30% better retrieval performance than lexical searching techniques by employing a reduced-rank model of the term-document space. However, the original implementation of LSI lacked the execution efficiency required to make LSI useful for large data sets. A new implementation of LSI, LSI++, seeks to make LSI efficient, extensible, portable, and maintainable. The LSI++ Application Programming Interface (API) allows applications to immediately use LSI without knowing the implementation details of the underlying system. LSI++ supports both serial and distributed searching of large data sets, providing the same programming interface regardless of the implementation actually executing. In addition, a World Wide Web interface was created to allow simple, intuitive searching of document collections using LSI++. Timing results indicate that the serial implementation of LSI++ searches up to six times faster than the original implementation of LSI, while the parallel implementation searches nearly 180 times faster on large document collections.  相似文献   

17.
Sprinkling方法是一种集成了训练样本类别信息的监督潜在语义模型。但是该方法特征权重采用词频,降低了文本分类效果,同时该模型并没有考虑不同样本对分类的贡献能力,而是认为样本对分类的贡献相同,另外,该模型采用多个特征映射一个类别来加强类别知识对分类的贡献。为此,文章在Sprinkling方法的基础上提出了一种新的监督潜在语义模型。实验结果表明,该文方法的总体性能优于原始的Sprinkling方法,在特征数为1 100时,获得了最高分类精度,提高幅度达到1.71%。  相似文献   

18.
Multilevel security (MLS) is specifically created to protect information from unauthorized access. In MLS, documents are assigned to a security label by a trusted subject e.g. an authorized user and based on this assignment; the access to documents is allowed or denied. Using a large number of security labels lead to a complex administration in MLS based operating systems. This is because the manual assignment of documents to a large number of security labels by an authorized user is time-consuming and error-prone. Thus in practice, most MLS based operating systems use a small number of security labels. However, information that is normally processed in an organization consists of different sensitivities and belongs to different compartments. To depict this information in MLS, a large number of security labels is necessary.The aim of this paper is to show that the use of latent semantic indexing is successful in assigning textual information to security labels. This supports the authorized user by his manual assignment. It reduces complexity by the administration of a MLS based operating system and it enables the use of a large number of security labels. In future, the findings probably will lead to an increased usage of these MLS based operating systems in organizations.  相似文献   

19.
主题模型LDA的多文档自动文摘   总被引:3,自引:0,他引:3  
近年来使用概率主题模型表示多文档文摘问题受到研究者的关注.LDA (latent dirichlet allocation)是主题模型中具有代表性的概率生成性模型之一.提出了一种基于LDA的文摘方法,该方法以混乱度确定LDA模型的主题数目,以Gibbs抽样获得模型中句子的主题概率分布和主题的词汇概率分布,以句子中主题权重的加和确定各个主题的重要程度,并根据LDA模型中主题的概率分布和句子的概率分布提出了2种不同的句子权重计算模型.实验中使用ROUGE评测标准,与代表最新水平的SumBasic方法和其他2种基于LDA的多文档自动文摘方法在通用型多文档摘要测试集DUC2002上的评测数据进行比较,结果表明提出的基于LDA的多文档自动文摘方法在ROUGE的各个评测标准上均优于SumBasic方法,与其他基于LDA模型的文摘相比也具有优势.  相似文献   

20.
提出并实现了一种改进的基于词汇链分析的自动摘要系统,该系统结合语法分析的特点计算候选词的上下文窗口,提高了采用贪心策略构建词汇链的质量;并针对传统的词汇链分析方法生成的摘要长度的相对固定性给予了改善.实验表明,该系统与采用原始贪心策略构建词汇链的自动摘要系统相比较,生成的文摘质量有明显提高.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号