首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present a novel evolutionary model for knowledge discovery from texts (KDTs), which deals with issues concerning shallow text representation and processing for mining purposes in an integrated way. Its aims is to look for novel and interesting explanatory knowledge across text documents. The approach uses natural language technology and genetic algorithms to produce explanatory novel hypotheses. The proposed approach is interdisciplinary, involving concepts not only from evolutionary algorithms but also from many kinds of text mining methods. Accordingly, new kinds of genetic operations suitable for text mining are proposed. The principles behind the representation and a new proposal for using multiobjective evaluation at the semantic level are described. Some promising results and their assessment by human experts are also discussed which indicate the plausibility of the model for effective KDT.  相似文献   

2.
Identifying syntactical information from natural‐language texts requires the use of sophisticated parsing techniques mainly based on statistical and machine‐learning methods. However, due to complexity and efficiency issues many intensive natural‐language processing applications using full syntactic analysis methods may not be effective when processing large amounts of natural‐language texts. These tasks can adequately be performed by identifying partial syntactical information through shallow parsing (or chunking) techniques. In this work, a new approach to natural‐language chunking using an evolutionary model is proposed. It uses previously captured training information to guide the evolution of the model. In addition, a multiobjective optimization strategy is used to produce unique quality values for objective functions involving the internal and the external quality of chunking. Experiments and the main results obtained using the model and state‐of‐the‐art approaches are discussed.  相似文献   

3.
尽管抽取式自动文摘方法是目前自动文摘领域的主流方法,并且取得了长足的进步,但抽取式自动文摘形成的摘要由于缺乏句子之间的合理指代或篇章结构,使得文摘缺乏连贯性而影响可读性。为提高自动摘要的可读性,该文尝试将篇章修辞结构信息应用于中文自动文摘。首先,基于汉语篇章修辞结构抽取摘要,然后使用基于LSTM的方法对文本连贯性进行建模,并使用该模型对文摘的连贯性做出评价。实验结果表明: 在摘要抽取方面,基于篇章修辞结构的自动文摘相比于传统的抽取方法具有更好的ROUGE评价值;在使用基于LSTM连贯性模型评价摘要连贯性方面,篇章结构信息在自动抽取文摘时可以很好地提炼出文章的主旨,同时使摘要具有更好的结果。  相似文献   

4.
5.
将传统的文本相似度量方法直接移植到短文本时,由于短文本内容简短的特性会导致数据稀疏而造成计算结果出现偏差。该文通过使用复杂网络表征短文本,提出了一种新的短文本相似度量方法。该方法首先对短文本进行预处理,然后对短文本建立复杂网络模型,计算短文本词语的复杂网络特征值,再借助外部工具计算短文本词语之间的语义相似度,然后结合短文本语义相似度定义计算短文本之间的相似度。最后在基准数据集上进行聚类实验,验证本文提出的短文本相似度计算方法在基于F-度量值标准上,优于传统的TF-IDF方法和另一种基于词项语义相似度的计算方法。  相似文献   

6.
基于GDT的不完整信息系统规则发现   总被引:4,自引:0,他引:4  
提出了一个基于GDT的从不完整信息系统进行规则发现的方法。该方法利用GDT的思想,通过概括强度、规则置信度和规则强度等概念,充分考虑到数据不完整性和噪音引起的不确定性,在不改变原信息系统大小的前提下,直接从不完整信息系统中得到简洁实用的规则。最后,通过一个例子阐述了该方法的实施过程,并将该方法与提及的其它几种不从不完整信息系统中发现规则的方法进行了分析比较。分析表明该方法是一种新的有效的从不完整信息系统抽取规则的途径。  相似文献   

7.
Automatic text summarization (ATS) has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale corpora. However, there is still no guarantee that the generated summaries are grammatical, concise, and convey all salient information as the original documents have. To make the summarization results more faithful, this paper presents an unsupervised approach that combines rhetorical structure theory, deep neural model, and domain knowledge concern for ATS. This architecture mainly contains three components: domain knowledge base construction based on representation learning, the attentional encoder–decoder model for rhetorical parsing, and subroutine-based model for text summarization. Domain knowledge can be effectively used for unsupervised rhetorical parsing thus rhetorical structure trees for each document can be derived. In the unsupervised rhetorical parsing module, the idea of translation was adopted to alleviate the problem of data scarcity. The subroutine-based summarization model purely depends on the derived rhetorical structure trees and can generate content-balanced results. To evaluate the summary results without golden standard, we proposed an unsupervised evaluation metric, whose hyper-parameters were tuned by supervised learning. Experimental results show that, on a large-scale Chinese dataset, our proposed approach can obtain comparable performances compared with existing methods.  相似文献   

8.
In this paper, a new multi-document summarization framework which combines rhetorical roles and corpus-based semantic analysis is proposed. The approach is able to capture the semantic and rhetorical relationships between sentences so as to combine them to produce coherent summaries. Experiments were conducted on datasets extracted from web-based news using standard evaluation methods. Results show the promise of our proposed model as compared to state-of-the-art approaches.  相似文献   

9.
A novel approach is introduced in this paper for the implementation of a question–answering based tool for the extraction of information and knowledge from texts. This effort resulted in the computer implementation of a system answering bilingual questions directly from a text using Natural Language Processing. The system uses domain knowledge concerning categories of actions and implicit semantic relations. The present state of the art in information extraction is based on the template approach which relies on a predefined user model. The model guides the extraction of information and the instantiation of a template that is similar to a frame or set of attribute value pairs as the result of the extraction process. Our question–answering based approach aims to create flexible information extraction tools accepting natural language questions and generating answers that contain information extracted from text either directly or after applying deductive inference. Our approach also addresses the problem of implicit semantic relations occurring either in the questions or in the texts from which information is extracted. These relations are made explicit with the use of domain knowledge. Examples of application of our methods are presented in this paper concerning four domains of quite different nature. These domains are: oceanography, medical physiology, aspirin pharmacology and ancient Greek law. Questions are expressed both in Greek and English. Another important point of our method is to process text directly avoiding any kind of formal representation when inference is required for the extraction of facts not mentioned explicitly in the text. This idea of using text as knowledge base was first presented in Kontos [7] and further elaborated in [9,11,12] as the ARISTA method. This is a new method for knowledge acquisition from texts that is based on using natural language itself for knowledge representation.  相似文献   

10.
有效HTML文本信息抽取方法的研究*   总被引:5,自引:1,他引:4  
从新闻网页和博客网页中抽取出正文内容是一个非常有意义的研究问题,但是多数网页中含有大量与正文无关的噪声内容,导致很难从网页中获取正确的文本信息。分析了中文新闻与博客网页的正文特征,用实验表明了利用HTML与文本的密度比可以进行文本的识别与抽取。提出了机器学习、统计估计以及FDR三种HLML正文抽取方法,并作了大量的实验比较和分析。实验结果表明,该算法可以有效地过滤噪声而且算法的复杂度很低,效率与效果均达到一个很好的平衡。  相似文献   

11.
在智能问诊中,为了让医生快速提出合理的反问以提高医患对话效率,提出了基于深度神经网络的反问生成方法。首先获取大量医患对话文本并进行标注;然后使用文本循环神经网络(TextRNN)、文本卷积神经网络(TextCNN)二种分类模型分别对医生的陈述进行分类;再利用双向文本循环神经网络(TextRNN-B)、双向变形编码器(BERT)分类模型进行问题触发;设计六种不同的问答选取方式来模拟医疗咨询领域情景,采用开源神经机器翻译(OpenNMT)模型进行反问生成;最后对已生成的反问进行综合评估。实验结果表明,使用TextRNN进行分类优于TextCNN,利用BERT模型进行问题触发优于TextRNN-B,采用OpenNMT模型在Window-top方式下实现反问生成时,使用双语评估替补(BLEU)和困惑度(PPL)指标进行评价的结果最好。所提方法验证了深度神经网络技术在反问生成中的有效性,可以有效解决智能问诊中医生反问生成的问题。  相似文献   

12.
基于图分析方法和余弦相似性的主题检测研究   总被引:1,自引:0,他引:1  
如何从海量文本中自动提取有价值的主题信息已成为重要的技术挑战,当下的研究方法大多数是在假设主题相互独立的前提下进行的,但实际上主题与主题之间有着复杂的内在联系。为解决以上问题,将相关性理论与改进的图分析方法相结合,基于主题相关性和术语共现性对主题检测进行建模,高精度语义信息和潜在共现关系同时被用于主题检测,来发现重要且有意义的主题和趋势,仿真实验验证了本文模型的有效性。  相似文献   

13.
Knight:一个通用知识挖掘工具   总被引:23,自引:0,他引:23  
现有知识挖掘系统普遍存在通用性不好,发现方法单一的弱点。  相似文献   

14.
This article aims to show the effectiveness of evolutionary algorithms in automatically parsing sentences of real texts. Parsing methods based on complete search techniques are limited by the exponential increase of the size of the search space with the size of the grammar and the length of the sentences to be parsed. Approximated methods, such as evolutionary algorithms, can provide approximate results, adequate to deal with the indeterminism that ambiguity introduces in natural language processing. This work investigates different alternatives to implement an evolutionary bottom-up parser. Different genetic operators have been considered and evaluated. We focus on statistical parsing models to establish preferences among different parses. It is not our aim to propose a new statistical model for parsing but a new algorithm to perform the parsing once the model has been defined. The training data are extracted from syntactically annotated corpora (treebanks) which provide sets of lexical and syntactic tags as well as the grammar in which the parsing is based. We have tested the system with two corpora: Susanne and Penn Treebank, obtaining very encouraging results.  相似文献   

15.
互联网中存在海量易获取的自然语言形式地址描述文本,其中蕴含丰富的空间信息。针对其非结构化特点,提出了自动提取中文自然语言地址描述中词语和句法信息的方法,以便深度挖掘空间知识。首先,根据地址语料中字串共现的统计规律设计一种不依赖地名词典的中文分词算法,并利用在地址文本中起指示、限定作用的常见词语组成的预定义词表改善分词效果及辅助词性标注。分词完成后,定义能够表达中文地址描述常用句法的有限状态机模型,进而利用其自动匹配与识别地址文本的句法结构。最后,基于大规模真实语料的统计分词及句法识别实验表明了该方法的可用性及有效性。  相似文献   

16.
A large-scale project produces a lot of text data during construction commonly achieved as various management reports. Having the right information at the right time can help the project team understand the project status and manage the construction process more efficiently. However, text information is presented in unstructured or semi-structured formats. Extracting useful information from such a large text warehouse is a challenge. A manual process is costly and often times cannot deliver the right information to the right person at the right time. This research proposes an integrated intelligent approach based on natural language processing technology (NLP), which mainly involves three stages. First, a text classification model based on Convolution Neural Network (CNN) is developed to classify the construction on-site reports by analyzing and extracting report text features. At the second stage, the classified construction report texts are analyzed with improved frequency-inverse document frequency (TF-IDF) by mutual information to identify and mine construction knowledge. At the third stage, a relation network based on the co-occurrence matrix of the knowledge is presented for visualization and better understanding of the construction on-site information. Actual construction reports are used to verify the feasibility of this approach. The study provides a new approach for handling construction on-site text data which can lead to enhancing management efficiency and practical knowledge discovery for project management.  相似文献   

17.
18.
基于线索树双层聚类的微博话题检测   总被引:1,自引:0,他引:1  
微博作为一种全新的信息发布模式,在极大程度上增强了网络信息的开放性和互动性,但同时也造成微博空间内信息量的裂变式增长。利用话题检测技术将微博文本信息按照话题进行归类和组织,可以帮助用户在动态变化的信息环境下高效获取个性信息或热点话题。该文针对微博文本短、半结构、上下文信息丰富等特点,提出了基于线索树的双层聚类的话题检测方法,通过利用融合了时序特征和作者信息的话题模型(Temporal-Author-Topic, TAT)进行线索树内的局部聚类,借以实现垃圾微博的过滤,最后利用整合后的线索树进行全局话题检测。实验结果显示该方法在解决数据稀疏方面取得了较好的效果,话题检测的F值达到31.2%。  相似文献   

19.
Recent scholarship points to the rhetorical role of the aesthetic in multimodal composition and new media contexts. In this article, I examine the aesthetic as a rhetorical concept in writing studies and imagine the ways in which this concept can be useful to teachers of multimodal composition. My treatment of the concept begins with a return to the ancient Greek aisthetikos (relating to perception by the senses) in order to discuss the aesthetic as a meaningful mode of experience. I then review European conceptions of the aesthetic and finally draw from John Dewey and Bruno Latour to help shape this concept into a pragmatic and useful approach that can complement multimodal teaching and learning. The empirical approach I construct adds to an understanding of aesthetic experience with media in order to render more transparent the ways in which an audience creates knowledge—or takes and makes meaning—via the senses. Significantly, this approach to meaning making supports learning in digital environments where students are increasingly asked to both produce and consume media convergent texts that combine multiple modalities including sound, image, and user interaction.  相似文献   

20.
This paper discusses a fundamental problem in natural language generation: how to organize the content of a text in a coherent and natural way. In this research, we set out to determine the semantic content and the rhetorical structure of texts and to develop heuristics to perform this process automatically within a text generation framework. The study was performed on a specific language and textual genre: French instructional texts. From a corpus analysis of these texts, we determined nine senses typically communicated in instructional texts and seven rhetorical relations used to present these senses. From this analysis, we then developed a set of presentation heuristics that determine how the senses to be communicated should be organized rhetorically in order to create a coherent and natural text. The heuristics are based on five types of constraints: conceptual, semantic, rhetorical, pragmatic, and intentional constraints. To verify the heuristics, we developed the spin natural language generation system, which performs all steps of text generation but focuses on the determination of the content and the rhetorical structure of the text.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号