首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
一种概念空间自生成方法   总被引:5,自引:2,他引:5  
文章提出一种自动生成概念空间的方法。首先通过SOM神经网络,对文本进行聚类,之后从结果中提取反映各类文本内容的概念,用于标注文本的类别,再通过模糊聚类进行概念自动抽象与归纳形成概念空间,用于文本的管理。SOM本身是无监督的学习方式,在设定好参数后,经过训练自动生成文本空间与概念空间的映射图。相关试验和结果表明概念空间对文本有很好的分类管理功能,便于文本检索。  相似文献   

2.
In this paper we investigate the use of a multimodal feature learning approach, using neural network based models such as Skip-gram and Denoising Autoencoders, to address sentiment analysis of micro-blogging content, such as Twitter short messages, that are composed by a short text and, possibly, an image. The approach used in this work is motivated by the recent advances in: i) training language models based on neural networks that have proved to be extremely efficient when dealing with web-scale text corpora, and have shown very good performances when dealing with syntactic and semantic word similarities; ii) unsupervised learning, with neural networks, of robust visual features, that are recoverable from partial observations that may be due to occlusions or noisy and heavily modified images. We propose a novel architecture that incorporates these neural networks, testing it on several standard Twitter datasets, and showing that the approach is efficient and obtains good classification results.  相似文献   

3.
Knowledge discovery through directed probabilistic topic models: a survey   总被引:1,自引:0,他引:1  
Graphical models have become the basic framework for topic based probabilistic modeling. Especially models with latent variables have proved to be effective in capturing hidden structures in the data. In this paper, we survey an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora. From an unsupervised learning perspective, “topics are semantically related probabilistic clusters of words in text corpora; and the process for finding these topics is called topic modeling”. In topic modeling, a document consists of different hidden topics and the topic probabilities provide an explicit representation of a document to smooth data from the semantic level. It has been an active area of research during the last decade. Many models have been proposed for handling the problems of modeling text corpora with different characteristics, for applications such as document classification, hidden association finding, expert finding, community discovery and temporal trend analysis. We give basic concepts, advantages and disadvantages in a chronological order, existing models classification into different categories, their parameter estimation and inference making algorithms with models performance evaluation measures. We also discuss their applications, open challenges and future directions in this dynamic area of research.  相似文献   

4.
Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.  相似文献   

5.
Automatic text classification based on vector space model (VSM), artificial neural networks (ANN), K-nearest neighbor (KNN), Naives Bayes (NB) and support vector machine (SVM) have been applied on English language documents, and gained popularity among text mining and information retrieval (IR) researchers. This paper proposes the application of VSM and ANN for the classification of Tamil language documents. Tamil is morphologically rich Dravidian classical language. The development of internet led to an exponential increase in the amount of electronic documents not only in English but also other regional languages. The automatic classification of Tamil documents has not been explored in detail so far. In this paper, corpus is used to construct and test the VSM and ANN models. Methods of document representation, assigning weights that reflect the importance of each term are discussed. In a traditional word-matching based categorization system, the most popular document representation is VSM. This method needs a high dimensional space to represent the documents. The ANN classifier requires smaller number of features. The experimental results show that ANN model achieves 93.33% which is better than the performance of VSM which yields 90.33% on Tamil document classification.  相似文献   

6.
Legal text retrieval traditionally relies upon external knowledge sources such as thesauri and classification schemes, and an accurate indexing of the documents is often manually done. As a result not all legal documents can be effectively retrieved. However a number of current artificial intelligence techniques are promising for legal text retrieval. They sustain the acquisition of knowledge and the knowledge-rich processing of the content of document texts and information need, and of their matching. Currently, techniques for learning information needs, learning concept attributes of texts, information extraction, text classification and clustering, and text summarization need to be studied in legal text retrieval because of their potential for improving retrieval and decreasing the cost of manual indexing. The resulting query and text representations are semantically much richer than a set of key terms. Their use allows for more refined retrieval models in which some reasoning can be applied. This paper gives an overview of the state of the art of these innovativetechniques and their potential for legal text retrieval.  相似文献   

7.

Purpose

Extracting comprehensible classification rules is the most emphasized concept in data mining researches. In order to obtain accurate and comprehensible classification rules from databases, a new approach is proposed by combining advantages of artificial neural networks (ANN) and swarm intelligence.

Method

Artificial neural networks (ANNs) are a group of very powerful tools applied to prediction, classification and clustering in different domains. The main disadvantage of this general purpose tool is the difficulties in its interpretability and comprehensibility. In order to eliminate these disadvantages, a novel approach is developed to uncover and decode the information hidden in the black-box structure of ANNs. Therefore, in this paper a study on knowledge extraction from trained ANNs for classification problems is carried out. The proposed approach makes use of particle swarm optimization (PSO) algorithm to transform the behaviors of trained ANNs into accurate and comprehensible classification rules. Particle swarm optimization with time varying inertia weight and acceleration coefficients is designed to explore the best attribute-value combination via optimizing ANN output function.

Results

The weights hidden in trained ANNs turned into comprehensible classification rule set with higher testing accuracy rates compared to traditional rule based classifiers.  相似文献   

8.
近几年来,随着词向量和各种神经网络模型在自然语言处理上的成功应用,基于神经网络的文本分类方法开始成为研究主流.但是当不同类别的训练数据不均衡时,训练得到的神经网络模型会由多数类所主导,分类结果往往倾向多数类,极大彩响了分类效果.针对这种情况,本文在卷积神经网络训练过程中,损失函数引入类别标签权重,强化少数类对模型参数的影响.在复旦大学文本分类数据集上进行测试,实验表明本文提出的方法相比于基线系统宏平均F1值提高了4.49%,较好地解决数据不平衡分类问题.  相似文献   

9.
An increasing number of computational and statistical approaches have been used for text classification, including nearest-neighbor classification, naïve Bayes classification, support vector machines, decision tree induction, rule induction, and artificial neural networks. Among these approaches, naïve Bayes classifiers have been widely used because of its simplicity. Due to the simplicity of the Bayes formula, the naïve Bayes classification algorithm requires a relatively small number of training data and shorter time in both the training and classification stages as compared to other classifiers. However, a major short coming of this technique is the fact that the classifier will pick the highest probability category as the one to which the document is annotated too. Doing this is tantamount to classifying using only one dimension of a multi-dimensional data set. The main aim of this work is to utilize the strengths of the self organizing map (SOM) to overcome the inadvertent dimensionality reduction resulting from using only the Bayes formula to classify. Combining the hybrid system with new ranking techniques further improves the performance of the proposed document classification approach. This work describes the implementation of an enhanced hybrid classification approach which affords a better classification accuracy through the utilization of two familiar algorithms, the naïve Bayes classification algorithm which is used to vectorize the document using a probability distribution and the self organizing map (SOM) clustering algorithm which is used as the multi-dimensional unsupervised classifier.  相似文献   

10.
Inspired by classical text document analysis employing the concept of (key) words, this paper presents an unsupervised approach to discover (key) audio elements in general audio documents. The (key) audio elements can be considered the equivalents of the text (key) words, and enable content-based audio analysis and retrieval following the analogy to the proven text analysis theories and methods. Since general audio signals usually show complicated and strongly varying distribution and density in the feature space, we propose an iterative spectral clustering method with context-dependent scaling factors to decompose an audio data stream into audio elements. Using this clustering method, temporal signal segments with similar low-level features are grouped into natural clusters that we adopt as audio elements. To detect those audio elements that are most representative for the semantic content, that is, the key audio elements, two cases are considered. First, if only one audio document is available for analysis, a number of heuristic importance indicators are defined and employed to detect the key audio elements. For the case that multiple audio documents are available, more sophisticated measures for audio element importance, including expected term frequency (ETF), expected inverse document frequency (EIDF), expected term duration (ETD) and expected inverse document duration (EIDD), are proposed. Our experiments showed encouraging results regarding the quality of the obtained (key) audio elements and their potential applicability for content-based audio document analysis and retrieval.  相似文献   

11.
由于基于变换器的双向编码器表征技术(Bidirectional Encoder Representations from Transformers,BERT)的提出,改变了传统神经网络解决句子级文本情感分析问题的方法。目前的深度学习模型BERT本身学习模式为无监督学习,其需要依赖后续任务补全推理和决策环节,故存在缺乏目标领域知识的问题。提出一种多层协同卷积神经网络模型(Multi-level Convolutional Neural Network,MCNN),该模型能学习到不同层次的情感特征来补充领域知识,并且使用BERT预训练模型提供词向量,通过BERT学习能力的动态调整将句子真实的情感倾向嵌入模型,最后将不同层次模型输出的特征信息同双向长短期记忆网络输出信息进行特征融合后计算出最终的文本情感性向。实验结果表明即使在不同语种的语料中,该模型对比传统神经网络和近期提出的基于BERT深度学习的模型,情感极性分类的能力有明显提升。  相似文献   

12.
刘金硕  张智 《计算机科学》2016,43(12):277-280
针对因中文食品安全文本特征表达困难,而造成语义信息缺失进而导致分类器准确率低下的问题,提出一种基于深度神经网络的跨文本粒度情感分类模型。以食品安全新闻报道为目标语料,采用无监督的浅层神经网络初始化文本的词语级词向量。引入递归神经网络,将预训练好的词向量作为下层递归神经网络(Recursive Neural Network)的输入层,计算得到具备词语间语义关联性的句子特征向量及句子级的情感倾向输出,同时动态反馈调节词向量特征,使其更加接近食品安全特定领域内真实的语义表达。然后,将递归神经网络输出的句子向量以时序逻辑作为上层循环神经网络(Recurrent Neural Network)的输入,进一步捕获句子结构的上下文语义关联信息,实现篇章级的情感倾向性分析任务。实验结果表明,联合深度模型在食品安全新闻报道的情感分类任务中具有良好的效果,其分类准确率和F1值分别达到了86.7%和85.9%,较基于词袋思想的SVM模型有显著的提升。  相似文献   

13.
财经新闻的情感分析有助于企业和投资者确定投资风险和提高经济效益,具有较高的应用价值。针对财经新闻文本,提出一种在图卷积神经网络中使用依存句法分析(Dependency Analysis-based Graph Convolutional Network,DA-GCN)的情感分析方法。该方法通过分析文档中词语的依存关系,获取句子的语序信息和文档中重要的句子成分,再通过词语在文档中的共现信息实现信息传递和对图的参数更新。在财经新闻数据集上进行的实验表明,本文提出的方法与传统深度学习方法相比,在各项评价指标上都取得显著提升。  相似文献   

14.
Abstract

Retrieving relevant information from Twitter is always a challenging task given its vocabulary mismatch, sheer volume and noise. Representing the content of text tweets is a critical part of any microblog retrieval model. For this reason, deep neural networks can be used for learning good representations of text data and then conduct to a better matching. In this paper, we are interested in improving both representation and retrieval effectiveness in microblogs. For that, a Hybrid-Deep Neural-Network-based text representation model is proposed to extract effective features’ representations for clustering oriented microblog retrieval. HDNN combines recurrent neural network and feedforward neural network architectures. Specifically, using a bi-directional LSTM, we first generate a deep contextualized word representation which incorporates character n-grams form FasText. However, these contextual embedded existing in a high-dimensional space are not all important. Some of them are redundant, correlated and sometimes noisy making the learning models over-fitting, complex and less interpretable. To deal with these problems, we proposed a Hybrid-Regularized-Autoencoder-based method which combines autoencoder with Elastic Net regularization for an effective unsupervised feature selection and extraction. Our experimental results show that the performance of clustering and especially information retrieval in microblogs depend heavily on features’ representation.  相似文献   

15.
Automatic text summarization (ATS) has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale corpora. However, there is still no guarantee that the generated summaries are grammatical, concise, and convey all salient information as the original documents have. To make the summarization results more faithful, this paper presents an unsupervised approach that combines rhetorical structure theory, deep neural model, and domain knowledge concern for ATS. This architecture mainly contains three components: domain knowledge base construction based on representation learning, the attentional encoder–decoder model for rhetorical parsing, and subroutine-based model for text summarization. Domain knowledge can be effectively used for unsupervised rhetorical parsing thus rhetorical structure trees for each document can be derived. In the unsupervised rhetorical parsing module, the idea of translation was adopted to alleviate the problem of data scarcity. The subroutine-based summarization model purely depends on the derived rhetorical structure trees and can generate content-balanced results. To evaluate the summary results without golden standard, we proposed an unsupervised evaluation metric, whose hyper-parameters were tuned by supervised learning. Experimental results show that, on a large-scale Chinese dataset, our proposed approach can obtain comparable performances compared with existing methods.  相似文献   

16.
Various studies have demonstrated that convolutional neural networks (CNNs) can be directly applied to different levels of text embedding, such as character‐, word‐, or document‐levels. However, the effectiveness of different embeddings is limited in the reported result and there is a lack of clear guidance on some aspects of their use, including choosing the proper level of embedding and switching word semantics from one domain to another when appropriate. In this paper, we propose a new architecture of CNN based on multiple representations for text classification, by constructing multiple planes so that more information can be dumped into the networks, such as different parts of text obtained through named entity recognizer or part‐of‐speech tagging tools, different levels of text embedding, or contextual sentences. Various large‐scale, domain‐specific datasets are used to validate the proposed architecture. Tasks analyzed include ontology document classification, biomedical event categorization, and sentiment analysis, showing that multi‐representational CNNs, which learns to focus attention to specific representations of text, can obtain further gains in performance over state‐of‐the‐art deep neural network models.  相似文献   

17.
Document clustering is text processing that groups documents with similar concepts. It's usually considered an unsupervised learning approach because there's no teacher to guide the training process, and topical information is often assumed to be unavailable. A guided approach to document clustering that integrates linguistic top-down knowledge from WordNet into text vector representations based on the extended significance vector weighting technique improves both classification accuracy and average quantization error. In our guided self-organization approach we integrate topical and semantic information from WordNet. Because a document-training set with preclassified information implies relationships between a word and its preference class, we propose a novel document vector representation approach to extract these relationships for document clustering. Furthermore, merging statistical methods, competitive neural models, and semantic relationships from symbolic Word-Net, our hybrid learning approach is robust and scales up to a real-world task of clustering 100,000 news documents.  相似文献   

18.
There are several data based methods in the field of artificial intelligence which are nowadays frequently used for analyzing classification problems in the context of medical applications. As we show in this paper, the application of enhanced evolutionary computation techniques to classification problems has the potential to evolve classifiers of even higher quality than those trained by standard machine learning methods. On the basis of five medical benchmark classification problems taken from the UCI repository as well as the Melanoma data set (prepared by members of the Department of Dermatology of the Medical University Vienna) we document that the enhanced genetic programming approach presented here is able to produce comparable or even better results than linear modeling methods, artificial neural networks, kNN classification, support vector machines and also various genetic programming approaches.
Stefan WagnerEmail:
  相似文献   

19.

In view of the exponential growth of online document corpora, even perfect retrieval will fetch too much material for a user to cope with. One way to reduce this problem is automatic domain-specific summarization tailored to user's needs, which is a kind of high-level data cleaning. This requires some method of discovering classes of similar items that may be grouped into predetermined domains. We explore whether there exists a synergic relation between systems for classification and those for summarization by way of composing those subsystems. In other words, we examine whether prior summarization will increase the performance of the classifier system and vice versa. In both cases, the answer is affirmative, as we show in this paper. We propose a text-mining framework in which these subsystems are treated as constituents of a knowledge discovery process for text corpora.  相似文献   

20.
Text Retrieval from Document Images Based on Word Shape Analysis   总被引:2,自引:1,他引:2  
In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号