首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
李超  严馨 《计算机应用研究》2021,38(11):3283-3288
针对柬语标注数据较少、语料稀缺,柬语句子级情感分析任务进步缓慢的问题,提出了一种基于深度半监督CNN(convolutional neural networks)的柬语句子级情感极性分类方法.该方法通过融合词典嵌入的分开卷积CNN模型,利用少量已有的柬语情感词典资源提升句子级情感分类任务性能.首先构建柬语句子词嵌入和词典嵌入,通过使用不同的卷积核对两部分嵌入分别进行卷积,将已有情感词典信息融入到CNN模型中去,经过最大延时池化得到最大输出特征,把两部分最大输出特征拼接后作为全连接层输入;然后通过结合半监督学习方法——时序组合模型,训练提出的深度神经网络模型,利用标注与未标注语料训练,降低对标注语料的需求,进一步提升模型情感分类的准确性.结果 证明,通过半监督方法时序组合模型训练,在人工标记数据相同的情况下,该方法相较于监督方法在柬语句子级情感分类任务上准确率提升了3.89%.  相似文献   

2.
覆盖面广且领域适应性好的情感词典可以有效提高文本情感分析效能。设计了基于连词语言特征和词性特征向量统计特征的中文情感词典扩展算法,提出了综合两种方法的混合特征算法。算法计算得到词语的细粒度的积极和消极情感极性值,并对通用情感词典在领域内进行扩展以提高覆盖度,对词典进行领域内调整以提高适应性。实验结果表明,算法在领域内扩展获得的词典比通用情感词典覆盖度和适应性更好,在情感分类任务中性能接近有监督方法。  相似文献   

3.
文本情感分析是目前自然语言处理领域的一个热点研究问题,具有广泛的实用价值和理论研究意义。情感词典构建则是文本情感分析的一项基础任务,即将词语按照情感倾向分为褒义、中性或者贬义。然而,中文情感词典构建存在两个主要问题 1)许多情感词存在多义、歧义的现象,即一个词语在不同语境中它的语义倾向也不尽相同,这给词语的情感计算带来困难;2)由国内外相关研究现状可知,中文情感字典建设的可用资源相对较少。考虑到英文情感分析研究中存在大量语料和词典,该文借助机器翻译系统,结合双语言资源的约束信息,利用标签传播算法(LP)计算词语的情感信息。在四个领域的实验结果显示我们的方法能获得一个分类精度高、覆盖领域语境的中文情感词典。  相似文献   

4.
Semi-supervised learning has attracted a significant amount of attention in pattern recognition and machine learning. Most previous studies have focused on designing special algorithms to effectively exploit the unlabeled data in conjunction with labeled data. Our goal is to improve the classification accuracy of any given supervised learning algorithm by using the available unlabeled examples. We call this as the Semi-supervised improvement problem, to distinguish the proposed approach from the existing approaches. We design a metasemi-supervised learning algorithm that wraps around the underlying supervised algorithm and improves its performance using unlabeled data. This problem is particularly important when we need to train a supervised learning algorithm with a limited number of labeled examples and a multitude of unlabeled examples. We present a boosting framework for semi-supervised learning, termed as SemiBoost. The key advantages of the proposed semi-supervised learning approach are: 1) performance improvement of any supervised learning algorithm with a multitude of unlabeled data, 2) efficient computation by the iterative boosting algorithm, and 3) exploiting both manifold and cluster assumption in training classification models. An empirical study on 16 different data sets and text categorization demonstrates that the proposed framework improves the performance of several commonly used supervised learning algorithms, given a large number of unlabeled examples. We also show that the performance of the proposed algorithm, SemiBoost, is comparable to the state-of-the-art semi-supervised learning algorithms.  相似文献   

5.
Sentiment analysis involves the detection of sentiment content of text using natural language processing. Natural language processing is a very challenging task due to syntactic ambiguities, named entity recognition, use of slangs, jargons, sarcasm, abbreviations and contextual sensitivity. Sentiment analysis can be performed using supervised as well as unsupervised approaches. As the amount of data grows, unsupervised approaches become vital as they cut down on the learning time and the requirements for availability of a labelled dataset. Sentiment lexicons provide an easy application of unsupervised algorithms for text classification. SentiWordNet is a lexical resource widely employed by many researchers for sentiment analysis and polarity classification. However, the reported performance levels need improvement. The proposed research is focused on raising the performance of SentiWordNet3.0 by using it as a labelled corpus to build another sentiment lexicon, named Senti‐CS. The part of speech information, usage based ranks and sentiment scores are used to calculate Chi‐Square‐based feature weight for each unique subjective term/part‐of‐speech pair extracted from SentiWordNet3.0. This weight is then normalized in a range of ?1 to +1 using min–max normalization. Senti‐CS based sentiment analysis framework is presented and applied on a large dataset of 50000 movie reviews. These results are then compared with baseline SentiWordNet, Mutual Information and Information Gain techniques. State of the art comparison is performed for the Cornell movie review dataset. The analyses of results indicate that the proposed approach outperforms state‐of‐the‐art classifiers.  相似文献   

6.
Semi-supervised clustering is gaining importance these days since neither supervised nor unsupervised learning methods in a stand-alone manner provide satisfactory results. Existing semi-supervised clustering techniques are mostly based on pair-wise constraints, which could be misleading. These semi-supervised clustering algorithms also fail to address the problem of dealing with attributes having different weights. In most of the real-life applications, all attributes do not have equal importance and hence same weights cannot be assigned for each attribute. In this paper, a novel distance-based semi-supervised clustering algorithm has been proposed, which uses functional link neural network (FLNN) for finding weights for attributes with small amount of labeled data for further use in parametric Minkowski’s model for clustering. In FLNN, the nonlinearity is captured by enhancing the input using orthonormal basis functions. The effectiveness of the approach has been illustrated over a number of datasets taken from UCI machine learning repository. Comparative performance evaluation demonstrates that the proposed approach outperforms the existing semi-supervised clustering algorithms. The proposed approach has also been successfully used to cluster the crime locations and to find crime hot spots in India on the data provided by National Crime Records Bureau (NCRB).  相似文献   

7.
付治  王红军  李天瑞  滕飞  张继 《软件学报》2020,31(4):981-990
聚类是机器学习领域中的一个研究热点,弱监督学习是半监督学习中一个重要的研究方向,有广泛的应用场景.在对聚类与弱监督学习的研究中,提出了一种基于k个标记样本的弱监督学习框架.该框架首先用聚类及聚类置信度实现了标记样本的扩展.其次,对受限玻尔兹曼机的能量函数进行改进,提出了基于k个标记样本的受限玻尔兹曼机学习模型.最后,完成了对该模型的推理并设计相关算法.为了完成对该框架和模型的检验,选择公开的数据集进行对比实验,实验结果表明,基于k个标记样本的弱监督学习框架实验效果较好.  相似文献   

8.
Twitter messages are increasingly used to determine consumer sentiment towards a brand. The existing literature on Twitter sentiment analysis uses various feature sets and methods, many of which are adapted from more traditional text classification problems. In this research, we introduce an approach to supervised feature reduction using n-grams and statistical analysis to develop a Twitter-specific lexicon for sentiment analysis. We augment this reduced Twitter-specific lexicon with brand-specific terms for brand-related tweets. We show that the reduced lexicon set, while significantly smaller (only 187 features), reduces modeling complexity, maintains a high degree of coverage over our Twitter corpus, and yields improved sentiment classification accuracy. To demonstrate the effectiveness of the devised Twitter-specific lexicon compared to a traditional sentiment lexicon, we develop comparable sentiment classification models using SVM. We show that the Twitter-specific lexicon is significantly more effective in terms of classification recall and accuracy metrics. We then develop sentiment classification models using the Twitter-specific lexicon and the DAN2 machine learning approach, which has demonstrated success in other text classification problems. We show that DAN2 produces more accurate sentiment classification results than SVM while using the same Twitter-specific lexicon.  相似文献   

9.
情感分类是通过分析数据中的情感信息,来预测数据所传递的情感倾向.其中结合语言学词典与产生式分类器构造带有先验知识的分类模型,是一类重要的研究课题.通过研究情感词的领域性和不同权重的特性,提出了一种新的融入情感先验知识的情感分类方法.通过自动分析构造领域相关的情感词及其权重信息,将其作为情感先验知识,融入到产生式分类模型...  相似文献   

10.
情感分析已经成为当今自然语言处理领域的热点问题。对于文本的自动化、半监督式的情感分析研究具有广泛的理论和实用价值。基于情感词典的情感倾向分析方法是文本情感分析的一种重要 手段。然而,中文词汇在不同领域中的情感倾向不尽相同,一词多义现象明显。同时,不同领域中的情感词也具有专业性、领 域性的特点。针对这些问题,本文提出一种基于词向量相似度的半监督情感极性判断算法 (Sentiment orientation from word vector,SO-WV),并依据该算法设计出一种跨领域的中文情感词典构建方法。实验证明,本文所设计的情感词典构建方法能有效地对情感词情感倾向进行判断。算法不仅在不同领域的情感词典 建立上具有良好的可移植性,同时还具有专业性、领域性的特点。  相似文献   

11.
Sentiment analysis is a challenging task that attracted increasing interest during the last years. The availability of online data along with the business interest to keep up with consumer feedback generates a constant demand for online analysis of user-generated content. A key role to this task plays the utilization of domain-specific lexicons of opinion words that enables algorithms to classify short snippets of text into sentiment classes (positive, negative). This process is known as dictionary-based sentiment analysis. The related work tends to solve this lexicon identification problem by either exploiting a corpus and a thesaurus or by manually defining a set of patterns that will extract opinion words. In this work, we propose an unsupervised approach for discovering patterns that will extract domain-specific dictionary. Our approach (DidaxTo) utilizes opinion modifiers, sentiment consistency theories, polarity assignment graphs and pattern similarity metrics. The outcome is compared against lexicons extracted by the state-of-the-art approaches on a sentiment analysis task. Experiments on user reviews coming from a diverse set of products demonstrate the utility of the proposed method. An implementation of the proposed approach in an easy to use application for extracting opinion words from any domain and evaluate their quality is also presented.  相似文献   

12.
为综合利用基于情感词典和基于机器学习的两类情感分类方法的优点,提出一种基于情感词汇与机器学习的方面级情感分类方法。通过选取少量情感倾向与评价对象无关的情感词汇对评价搭配进行情感分类;通过构建机器学习分类器,以评价短语对各类别的互信息占比作为分类器的分类概率权重,进行加权计算,选择加权后分类概率最大的类别作为评价搭配的情感倾向类别。在中文评论数据集上的实验结果表明,该方法能有效提高情感分类性能。  相似文献   

13.
Extreme learning machine (ELM) works for generalized single-hidden-layer feedforward networks (SLFNs), and its essence is that the hidden layer of SLFNs need not be tuned. But ELM only utilizes labeled data to carry out the supervised learning task. In order to exploit unlabeled data in the ELM model, we first extend the manifold regularization (MR) framework and then demonstrate the relation between the extended MR framework and ELM. Finally, a manifold regularized extreme learning machine is derived from the proposed framework, which maintains the properties of ELM and can be applicable to large-scale learning problems. Experimental results show that the proposed semi-supervised extreme learning machine is the most cost-efficient method. It tends to have better scalability and achieve satisfactory generalization performance at a relatively faster learning speed than traditional semi-supervised learning algorithms.  相似文献   

14.
The literature in sentiment analysis has widely assumed that semantic relationships between words cannot be effectively exploited to produce satisfactory sentiment lexicon expansions. This assumption stems from the fact that words considered to be “close” in a semantic space (e.g., word embeddings) may present completely opposite polarities, which might suggest that sentiment information in such spaces is either too faint, or at least not readily exploitable. Our main contribution in this paper is a rigorous and robust challenge to this assumption: by proposing a set of theoretical hypotheses and corroborating them with strong experimental evidence, we demonstrate that semantic relationships can be effectively used for good lexicon expansion. Based on these results, our second contribution is a novel, simple, and yet effective lexicon-expansion strategy based on semantic relationships extracted from word embeddings. This strategy is able to substantially enhance the lexicons, whilst overcoming the major problem of lexicon coverage. We present an extensive experimental evaluation of sentence-level sentiment analysis, comparing our approach to sixteen state-of-the-art (SOTA) lexicon-based and five lexicon expansion methods, over twenty datasets. Results show that in the vast majority of cases our approach outperforms the alternatives, achieving coverage of almost 100% and gains of about 26% against the best baselines. Moreover, our unsupervised approach performed competitively against SOTA supervised sentiment analysis methods, mainly in scenarios with scarce information. Finally, in a cross-dataset comparison, our approach turned out to be as competitive as (i.e., statistically tie with) state-of-the-art supervised solutions such as pre-trained transformers (BERT), even without relying on any training (labeled) data. Indeed in small datasets or in datasets with scarce information (short messages), our solution outperformed the supervised ones by large margins.  相似文献   

15.
软件缺陷预测有助于提高软件开发质量,保证测试资源有效分配。针对软件缺陷预测研究中类标签数据难以获取和类不平衡分布问题,提出基于采样的半监督支持向量机预测模型。该模型采用无监督的采样技术,确保带标签样本数据中缺陷样本数量不会过低,使用半监督支持向量机方法,在少量带标签样本数据基础上利用无标签数据信息构建预测模型;使用公开的NASA软件缺陷预测数据集进行仿真实验。实验结果表明提出的方法与现有半监督方法相比,在综合评价指标[F]值和召回率上均优于现有方法;与有监督方法相比,能在学习样本较少的情况下取得相当的预测性能。  相似文献   

16.
罗浩然  杨青 《计算机应用》2022,42(4):1099-1107
情感分析作为自然语言处理(NLP)的细分研究方向经历了使用情感词典、机器学习和深度学习分析的发展过程。针对使用一般化的深度学习模型作为文本分类器对于特定领域的网络评论类型的文本的分析的精准度较低,训练时发生过拟合现象以及情感词典覆盖率低、编纂工作量大的问题,提出了基于情感词典和堆叠残差的双向长短期记忆(Bi-LSTM)网络的情感分析模型。首先,借助情感词典中情感词的设计覆盖“教育机器人”研究领域内的专业词汇,从而弥补Bi-LSTM模型在分析此类文本时精准度的不足;然后,使用Bi-LSTM和SnowNLP来降低情感词典的编纂体量。长短期记忆(LSTM)网络的“记忆门”“遗忘门”结构可以在保证充分考虑评论文本中的前后词语的关联性的同时,适时选择遗忘一些已分析词语,从而避免反向传播时的梯度爆炸问题。而在将堆叠残差的Bi-LSTM引入后,不仅使得模型的层数加深至8层,而且还使残差网络避免了叠加LSTM时会导致的“退化”问题;最后,通过适当设置和调整两部分的得分权重,并将总分使用Sigmoid激活函数标准化到[0,1]的区间上,按照[0,0.5],(0.5,1]的区间划分分别表示负面和正面情绪,完成情感分类。实验结果表明,在“教育机器人”评论数据集中,所提模型对于情感分类准确率相较于标准的LSTM模型提升了约4.5个百分点,相较于BERT提升了约2.0个百分点。综上,所提模型将基于情感词典和深度学习模型的情感分类方法一般化;而通过修改情感词典中的情感词汇并适当调整深度学习模型的结构和层数,所提模型可以应用于电子商务平台中各类商品的购物评价的精确情感分析,从而帮助企业洞悉消费者的购物心理和市场需求,同时也可以为消费者提供商品质量的一种参考标准。  相似文献   

17.
Semi-supervised dimensionality reduction has attracted an increasing amount of attention in this big-data era. Many algorithms have been developed with a small number of pairwise constraints to achieve performances comparable to those of fully supervised methods. However, one challenging problem with semi-supervised approaches is the appropriate choice of the constraint set, including the cardinality and the composition of the constraint set which, to a large extent, affects the performance of the resulting algorithm. In this work, we address the problem by incorporating ensemble subspaces and active learning into dimensionality reduction and propose a new global and local scatter based semi-supervised dimensionality reduction method with active constraints selection. Unlike traditional methods that select the supervised information in one subspace, we pick up pairwise constraints in ensemble subspaces, where a novel active learning algorithm is designed with both exploration and filtering to generate informative pairwise constraints. The automatic constraint selection approach proposed in this paper can be generalized to be used with all constraint-based semi-supervised learning algorithms. Comparative experiments are conducted on four face database and the results validate the effectiveness of the proposed method.  相似文献   

18.
特征选择旨在降低待处理数据的维度,剔除冗余特征,是机器学习领域的关键问题之一。现有的半监督特征选择方法一般借助图模型提取数据集的聚类结构,但其所提取的聚类结构缺乏清晰的边界,影响了特征选择的效果。为此,提出一种基于稀疏图表示的半监督特征选择方法,构建了聚类结构和特征选择的联合学习模型,采用l__1范数约束图模型以得到清晰的聚类结构,并引入l_2,1范数以避免噪声的干扰并提高特征选择的准确度。为了验证本方法的有效性,选择了目前流行的几种特征方法进行对比分析,实验结果表明了本方法的有效性。  相似文献   

19.
李志恒 《计算机应用研究》2021,38(2):591-594,599
针对机器学习中训练样本和测试样本概率分布不一致的问题,提出了一种基于dropout正则化的半监督域自适应方法来实现将神经网络的特征表示从标签丰富的源域转移到无标签的目标域。此方法从半监督学习的角度出发,在源域数据中添加少量带标签的目标域数据,使得神经网络在学习到源域数据特征分布的同时也能学习到目标域数据的特征分布。由于有了先验知识的指导,即使没有丰富的标签信息,神经网络依然可以很好地拟合目标域数据。实验结果表明,此算法在几种典型的数字数据集SVHN、MNIST和USPS的域自适应任务上的性能优于现有的其他算法,并且在涵盖广泛自然类别的真实数据集CIFAR-10和STL-10的域自适应任务上有较好的鲁棒性。  相似文献   

20.
情感词典自动构建方法综述   总被引:13,自引:1,他引:12  
王科  夏睿 《自动化学报》2016,42(4):495-511
情感词典作为判断词语和文本情感倾向的重要工具, 其自动构建方法已成为情感分析和观点挖掘领域的一项重要研究内容. 本文整理了现有的中、英文情感词典资源, 同时分别从知识库、语料库、以及两者结合的角度, 归纳现有英文和中文情感词典的构建方法, 分析了各种方法的优缺点, 并总结了情感词典构建中的若干难点问题. 之后, 我们回顾了情感词典性能评估方法及相关评测竞赛. 最后总结了情感词典构建任务的发展前景以及一些亟需解决的问题.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号