首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
考虑到同类型的情感句往往具有相同或者相似的句法和语义表达模式,该文提出了一种基于情感句模的文本情感自动分类方法。首先,将情感表达相关句模人工分为3大类105个二级分类;然后,设计了一种利用依存特征、句法特征和同义词特征的句模获取方法,从标注情感句中半自动地获取情感句模。最后,通过对输入句进行情感句模分类实现文本情感分类。在NLP&CC2013中文微博情绪分类评测语料及RenCECps博客语料的实验结果显示,该文提出的分类方法准确率显著高于基于词特征支持向量机分类器。  相似文献   

2.
该文针对中文网络评论情感分类任务,提出了一种集成学习框架。首先针对中文网络评论复杂多样的特点,采用词性组合模式、频繁词序列模式和保序子矩阵模式作为输入特征。然后采用基于信息增益的随机子空间算法解决文本特征繁多的问题,同时提高基分类器的分类性能。最后基于产品属性构造基分类器算法综合评论文本中每个属性的情感信息,进而判别评论的句子级情感倾向。实验结果表明了该框架在中文网络评论情感分类任务上的有效性,特别是在Logistic Regression分类算法上准确率达到90.3%。  相似文献   

3.
方丁  王刚 《计算机系统应用》2012,21(7):177-181,248
随着Web2.0的迅速发展,越来越多的用户乐于在互联网上分享自己的观点或体验。这类评论信息迅速膨胀,仅靠人工的方法难以应对网上海量信息的收集和处理,因此基于计算机的文本情感分类技术应运而生,并且研究的重点之一就是提高分类的精度。由于集成学习理论是提高分类精度的一种有效途径,并且已在许多领域显示出其优于单个分类器的良好性能,为此,提出基于集成学习理论的文本情感分类方法。实验结果显示三种常用的集成学习方法 Bagging、Boosting和Random Subspace对基础分类器的分类精度都有提高,并且在不同的基础分类器条件下,Random Subspace方法较Bagging和Boosting方法在统计意义上更优,以上结果进一步验证了集成学习理论在文本情感分类中应用的有效性。  相似文献   

4.
As an important attribute of proteins, protein subcellular location(s) can provide valuable information about their functions. Determining protein subcellular locations using experimental methods are usually expensive and time-consuming. Over the years, a variety of computational approaches have been developed to predict protein subcellular locations based on knowledge of known protein locations. However, the problem is inherently hard, especially for proteins that can exist at multiple subcellular locations. Further studies are still in great need in this area. In this paper, we propose an ensemble learning framework that utilizes a modified Weighted K-Nearest Neighbors (WKNN) as the basic learning algorithm. Two different types of features are considered and extracted from training data, which are based on protein amino acid compositions (Amphiphilic Pseudo Amino Acid Composition, or AmPseAAC) and protein sequence similarities (Protein Similarity Measure, or PSM), respectively. Two individual classifiers are trained separately based on these two types of features and each assigns a probability distribution over different locations to a query protein. Based on the outputs of the two base classifiers, a novel ensemble strategy named Maximized Probability on Label (MPoL) is proposed. The strategy produces a final set of protein locations for each protein by integrating prediction results of the base classifiers through an optimization procedure. To measure the prediction quality of the proposed approach, two different types of evaluation metrics, example-based metrics and label-based metrics, are used. To evaluate the performance of our approach objectively, we compare its results with those predicted by another popular method named iLoc-Animal on a benchmark dataset through cross-validation. Results show that in terms of absolute true success rate on multi-location prediction, MPoL has achieved much better results than iLoc-Animal. It implies that the proposed method has some potential to solve a diverse set of multi-label learning problems.  相似文献   

5.
李卫疆  漆芳 《中文信息学报》2019,33(12):119-128
当前存在着大量的语言知识和情感资源,但在基于深度学习的情感分析研究中,这些特有的情感信息,没有在情感分析任务中得到充分利用。针对以上问题,该文提出了一种基于多通道双向长短期记忆网络的情感分析模型(multi-channels bidirectional long short term memory network,Multi-Bi-LSTM),该模型对情感分析任务中现有的语言知识和情感资源进行建模,生成不同的特征通道,让模型充分学习句子中的情感信息。与CNN相比,该模型使用的Bi-LSTM考虑了词序列之间依赖关系,能够捕捉句子的上下文语义信息,使模型获得更多的情感信息。最后在中文COAE2014数据集、英文MR数据集和SST数据集进行实验,取得了比普通Bi-LSTM、结合情感序列特征的卷积神经网络以及传统分类器更好的性能。  相似文献   

6.
近年来,图神经网络由于其丰富的表征和推理能力受到广泛的关注,然而,目前的研究聚焦于卷积策略和网络结构的调整以获得更高的性能,不可避免地面临单一模型局限性的约束。受到集成学习思想的启发,面向图神经网络创新性地提出一套集成学习框架(EL-GNN)。不同于常规的文本和图像数据,图数据除了特征信息外还包括了丰富的拓扑结构信息。因此,EL-GNN不仅将不同基分类器的预测结果进行融合,还在集成阶段额外补充了结构信息。此外,基于特征相似或结构邻居节点通常具有相似标签的先验假设,借助特征图重构,进一步优化集成策略,充分平衡了节点的特征和结构信息。大量实验表明,提出的集成策略取得了良好的成效,并EL-GNN在节点分类任务上显著优于现有模型。  相似文献   

7.
情感分类一直是自然语言处理任务中重要的研究热点,并在电子商务评论、热点论坛、公共舆论等众多场景中广泛应用。如何提高情感分类模型性能仍是情感分析领域的重点研究问题。集成学习是通过联合若干分类器达到提高模型总体效果的有效方法。基于粒计算和三支决策思想,并结合集成学习的优势,构建了结合集成学习的多粒度序贯三支决策模型。通过N-gram语言模型构建文本多粒度结构,形成序贯三支情感分类基础;在每一粒度下,集成三个分类算法以提高在该粒度下的分类效果;通过4个数据集对所提出方法进行了实验验证。结果证明,该方法不仅可以提高整体分类效果,还可以降低分类成本。  相似文献   

8.
In this paper we introduce a framework for making statistical inference on the asymptotic prediction of parallel classification ensembles. The validity of the analysis is fairly general. It only requires that the individual classifiers are generated in independent executions of some randomized learning algorithm, and that the final ensemble prediction is made via majority voting. Given an unlabeled test instance, the predictions of the classifiers in the ensemble are obtained sequentially. As the individual predictions become known, Bayes' theorem is used to update an estimate of the probability that the class predicted by the current ensemble coincides with the classification of the corresponding ensemble of infinite size. Using this estimate, the voting process can be halted when the confidence on the asymptotic prediction is sufficiently high. An empirical investigation in several benchmark classification problems shows that most of the test instances require querying only a small number of classifiers to converge to the infinite ensemble prediction with a high degree of confidence. For these instances, the difference between the generalization error of the finite ensemble and the infinite ensemble limit is very small, often negligible.  相似文献   

9.
机器学习和深度学习技术可用于解决医学分类预测中的许多问题,其中一些分类算法的预测精度较高,而另一些算法的精度有限。提出了基于C-AdaBoost模型的集成学习算法,对乳腺癌疾病进行预测,发现了判断乳腺癌是否复发、乳腺癌肿瘤是否为良性的最优特征组合。通过逐步回归方法对现有特征进行二次选取,并结合C-AdaBoost模型使得预测效果更优。大量实验表明,基于C-AdaBoost模型的算法的预测准确率比SVM、Naive Bayes、RandomForest以及传统的集成学习模型等机器学习分类器的准确率最多可提高19.5%,从而可以更好地帮助医生进行临床决策。  相似文献   

10.

Human-centric driver assistance systems with integrated sensing, processing and networking aim to find solutions for traffic accidents and other relevant issues. The key technology for developing such a system is the capability of automatically understanding and characterizing driver behaviors. This paper proposes a novel driving posture recognition approach, which consists of an efficient combined feature extraction and a random subspace ensemble of multilayer perceptron classifiers. A Southeast University Driving Posture Database (SEU-DP Database) has been created for training and testing the proposed approach. The data set contains driver images of (1) grasping the steering wheel, (2) operating the shift lever, (3) eating a cake and (4) talking on a cellular phone. Combining spatial scale features and histogram-based features, holdout and cross-validation experiments on driving posture classification are conducted, comparatively. The experimental results indicate that the proposed combined feature extraction approach with random subspace ensemble of multilayer perceptron classifiers outperforms the two individual feature extraction approaches. The experiments also suggest that talking on a cellular phone is the most difficult posture in classification among the four predefined postures. Using the proposed approach, the classification accuracy on talking on a cellular phone is over 89 % in both holdout and cross-validation experiments. These results show the effectiveness of the proposed combined feature extraction approach and random subspace ensemble of multilayer perceptron classifiers in automatically understanding and characterizing driver behaviors toward human-centric driver assistance systems.

  相似文献   

11.
乔善平  闫宝强 《计算机应用》2016,36(8):2150-2156
针对多标记学习和集成学习在解决蛋白质多亚细胞定位预测问题上应用还不成熟的状况,研究基于集成多标记学习的蛋白质多亚细胞定位预测方法。首先,从多标记学习和集成学习相结合的角度提出了一种三层的集成多标记学习系统框架结构,该框架将学习算法和分类器进行了层次性分类,并把二分类学习、多分类学习、多标记学习和集成学习进行有效整合,形成一个通用型的三层集成多标记学习模型;其次,基于面向对象技术和统一建模语言(UML)对系统模型进行了设计,使系统具备良好的可扩展性,通过扩展手段增强系统的功能和提高系统的性能;最后,使用Java编程技术对模型进行扩展,实现了一个学习系统软件,并成功应用于蛋白质多亚细胞定位预测问题上。通过在革兰氏阳性细菌数据集上进行测试,验证了系统功能的可操作性和较好的预测性能,该系统可以作为解决蛋白质多亚细胞定位预测问题的一个有效工具。  相似文献   

12.
文本的深度语义分析近年来已经成为自然语言处理研究领域的一个热点,文本的信息抽取及属性识别是文本语义分析的一项重要任务。随着机器学习技术近年来在自然语言处理领域取得了成功,部分学者将该技术推广到了医疗领域的信息抽取任务上面,并且在标准测试集上得到了比传统统计学方法更好的结果,然而这些模型方法仍然存在信息获取不充分等问题。因而,本文在已有工作基础上提出了双向LSTM与MLP集成的深度神经网络模型。在2016年SemEval的医疗事件抽取以及事件属性预测任务中,该模型将医疗文本的词性以及命名实体的描述信息当作附加属性,使用双向LSTM神经网络学习文本的隐藏特征,解决了传统方法通用性不强以及无法捕捉前后文隐含信息的缺点,然后,再使用全连接的方式去判断候选词汇是否属于医疗事件以及识别其相关属性。实验结果表明,本文提出的神经网络模型对医疗文本的抽取效果优于以往学者的方法。  相似文献   

13.
In this paper, the concept of finding an appropriate classifier ensemble for named entity recognition is posed as a multiobjective optimization (MOO) problem. Our underlying assumption is that instead of searching for the best-fitting feature set for a particular classifier, ensembling of several classifiers those are trained using different feature representations could be a more fruitful approach, but it is crucial to determine the appropriate subset of classifiers that are most suitable for the ensemble. We use three heterogenous classifiers namely maximum entropy, conditional random field, and support vector machine in order to build a number of models depending upon the various representations of the available features. The proposed MOO-based ensemble technique is evaluated for three resource-constrained languages, namely Bengali, Hindi, and Telugu. Evaluation results yield the recall, precision, and F-measure values of 92.21, 92.72, and 92.46%, respectively, for Bengali; 97.07, 89.63, and 93.20%, respectively, for Hindi; and 80.79, 93.18, and 86.54%, respectively, for Telugu. We also evaluate our proposed technique with the CoNLL-2003 shared task English data sets that yield the recall, precision, and F-measure values of 89.72, 89.84, and 89.78%, respectively. Experimental results show that the classifier ensemble identified by our proposed MOO-based approach outperforms all the individual classifiers, two different conventional baseline ensembles, and the classifier ensemble identified by a single objective?Cbased approach. In a part of the paper, we formulate the problem of feature selection in any classifier under the MOO framework and show that our proposed classifier ensemble attains superior performance to it.  相似文献   

14.
Yi Mao  Guy Lebanon 《Machine Learning》2009,77(2-3):225-248
Conditional random fields are one of the most popular structured prediction models. Nevertheless, the problem of incorporating domain knowledge into the model is poorly understood and remains an open issue. We explore a new approach for incorporating a particular form of domain knowledge through generalized isotonic constraints on the model parameters. The resulting approach has a clear probabilistic interpretation and efficient training procedures. We demonstrate the applicability of our framework with an experimental study on sentiment prediction and information extraction tasks.  相似文献   

15.
点击欺诈是近年来最常见的网络犯罪手段之一,互联网广告行业每年都会因点击欺诈而遭受巨大损失。为了能够在海量点击中有效地检测欺诈点击,构建了多种充分结合广告点击与时间属性关系的特征,并提出了一种点击欺诈检测的集成学习框架——CAT-RFE集成学习框架。CAT-RFE集成学习框架包含3个部分:基分类器、递归特征消除(RFE,recursive feature elimination)和voting集成学习。其中,将适用于类别特征的梯度提升模型——CatBoost(categorical boosting)作为基分类器;RFE是基于贪心策略的特征选择方法,可在多组特征中选出较好的特征组合;Voting集成学习是采用投票的方式将多个基分类器的结果进行组合的学习方法。该框架通过CatBoost和RFE在特征空间中获取多组较优的特征组合,再在这些特征组合下的训练结果通过voting进行集成,获得集成的点击欺诈检测结果。该框架采用了相同的基分类器和集成学习方法,不仅克服了差异较大的分类器相互制约而导致集成结果不理想的问题,也克服了RFE在选择特征时容易陷入局部最优解的问题,具备更好的检测能力。在实际互联网点击欺诈数据集上的性能评估和对比实验结果显示,CAT-RFE集成学习框架的点击欺诈检测能力超过了CatBoost模型、CatBoost和RFE组合的模型以及其他机器学习模型,证明该框架具备良好的竞争力。该框架为互联网广告点击欺诈检测提供一种可行的解决方案。  相似文献   

16.
随着网络购物的发展,Web上产生了大量的商品评论文本数据,其中蕴含着丰富的评价知识。如何从这些海量评论文本中有效提取商品特征和情感词,进而获取特征级别的情感倾向,是进行商品评论细粒度情感分析的关键。本文根据中文商品评论文本的特点,从句法分析、词义理解和语境相关等多角度获取词语间的语义关系,然后将其作为约束知识嵌入到主题模型,提出语义关系约束的主题模型SRC-LDA(semantic relation constrained LDA),用来实现语义指导下LDA的细粒度主题词提取。由于SRC-LDA改善了标准LDA对于主题词的语义理解和识别能力,从而提高了相同主题下主题词分配的关联度和不同主题下主题词分配的区分度,可以更多地发现细粒度特征词、情感词及其之间的语义关联性。通过实验表明,SRC-LDA对于细粒度特征和情感词的发现和提取具有较好的效果。  相似文献   

17.
The global prediction of a homogeneous ensemble of classifiers generated in independent applications of a randomized learning algorithm on a fixed training set is analyzed within a Bayesian framework. Assuming that majority voting is used, it is possible to estimate with a given confidence level the prediction of the complete ensemble by querying only a subset of classifiers. For a particular instance that needs to be classified, the polling of ensemble classifiers can be halted when the probability that the predicted class will not change when taking into account the remaining votes is above the specified confidence level. Experiments on a collection of benchmark classification problems using representative parallel ensembles, such as bagging and random forests, confirm the validity of the analysis and demonstrate the effectiveness of the instance-based ensemble pruning method proposed.  相似文献   

18.
随着软件系统的规模越来越庞大,如何快速高效地预测软件中的程序缺陷成为一个研究热点。最近的研究引入了深度学习模型,使用神经网络提取代码特征构建分类器进行缺陷预测。针对现有的神经网络只在单层面、单粒度上提取代码特征,导致特征不够丰富,造成预测精度不高的问题,提出了一种基于特征融合的软件缺陷预测框架。通过将程序解析为抽象语法树(abstract syntax tree,AST)以及Token序列两种不同的程序表示方式,利用树卷积神经网络以及文本卷积神经网络分别提取代码的结构和语义特征进行特征融合,从而提取到更丰富的代码特征用于缺陷预测。同时改进了AST和Token序列提取方法,降低模型复杂度。选择使用公共存储库PROMISE中的公开数据集作为实验数据集,采用softmax分类器预测得到最终的预测结果。实验结果表明,该框架在实验数据集上可以获得比已有方法更高的F1-score。  相似文献   

19.
电子词典与词汇知识表达   总被引:3,自引:0,他引:3  
词汇知识的表达与取得是自然语言处理极须克服的问题,本论文提出一个初步的架构与常识的抽取机制。语言处理系统是以词为讯息处理单元,登录在词项下的讯息可以包括统计、语法、语义、常识等。语言分析系统利用〈词〉为引得取得输入语句中相关词汇的语法、语义、常识等信息,让语言处理系统有更好的聚焦能力,可以藉以解决分词歧义、结构的歧义。对于不易以人工整理取得的常识,本论文也提出计算机自动学习的策略,以渐进式的方式累积概念与概念之间的语义关系,来增进语言系统的分析能力。这个策略可行的几个关键技术,包括(1)未登录词判别及语法语义自动分类, (2)词义分析, (3)应用语法语义及常识的剖析系统。  相似文献   

20.
软件缺陷预测通过预先识别出被测项目内的潜在缺陷程序模块,可以优化测试资源的分配并提高软件产品的质量。论文对跨项目缺陷预测问题展开了深入研究,在源项目实例选择时,考虑了三种不同的实例相似度计算方法,并发现这些方法的缺陷预测结果存在多样性,因此提出了一种基于Box-Cox转换的集成跨项目软件缺陷预测方法BCEL,具体来说,首先基于不同的实例相似度计算方法,从候选集中选出不同的训练集,随后针对这些数据集,进行针对性的Box-Cox转化,并借助特定分类方法构造出不同的基分类器,最后将这三个基分类器进行有效集成。基于实际项目的数据集,验证了BCEL方法的有效性,并深入分析了BCEL方法内的影响因素对缺陷预测性能的影响。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号