首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
面向非平衡文本情感分类的TSF特征选择方法   总被引:1,自引:1,他引:0  
王杰  李德玉  王素格 《计算机科学》2016,43(10):206-210, 224
非平衡数据中样本数量的不平衡分布往往伴随着特征分布的不平衡,在多数类文本中经常出现的特征,在少数类中却很少出现。针对非平衡数据特征分布的特点,提出了一种新的双边fisher特征选择算法TSF。该方法通过显式地组合正相关和负相关特征,缓解了特征层面的非平衡性,较好地表示了文本的信息。TSF方法在图书评论和COAE2014微博非平衡数据上进行实验,结果验证了该方法是可行的。  相似文献   

2.
齿轮是传动机械中的重要部件,也是在运行过程中产生故障的主要原因之一,因此对齿轮进行故障诊断研究就具有十分重要的意义。但是在齿轮故障诊断数据集中,故障样本数通常比非故障样本数要少很多,由此引发了数据不均衡问题下故障诊断的问题。以往的研究很少关注这种数据不均衡问题对故障诊断的影响。此外,在故障数据集中有一些冗余甚至是不相关的特征,这些特征降低了学习器的泛化能力。为解决这类问题,提出了一种基于Relief的EasyEnsemble算法来解决故障诊断中的数据不均衡问题。在UCI数据集和齿轮数据集上的实验结果表明新算法提高了分类器在不均衡数据集上的分类性能和预报能力。  相似文献   

3.
Efficient and robust feature extraction by maximum margin criterion   总被引:15,自引:0,他引:15  
In pattern recognition, feature extraction techniques are widely employed to reduce the dimensionality of data and to enhance the discriminatory information. Principal component analysis (PCA) and linear discriminant analysis (LDA) are the two most popular linear dimensionality reduction methods. However, PCA is not very effective for the extraction of the most discriminant features, and LDA is not stable due to the small sample size problem . In this paper, we propose some new (linear and nonlinear) feature extractors based on maximum margin criterion (MMC). Geometrically, feature extractors based on MMC maximize the (average) margin between classes after dimensionality reduction. It is shown that MMC can represent class separability better than PCA. As a connection to LDA, we may also derive LDA from MMC by incorporating some constraints. By using some other constraints, we establish a new linear feature extractor that does not suffer from the small sample size problem, which is known to cause serious stability problems for LDA. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in the paper. Our extensive experiments demonstrate that the new feature extractors are effective, stable, and efficient.  相似文献   

4.
文本分类中数据集的不均衡问题是一个在实际应用中普遍存在的问题。从特征选择优化和分类器性能提升两方面出发,提出了一种组合的不均衡数据集文本分类方法。在特征选择方面,综合考虑特征项与类别的正负相关特性及类别区分强度对传统CHI统计特征选择方法予以改进。在数据层上,采用数据重取样方法对不均衡训练语料的不平衡性过滤减少其对分类性能的影响。实验结果表明该方法对不均衡数据集上文本可达到较好分类效果。  相似文献   

5.
Financial distress prediction (FDP) is of great importance to both inner and outside parts of companies. Though lots of literatures have given comprehensive analysis on single classifier FDP method, ensemble method for FDP just emerged in recent years and needs to be further studied. Support vector machine (SVM) shows promising performance in FDP when compared with other single classifier methods. The contribution of this paper is to propose a new FDP method based on SVM ensemble, whose candidate single classifiers are trained by SVM algorithms with different kernel functions on different feature subsets of one initial dataset. SVM kernels such as linear, polynomial, RBF and sigmoid, and the filter feature selection/extraction methods of stepwise multi discriminant analysis (MDA), stepwise logistic regression (logit), and principal component analysis (PCA) are applied. The algorithm for selecting SVM ensemble's base classifiers from candidate ones is designed by considering both individual performance and diversity analysis. Weighted majority voting based on base classifiers’ cross validation accuracy on training dataset is used as the combination mechanism. Experimental results indicate that SVM ensemble is significantly superior to individual SVM classifier when the number of base classifiers in SVM ensemble is properly set. Besides, it also shows that RBF SVM based on features selected by stepwise MDA is a good choice for FDP when individual SVM classifier is applied.  相似文献   

6.
The curse of high dimensionality in text classification is a worrisome problem that requires efficient and optimal feature selection (FS) methods to improve classification accuracy and reduce learning time. Existing filter-based FS methods evaluate features independently of other related ones, which can then lead to selecting a large number of redundant features, especially in high-dimensional datasets, resulting in more learning time and less classification performance, whereas information theory-based methods aim to maximize feature dependency with the class variable and minimize its redundancy for all selected features, which gradually becomes impractical when increasing the feature space. To overcome the time complexity issue of information theory-based methods while taking into account the redundancy issue, in this article, we propose a new feature selection method for text classification termed correlation-based redundancy removal, which aims to minimize the redundancy using subsets of features having close mutual information scores without sequentially seeking already selected features. The idea is that it is not important to assess the redundancy of a dominant feature having high classification information with another irrelevant feature having low classification information and vice-versa since they are implicitly weakly correlated. Our method, tested on seven datasets using both traditional classifiers (Naive Bayes and support vector machines) and deep learning models (long short-term memory and convolutional neural networks), demonstrated strong performance by reducing redundancy and improving classification compared to ten competitive metrics.  相似文献   

7.
8.
基于双向语义的中文实体关系联合抽取方法   总被引:1,自引:0,他引:1  
禹克强  黄芳  吴琪  欧阳洋 《计算机工程》2023,49(1):92-99+112
现有中文实体关系抽取方法通常利用实体间的单向关系语义特征进行关系抽取,然而仅靠单向语义特征并不能完全利用实体间的语义关系,从而使得实体关系抽取的有效性受到影响。提出一种基于双向语义的中文实体关系联合抽取方法。利用RoBERTa预训练模型获取具有上下文信息的文本字向量表征,通过首尾指针标注识别句子中可能存在关系的实体。为了同时利用文本中的双向关系语义信息,将实体分别作为关系中的主体与客体来建立正负关系,并利用两组全连接神经网络构建正负关系映射器,从而对每一个输入实体同时从正关系与负关系的角度构建候选关系三元组。将候选关系三元组分别在正负关系下的概率分布序列与实体位置嵌入特征相结合,以对候选三元组进行判别,从而确定最终的关系三元组。在DuIE数据集上进行对比实验,结果表明,该方法的精确率与召回率优于MultiR、CoType等基线模型,其F1值达到0.805,相较基线模型平均提高了12.8%。  相似文献   

9.
一种面向非平衡数据的邻居词特征选择方法   总被引:1,自引:0,他引:1  
在非平衡数据情况下,由于传统特征选择方法,如信息增益(Information Gain,IG)和相关系数(Correlation Coefficient,CC),或者不考虑负特征对分类的作用,或者不能显式地均衡正负特征比例,导致特征选择的结果下降.本文提出一种新的特征选择方法(Positive-Negative feature selection,PN),用于邻居词的选择,实现了文本中术语的自动抽取.本文提出的PN特征选择方法和CC特征选择方法相比,考虑了负特征;和IG特征选择方法相比,从特征t出现在正(负)训练文本的文本数占所有出现特征t的训练文本数比例的角度,分别显式地均衡了正特征和负特征的比例.通过计算特征t后面所跟的不同(非)领域概念个数占总(非)领域概念个数比值分别考察正、负特征t的重要性,解决了IG特征选择方法正特征偏置问题.实验结果表明,本文提出的PN特征选择方法优越于IG特征选择方法和CC特征选择.  相似文献   

10.
Correlated information between multiple views can provide useful information for building robust classifiers. One way to extract correlated features from different views is using canonical correlation analysis (CCA). However, CCA is an unsupervised method and can not preserve discriminant information in feature extraction. In this paper, we first incorporate discriminant information into CCA by using random cross-view correlations between within-class examples. Because of the random property, we can construct a lot of feature extractors based on CCA and random correlation. So furthermore, we fuse those feature extractors and propose a novel method called random correlation ensemble (RCE) for multi-view ensemble learning. We compare RCE with existing multi-view feature extraction methods including CCA and discriminant CCA (DCCA) which use all cross-view correlations between within-class examples, as well as the trivial ensembles of CCA and DCCA which adopt standard bagging and boosting strategies for ensemble learning. Experimental results on several multi-view data sets validate the effectiveness of the proposed method.  相似文献   

11.
《Journal of Process Control》2014,24(6):1015-1023
This study addresses classification methodology for the automatic inspection of a range of defects on the surface of glass substrates in thin film transistor liquid crystal display glass substrate manufacturing. The proposed methodology consisted of four stages: (1) feature extraction by calculating the wavelet co-occurrence signature from the substrate images, (2) handling of imbalanced dataset using the Synthetic Minority Over-sampling TEchnique (SMOTE), (3) reduction of the feature's dimension by principal component analysis, and (4) finally choosing the best classifier between three different methods: Classification And Regression Tree (CART), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM). In training the SVM and MLP classifiers, the simulated annealing algorithm was used to obtain the optimal tuning parameters for the classifiers. From the industrial case study, the proposed feature extraction algorithm could remove the defect-irrelevant image features and SMOTE increased the accuracy of all three methods. Furthermore, the optimized SVM and MLP models were more accurate than the CART model whereas a higher accuracy of 89.5% was observed for the proposed SVM model.  相似文献   

12.
针对microRNA识别方法中过多注重新特征、忽略弱分类能力特征和冗余特征,导致敏感性和特异性指标不佳或两者不平衡的问题,提出一种基于特征聚类和随机子空间的集成算法CLUSTER-RS。该算法采用信息增益率剔除部分弱分类能力的特征后,利用信息熵度量特征之间相关性,对特征进行聚类,再从每个特征簇中随机选取等量特征组成特征集用于构建基分类器,最后将基分类器集成用于microRNA识别。通过调整参数、选择基分类器实现算法最优化后,在microRNA最新数据集上与经典方法Triplet-SVM、miPred、MiPred、microPred和HuntMi进行对比实验,结果显示CLUSTER-RS在识别中敏感性不及microPred但优于其他模型,特异性为六者最优,而且从整体性能指标准确性和马修兹系数可以看出,CLUSTER-RS比其他算法具有优势。结果表明,CLUSTER-RS取得了较好的识别效果,在敏感性和特异性上实现了很好的平衡,即在性能指标平衡方面优于对比方法。  相似文献   

13.
针对传统基于远程监督的关系抽取方法中存在噪声和负例数据利用不足的问题,提出结合从句级远程监督和半监督集成学习的关系抽取方法.首先通过远程监督构建关系实例集,使用基于从句识别的去噪算法去除关系实例集中的噪声.然后抽取关系实例的词法特征并转化为分布式表征向量,构建特征数据集.最后选择特征数据集中所有正例数据和部分负例数据组成标注数据集,其余的负例数据组成未标注数据集,通过改进的半监督集成学习算法训练关系分类器.实验表明,相比基线方法,文中方法可以获得更高的分类准确率和召回率.  相似文献   

14.
王林  郭娜娜 《计算机应用》2017,37(4):1032-1037
针对传统分类技术对不均衡电信客户数据集中流失客户识别能力不足的问题,提出一种基于差异度的改进型不均衡数据分类(IDBC)算法。该算法在基于差异度分类(DBC)算法的基础上改进了原型选择策略。在原型选择阶段,利用改进型的样本子集优化方法从整体数据集中选择最具参考价值的原型集,从而避免了随机选择所带来的不确定性;在分类阶段,分别利用训练集和原型集、测试集和原型集样本之间的差异性构建相应的特征空间,进而采用传统的分类预测算法对映射到相应特征空间内的差异度数据集进行学习。最后选用了UCI数据库中的电信客户数据集和另外6个普通的不均衡数据集对该算法进行验证,相对于传统基于特征的不均衡数据分类算法,DBC算法对稀有类的识别率平均提高了8.3%,IDBC算法对稀有类的识别率平均提高了11.3%。实验结果表明,所提IDBC算法不受类别分布的影响,而且对不均衡数据集中稀有类的识别能力优于已有的先进分类技术。  相似文献   

15.
With the advent of technology in various scientific fields, high dimensional data are becoming abundant. A general approach to tackle the resulting challenges is to reduce data dimensionality through feature selection. Traditional feature selection approaches concentrate on selecting relevant features and ignoring irrelevant or redundant ones. However, most of these approaches neglect feature interactions. On the other hand, some datasets have imbalanced classes, which may result in biases towards the majority class. The main goal of this paper is to propose a novel feature selection method based on the interaction information (II) to provide higher level interaction analysis and improve the search procedure in the feature space. In this regard, an evolutionary feature subset selection algorithm based on interaction information is proposed, which consists of three stages. At the first stage, candidate features and candidate feature pairs are identified using traditional feature weighting approaches such as symmetric uncertainty (SU) and bivariate interaction information. In the second phase, candidate feature subsets are formed and evaluated using multivariate interaction information. Finally, the best candidate feature subsets are selected using dominant/dominated relationships. The proposed algorithm is compared with some other feature selection algorithms including mRMR, WJMI, IWFS, IGFS, DCSF, IWFS, K_OFSD, WFLNS, Information Gain and ReliefF in terms of the number of selected features, classification accuracy, F-measure and algorithm stability using three different classifiers, namely KNN, NB, and CART. The results justify the improvement of classification accuracy and the robustness of the proposed method in comparison with the other approaches.  相似文献   

16.
化工过程故障诊断中样本数据分布不均衡现象普遍存在.在使用不均衡样本作为训练集建立各类故障诊断分类器时,易出现分类器的识别率偏置于多数类样本的结果,由此产生虽正常状态易识别,但更受关注的故障状态却难以被诊断的现象.针对该问题,本文提出一种基于Easy Ensemble思想的主元分析–支持向量机(Easy Ensemble based principle component analysis–support vector machine,EEPS)故障诊断算法,通过欠采样方法抽取多数类样本子集组建多个新的均衡数据样本集,使用主元分析(principle component analysis,PCA)进行特征提取并使用支持向量机(support vector machine,SVM)算法进行训练,得到多个基于SVM的故障诊断分类器,然后使用Adaboost算法集成最终的分类,从而提高故障诊断准确性.所提方法被用于TE(Tenessee Eastman)化工过程,实验结果表明,EEPS算法能够有效提高分类器在不均衡数据集上的诊断性能和预报能力.  相似文献   

17.
网络作弊检测是搜索引擎的重要挑战之一,该文提出基于遗传规划的集成学习方法 (简记为GPENL)来检测网络作弊。该方法首先通过欠抽样技术从原训练集中抽样得到t个不同的训练集;然后使用c个不同的分类算法对t个训练集进行训练得到t*c个基分类器;最后利用遗传规划得到t*c个基分类器的集成方式。新方法不仅将欠抽样技术和集成学习融合起来提高非平衡数据集的分类性能,还能方便地集成不同类型的基分类器。在WEBSPAM-UK2006数据集上所做的实验表明无论是同态集成还是异态集成,GPENL均能提高分类的性能,且异态集成比同态集成更加有效;GPENL比AdaBoost、Bagging、RandomForest、多数投票集成、EDKC算法和基于Prediction Spamicity的方法取得更高的F-度量值。  相似文献   

18.
针对背景知识数据集中存在的类不平衡对分类器的影响,根据背景知识数据集样本量小、数据维数高的特性分析了目前各种方法在解决背景知识数据中的类不平衡问题时的缺陷,提出了一种基于分类后处理的改进SVM算法。改进算法引入权重参数调整SVM的分类决策函数,提高少类样本对分类器的贡献,使分类平面向多类样本倾斜,从而解决类不平衡对SVM造成的影响。在MAROB数据集上的实验表明,改进算法对少类的预测效果要优于传统的机器学习算法。  相似文献   

19.
在线知识蒸馏通过同时训练两个或多个模型的集合,并使之相互学习彼此的提取特征,从而实现模型性能的共同提高.已有方法侧重于模型间特征的直接对齐,从而忽略了决策边界特征的独特性和鲁棒性.利用一致性正则化来指导模型学习决策边界的判别性特征.具体地说,网络中每个模型由特征提取器和一对任务特定的分类器组成,通过正则化同一模型不同分类器间以及不同模型对应分类器间的分布距离来度量模型内和模型间的一致性,这两类一致性共同用于更新特征提取器和决策边界的特征.此外,模型内一致性将作为自适应权重,与每个模型的平均输出加权生成集成预测值,进而指导所有分类器与之相互学习.在多个公共数据集上,该算法均取得了较好的表现性能.  相似文献   

20.
Hybrid models based on feature selection and machine learning techniques have significantly enhanced the accuracy of standalone models. This paper presents a feature selection‐based hybrid‐bagging algorithm (FS‐HB) for improved credit risk evaluation. The 2 feature selection methods chi‐square and principal component analysis were used for ranking and selecting the important features from the datasets. The classifiers were built on 5 training and test data partitions of the input data set. The performance of the hybrid algorithm was compared with that of the standalone classifiers: feature selection‐based classifiers and bagging. The hybrid FS‐HB algorithm performed best for qualitative dataset with less features and tree‐based unstable base classifier. Its performance on numeric data was also better than other standalone classifiers, whereas comparable to bagging with only selected features. Its performance was found better on 70:30 data partition and the type II error, which is very significant in risk evaluation was also reduced significantly. The improved performance of FS‐HB is attributed to the important features used for developing the classifier thereby reducing the complexity of the algorithm and the use of ensemble methodology, which added to the classical bias variance trade‐off and performed better than standalone classifiers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号