首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
一种进化半监督式模糊聚类的入侵检测算法   总被引:3,自引:0,他引:3       下载免费PDF全文
在入侵检测系统中,未知标签数据容易获得,标签数据较难获得,对此提出了一种基于进化半监督式模糊聚类入侵检测算法。算法利用标签数据信息担任染色体的角色,引导非标签数据每个模糊分类的进化过程,能够使用少量的标签数据和大量未知标签数据生成入侵检测系统分类器,可处理模糊类标签,不易陷入局部最优,适合并行结构的实现。实验结果表明,算法有较高的检测率。  相似文献   

2.
Fuzzy feature selection   总被引:2,自引:0,他引:2  
In fuzzy classifier systems the classification is obtained by a number of fuzzy If–Then rules including linguistic terms such as Low and High that fuzzify each feature. This paper presents a method by which a reduced linguistic (fuzzy) set of a labeled multi-dimensional data set can be identified automatically. After the projection of the original data set onto a fuzzy space, the optimal subset of fuzzy features is determined using conventional search techniques. The applicability of this method has been demonstrated by reducing the number of features used for the classification of four real-world data sets. This method can also be used to generate an initial rule set for a fuzzy neural network.  相似文献   

3.
Learning with partly labeled data   总被引:2,自引:0,他引:2  
Learning with partly labeled data aims at combining labeled and unlabeled data in order to boost the accuracy of a classifier. This paper outlines the two main classes of learning methods to deal with partly labeled data: pre-labeling-based learning and semi-supervised learning. Concretely, we introduce and discuss three methods from each class. The first three ones are two-stage methods consisting of selecting the data to be labeled and then training the classifier using the pre-labeled and the originally labeled data. The last three ones show how labeled and unlabeled data can be combined in a symbiotic way during training. The empirical evaluation of these methods shows: (1) pre-labeling methods tend be better than semi-supervised learning methods, (2) both labeled and unlabeled have positive effect on the classification accuracy of each of the proposed methods, (3) the combination of all the methods improve the accuracy, and (4) the proposed methods compare very well with the state-of-art methods.  相似文献   

4.
基于单类分类器的半监督学习   总被引:1,自引:0,他引:1  
提出一种结合单类学习器和集成学习优点的Ensemble one-class半监督学习算法.该算法首先为少量有标识数据中的两类数据分别建立两个单类分类器.然后用建立好的两个单类分类器共同对无标识样本进行识别,利用已识别的无标识样本对已建立的两个分类面进行调整、优化.最终被识别出来的无标识数据和有标识数据集合在一起训练一个基分类器,多个基分类器集成在一起对测试样本的测试结果进行投票.在5个UCI数据集上进行实验表明,该算法与tri-training算法相比平均识别精度提高4.5%,与仅采用纯有标识数据的单类分类器相比,平均识别精度提高8.9%.从实验结果可以看出,该算法在解决半监督问题上是有效的.  相似文献   

5.
主动协同半监督粗糙集分类模型   总被引:1,自引:0,他引:1  
粗糙集理论是一种有监督学习模型,一般需要适量有标记的数据来训练分类器。但现实一些问题往往存在大量无标记的数据,而有标记数据由于标记代价过大较为稀少。文中结合主动学习和协同训练理论,提出一种可有效利用无标记数据提升分类性能的半监督粗糙集模型。该模型利用半监督属性约简算法提取两个差异性较大的约简构造基分类器,然后基于主动学习思想在无标记数据中选择两分类器分歧较大的样本进行人工标注,并将更新后的分类器交互协同学习。UCI数据集实验对比分析表明,该模型能明显提高分类学习性能,甚至能达到数据集的最优值。  相似文献   

6.
数据流分类是数据挖掘领域的重要研究任务之一,已有的数据流分类算法大多是在有标记数据集上进行训练,而实际应用领域数据流中有标记的数据数量极少。为解决这一问题,可通过人工标注的方式获取标记数据,但人工标注昂贵且耗时。考虑到未标记数据的数量极大且隐含大量信息,因此在保证精度的前提下,为利用这些未标记数据的信息,本文提出了一种基于Tri-training的数据流集成分类算法。该算法采用滑动窗口机制将数据流分块,在前k块含有未标记数据和标记数据的数据集上使用Tri-training训练基分类器,通过迭代的加权投票方式不断更新分类器直到所有未标记数据都被打上标记,并利用k个Tri-training集成模型对第k+1块数据进行预测,丢弃分类错误率高的分类器并在当前数据块上重建新分类器从而更新当前模型。在10个UCI数据集上的实验结果表明:与经典算法相比,本文提出的算法在含80%未标记数据的数据流上的分类精度有显著提高。  相似文献   

7.
通过对已标示和未标示数据的学习和分类,提出一种改进微分进化算法的半监督模糊聚类。先从大量的数据中选取一小部分进行标记,然后利用标记数据来指导进化过程,实现对未标记数据的分类。通过参考粒子群算法惯性权重思想,引入惯性加权系数,在计算初期能够维持个体的多样性,后期能够加快算法的收敛速度,有效提高了算法的性能。遥感图像数据实验结果显示该方法可以提高分类精度。  相似文献   

8.
Exploiting semantic resources for large scale text categorization   总被引:1,自引:0,他引:1  
The traditional supervised classifier for Text Categorization (TC) is learned from a set of hand-labeled documents. However, the task of manual data labeling is labor intensive and time consuming, especially for a complex TC task with hundreds or thousands of categories. To address this issue, many semi-supervised methods have been reported to use both labeled and unlabeled documents for TC. But they still need a small set of labeled data for each category. In this paper, we propose a Fully Automatic Categorization approach for Text (FACT), where no manual labeling efforts are required. In FACT, the lexical databases serve as semantic resources for category name understanding. It combines the semantic analysis of category names and statistic analysis of the unlabeled document set for fully automatic training data construction. With the support of lexical databases, we first use the category name to generate a set of features as a representative profile for the corresponding category. Then, a set of documents is labeled according to the representative profile. To reduce the possible bias originating from the category name and the representative profile, document clustering is used to refine the quality of initial labeling. The training data are subsequently constructed to train the discriminative classifier. The empirical experiments show that one variant of our FACT approach outperforms the state-of-the-art unsupervised TC approach significantly. It can achieve more than 90% of F1 performance of the baseline SVM methods, which demonstrates the effectiveness of the proposed approaches.  相似文献   

9.
“半监督学习”方法,利用已经标注好的训练样本和无标注的训练样本一起训练分类器。在标准SVM分类器训练方法中融入这种思想,给分类面附近加入混合数据,提出了一种新的基于SVM的分类器设计方法,并将这种方法应用于小样本数据的分类问题中。实验表明,新的基于SVM的分类器与传统SVM相比较,在分类准确率上有很大提高,同时偏差有所降低。  相似文献   

10.
It is an actual and challenging issue to learn cost-sensitive models from those datasets that are with few labeled data and plentiful unlabeled data, because some time labeled data are very difficult, time consuming and/or expensive to obtain. To solve this issue, in this paper we proposed two classification strategies to learn cost-sensitive classifier from training datasets with both labeled and unlabeled data, based on Expectation Maximization (EM). The first method, Direct-EM, uses EM to build a semi-supervised classifier, then directly computes the optimal class label for each test example using the class probability produced by the learning model. The second method, CS-EM, modifies EM by incorporating misclassification cost into the probability estimation process. We conducted extensive experiments to evaluate the efficiency, and results show that when using only a small number of labeled training examples, the CS-EM outperforms the other competing methods on majority of the selected UCI data sets across different cost ratios, especially when cost ratio is high.  相似文献   

11.
The accuracy of a classification-based surrogate model for reliability assessment can be improved by augmenting the training data (labeled data or data with known responses) with a large number of unlabeled data (data with unknown responses) in semi-supervised learning methods. In this research, an enhanced Probabilistic Neural Network (PNN) algorithm is proposed where the Gaussians at each labeled point are not assumed to be spherical. Each of the Gaussians has a ‘full’ covariance matrix instead of simply assuming the Gaussian with a ‘spherical’ covariance matrix. First, the Expectation-Maximization algorithm is applied on the labeled and unlabeled data while assuming that the number of ‘full’ Gaussians is equal to the number of labeled datapoints. The contribution of each of these ‘full’ Gaussians at a particular datapoint is found by using the Bayes Theorem. The Bayes decision criterion is then used in the final output layer of the PNN to classify test patterns into either the safe or the failure class. The primary benefit of the proposed method comes from utilizing unlabeled data for better estimation of ‘full’ covariance matrices of constituting Gaussian clusters of underlying data, which are then used to estimate the Probability Density Functions of classes for classification. This procedure does not require additional computational costs to improve the accuracy of the classification results since the cost of unlabeled data is negligible in general. Two examples including an analytic problem and a truss problem are presented in order to validate the proposed reliability estimation process. The results reflect considerable improvements of the classifier performance for estimating reliability while maintaining sufficient accuracy.  相似文献   

12.
Video Annotation Based on Kernel Linear Neighborhood Propagation   总被引:1,自引:0,他引:1  
The insufficiency of labeled training data for representing the distribution of the entire dataset is a major obstacle in automatic semantic annotation of large-scale video database. Semi-supervised learning algorithms, which attempt to learn from both labeled and unlabeled data, are promising to solve this problem. In this paper, a novel graph-based semi-supervised learning method named kernel linear neighborhood propagation (KLNP) is proposed and applied to video annotation. This approach combines the consistency assumption, which is the basic assumption in semi-supervised learning, and the local linear embedding (LLE) method in a nonlinear kernel-mapped space. KLNP improves a recently proposed method linear neighborhood propagation (LNP) by tackling the limitation of its local linear assumption on the distribution of semantics. Experiments conducted on the TRECVID data set demonstrate that this approach outperforms other popular graph-based semi-supervised learning methods for video semantic annotation.  相似文献   

13.
具有噪声过滤功能的协同训练半监督主动学习算法   总被引:1,自引:0,他引:1  
针对基于半监督学习的分类器利用未标记样本训练会引入噪声而使得分类性能下降的情形,文中提出一种具有噪声过滤功能的协同训练半监督主动学习算法.该算法以3个模糊深隐马尔可夫模型进行协同半监督学习,在适当的时候主动引入一些人机交互来补充类别标记,避免判决类别不相同时的拒判和初始时判决一致即认为正确的误判情形.同时加入噪声过滤机制,用以过滤南机器自动标记的可能是噪声的样本.将该算法应用于人脸表情识别.实验结果表明,该算法能有效提高未标记样本的利用率并降低半监督学习而引入的噪声,提高表情识别的准确率.  相似文献   

14.
针对集成学习方法中分类器差异性不足以及已标记样本少的问题,提出了一种新的半监督集成学习算法,将半监督方法引入到集成学习中,利用大量未标记样本的信息来细化每个基分类器,并且构造差异性更大的基分类器,首先通过多视图方法选取合适的未标记样本,并使用多视图方法将大量繁杂的特征属性分类,使用不同的特征降维方法对不同的视图进行降维...  相似文献   

15.
Graph-based learning provides a useful approach for modeling data in classification problems. In this modeling scenario, the relationship between labeled and unlabeled data impacts the construction and performance of classifiers, and therefore a semi-supervised learning framework is adopted. We propose a graph classifier based on kernel smoothing. A regularization framework is also introduced, and it is shown that the proposed classifier optimizes certain loss functions. Its performance is assessed on several synthetic and real benchmark data sets with good results, especially in settings where only a small fraction of the data are labeled.  相似文献   

16.
The following two-stage approach to learning from dissimilarity data is described: (1) embed both labeled and unlabeled objects in a Euclidean space; then (2) train a classifier on the labeled objects. The use of linear discriminant analysis for (2), which naturally invites the use of classical multidimensional scaling for (1), is emphasized. The choice of the dimension of the Euclidean space in (1) is a model selection problem; too few or too many dimensions can degrade classifier performance. The question of how the inclusion of unlabeled objects in (1) affects classifier performance is investigated. In the case of spherical covariances, including unlabeled objects in (1) is demonstrably superior. Several examples are presented.  相似文献   

17.
翟俊海  张素芳  王聪  沈矗  刘晓萌 《计算机应用》2018,38(10):2759-2763
针对传统的主动学习算法只能处理中小型数据集的问题,提出一种基于MapReduce的大数据主动学习算法。首先,在有类别标签的初始训练集上,用极限学习机(ELM)算法训练一个分类器,并将其输出用软最大化函数变换为一个后验概率分布。然后,将无类别标签的大数据集划分为l个子集,并部署到l个云计算节点上。在每一个节点,用训练出的分类器并行地计算各个子集中样例的信息熵,并选择信息熵大的前q个样例进行类别标注,将标注类别的l×q个样例添加到有类别标签的训练集中。重复以上步骤直到满足预定义的停止条件。在Artificial、Skin、Statlog和Poker 4个数据集上与基于ELM的主动学习算法进行了比较,结果显示,所提算法在4个数据集上均能完成主动样例选择,而基于ELM的主动学习算法只在规模最小的数据集上能完成主动样例选择。实验结果表明,所提算法优于基于极限学习机的主动学习算法。  相似文献   

18.
This paper presents a method for designing semi-supervised classifiers trained on labeled and unlabeled samples. We focus on probabilistic semi-supervised classifier design for multi-class and single-labeled classification problems, and propose a hybrid approach that takes advantage of generative and discriminative approaches. In our approach, we first consider a generative model trained by using labeled samples and introduce a bias correction model, where these models belong to the same model family, but have different parameters. Then, we construct a hybrid classifier by combining these models based on the maximum entropy principle. To enable us to apply our hybrid approach to text classification problems, we employed naive Bayes models as the generative and bias correction models. Our experimental results for four text data sets confirmed that the generalization ability of our hybrid classifier was much improved by using a large number of unlabeled samples for training when there were too few labeled samples to obtain good performance. We also confirmed that our hybrid approach significantly outperformed generative and discriminative approaches when the performance of the generative and discriminative approaches was comparable. Moreover, we examined the performance of our hybrid classifier when the labeled and unlabeled data distributions were different.  相似文献   

19.
一种利用近邻和信息熵的主动文本标注方法   总被引:1,自引:0,他引:1  
由于大规模标注文本数据费时费力,利用少量标注样本和大量未标注样本的半监督文本分类发展迅速.在半监督文本分类中,少量标注样本主要用来初始化分类模型,其合理性将影响最终分类模型的性能.为了使标注样本尽可能吻合原始数据的分布,提出一种避开选择已标注样本的K近邻来抽取下一组候选标注样本的方法,使得分布在不同区域的样本有更多的标注机会.在此基础上,为了获得更多的类别信息,在候选标注样本中选择信息熵最大的样本作为最终的标注样本.真实文本数据上的实验表明了提出方法的有效性.  相似文献   

20.
Most of the existing classification methods, used for voice pathology assessment, are built based on labeled pathological and normal voice signals. This paper studies the problem of building a classifier using labeled and unlabeled data. We propose a novel learning technique, called Partitioning and Biased Support Vector Machine Classification (PBSVM), which tries to utilize all the available data in two steps: (1) a new heuristically partition-based algorithm, which extracts high quality pathological and normal samples from an unlabeled set, and (2) a more principle approach based on biased formulation of support vector machine, which is fairly robust to mislabeling and unbalance data problem. Experiments with wavelet-based energy features extracted from sustained vowels show that the new recognition scheme is highly feasible and significantly outperform the baseline classical SVM classifier, especially in the situation where the labeled training data is small.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号