首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In classification, noise may deteriorate the system performance and increase the complexity of the models built. In order to mitigate its consequences, several approaches have been proposed in the literature. Among them, noise filtering, which removes noisy examples from the training data, is one of the most used techniques. This paper proposes a new noise filtering method that combines several filtering strategies in order to increase the accuracy of the classification algorithms used after the filtering process. The filtering is based on the fusion of the predictions of several classifiers used to detect the presence of noise. We translate the idea behind multiple classifier systems, where the information gathered from different models is combined, to noise filtering. In this way, we consider the combination of classifiers instead of using only one to detect noise. Additionally, the proposed method follows an iterative noise filtering scheme that allows us to avoid the usage of detected noisy examples in each new iteration of the filtering process. Finally, we introduce a noisy score to control the filtering sensitivity, in such a way that the amount of noisy examples removed in each iteration can be adapted to the necessities of the practitioner. The first two strategies (use of multiple classifiers and iterative filtering) are used to improve the filtering accuracy, whereas the last one (the noisy score) controls the level of conservation of the filter removing potentially noisy examples. The validity of the proposed method is studied in an exhaustive experimental study. We compare the new filtering method against several state-of-the-art methods to deal with datasets with class noise and study their efficacy in three classifiers with different sensitivity to noise.  相似文献   

2.
众包是一个新兴的收集数据集标签的方法。虽然它经济实惠,但面临着数据标签质量无法保证的问题。尤其是当客观原因存在使得众包工作者工作质量较差时,所得的标签会更加不可靠。因此提出一个名为基于特征扩维提高众包质量的方法(FA-method),其基本思想是,首先由专家标注少部分标签,再利用众包者标注的数据集训练模型,对专家集进行预测,所得结果作为专家数据集新的特征,并利用扩维后的专家集训练模型进行预测,计算每个实例为噪声的可能性以及噪声数量上限来过滤出潜在含噪声标签的数据集,类似地,对过滤后的高质量集再次使用扩维的方法进一步校正噪声。在8个UCI数据集上进行验证的结果表明,和现有的结合噪声识别和校正的众包标签方法相比,所提方法能够在重复标签数量较少或标注质量较低时均取得很好的效果。  相似文献   

3.
To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance I k , two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.A preliminary version of this paper was published in the Proceedings of the 20th International Conference on Machine Learning, Washington D.C., USA, 2003, pp. 920–927.  相似文献   

4.
Yeqiu  Jianming  Ling  Yahagi 《Neurocomputing》2009,72(13-15):2884
In this paper, a new type of multineural network filter (MNNF) is presented that is trained for restoration and enhancement of the digital radiological images. In medical radiographices, noise has been categorized as quantum mottle, which is related to the incident X-ray exposure and artificial noise, which is caused by the grid, etc. MNNF consists of several neural network filters (NNFs). A novel analysis method is proposed to make the characteristics of the trained MNNF clearly. In the proposed method, a characteristics judgement system is presented to decide which NNF will be executed through the standard deviation value of pixels in the input region. The new approach was tested on nine clinical medical X-ray images and five synthesized noisy X-ray images. In all cases, the proposed MNNF produced better results in terms of peak signal-to-noise ratio (PSNR), mean-to-standard-deviation ratio (MSR) and contrast to noise ratio (CNR) measures than the original NNF, linear inverse filter and nonlinear median filter.  相似文献   

5.
针对微阵列基因表达数据高维小样本、高冗余且高噪声的问题,提出一种基于FCBF特征选择和集成优化学习的分类算法FICS-EKELM。首先使用快速关联过滤方法FCBF滤除部分不相关特征和噪声,找出与类别相关性较高的特征集合;其次,运用抽样技术生成多个样本子集,在每个训练子集上利用改进乌鸦搜索算法同步实现最优特征子集选择和核极限学习机KELM分类器参数优化;然后基于基分类器构建集成分类模型对目标数据进行分类识别;此外运用多核平台多线程并行方式进一步提高算法计算效率。在六组基因数据集上的实验结果表明,本文算法不仅能用较少特征基因达到较优的分类效果,并且分类结果显著高于已有和相似方法,是一种有效的高维数据分类方法。  相似文献   

6.
弱监督关系抽取利用已有关系实体对从文本集中自动获取训练数据,有效解决了训练数据不足的问题。针对弱监督训练数据存在噪声、特征不足和不平衡,导致关系抽取性能不高的问题,文中提出NF-Tri-training(Tri-training with Noise Filtering)弱监督关系抽取算法。它利用欠采样解决样本不平衡问题,基于Tri-training从未标注数据中迭代学习新的样本,提高分类器的泛化能力,采用数据编辑技术识别并移除初始训练数据和每次迭代产生的错标样本。在互动百科采集数据集上实验结果表明NF-Tri-training算法能够有效提升关系分类器的性能。  相似文献   

7.
Accuracy of machine learners is affected by quality of the data the learners are induced on. In this paper, quality of the training dataset is improved by removing instances detected as noisy by the Partitioning Filter. The fit dataset is first split into subsets, and different base learners are induced on each of these splits. The predictions are combined in such a way that an instance is identified as noisy if it is misclassified by a certain number of base learners. Two versions of the Partitioning Filter are used: Multiple-Partitioning Filter and Iterative-Partitioning Filter. The number of instances removed by the filters is tuned by the voting scheme of the filter and the number of iterations. The primary aim of this study is to compare the predictive performances of the final models built on the filtered and the un-filtered training datasets. A case study of software measurement data of a high assurance software project is performed. It is shown that predictive performances of models built on the filtered fit datasets and evaluated on a noisy test dataset are generally better than those built on the noisy (un-filtered) fit dataset. However, predictive performance based on certain aggressive filters is affected by presence of noise in the evaluation dataset.  相似文献   

8.
Instance selection aims at filtering out noisy data (or outliers) from a given training set, which not only reduces the need for storage space, but can also ensure that the classifier trained by the reduced set provides similar or better performance than the baseline classifier trained by the original set. However, since there are numerous instance selection algorithms, there is no concrete winner that is the best for various problem domain datasets. In other words, the instance selection performance is algorithm and dataset dependent. One main reason for this is because it is very hard to define what the outliers are over different datasets. It should be noted that, using a specific instance selection algorithm, over-selection may occur by filtering out too many ‘good’ data samples, which leads to the classifier providing worse performance than the baseline. In this paper, we introduce a dual classification (DuC) approach, which aims to deal with the potential drawback of over-selection. Specifically, performing instance selection over a given training set, two classifiers are trained using both a ‘good’ and ‘noisy’ sets respectively identified by the instance selection algorithm. Then, a test sample is used to compare the similarities between the data in the good and noisy sets. This comparison guides the input of the test sample to one of the two classifiers. The experiments are conducted using 50 small scale and 4 large scale datasets and the results demonstrate the superior performance of the proposed DuC approach over the baseline instance selection approach.  相似文献   

9.
入侵检测领域的数据往往具有高维性及非线性特点,且其中含有大量的噪声、冗余及连续型属性,这就使得一般的模式分类方法不能对其进行有效的处理。为了进一步提高入侵检测效果,提出了基于邻域粗糙集的入侵检测集成算法。采用Bagging技术产生多个具有较大差异性的训练子集,针对入侵检测数据的连续型特点,在各训练子集上使用具有不同半径的邻域粗糙集模型进行属性约简,消除冗余与噪声,实现属性约简以提高属性子集的分类性能,同时也获得具有更大差异性的训练子集,采用SVM为分类器训练多个基分类器,以各基分类器的检测精度构造权重进行加权集成。KDD99数据集的仿真实验结果表明,该算法能有效地提高入侵检测的精度和效率,具有较高的泛化性和稳定性。  相似文献   

10.
This paper proposes to apply machine learning techniques to predict students’ performance on two real-world educational data-sets. The first data-set is used to predict the response of students with autism while they learn a specific task, whereas the second one is used to predict students’ failure at a secondary school. The two data-sets suffer from two major problems that can negatively impact the ability of classification models to predict the correct label; class imbalance and class noise. A series of experiments have been carried out to improve the quality of training data, and hence improve prediction results. In this paper, we propose two noise filter methods to eliminate the noisy instances from the majority class located inside the borderline area. Our methods combine the over-sampling SMOTE technique with the thresholding technique to balance the training data and choose the best boundary between classes. Then we apply a noise detection approach to identify the noisy instances. We have used the two data-sets to assess the efficacy of class-imbalance approaches as well as both proposed methods. Results for different classifiers show that, the AUC scores significantly improved when the two proposed methods combined with existing class-imbalance techniques.  相似文献   

11.
Bayesian model averaging (BMA) can resolve the overfitting problem by explicitly incorporating the model uncertainty into the analysis procedure. Hence, it can be used to improve the generalization performance of Bayesian network classifiers. Until now, BMA of Bayesian network classifiers has only been performed in some restricted forms, e.g., the model is averaged given a single node-order, because of its heavy computational burden. However, it can be hard to obtain a good node-order when the available training dataset is sparse. To alleviate this problem, we propose BMA of Bayesian network classifiers over several distinct node-orders obtained using the Markov chain Monte Carlo sampling technique. The proposed method was examined using two synthetic problems and four real-life datasets. First, we show that the proposed method is especially effective when the given dataset is very sparse. The classification accuracy of averaging over multiple node-orders was higher in most cases than that achieved using a single node-order in our experiments. We also present experimental results for test datasets with unobserved variables, where the quality of the averaged node-order is more important. Through these experiments, we show that the difference in classification performance between the cases of multiple node-orders and single node-order is related to the level of noise, confirming the relative benefit of averaging over multiple node-orders for incomplete data. We conclude that BMA of Bayesian network classifiers over multiple node-orders has an apparent advantage when the given dataset is sparse and noisy, despite the method's heavy computational cost.  相似文献   

12.
在信号肽预测问题中,由于信号肽序列长度不等且氨基酸组成具有多样性的特点,以往方法通常采用滑动窗口进行处理,从而导致了信息丢失以及数据不平衡等问题。为改善少数类预测效果,对训练数据进行了预处理,将多数类样本数据划分,生成的各组样本分别与少数类样本合并组成若干个数据子集,在两种蛋白质编码方案下采用概率神经网络建立多个分类器,采用加权投票将多分类器集成的方法预测信号肽。在目前广泛使用的Neilsen数据集上进行实验,表明该方法具有一定的有效性。  相似文献   

13.
Noisy and large data sets are extremely difficult to handle and especially to predict. Time series prediction is a problem, which is frequently addressed by researchers in many engineering fields. This paper presents a hybrid approach to handle a large and noisy data set. In fact, a Self Organizing Map (SOM), combined with multiple recurrent neural networks (RNN) has been trained to predict the components of noisy and large data set. The SOM has been developed to construct incrementally a set of clusters. Each cluster has been represented by a subset of data used to train a recurrent neural network. The back propagation through time has been deployed to train the set of recurrent neural networks. To show the performances of the proposed approach, a problem of instruction addresses prefetching has been treated.  相似文献   

14.
一种基于微阵列数据的集成分类方法*   总被引:1,自引:0,他引:1  
针对现有的微阵列数据集成分类方法分类精度不高这一问题,提出了一种Bagging-PCA-SVM方法。该方法首先采用Bootstrap技术对训练样本集重复取样,构成大量训练样本子集,然后在每个子集上进行特征选择和主成分分析以消除噪声基因与冗余基因;最后利用支持向量机作为分类器,采用多数投票的方法预测样本的类属。通过三个数据集进行了测试,测试结果表明了该方法的有效性和可行性。  相似文献   

15.
针对包含噪声与干扰数据的大规模机器学习问题,采用非凸Ramp损失函数抑制噪声和干扰数据的影响,提出一种基于随机优化的非凸线性支持向量机快速学习方法,有效改进训练速度和预测精度.实验结果表明该方法降低学习时间,在MNIST数据集上较传统学习方法的训练时间降低4个数量级.同时在一定程度上改进预测速度,并有效提升分类器对噪声数据集的泛化性能.  相似文献   

16.

Empirical studies on ensemble learning that combines multiple classifiers have shown that, it is an effective technique to improve accuracy and stability of a single classifier. In this paper, we propose a novel method of dynamically building diversified sparse ensembles. We first apply a technique known as the canonical correlation to model the relationship between the input data variables and output base classifiers. The canonical (projected) output classifiers and input training data variables are encoded globally through a multi-linear projection of CCA, to decrease the impacts of noisy input data and incorrect classifiers to a minimum degree in such a global view. Secondly, based on the projection, a sparse regression method is used to prune representative classifiers by combining classifier diversity measurement. Based on the above methods, we evaluate the proposed approach by several datasets, such as UCI and handwritten digit recognition. Experimental results of the study show that, the proposed approach achieves better accuracy as compared to other ensemble methods such as QFWEC, Simple Vote Rule, Random Forest, Drep and Adaboost.

  相似文献   

17.
为解决垃圾网页检测过程中的“维数灾难”和不平衡分类问题,提出一种基于免疫克隆特征选择和欠采样(US)集成的二元分类器算法。首先,使用欠采样技术将训练样本集大类抽样成多个与小类样本数相近的样本集,再将其分别与小类样本合并构成多个平衡的子训练样本集;然后,设计一种免疫克隆算法遴选出多个最优的特征子集;基于最优特征子集对平衡的子样本集进行投影操作,生成平衡数据集的多个视图;最后,用随机森林(RF)分类器对测试样本进行分类,采用简单投票法确定测试样本的最终类别。在WEBSPAM UK-2006数据集上的实验结果表明,该集成分类器算法应用于垃圾网页检测:与随机森林算法及其Bagging和AdaBoost集成分类器算法相比,准确率、F1测度、AUC等指标均提高11%以上;与其他最优的研究结果相比,该集成分类器算法在F1测度上提高2%,在AUC上达到最优。  相似文献   

18.
In this work a novel technique for building ensembles of classifiers for spectrogram classification is presented. We propose a simple approach for classifying signals from a large database of plant echoes, these echoes are highly complex stochastic signals, anyway their spectrograms contain enough information for extracting a good set of features for training the proposed ensemble of classifiers.The proposed ensemble of classifiers is a novel modified version of a recent feature transform based ensemble method: the Input Decimated Ensemble. In the proposed variant different subsets of randomly extracted training patterns are used to create a set of different Neighborhood Preserving Embedding subspace projections. These feature transformations are applied to the whole dataset and a set of decision trees are trained using these transformed spaces. Finally, the scores of this set of classifiers are combined by sum rule.Experiments carried out on a yet proposed dataset show the superiority of this method with respect to other approaches. The proposed approach outperforms the yet proposed, for the tested dataset, combination of principal component analysis and support vector machine (SVM). Moreover, we show that the fusion between the proposed ensemble and the system based on SVM outperforms both the stand-alone methods.  相似文献   

19.
针对目标检测中的特征失配问题,提出多配置特征包的概念,刻画同一特征可能出现的不同失配情况。目标分类器学习时,利用Boosting算法学习出最具鉴别力特征包,每个特征包对应一个单特征和它的失配情况,目标分类器是最优特征包分类器的线性组合。进一步地,引入多示例学习思想,有效评估特征包鉴别力、学习特征包分类器。在人脸数据集上的实验表明,较之传统方法,考虑特征失配后,文中算法能获得更好的检测性能。同时,与固定包生成方式相比,多配置特征包能较好拟合特征失配情况,在提高检测率的同时获得更小的检测器尺寸。  相似文献   

20.
提出一种基于独立成分分析(ICA)的最小二乘支持向量机(LS-SVM),用于时间序列的多步超前独立预测.用ICA估计预测变量中的独立成分(IC),用不含噪声的IC重新构建时间序列.利用 -最近邻法( -NN)减小训练集的规模,提出一种新的距离函数以降低LS-SVM训练过程的计算复杂度,并用约束条件对预测值进行后处理.使用基于ICA的LS-SVM、普通LS-SVM与反向传播神经网络(BP-ANN),对多个时间序列进行对比预测实验.实验结果表明,基于ICA的LS-SVM的预测性能优于普通LS-SVM和BP-ANN.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号