首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In classification, noise may deteriorate the system performance and increase the complexity of the models built. In order to mitigate its consequences, several approaches have been proposed in the literature. Among them, noise filtering, which removes noisy examples from the training data, is one of the most used techniques. This paper proposes a new noise filtering method that combines several filtering strategies in order to increase the accuracy of the classification algorithms used after the filtering process. The filtering is based on the fusion of the predictions of several classifiers used to detect the presence of noise. We translate the idea behind multiple classifier systems, where the information gathered from different models is combined, to noise filtering. In this way, we consider the combination of classifiers instead of using only one to detect noise. Additionally, the proposed method follows an iterative noise filtering scheme that allows us to avoid the usage of detected noisy examples in each new iteration of the filtering process. Finally, we introduce a noisy score to control the filtering sensitivity, in such a way that the amount of noisy examples removed in each iteration can be adapted to the necessities of the practitioner. The first two strategies (use of multiple classifiers and iterative filtering) are used to improve the filtering accuracy, whereas the last one (the noisy score) controls the level of conservation of the filter removing potentially noisy examples. The validity of the proposed method is studied in an exhaustive experimental study. We compare the new filtering method against several state-of-the-art methods to deal with datasets with class noise and study their efficacy in three classifiers with different sensitivity to noise.  相似文献   

2.
Yuan  Weiwei  Guan  Donghai  Zhu  Qi  Ma  Tinghuai 《Neural computing & applications》2018,29(10):673-683

As a kind of noise, mislabeled training data exist in many applications. Because of their negative effects on learning, many filter techniques have been proposed to identify and eliminate them. Ensemble learning-based filter (EnFilter) is the most widely used filter which employs ensemble classifiers. In EnFilter, first the noisy training dataset is divided into several subsets. Each noisy subset is then checked by the multiple classifiers which are trained based on other noisy subsets. It is noted that since the training data used to train multiple classifiers are noisy, the quality of these classifiers cannot be guaranteed, which might generate poor noise identification result. This problem is more serious when the noise ratio in the training dataset is high. To solve this problem, a straightforward but effective approach is proposed in this work. Instead of using noisy data to train the classifiers, nearly noise-free (NNF) data are used since they are supposed to train more reliable classifiers. To this end, a novel NNF data extraction approach is also proposed. Experimental results on a set of benchmark datasets illustrate the utility of our proposed approach.

  相似文献   

3.
This paper proposes to apply machine learning techniques to predict students’ performance on two real-world educational data-sets. The first data-set is used to predict the response of students with autism while they learn a specific task, whereas the second one is used to predict students’ failure at a secondary school. The two data-sets suffer from two major problems that can negatively impact the ability of classification models to predict the correct label; class imbalance and class noise. A series of experiments have been carried out to improve the quality of training data, and hence improve prediction results. In this paper, we propose two noise filter methods to eliminate the noisy instances from the majority class located inside the borderline area. Our methods combine the over-sampling SMOTE technique with the thresholding technique to balance the training data and choose the best boundary between classes. Then we apply a noise detection approach to identify the noisy instances. We have used the two data-sets to assess the efficacy of class-imbalance approaches as well as both proposed methods. Results for different classifiers show that, the AUC scores significantly improved when the two proposed methods combined with existing class-imbalance techniques.  相似文献   

4.
众包是一个新兴的收集数据集标签的方法。虽然它经济实惠,但面临着数据标签质量无法保证的问题。尤其是当客观原因存在使得众包工作者工作质量较差时,所得的标签会更加不可靠。因此提出一个名为基于特征扩维提高众包质量的方法(FA-method),其基本思想是,首先由专家标注少部分标签,再利用众包者标注的数据集训练模型,对专家集进行预测,所得结果作为专家数据集新的特征,并利用扩维后的专家集训练模型进行预测,计算每个实例为噪声的可能性以及噪声数量上限来过滤出潜在含噪声标签的数据集,类似地,对过滤后的高质量集再次使用扩维的方法进一步校正噪声。在8个UCI数据集上进行验证的结果表明,和现有的结合噪声识别和校正的众包标签方法相比,所提方法能够在重复标签数量较少或标注质量较低时均取得很好的效果。  相似文献   

5.
The ensemble method is a powerful data mining paradigm, which builds a classification model by integrating multiple diversified component learners. Bagging is one of the most successful ensemble methods. It is made of bootstrap-inspired classifiers and uses these classifiers to get an aggregated classifier. However, in bagging, bootstrapped training sets become more and more similar as redundancy is increasing. Besides redundancy, any training set is usually subject to noise. Moreover, the training set might be imbalanced. Thus, each training instance has a different impact on the learning process. This paper explores some properties of the ensemble margin and its use in improving the performance of bagging. We introduce a new approach to measure the importance of training data in learning, based on the margin theory. Then, a new bagging method concentrating on critical instances is proposed. This method is more accurate than bagging and more robust than boosting. Compared to bagging, it reduces the bias while generally keeping the same variance. Our findings suggest that (a) examples with low margins tend to be more critical for the classifier performance; (b) examples with higher margins tend to be more redundant; (c) misclassified examples with high margins tend to be noisy examples. Our experimental results on 15 various data sets show that the generalization error of bagging can be reduced up to 2.5% and its resilience to noise strengthened by iteratively removing both typical and noisy training instances, reducing the training set size by up to 75%.  相似文献   

6.
In the real world, it is extremely difficult to avoid errors; for instance, a doctor may misdiagnose patients. In other words, databases are never free from data entry or other related errors, and many kinds of mistakes are unavoidable in real world data sets. In existing approaches to pattern recognition, handling noisy data in the learning process always produces better generalization performance than if the noise were ignored. In this article, a novel and adaptive weighting mechanism for noise learning tasks is proposed, especially for boosting learning approaches, preventing the algorithm from concentrating on unreasonably noisy learning samples. Several experiments on UC Irvine Machine Learning Repository and a facial expression data set demonstrate the effectiveness of our method.  相似文献   

7.
目的 利用深度卷积神经网络(deep convolutional neural network,DCNN)构建的非开关型随机脉冲噪声(random-valued impulse noise,RVIN)降噪模型在降噪效果和执行效率上均比主流的开关型RVIN降噪算法更有优势,但在实际应用中,这类基于训练(数据驱动)的降噪模型,其性能却受制于能否对待降噪图像受噪声干扰的严重程度进行准确的测定(即存在数据依赖问题)。为此,提出了一种基于浅层卷积神经网络的快速RVIN噪声比例预测(noise ratio estimation,NRE)模型。方法 该预测模型的主要任务是检测待降噪图像中的噪声比例值并将其作为反映图像受噪声干扰严重程度的指标,依据NRE预测模型的检测结果可以自适应调用相应预先训练好的特定区间DCNN降噪模型,从而快速且高质量地完成图像降噪任务。结果 分别在10幅常用图像和50幅纹理图像两个测试集上进行测试,并与现有的主流RVIN降噪算法中的检测模块进行对比。在常用图像测试集上,本文所提出的NRE预测模型的预测准确性最高。相比于噪声比例预测精度排名第2的算法, NRE预测模型在噪声比例预测值均方根误差上低0.6% 2.4%。在50幅纹理图像测试集上,NRE模型的均方根误差波动范围最小,表明其稳定性最好。通过在1幅大小为512×512像素图像上的总体平均执行时间来比较各个算法执行效率的优劣,NRE模型执行时间仅为0.02 s。实验数据表明:所提出的NRE预测模型在受各种不同噪声比例干扰的自然图像上均可以快速而稳定地测定图像中受RVIN噪声干扰的严重程度,非盲的DCNN降噪模型与其联用后即可无缝地转化为盲降噪算法。结论 本文RVIN噪声比例预测模型在各个噪声比例下具有鲁棒的预测准确性,与基于DCNN的非开关型RVIN深度降噪模型配合使用后能妥善解决DCNN网络模型固有的数据依赖问题。  相似文献   

8.
目的 大多数图像降噪算法都属于非盲降噪算法,其获得良好降噪性能的前提是能够准确地获知图像的噪声水平值。然而,现有的噪声水平估计(NLE)算法在噪声水平感知特征(NLAF)提取和噪声水平值映射两个核心模块中分别存在特征描述能力不足和预测准确性有待提高的问题。为此,提出了一种基于卷积神经网络(CNN)自动提取NLAF特征,并利用增强BP (back propagation)神经网络将其映射为相应噪声水平值的改进算法。方法 在训练阶段,首先通过训练卷积神经网络模型并以全连接层中若干与噪声水平值相关系数较高的输出值构成NLAF特征矢量;然后,在AdaBoost技术的支撑下,利用多个映射能力相对较弱的BP神经网络构建一个非线性映射能力更强的增强BP神经网络预测模型,将NLAF特征矢量直接映射为噪声水平值。在预测阶段,首先从给定噪声图像中随机选取若干个图块输入到卷积神经网络模型中,提取每个图块的若干维NLAF特征值后,利用预先训练的BP网络模型将其映射为对应的噪声水平值,然后以估计值的中值作为图像噪声水平值的最终估计结果。结果 对于具有不同噪声水平和内容结构的噪声图像,利用所提算法估计出的噪声水平值与真实值之间的估计误差小于0.5,均方根误差小于0.9,表现出良好的预测准确性和稳定性。此外,所提算法具有较高的执行效率,估计一幅512×512像素的图像的噪声水平值仅需约13.9 ms。结论 实验数据表明,所提算法在高、中、低各个噪声水平下都具有稳定的预测准确性和较高的执行效率,与现有的主流噪声水平估计算法相比综合性能更佳,可以很好地应用于要求噪声水平作为关键参数的实际应用中。  相似文献   

9.
Noise is one of the main factors degrading the quality of original multichannel remote sensing data and its presence influences classification efficiency, object detection, etc. Thus, pre-filtering is often used to remove noise and improve the solving of final tasks of multichannel remote sensing. Recent studies indicate that a classical model of additive noise is not adequate enough for images formed by modern multichannel sensors operating in visible and infrared bands. However, this fact is often ignored by researchers designing noise removal methods and algorithms. Because of this, we focus on the classification of multichannel remote sensing images in the case of signal-dependent noise present in component images. Three approaches to filtering of multichannel images for the considered noise model are analysed, all based on discrete cosine transform in blocks. The study is carried out not only in terms of conventional efficiency metrics used in filtering (MSE) but also in terms of multichannel data classification accuracy (probability of correct classification, confusion matrix). The proposed classification system combines the pre-processing stage where a DCT-based filter processes the blocks of the multichannel remote sensing image and the classification stage. Two modern classifiers are employed, radial basis function neural network and support vector machines. Simulations are carried out for three-channel image of Landsat TM sensor. Different cases of learning are considered: using noise-free samples of the test multichannel image, the noisy multichannel image and the pre-filtered one. It is shown that the use of the pre-filtered image for training produces better classification in comparison to the case of learning for the noisy image. It is demonstrated that the best results for both groups of quantitative criteria are provided if a proposed 3D discrete cosine transform filter equipped by variance stabilizing transform is applied. The classification results obtained for data pre-filtered in different ways are in agreement for both considered classifiers. Comparison of classifier performance is carried out as well. The radial basis neural network classifier is less sensitive to noise in original images, but after pre-filtering the performance of both classifiers is approximately the same.  相似文献   

10.
To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance I k , two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.A preliminary version of this paper was published in the Proceedings of the 20th International Conference on Machine Learning, Washington D.C., USA, 2003, pp. 920–927.  相似文献   

11.
12.
In this work, a first approach to a robust phoneme recognition task by means of a biologically inspired feature extraction method is presented. The proposed technique provides an approximation to the speech signal representation at the auditory cortical level. It is based on an optimal dictionary of atoms, estimated from auditory spectrograms, and the Matching Pursuit algorithm to approximate the cortical activations. This provides a sparse coding with intrinsic noise robustness, which can be therefore exploited when using the system in adverse environments. The recognition task consisted in the classification of a set of 5 easily confused English phonemes, in both clean and noisy conditions. Multilayer perceptrons were trained as classifiers and the performance was compared to other classic and robust parameterizations: the auditory spectrogram, a probabilistic optimum filtering on Mel frequency cepstral coefficients and the perceptual linear prediction coefficients. Results showed a significant improvement in the recognition rate of clean and noisy phonemes by the cortical representation over these other parameterizations.  相似文献   

13.
Many techniques have been proposed for credit risk prediction, from statistical models to artificial intelligence methods. However, very few research efforts have been devoted to deal with the presence of noise and outliers in the training set, which may strongly affect the performance of the prediction model. Accordingly, the aim of the present paper is to systematically investigate whether the application of filtering algorithms leads to an increase in accuracy of instance-based classifiers in the context of credit risk assessment. The experimental results with 20 different algorithms and 8 credit databases show that the filtered sets perform significantly better than the non-preprocessed training sets when using the nearest neighbour decision rule. The experiments also allow to identify which techniques are most robust and accurate when confronted with noisy credit data.  相似文献   

14.
Soft fuzzy rough sets for robust feature evaluation and selection   总被引:2,自引:0,他引:2  
The fuzzy dependency function proposed in the fuzzy rough set model is widely employed in feature evaluation and attribute reduction. It is shown that this function is not robust to noisy information in this paper. As datasets in real-world applications are usually contaminated by noise, robustness of data analysis models is very important in practice. In this work, we develop a new model of fuzzy rough sets, called soft fuzzy rough sets, which can reduce the influence of noise. We discuss the properties of the model and construct a new dependence function from the model. Then we use the function to evaluate and select features. The presented experimental results show the effectiveness of the new model.  相似文献   

15.
Machine learning techniques often have to deal with noisy data, which may affect the accuracy of the resulting data models. Therefore, effectively dealing with noise is a key aspect in supervised learning to obtain reliable models from data. Although several authors have studied the effect of noise for some particular learners, comparisons of its effect among different learners are lacking. In this paper, we address this issue by systematically comparing how different degrees of noise affect four supervised learners that belong to different paradigms. Specifically, we consider the Naïve Bayes probabilistic classifier, the C4.5 decision tree, the IBk instance-based learner and the SMO support vector machine. We have selected four methods which enable us to contrast different learning paradigms, and which are considered to be four of the top ten algorithms in data mining (Yu et al. 2007). We test them on a collection of data sets that are perturbed with noise in the input attributes and noise in the output class. As an initial hypothesis, we assign the techniques to two groups, NB with C4.5 and IBk with SMO, based on their proposed sensitivity to noise, the first group being the least sensitive. The analysis enables us to extract key observations about the effect of different types and degrees of noise on these learning techniques. In general, we find that Naïve Bayes appears as the most robust algorithm, and SMO the least, relative to the other two techniques. However, we find that the underlying empirical behavior of the techniques is more complex, and varies depending on the noise type and the specific data set being processed. In general, noise in the training data set is found to give the most difficulty to the learners.  相似文献   

16.
A novel framework to context modeling based on the probability of co-occurrence of objects and scenes is proposed. The modeling is quite simple, and builds upon the availability of robust appearance classifiers. Images are represented by their posterior probabilities with respect to a set of contextual models, built upon the bag-of-features image representation, through two layers of probabilistic modeling. The first layer represents the image in a semantic space, where each dimension encodes an appearance-based posterior probability with respect to a concept. Due to the inherent ambiguity of classifying image patches, this representation suffers from a certain amount of contextual noise. The second layer enables robust inference in the presence of this noise by modeling the distribution of each concept in the semantic space. A thorough and systematic experimental evaluation of the proposed context modeling is presented. It is shown that it captures the contextual “gist” of natural images. Scene classification experiments show that contextual classifiers outperform their appearance-based counterparts, irrespective of the precise choice and accuracy of the latter. The effectiveness of the proposed approach to context modeling is further demonstrated through a comparison to existing approaches on scene classification and image retrieval, on benchmark data sets. In all cases, the proposed approach achieves superior results.  相似文献   

17.
Nearest prototype classification of noisy data   总被引:1,自引:1,他引:0  
Nearest prototype approaches offer a common way to design classifiers. However, when data is noisy, the success of this sort of classifiers depends on some parameters that the designer needs to tune, as the number of prototypes. In this work, we have made a study of the ENPC technique, based on the nearest prototype approach, in noisy datasets. Previous experimentation of this algorithm had shown that it does not require any parameter tuning to obtain good solutions in problems where class limits are well defined, and data is not noisy. In this work, we show that the algorithm is able to obtain solutions with high classification success even when data is noisy. A comparison with optimal (hand made) solutions and other different classification algorithms demonstrates the good performance of the ENPC algorithm in accuracy and number of prototypes as the noise level increases. We have performed experiments in four different datasets, each of them with different characteristics.  相似文献   

18.
叶育鑫  薛环  王璐  欧阳丹彤 《软件学报》2020,31(4):1025-1038
远监督关系抽取的最大优势是通过知识库和自然语言文本的自动对齐生成标记数据.这种简单的自动对齐机制在将人从繁重的样本标注工作中解放出来的同时,不可避免地会产生各种错误数据标记,进而影响构建高质量的关系抽取模型.针对远监督关系抽取任务中的标记噪声问题,提出“最终句子对齐的标签是基于某些未知因素所生成的带噪观测结果”这一假设.并在此假设的基础上,构建由编码层、基于噪声分布的注意力层、真实标签输出层和带噪观测层的新型关系抽取模型.模型利用自动标记的数据学习真实标签到噪声标签的转移概率,并在测试阶段,通过真实标签输出层得到最终的关系分类.随后,研究带噪观测模型与深度神经网络的结合,重点讨论基于深度神经网络编码的噪声分布注意力机制以及深度神经网络框架下不均衡样本的降噪处理.通过以上研究,进一步提升基于带噪观测远监督关系抽取模型的抽取精度和鲁棒性.最后,在公测数据集和同等参数设置下进行带噪观测远监督关系抽取模型的验证实验,通过分析样本噪声的分布情况,对在各种样本噪声分布下的带噪观测模型进行性能评价,并与现有的主流基线方法进行比较.结果显示,所提出的带噪观测模型具有更高的准确率和召回率.  相似文献   

19.
利用几何结构检测去除图像中的随机值脉冲噪声   总被引:1,自引:1,他引:0       下载免费PDF全文
尽管中值滤波以及各种改进方法是去除图像中随机值脉冲噪声的有效方法,然而,大多数去噪方法存在门限值选取困难和对图像边缘纹理结构过平滑的缺点。针对这一问题,提出了一种基于几何结构的用于检测和去除随机值脉冲噪声的新方法。该方法首先利用图像的直方图分布来估计脉冲噪声的噪声率;然后进一步基于噪声率和细节图像的直方图分布,自适应地确定两个分类门限;最后利用两个门限,将细节图像中的像素分成‘未被污染点’、‘待定点’和‘噪声点’。其中‘待定点’主要由边缘和纹理区像素和噪声像素构成,为区分其属性,还引入了几何结构检测方法。基于各像素点的类型,细节图像被用于修正中值滤波的结果。实验结果表明,该新方法在去除脉冲噪声的同时,还很好地保留了图像的边缘结构。与已有的方法相比,具有明显的优势。  相似文献   

20.
特征选择有助于增强集成分类器成员间的随机差异性,从而提高泛化精度。研究了随机子空间法(RandomSub-space)和旋转森林法(RotationForest)两种基于特征选择的集成分类器构造算法,分析讨论了两算法特征选择的方式与随机差异程度之间的关系。通过对UCI数据集引入噪声,比较两者在噪声环境下的分类精度。实验结果表明:当噪声增加及特征关联度下降时,基本学习算法及噪声程度对集成效果均有影响,当噪声增强到一定程度后。集成效果和单分类器的性能趋于一致。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号