首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Cluster ensembles in collaborative filtering recommendation   总被引:1,自引:0,他引:1  
Recommender systems, which recommend items of information that are likely to be of interest to the users, and filter out less favored data items, have been developed. Collaborative filtering is a widely used recommendation technique. It is based on the assumption that people who share the same preferences on some items tend to share the same preferences on other items. Clustering techniques are commonly used for collaborative filtering recommendation. While cluster ensembles have been shown to outperform many single clustering techniques in the literature, the performance of cluster ensembles for recommendation has not been fully examined. Thus, the aim of this paper is to assess the applicability of cluster ensembles to collaborative filtering recommendation. In particular, two well-known clustering techniques (self-organizing maps (SOM) and k-means), and three ensemble methods (the cluster-based similarity partitioning algorithm (CSPA), hypergraph partitioning algorithm (HGPA), and majority voting) are used. The experimental results based on the Movielens dataset show that cluster ensembles can provide better recommendation performance than single clustering techniques in terms of recommendation accuracy and precision. In addition, there are no statistically significant differences between either the three SOM ensembles or the three k-means ensembles. Either the SOM or k-means ensembles could be considered in the future as the baseline collaborative filtering technique.  相似文献   

2.
The problem of object category classification by committees or ensembles of classifiers, each of which is based on one diverse codebook, is addressed in this paper. Two methods of constructing visual codebook ensembles are proposed in this study. The first technique introduces diverse individual visual codebooks using different clustering algorithms. The second uses various visual codebooks of different sizes for constructing an ensemble with high diversity. Codebook ensembles are trained to capture and convey image properties from different aspects. Based on these codebook ensembles, different types of image representations can be acquired. A classifier ensemble can be trained based on different expression datasets from the same training image set. The use of a classifier ensemble to categorize new images can lead to improved performance. Detailed experimental analysis on a Pascal VOC challenge dataset reveals that the present ensemble approach performs well, consistently improves the performance of visual object classifiers, and results in state-of-the-art performance in categorization.  相似文献   

3.
针对众包标记经过标记集成后仍然存在噪声的问题,提出了一种基于自训练的众包标记噪声纠正算法(Selftraining-based label noise correction, STLNC). STLNC整体分为3个阶段:第1阶段利用过滤器将带集成标记的众包数据集分为噪声集和干净集.第2阶段利用加权密度峰值聚类算法构建数据集中低密度实例指向高密度实例的空间结构关系.第3阶段首先根据发现的空间结构关系设计噪声实例选择策略;然后利用在干净集上训练的集成分类器对选择的噪声实例按照设计的实例纠正策略进行纠正,并将纠正后的实例加入到干净集,再重新训练集成分类器;重复实例选择与纠正过程直到噪声集中所有的实例被纠正;最后用最后一轮训练得到的集成分类器对所有实例进行纠正.在仿真标准数据集和真实众包数据集上的实验结果表明STLNC比其他5种最先进的噪声纠正算法在噪声比和模型质量两个度量指标上表现更优.  相似文献   

4.
The use of all samples in the optimization process does not produce robust results in datasets with label noise. Because the gradients calculated according to the losses of the noisy samples cause the optimization process to go in the wrong direction. In this paper, we recommend using samples with loss less than a threshold determined during the optimization, instead of using all samples in the mini-batch. Our proposed method, Adaptive-k, aims to exclude label noise samples from the optimization process and make the process robust. On noisy datasets, we found that using a threshold-based approach, such as Adaptive-k, produces better results than using all samples or a fixed number of low-loss samples in the mini-batch. On the basis of our theoretical analysis and experimental results, we show that the Adaptive-k method is closest to the performance of the Oracle, in which noisy samples are entirely removed from the dataset. Adaptive-k is a simple but effective method. It does not require prior knowledge of the noise ratio of the dataset, does not require additional model training, and does not increase training time significantly. In the experiments, we also show that Adaptive-k is compatible with different optimizers such as SGD, SGDM, and Adam. The code for Adaptive-k is available at GitHub.  相似文献   

5.
标签噪声会极大地降低深度网络模型的性能. 针对这一问题, 本文提出了一种基于对比学习的标签带噪图像分类方法. 该方法包括自适应阈值、对比学习模块和基于类原型的标签去噪模块. 首先采用对比学习最大化一幅图像的两个增强视图的相似度来提取图像鲁棒特征; 接下来通过一种新颖的自适应阈值过滤训练样本, 在模型训练过程中根据各个类别的学习情况动态调整阈值; 然后创新性地引入基于类原型的标签去噪模块, 通过计算样本特征向量与原型向量的相似度更新伪标签, 从而避免标签中噪声的影响; 在公开数据集CIFAR-10、CIFAR-100和真实数据集ANIMAL10上进行对比实验, 实验结果表明, 在人工合成噪声的条件下, 本文方法实验结果均高于常规方法, 通过计算图像鲁棒的特征向量与各个原型向量的相似度更新伪标签的方式, 降低了噪声标签的负面影响, 在一定程度上提高模型的抗噪声能力, 验证了该模型的有效性.  相似文献   

6.
Delivering an accurate estimate of software development effort plays a decisive role in successful management of a software project. Therefore, several effort estimation techniques have been proposed including analogy based techniques. However, despite the large number of proposed techniques, none has outperformed the others in all circumstances and previous studies have recommended generating estimation from ensembles of various single techniques rather than using only one solo technique. Hence, this paper proposes two types of homogeneous ensembles based on single Classical Analogy or single Fuzzy Analogy for the first time. To evaluate this proposal, we conducted an empirical study with 100/60 variants of Classical/Fuzzy Analogy techniques respectively. These variants were assessed using standardized accuracy and effect size criteria over seven datasets. Thereafter, these variants were clustered using the Scott-Knott statistical test and ranked using four unbiased errors measures. Moreover, three linear combiners were used to combine the single estimates. The results show that there is no best single Classical/Fuzzy Analogy technique across all datasets, and the constructed ensembles (Classical/Fuzzy Analogy ensembles) are often ranked first and their performances are, in general, higher than the single techniques. Furthermore, Fuzzy Analogy ensembles achieve better performance than Classical Analogy ensembles and there is no best Classical/Fuzzy ensemble across all datasets and no evidence concerning the best combiner.  相似文献   

7.
The Corona Virus Disease 2019 (COVID-19) has been declared a worldwide pandemic, and a key method for diagnosing COVID-19 is chest X-ray imaging. The application of convolutional neural network with medical imaging helps to diagnose the disease accurately, where the label quality plays an important role in the classification problem of COVID-19 chest X-rays. However, most of the existing classification methods ignore the problem that the labels are hardly completely true and effective, and noisy labels lead to a significant degradation in the performance of image classification frameworks. In addition, due to the wide distribution of lesions and the large number of local features of COVID-19 chest X-ray images, existing label recovery algorithms have to face the bottleneck problem of the difficult reuse of noisy samples. Therefore, this paper introduces a general classification framework for COVID-19 chest X-ray images with noisy labels and proposes a noisy label recovery algorithm based on subset label iterative propagation and replacement (SLIPR). Specifically, the proposed algorithm first obtains random subsets of the samples multiple times. Then, it integrates several techniques such as principal component analysis, low-rank representation, neighborhood graph regularization, and k-nearest neighbor for feature extraction and image classification. Finally, multi-level weight distribution and replacement are performed on the labels to cleanse the noise. In addition, for the label-recovered dataset, high confidence samples are further selected as the training set to improve the stability and accuracy of the classification framework without affecting its inherent performance. In this paper, three typical datasets are chosen to conduct extensive experiments and comparisons of existing algorithms under different metrics. Experimental results on three publicly available COVID-19 chest X-ray image datasets show that the proposed algorithm can effectively recover noisy labels and improve the accuracy of the image classification framework by 18.9% on the Tawsifur dataset, 19.92% on the Skytells dataset, and 16.72% on the CXRs dataset. Compared to the state-of-the-art algorithms, the gain of classification accuracy of SLIPR on the three datasets can reach 8.67%-19.38%, and the proposed algorithm also has certain scalability while ensuring data integrity.  相似文献   

8.

The algorithm selection problem is defined as identifying the best-performing machine learning (ML) algorithm for a given combination of dataset, task, and evaluation measure. The human expertise required to evaluate the increasing number of ML algorithms available has resulted in the need to automate the algorithm selection task. Various approaches have emerged to handle the automatic algorithm selection challenge, including meta-learning. Meta-learning is a popular approach that leverages accumulated experience for future learning and typically involves dataset characterization. Existing meta-learning methods often represent a dataset using predefined features and thus cannot be generalized across different ML tasks, or alternatively, learn a dataset’s representation in a supervised manner and therefore are unable to deal with unsupervised tasks. In this study, we propose a novel learning-based task-agnostic method for producing dataset representations. Then, we introduce TRIO, a meta-learning approach, that utilizes the proposed dataset representations to accurately recommend top-performing algorithms for previously unseen datasets. TRIO first learns graphical representations for the datasets, using four tools to learn the latent interactions among dataset instances and then utilizes a graph convolutional neural network technique to extract embedding representations from the graphs obtained. We extensively evaluate the effectiveness of our approach on 337 datasets and 195 ML algorithms, demonstrating that TRIO significantly outperforms state-of-the-art methods for algorithm selection for both supervised (classification and regression) and unsupervised (clustering) tasks.

  相似文献   

9.
10.
噪声标签在实际数据集中普遍存在,这将严重影响深度神经网络的学习效果。针对此问题,提出了一种基于标签差学习的噪声标签数据识别与数据再标记方法。该方法设计两种不同的伪标签生成策略,利用基础网络所识别的干净数据生成人工噪声数据集,并计算该数据集的标签差向量或标签差矩阵;以强化相似类别间的关联性为目标,利用全连接层与单行卷积核,设计标签差向量网络与标签差矩阵网络等两种噪声学习网络直接学习样本数据的噪声概率;设计与噪声率线性相关的阈值,对干净数据与噪声数据进行判断。通过设计实验,对包括伪标签生成策略、网络结构、训练迭代次数等影响网络识别性能的因素进行分析。在公开数据集上的测试表明,在多种噪声分布情况中,该算法在保持干净数据的准确率与召回率基本稳定的前提下,能显著提高噪声数据的准确率与召回率,提高幅度最大为16.45%及21.01%。  相似文献   

11.
In this paper, we present a novel meta-feature generation method in the context of meta-learning, which is based on rules that compare the performance of individual base learners in a one-against-one manner. In addition to these new meta-features, we also introduce a new meta-learner called Approximate Ranking Tree Forests (ART Forests) that performs very competitively when compared with several state-of-the-art meta-learners. Our experimental results are based on a large collection of datasets and show that the proposed new techniques can improve the overall performance of meta-learning for algorithm ranking significantly. A key point in our approach is that each performance figure of any base learner for any specific dataset is generated by optimising the parameters of the base learner separately for each dataset.  相似文献   

12.
An ensemble in machine learning is defined as a set of models (such as classifiers or predictors) that are induced individually from data by using one or more machine learning algorithms for a given task and then work collectively in the hope of generating improved decisions. In this paper we investigate the factors that influence ensemble performance, which mainly include accuracy of individual classifiers, diversity between classifiers, the number of classifiers in an ensemble and the decision fusion strategy. Among them, diversity is believed to be a key factor but more complex and difficult to be measured quantitatively, and it was thus chosen as the focus of this study, together with the relationships between the other factors. A technique was devised to build ensembles with decision trees that are induced with randomly selected features. Three sets of experiments were performed using 12 benchmark datasets, and the results indicate that (i) a high level of diversity indeed makes an ensemble more accurate and robust compared with individual models; (ii) small ensembles can produce results as good as, or better than, large ensembles provided the appropriate (e.g. more diverse) models are selected for the inclusion. This has implications that for scaling up to larger databases the increased efficiency of smaller ensembles becomes more significant and beneficial. As a test case study, ensembles are built based on these findings for a real world application—osteoporosis classification, and found that, in each case of three datasets used, the ensembles out-performed individual decision trees consistently and reliably.  相似文献   

13.
Yuan  Weiwei  Guan  Donghai  Zhu  Qi  Ma  Tinghuai 《Neural computing & applications》2018,29(10):673-683

As a kind of noise, mislabeled training data exist in many applications. Because of their negative effects on learning, many filter techniques have been proposed to identify and eliminate them. Ensemble learning-based filter (EnFilter) is the most widely used filter which employs ensemble classifiers. In EnFilter, first the noisy training dataset is divided into several subsets. Each noisy subset is then checked by the multiple classifiers which are trained based on other noisy subsets. It is noted that since the training data used to train multiple classifiers are noisy, the quality of these classifiers cannot be guaranteed, which might generate poor noise identification result. This problem is more serious when the noise ratio in the training dataset is high. To solve this problem, a straightforward but effective approach is proposed in this work. Instead of using noisy data to train the classifiers, nearly noise-free (NNF) data are used since they are supposed to train more reliable classifiers. To this end, a novel NNF data extraction approach is also proposed. Experimental results on a set of benchmark datasets illustrate the utility of our proposed approach.

  相似文献   

14.
In classification, noise may deteriorate the system performance and increase the complexity of the models built. In order to mitigate its consequences, several approaches have been proposed in the literature. Among them, noise filtering, which removes noisy examples from the training data, is one of the most used techniques. This paper proposes a new noise filtering method that combines several filtering strategies in order to increase the accuracy of the classification algorithms used after the filtering process. The filtering is based on the fusion of the predictions of several classifiers used to detect the presence of noise. We translate the idea behind multiple classifier systems, where the information gathered from different models is combined, to noise filtering. In this way, we consider the combination of classifiers instead of using only one to detect noise. Additionally, the proposed method follows an iterative noise filtering scheme that allows us to avoid the usage of detected noisy examples in each new iteration of the filtering process. Finally, we introduce a noisy score to control the filtering sensitivity, in such a way that the amount of noisy examples removed in each iteration can be adapted to the necessities of the practitioner. The first two strategies (use of multiple classifiers and iterative filtering) are used to improve the filtering accuracy, whereas the last one (the noisy score) controls the level of conservation of the filter removing potentially noisy examples. The validity of the proposed method is studied in an exhaustive experimental study. We compare the new filtering method against several state-of-the-art methods to deal with datasets with class noise and study their efficacy in three classifiers with different sensitivity to noise.  相似文献   

15.
This paper addresses the task of identification of nonlinear dynamic systems from measured data. The discrete-time variant of this task is commonly reformulated as a regression problem. As tree ensembles have proven to be a successful predictive modeling approach, we investigate the use of tree ensembles for solving the regression problem. While different variants of tree ensembles have been proposed and used, they are mostly limited to using regression trees as base models. We introduce ensembles of fuzzified model trees with split attribute randomization and evaluate them for nonlinear dynamic system identification.Models of dynamic systems which are built for control purposes are usually evaluated by a more stringent evaluation procedure using the output, i.e., simulation error. Taking this into account, we perform ensemble pruning to optimize the output error of the tree ensemble models. The proposed Model-Tree Ensemble method is empirically evaluated by using input–output data disturbed by noise. It is compared to representative state-of-the-art approaches, on one synthetic dataset with artificially introduced noise and one real-world noisy data set. The evaluation shows that the method is suitable for modeling dynamic systems and produces models with comparable output error performance to the other approaches. Also, the method is resilient to noise, as its performance does not deteriorate even when up to 20% of noise is added.  相似文献   

16.
特征选择有助于增强集成分类器成员间的随机差异性,从而提高泛化精度。研究了随机子空间法(RandomSub-space)和旋转森林法(RotationForest)两种基于特征选择的集成分类器构造算法,分析讨论了两算法特征选择的方式与随机差异程度之间的关系。通过对UCI数据集引入噪声,比较两者在噪声环境下的分类精度。实验结果表明:当噪声增加及特征关联度下降时,基本学习算法及噪声程度对集成效果均有影响,当噪声增强到一定程度后。集成效果和单分类器的性能趋于一致。  相似文献   

17.
In general, the analysis of microarray data requires two steps: feature selection and classification. From a variety of feature selection methods and classifiers, it is difficult to find optimal ensembles composed of any feature-classifier pairs. This paper proposes a novel method based on the evolutionary algorithm (EA) to form sophisticated ensembles of features and classifiers that can be used to obtain high classification performance. In spite of the exponential number of possible ensembles of individual feature-classifier pairs, an EA can produce the best ensemble in a reasonable amount of time. The chromosome is encoded with real values to decide the weight for each feature-classifier pair in an ensemble. Experimental results with two well-known microarray datasets in terms of time and classification rate indicate that the proposed method produces ensembles that are superior to individual classifiers, as well as other ensembles optimized by random and greedy strategies.  相似文献   

18.
Accuracy of machine learners is affected by quality of the data the learners are induced on. In this paper, quality of the training dataset is improved by removing instances detected as noisy by the Partitioning Filter. The fit dataset is first split into subsets, and different base learners are induced on each of these splits. The predictions are combined in such a way that an instance is identified as noisy if it is misclassified by a certain number of base learners. Two versions of the Partitioning Filter are used: Multiple-Partitioning Filter and Iterative-Partitioning Filter. The number of instances removed by the filters is tuned by the voting scheme of the filter and the number of iterations. The primary aim of this study is to compare the predictive performances of the final models built on the filtered and the un-filtered training datasets. A case study of software measurement data of a high assurance software project is performed. It is shown that predictive performances of models built on the filtered fit datasets and evaluated on a noisy test dataset are generally better than those built on the noisy (un-filtered) fit dataset. However, predictive performance based on certain aggressive filters is affected by presence of noise in the evaluation dataset.  相似文献   

19.
Bagging, boosting, rotation forest and random subspace methods are well known re-sampling ensemble methods that generate and combine a diversity of learners using the same learning algorithm for the base-classifiers. Boosting and rotation forest algorithms are considered stronger than bagging and random subspace methods on noise-free data. However, there are strong empirical indications that bagging and random subspace methods are much more robust than boosting and rotation forest in noisy settings. For this reason, in this work we built an ensemble of bagging, boosting, rotation forest and random subspace methods ensembles with 6 sub-classifiers in each one and then a voting methodology is used for the final prediction. We performed a comparison with simple bagging, boosting, rotation forest and random subspace methods ensembles with 25 sub-classifiers, as well as other well known combining methods, on standard benchmark datasets and the proposed technique had better accuracy in most cases.  相似文献   

20.
Most existing visual saliency analysis algorithms assume that the input image is clean and does not have any disturbances. However, this situation is not always the case. In this paper, we provide an extensive evaluation of visual saliency analysis algorithms in noisy images. We analyze the noise immunity of saliency analysis algorithms by evaluating the performances of the algorithms in noisy images with increasing noise scales and by studying the effects of applying different denoising methods before performing saliency analysis. We use 10 state-of-the-art saliency analysis algorithms and 7 typical image denoising methods on 4 eye fixation datasets and 2 salient object detection datasets. Our experiments show that the performances of saliency analysis algorithms decrease with increasing image noise scales in general. An exception is that the nonlinear features (NF) integrated algorithm shows good noise immunity. We also find that image denoising methods can greatly improve the noise immunity of the algorithms. Our results show that the combination of NF and Median denoising method works best on eye fixation datasets and the combination of saliency optimization (SO) and color block-matching and 3D filtering (C-BM3D) method works best on salient object detection datasets. The combination of SO and Average denoising method works best for applications wherein time efficiency is a major concern for both types of datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号