It is well-known that in order to build a strong ensemble, the component learners should be with high diversity as well as high accuracy. If perturbing the training set can cause significant changes in the component learners constructed,then Bagging can effectively improve accuracy. However, for stable learners such as nearest neighbor classifiers, perturbing the training set can hardly produce diverse component learners, therefore Bagging does not work well. This paper adapts Bagging to nearest neighbor classifiers through injecting randomness to distance metrics. In constructing the component learners, both the training set and the distance metric employed for identifying the neighbors are perturbed. A large scale empirical study reported in this paper shows that the proposed BagInRand algorithm can effectively improve the accuracy of nearest neighbor classifiers.  相似文献   

Bagging and boosting are methods that generate a diverse ensemble of classifiers by manipulating the training data given to a base learning algorithm. Breiman has pointed out that they rely for their effectiveness on the instability of the base learning algorithm. An alternative approach to generating an ensemble is to randomize the internal decisions made by the base algorithm. This general approach has been studied previously by Ali and Pazzani and by Dietterich and Kong. This paper compares the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5. The experiments show that in situations with little or no classification noise, randomization is competitive with (and perhaps slightly superior to) bagging but not as accurate as boosting. In situations with substantial classification noise, bagging is much better than boosting, and sometimes better than randomization.  相似文献   

In many domains, important events are not represented as the common scenario, but as deviations from the rule. The importance and impact associated with these particular, outnumbered, deviant, and sometimes even previously unseen events is directly related to the application domain (e.g., breast cancer detection, satellite image classification, etc.). The detection of these rare events or outliers has recently been gaining popularity as evidenced by the wide variety of algorithms currently available. These algorithms are based on different assumptions about what constitutes an outlier, a characteristic pointing toward their integration in an ensemble to improve their individual detection rate. However, there are two factors that limit the use of current ensemble outlier detection approaches: first, in most cases, outliers are not detectable in full dimensionality, but instead are located in specific subspaces of data; and second, despite the expected improvement on detection rate achieved using an ensemble of detectors, the computational efficiency of the ensemble will increase linearly as the number of components increases. In this article, we propose an ensemble approach that identifies outliers based on different subsets of features and subsamples of data, providing more robust results while improving the computational efficiency of similar ensemble outlier detection approaches.  相似文献   

链篦机回转窑球团矿烧结过程是典型的热工过程,具有非线性、高耦合和大滞后的特点.要建立精确可靠的机理模型十分困难.此外,简化和假定条件与实际过程之间往往存在偏差,因此,单纯利用机理建模方法对球团矿烧结过程进行建模具有一定的局限性.考虑到球团矿烧结过程的复杂性和单纯机理模型的局限性,在机理模型的基础上,利用神经网络集成进行灰箱模型建模,以BP神经网络为集成的个体网络,采用Bagging法来生成样本集,样本用来训练个体网络.结果显示,混合模型具有更高的精度,是一种更优的模型.  相似文献   

网络协议流不平衡环境下,流样本分布的变化对基于机器学习的流量分类器准确性及稳定性有较大的影响选择合适的机器学习算法以适应网络协议流不平衡环境下的在线流量分类,显得格外重要.为此,首先通过单因子实验设计,验证了C4.5决策树、贝叶斯核估计(NBK)和支持向量机(SVM)这3种分类算法统计TCP连接开始的前4个数据包足以分类流量.接着,比较了上述3种分类算法的性能,发现C4.5决策树的测试时间最短,SVM分类算法最稳定.然后,将Bagging算法应用到流量分类中.实验结果表明,Bagging分类算法的稳定性与SVM相似,且测试时间与建模时间接近于C4.5决策树,因此更适于在线分类流量.  相似文献   

提出了一种协作式整体局部分类算法,即C2M (Collaborative classification machine with local and global information),该算法利用两类样本各自的协方差作为整体方向信息, 获得两个带整体和局部信息的分类面,并通过组合分类器的平均规则将两个分类面组合, 得到最终的最优判决平面.该算法可用两次QP (Quadratic programming)求解,时间复杂度为O(2N3), 大大小于M4 (Maxi-min margin machine)的O(N4), 线性核时的分类精度高于只利用了局部信息的支持向量机 (Support vector machine, SVM).理论上证明了在交遇区较多时, C2M 可以比M4 更有效地利用全局信息,并提出了判断整体信息对分类是否有贡献的4个判别指标. 模拟数据和标准数据集上与M4 和SVM的对比实验证明了该算法的有效性.  相似文献   

从差异性出发,研究了基于特征集技术(通过一定的策略选取不同特征集以组成训练集)与数据技术(通过取样技术选取不同的训练集)的集成学习算法,分析了两种集成学习算法产生差异性的方法。针对决策树与神经网络模型,在标准数据集中对集成学习算法的性能进行实验研究,结果表明集成学习算法的性能依赖于数据集的特性以及产生差异性的方法等因素。从总体性能考虑,基于数据的集成学习算法在大多数数据集上优于基于特征集的集成学习算法。  相似文献   

引入一种按邻点对的相似性权值计算次数来归类Laplacian 的思想,并从理论上证明了包含多次相似性权值计算的Laplacian 构造比只计算一次或两次相似性权值的Laplacian 构造更能精细地刻画数据局部几何结构.据此提出了一种新的更能胜任自然图像景物提取任务的Laplacian 构造方法.该方法通过任意一对相邻像素在不同局部邻域内建立一个线性学习模型来重构不同的相似性权值.结合用户提供的部分前、背景标记约束,导出求解景物提取的半监督二次优化目标函数.当考虑通过对前、背景抽样来估计未知像素的颜色值时,优化目标可以迭代求解.更有意义的是,该迭代方法可以成功地将原来构造的其他Laplacian 推广应用于只提供稀疏指示条带的景物提取问题中.理论分析与实验结果均证实,所构造的Laplacian 能够更充分地表达图像像素间的内在结构,能以更精细的方式约束传播前、背景的成分比例而不仅仅是标号,从而获得更优的景物提取效果.  相似文献   

In the class imbalanced learning scenario, traditional machine learning algorithms focusing on optimizing the overall accuracy tend to achieve poor classification performance especially for the minority class in which we are most interested. To solve this problem, many effective approaches have been proposed. Among them, the bagging ensemble methods with integration of the under-sampling techniques have demonstrated better performance than some other ones including the bagging ensemble methods integrated with the over-sampling techniques, the cost-sensitive methods, etc. Although these under-sampling techniques promote the diversity among the generated base classifiers with the help of random partition or sampling for the majority class, they do not take any measure to ensure the individual classification performance, consequently affecting the achievability of better ensemble performance. On the other hand, evolutionary under-sampling EUS as a novel undersampling technique has been successfully applied in searching for the best majority class subset for training a good-performance nearest neighbor classifier. Inspired by EUS, in this paper, we try to introduce it into the under-sampling bagging framework and propose an EUS based bagging ensemble method EUS-Bag by designing a new fitness function considering three factors to make EUS better suited to the framework. With our fitness function, EUS-Bag could generate a set of accurate and diverse base classifiers. To verify the effectiveness of EUS-Bag, we conduct a series of comparison experiments on 22 two-class imbalanced classification problems. Experimental results measured using recall, geometric mean and AUC all demonstrate its superior performance.  相似文献   

该文分析了神经网络集成中成员神经网络的泛化误差、成员神经网络之间的差异度对神经网络集成泛化误差的影响,提出了一种并行学习神经网络集成方法;对参与集成的成员神经网络,给出了一种并行训练方法,不仅满足了成员网络本身的精度要求,还满足了它与其余成员网络的差异性要求;另外,给出了一种并行确定集成成员神经网络权重方法.实验结果表明,使用该文的成员神经网络训练方法、成员神经网络集成方法能够构建有效的神经网络集成系统.  相似文献   

This paper presents an efficient prediction model for a good learning environment using Random Forest (RF) classifier. It consists of a series of modules; data preprocessing, data normalization, data split and finally classification or prediction by the RF classifier. The preprocessed data is normalized using min-max normalization often used before model fitting. As the input data or variables are measured at different scales, it is necessary to normalize them to contribute equally to the model fitting. Then, the RF classifier is employed for course selection which is an ensemble learning method and k-fold cross-validation (k = 10) is used to validate the model. The proposed Prediction Model for Course Selection (PMCS) system is considered a multi-class problem that predicts the course for a particular learner with three complexity levels, namely low, medium and high. It is operated under two modes; locally and globally. The former considers the gender of the learner and the later does not consider the gender of the learner. The database comprises the learner opinions from 75 males and 75 females per category (low, medium and high). Thus the system uses a total of 450 samples to evaluate the performance of the PMCS system. Results show that the system’s performance, while using locally i.e., gender-wise has slightly higher performance than the global system. The RF classifier with 75 decision trees in the global system provides an average accuracy of 97.6%, whereas in the local system it is 97% (male) and 97.6% (female). The overall performance of the RF classifier with 75 trees is better than 25, 50 and 100 decision trees in both local and global systems.  相似文献   

随着开源软件的兴起及软件开发支撑工具的普及,Internet上积累了大量开放的软件开发活动数据,越来越多的实践者与研究者尝试从中获取提高软件开发效率和产品质量的洞察.为了提高数据分析的效率、方便分析结果的重现与对比,许多工作提出了构建与使用共享数据集.然而,现有软件开发活动数据集的构建过程可追溯性差、适用范围窄,对数据随时间、环境发生的变化欠考虑.这些不足直接威胁数据的质量及分析结果的有效性.针对该问题,提出一种层次化、多版本化的方法来构建与使用软件开发活动数据集.层次化是指在数据集中包括收集和后续处理所得的原始、中间和最终数据,建立数据集的可追溯性并扩展其适用范围.多版本化是指通过多种方式进行多次数据收集,使数据使用者能够观察到数据的变化,为数据质量及分析结果有效性的验证和提高创造条件.通过基于该方法构建的Mozilla问题追踪数据集进行示范,并验证了该方法能够帮助数据使用者高效地使用数据.  相似文献   

集成学习是通过集成多个基分类器共同决策的机器学习技术,通过不同的样本集训练有差异的基分类器,得到的集成分类器可以有效地提高学习效果。在基分类器的训练过程中,可以通过代价敏感技术和数据采样实现不平衡数据的处理。由于集成学习在不平衡数据分类的优势,针对不平衡数据的集成分类算法得到广泛研究。详细分析了不平衡数据集成分类算法的研究现状,比较了现有算法的差异和各自存在的优点及问题,提出和分析了有待进一步研究的问题。  相似文献   

In today’s digital world, millions of individuals are linked to one another via the Internet and social media. This opens up new avenues for information exchange with others. Sentiment analysis (SA) has gotten a lot of attention during the last decade. We analyse the challenges of Sentiment Analysis (SA) in one of the Asian regional languages known as Marathi in this study by providing a benchmark setup in which we first produced an annotated dataset composed of Marathi text acquired from microblogging websites such as Twitter. We also choose domain experts to manually annotate Marathi microblogging posts with positive, negative, and neutral polarity. In addition, to show the efficient use of the annotated dataset, an ensemble-based model for sentiment analysis was created. In contrast to others machine learning classifier, we achieved better performance in terms of accuracy for ensemble classifier with 10-fold cross-validation (cv), outcomes as 97.77%, f-score is 97.89%.  相似文献   

k近邻学习器将复杂的全局非线性关系映射为大量局部线性关系的组合,具有易解释、易扩展、抗噪能力强等优点,被广泛应用于说话人识别领域并取得了良好的效果。而集成学习算法因其强泛化能力和易于应用的特性得到了许多领域研究者的关注,但是研究表明通过重采样产生训练集差异的集成算法并不能有效地提高k近邻学习器系统的泛化能力。提出了一种新的BagWithProb采样算法产生训练集。实验表明,该算法可以有效地扩展训练集差异,提高集成系统性能。此外,还提出了基于环域分层采样的算法以加快k近邻识别算法在识别阶段的运算速度。  相似文献   

数据流分类在传感器网络、网络监控等实际领域有着广泛地应用。然而,实际数据流中类分布不平衡和类标签大量缺失的问题严重加剧了数据流分类问题求解的难度。因此,针对数据流中类分布不平衡和类标签大量缺失的问题,提出了一种基于距离和采样机制的集成分类方法。该方法首先计算无标签数据与有标签正负类数据块的中心点距离来标记正负类示例,其次通过正类样本的上采样和负类样本的下采样机制重组数据流块以平衡数据块的类分布,并在其上构建集成分类模型。在模拟的具有类分布不平衡的不完全标记数据流上的实验表明:与经典的同类算法相比,所提方法能够在降低不平衡类分布影响的前提下,提高不完全标记数据流的分类精度。  相似文献   

This letter presents a novel cooperative neural network ensemble learning method based on Negative Correlation learning. It enables easy integration of various network models and reduces communication bandwidth significantly for effective parallel speedup. Comparison with the best Negative Correlation learning method reported demonstrates comparable performance at significantly reduced communication overhead.  相似文献   

Electrochemical sensors, like ion-selective field transistors (ISFET), are electronic devices that merge solid-state electronic technology with chemical sensors so as to be sensitive to the concentration of a particular ion in a solution. However, as it has been previously reported, their response does not only depend on a single ion but also is affected by several interfering ions found in the solution to be measured. These interfering ions can be considered as noise and consequently, a post-processing stage that increases the SNR is obligatory. Our work shows how ensemble learning methods could be used in an array of chemical sensors in order to deal with this problem. In particular, we introduce a novel neural learning architecture for ISFET arrays, which employ ISFET models as prior knowledge. The proposed ensemble learning systems are RBF-like solutions based on bagging and optimal linear combination. Several experimental results are included, which demonstrate the interest and viability of the proposed solution.  相似文献   

网络流量特征分布的动态变化产生概念漂移问题,造成基于机器学习的网络流量分类模型精度下降.定期更新分类模型耗时且无法保证分类模型的泛化能力.基于此,提出一种基于散度的网络流概念漂移分类方法(ensemble classification based on divergence detection, ECDD),采用双层窗口机制,从信息熵的角度出发,根据流量特征分布的JS散度,记为JSD(Jensen-Shannon divergence)来度量滑动窗口内数据分布的差异,从而检测概念漂移.借鉴增量集成学习的思想,检测到漂移时对于新样本重新训练出新的分类器,之后通过分类器权值排序,保留性能较高的分类器,加权集成分类结果对样本进行分类.抓取常见的网络应用流量,根据应用特征分布的不同构建概念漂移数据集,将该方法与常见的概念漂移检测方法进行实验对比,实验结果表明:该方法可以有效地检测概念漂移和更新分类器,表现出较好的分类性能.  相似文献   

司法人工智能中主要挑战性问题之一是案情关键要素识别,现有方法仅将案情要素作为一个命名实体识别任务,导致识别出的多数信息是无关的.另外,也缺乏对文本的全局信息和词汇局部信息的有效利用,导致要素边界识别的效果不佳.针对这些问题,提出一种融合全局和局部信息的关键案情要素识别方法.所提方法首先利用BERT模型作为司法文本的输入共享层以提取文本特征.然后,在共享层之上建立司法案情要素识别、司法文本分类(全局信息)、司法中文分词(局部信息)这3个子任务进行联合学习模型.最后,在两个公开数据集上测试所提方法的效果,结果表明:所提方法 F1值均超过了现有的先进方法,提高了要素实体分类的准确率并减少了识别边界错误问题.  相似文献   

