首页 | 本学科首页   官方微博 | 高级检索  
     

基于Tomek链的边界少数类样本合成过采样方法
引用本文:陶佳晴,贺作伟,冷强奎,翟军昌,孟祥福.基于Tomek链的边界少数类样本合成过采样方法[J].计算机应用研究,2023,40(2).
作者姓名:陶佳晴  贺作伟  冷强奎  翟军昌  孟祥福
作者单位:渤海大学,渤海大学,辽宁工程技术大学,渤海大学,辽宁工程技术大学
基金项目:国家自然科学基金资助项目(61602056,61772249);辽宁省自然科学基金资助项目(2019-ZD-0493);辽宁省教育厅科研项目(LQ2019012)
摘    要:在类别不平衡数据集中,由于靠近类边界的样本更容易被错分,因此准确识别边界样本对分类具有重要意义。现有方法通常采用K近邻来标识边界样本,准确率有待提高。针对上述问题,提出一种基于Tomek 链的边界少数类样本合成过采样方法。首先,计算得到类间距离互为最近的样本形成Tomek链;然后,根据Tomek链标识出位于类间边界处的少数类样本;接下来,利用合成少数类过采样技术(SMOTE)中的线性插值机制在边界样本及其少数类近邻间进行过采样,并最终实现数据集的平衡。实验对比了8种采样方法,结果表明所提方法在大部分数据集上均获得了更高的G-mean和F1值。

关 键 词:不平衡数据分类    合成过采样    边界样本    K近邻    Tomek链
收稿时间:2022/7/4 0:00:00
修稿时间:2023/1/13 0:00:00

Synthetic oversampling method for boundary minority samples based on Tomek links
Tao Jiaqing,He Zuowei,Leng Qiangkui,Zhai Junchang and Meng Xiangfu.Synthetic oversampling method for boundary minority samples based on Tomek links[J].Application Research of Computers,2023,40(2).
Authors:Tao Jiaqing  He Zuowei  Leng Qiangkui  Zhai Junchang and Meng Xiangfu
Affiliation:Bohai University,,,,
Abstract:In a class-imbalanced dataset, since the samples close to the class boundary are more likely to be misclassified, it is of great significance to accurately identify boundary samples for classification. Existing methods usually use K-nearest neighbors to identify boundary samples, but the accuracy needs to be improved. To address the above problem, this paper proposed a synthetic oversampling method for boundary minority samples based on Tomek links. This method first found inter-class samples with the nearest distance to form Tomek links. Then, it identifies those minority samples located at the inter-class boundary according to Tomek links. Next, it used the linear interpolation mechanism in synthetic minority oversampling technology(SMOTE) to perform oversampling between the boundary samples and their minority neighbors, thereby achieving the balance of the datasets. The comparison experiment with eight sampling algorithms shows that the proposed method can obtain higher G-mean and F1 values on most of the datasets.
Keywords:classification of imbalanced data  synthetic oversampling  boundary samples  K-nearest neighbors  Tomek links
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号