基于Tomek链的边界少数类样本合成过采样方法 Synthetic oversampling method for boundary minority samples based on Tomek links期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Tomek链的边界少数类样本合成过采样方法

引用本文：	陶佳晴,贺作伟,冷强奎,翟军昌,孟祥福.基于Tomek链的边界少数类样本合成过采样方法[J].计算机应用研究,2023,40(2).

作者姓名：	陶佳晴贺作伟冷强奎翟军昌孟祥福

作者单位：	渤海大学,渤海大学,辽宁工程技术大学,渤海大学,辽宁工程技术大学

基金项目：	国家自然科学基金资助项目(61602056,61772249);辽宁省自然科学基金资助项目(2019-ZD-0493);辽宁省教育厅科研项目(LQ2019012)

摘要：	在类别不平衡数据集中，由于靠近类边界的样本更容易被错分，因此准确识别边界样本对分类具有重要意义。现有方法通常采用K近邻来标识边界样本，准确率有待提高。针对上述问题，提出一种基于Tomek 链的边界少数类样本合成过采样方法。首先，计算得到类间距离互为最近的样本形成Tomek链；然后，根据Tomek链标识出位于类间边界处的少数类样本；接下来，利用合成少数类过采样技术（SMOTE）中的线性插值机制在边界样本及其少数类近邻间进行过采样，并最终实现数据集的平衡。实验对比了8种采样方法，结果表明所提方法在大部分数据集上均获得了更高的G-mean和F1值。
关键词：	不平衡数据分类合成过采样边界样本 K近邻 Tomek链
收稿时间：	2022/7/4 0:00:00
修稿时间：	2023/1/13 0:00:00
Synthetic oversampling method for boundary minority samples based on Tomek links

Tao Jiaqing,He Zuowei,Leng Qiangkui,Zhai Junchang and Meng Xiangfu.Synthetic oversampling method for boundary minority samples based on Tomek links[J].Application Research of Computers,2023,40(2).

Authors:	Tao Jiaqing He Zuowei Leng Qiangkui Zhai Junchang and Meng Xiangfu

Affiliation:	Bohai University,,,,

Abstract:	In a class-imbalanced dataset, since the samples close to the class boundary are more likely to be misclassified, it is of great significance to accurately identify boundary samples for classification. Existing methods usually use K-nearest neighbors to identify boundary samples, but the accuracy needs to be improved. To address the above problem, this paper proposed a synthetic oversampling method for boundary minority samples based on Tomek links. This method first found inter-class samples with the nearest distance to form Tomek links. Then, it identifies those minority samples located at the inter-class boundary according to Tomek links. Next, it used the linear interpolation mechanism in synthetic minority oversampling technology(SMOTE) to perform oversampling between the boundary samples and their minority neighbors, thereby achieving the balance of the datasets. The comparison experiment with eight sampling algorithms shows that the proposed method can obtain higher G-mean and F1 values on most of the datasets.

Keywords:	classification of imbalanced data synthetic oversampling boundary samples K-nearest neighbors Tomek links

	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏