首页 | 本学科首页   官方微博 | 高级检索  
     

不平衡数据中基于异类k距离的边界混合采样
引用本文:于艳丽,江开忠,盛静文.不平衡数据中基于异类k距离的边界混合采样[J].计算机应用与软件,2021,38(2):299-304,310.
作者姓名:于艳丽  江开忠  盛静文
作者单位:上海工程技术大学数理与统计学院 上海 201620;上海工程技术大学数理与统计学院 上海 201620;上海工程技术大学数理与统计学院 上海 201620
摘    要:不平衡数据集中,样本的分布位置对于决策边界具有差异性,传统的采样方法没有根据样本位置做区别化采样处理。为此提出一种不平衡数据中基于异类k距离的边界混合采样算法(BHSK)。通过异类k距离识别出边界集;再根据支持度将边界少数类样本细分为三类,分别采用不同的过采样方法和过采样倍率,根据少数类样本的不同重要性进行过采样,生成更具有信息的样本点;根据异类k距离删除部分非边界多数类样本点。实验结果表明,该算法在最小距离分类法下的少数类识别性能较几种常见的采样算法提高了1%~11%,验证了其有效性。

关 键 词:边界点  不平衡数据  分类  混合采样  异类k距离

BOUNDARY MIXED SAMPLING BASED ON HETEROGENEOUS K DISTANCE IN UNBALANCED DATA
Yu Yanli,Jiang Kaizhong,Sheng Jingwen.BOUNDARY MIXED SAMPLING BASED ON HETEROGENEOUS K DISTANCE IN UNBALANCED DATA[J].Computer Applications and Software,2021,38(2):299-304,310.
Authors:Yu Yanli  Jiang Kaizhong  Sheng Jingwen
Affiliation:(School of Mathematics,Physics&Statistics,Shanghai University of Engineering Science,Shanghai 201620,China)
Abstract:In the unbalanced data set,the distribution position of the samples is different for the decision boundary.The traditional sampling method does not make the differential sampling processing based on the sample position.To this end,a boundary hybrid sampling algorithm based on heterogeneous k distance in unbalanced data(BHSK)is proposed in the unbalanced data.It identified the boundary set by heterogeneous k-distance,and subdivided the boundary minority samples into three categories according to the support degree.We adopted different oversampling methods and oversampling magnification,and it could do oversampling according to the different importance of a small number of samples to generate more informative sample points.We deleted some non-boundary majority class samples according to the heterogeneous k-distance.The experiments show that the BHSK algorithm s minority recognition performance under the minimum distance classification method is improved by 1%~11%compared with several common sampling algorithms,which proves the effectiveness of the modified algorithm.
Keywords:Boundary points  Imbalanced data set  Classification  Mixed sampling  Different k distance
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号