首页 | 本学科首页   官方微博 | 高级检索  
     

融合拟单层覆盖粗集的集值数据平衡方法研究
引用本文:吴正江,杨天,郑爱玲,梅秋雨,张亚宁.融合拟单层覆盖粗集的集值数据平衡方法研究[J].计算机工程与应用,2022,58(19):166-173.
作者姓名:吴正江  杨天  郑爱玲  梅秋雨  张亚宁
作者单位:河南理工大学 计算机科学与技术学院,河南 焦作 454003
摘    要:如今不平衡数据存在生活中各个领域,如何有效地对其分类已经成为研究的热点。传统的过采样与欠采样方法虽然能保证数据的平衡性,但无法克服因数据分布和噪声对数据的分类造成的影响。为了降低数据分布与噪声在集值信息系统中对不平衡数据分类的影响,提出了一种基于拟单层覆盖粗集的过采样与欠采样相结合的模型。通过拟单层覆盖粗集DA0]与DE0]下近似将数据主要划分为两个部分,将属于下近似集的部分用BorderlineSMOTE进行过采样,将不属于下近似集的部分用ClusterCentroids进行欠采样,最终将二者合并即为最终数据集。拟单层覆盖粗集是适用于集值信息系统的高近似质量、快速计算的模型,高近似质量可以使其保留尽可能多的可靠数据来保证模型的泛化能力。通过混合处理方式,不仅能够降低噪声数据对BorderlineSMOTE的影响,还能通过ClusterCentroids极大程度地保留被过滤数据的信息完整性。通过相关对比实验,采用ExtraTree、DecisionTree、FGCNN等方法,验证了该模型的有效性。

关 键 词:拟单层覆盖粗集  不平衡数据  近似集  混合处理  过采样  欠采样  

Study on Set-Valued Data Balancing Method by Semi-Monolayer Covering Rough Set
WU Zhengjiang,YANG Tian,ZHENG Ailing,MEI Qiuyu,ZHANG Yaning.Study on Set-Valued Data Balancing Method by Semi-Monolayer Covering Rough Set[J].Computer Engineering and Applications,2022,58(19):166-173.
Authors:WU Zhengjiang  YANG Tian  ZHENG Ailing  MEI Qiuyu  ZHANG Yaning
Affiliation:School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan 454003, China
Abstract:Nowadays, imbalanced data exist in all areas of life, and how to effectively classify it has become a hot topic of studies. Traditional methods of over-sampling and under-sampling ensure balanced data, but cannot overcome the effects on the classification of the data due to data distribution and noise. To reduce the influence of data distribution and noise on the classification of imbalanced data in set-valued information systems, a new method combining oversampling and under-sampling based on semi-monolayer covering rough set is proposed. The data are divided into two main parts by applying semi-monolayer covering rough set DA0] and DE0] lower approximation, the part be-longing to the lower approximation set is oversampled by BorderlineSMOTE, the part not belonging to the lower approximation set is under-sampled by ClusterCentroids, and finally, the two are combined to the final data set. Semi-monolayer covering rough set is a high approximation quality, a fast computational model which suitable for set-valued information systems. The high approximation quality allows it to retain as much reliable data as possible to ensure the generalization capability of the model. The hybrid approach not only reduces the impact of noisy data on BorderlineSMOTE but also preserves the information integrity of the filtered-out data to a great extent through ClusterCentroids. Finally, the effectiveness of the model is verified through relevant comparative experiments using ExtraTree, DecisionTree and FGCNN.
Keywords:semi-monolayer covering rough set  imbalanced data  approximation set  hybrid approach  over-sampling  under-sampling  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号