首页 | 本学科首页   官方微博 | 高级检索  
     

基于凸边界的学习样本抽取方法
引用本文:顾依依,谈询滔,袁玉波.基于凸边界的学习样本抽取方法[J].计算机应用,2019,39(8):2281-2287.
作者姓名:顾依依  谈询滔  袁玉波
作者单位:华东理工大学信息科学与工程学院,上海,200237;华东理工大学信息科学与工程学院,上海,200237;华东理工大学信息科学与工程学院,上海,200237
基金项目:浙江省省级重点研发计划项目(2019C03004)。
摘    要:学习样本的质量和数量对于智能数据分类系统至关重要,但在数据分类系统中没有一个通用的良好方法用于发现有意义的样本。以此为动机,提出数据集合凸边界的概念,给出了快速发现有意义样本集合的方法。首先,利用箱型函数对学习样本集合中的异常和特征不全样本进行清洗;接着,提出数据锥的概念,对归一化的学习样本进行锥形分割;最后,对每个锥形样本子集进行中心化,以凸边界为基础提取距离凸边界差异极小的样本构成凸边界样本集合。实验在12个UCI数据集上进行,并与高斯朴素贝叶斯(GNB)、决策树(CART)、线性判别分析(LDA)、提升算法(AdaBoost)、随机森林(RF)和逻辑回归(LR)这六种经典的数据分类算法进行对比。结果表明,各个算法在凸边界样本集合的训练时间显著缩短,同时保持了分类性能。特别地,对包含噪声数据较多的数据集,如剖腹产、电网稳定性、汽车评估等数据集,凸边界样本集合能使分类性能得到提升。为了更好地评价凸边界样本集合的效率,以样本变化率和分类性能变化率的比值定义了样本清洗效率,并用该指标来客观评价凸边界样本的意义。清洗效率大于1时说明方法有效,且数值越高效果越好。在脉冲星数据集合上,所提方法对GNB算法的清洗效率超过68,说明所提方法性能优越。

关 键 词:机器学习  数据分类  样本选择  凸锥  边界样本
收稿时间:2019-01-22
修稿时间:2019-03-30

Learning sample extraction method based on convex boundary
GU Yiyi,TAN Xuntao,YUAN Yubo.Learning sample extraction method based on convex boundary[J].journal of Computer Applications,2019,39(8):2281-2287.
Authors:GU Yiyi  TAN Xuntao  YUAN Yubo
Affiliation:School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
Abstract:The quality and quantity of learning samples are very important for intelligent data classification systems. But there is no general good method for finding meaningful samples in data classification systems. For this reason, the concept of convex boundary of dataset was proposed, and a fast method of discovering meaningful sample set was given. Firstly, abnormal and incomplete samples in the learning sample set were cleaned by box-plot function. Secondly, the concept of data cone was proposed to divide the normalized learning samples into cones. Finally, each cone of sample subset was centralized, and based on convex boundary, samples with very small difference from convex boundary were extracted to form convex boundary sample set. In the experiments, 6 classical data classification algorithms, including Gaussian Naive Bayes (GNB), Classification And Regression Tree (CART), Linear Discriminant Analysis (LDA), Adaptive Boosting (AdaBoost), Random Forest (RF) and Logistic Regression (LR), were tested on 12 UCI datasets. The results show that convex boundary sample sets can significantly shorten the training time of each algorithm while maintaining the classification performance. In particular, for datasets with many noise data such as caesarian section, electrical grid, car evaluation datasets, convex boundary sample set can improve the classification performance. In order to better evaluate the efficiency of convex boundary sample set, the sample cleaning efficiency was defined as the quotient of sample size change rate and classification performance change rate. With this index, the significance of convex boundary samples was evaluated objectively. Cleaning efficiency greater than 1 proves that the method is effective. The higher the numerical value, the better the effect of using convex boundary samples as learning samples. For example, on the dataset of HTRU2, the cleaning efficiency of the proposed method for GNB algorithm is over 68, which proves the strong performance of this method.
Keywords:machine learning                                                                                                                        data classification                                                                                                                        sample selection                                                                                                                        convex cone                                                                                                                        boundary sample
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号