文本分类中连续属性离散化方法的研究 Design of Auto Text Categorization Classifier Based on Boosting Algorithm期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

文本分类中连续属性离散化方法的研究

引用本文：	董乐红,耿国华,周明全.文本分类中连续属性离散化方法的研究[J].小型微型计算机系统,2009,30(11).

作者姓名：	董乐红耿国华周明全

作者单位：	1. 西北大学,信息科学与技术学院,陕西,西安,710069 2. 西北大学,信息科学与技术学院,陕西,西安,710069;北京师范大学,信息科学与技术学院,北京,100875

基金项目：	国家自然科学基金重点项目，陕西省教育厅自然科学专项

摘要：	针对机器学习领域的一些分类算法不能处理连续属性的问题,提出一种基于词出现和信息增益相结合的多区间连续属性离散化方法.该算法定义了一个离散化过程,离散化了采用传统信息检索的加权技术生成的非二值特征词空间,然后判断原特征空间中每个特征词属于或不属于某给定子区间,将问题转换成二值表示方式,以使得这些分类算法适用于连续属性值.实验结果表明,该算法离散过程简单高效,预测精度高,可理解性强.
关键词：	机器学习文本分类信息增益连续属性离散化 Boosting算法
Design of Auto Text Categorization Classifier Based on Boosting Algorithm

DONG Le-hong,GENG Guo-hua,ZHOU Ming-quan.Design of Auto Text Categorization Classifier Based on Boosting Algorithm[J].Mini-micro Systems,2009,30(11).

Authors:	DONG Le-hong GENG Guo-hua ZHOU Ming-quan

Abstract:	Aiming at the problem that some good algorithm in machine learning cannt deal with continuous attribute, the paper puts forward a method of multi-interval discretization based on term presence and information gain, which defines a discretization procedure dis-cretizing the non-binary term space produced by classical weighting technique of information retrieval. The problem is then transformed into binary pattern after judging the original terms belong to or not belong to one given sub-intervals so as to make it adaptable to the continuous attribute. The evaluation results show that the method is simple, efficient, precise and understandable as well.

Keywords:	machine learning text categorization information gain continuous attribute discretization boosting algorithm
本文献已被万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏