非数值化特征的条件概率区域划分(CZT)编码方法 Conditional-probability zone transformation coding for categorical features期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

非数值化特征的条件概率区域划分(CZT)编码方法

引用本文：	贺亮,徐正国,李赟,沈超.非数值化特征的条件概率区域划分(CZT)编码方法[J].计算机应用研究,2020,37(5):1400-1405.

作者姓名：	贺亮徐正国李赟沈超

作者单位：	盲信号处理重点实验室,成都610041;西安交通大学智能网络与网络安全教育部重点实验室,西安710049

摘要：	非数值化特征经常出现在数据中，对其有效编码是采用机器学习模型解决问题的关键。针对目前被广泛使用的one-hot编码方法的编码结果具有较大的稀疏性，并且编码出的数值仍然没有明确的物理意义等问题，提出一种基于条件概率的区域划分编码算法CZT（conditional-probability-based zone transformation coding）。该方法首先对特征进行条件概率计算，并依据条件概率划分特征区域，按照区域内的联合条件概率进行编码；然后将CZT编码算法与one-hot算法进行对比分析，从理论上推导并证明CZT编码对特征的压缩率至少为每个特征取值空间的平均大小，同时证明经过CZT编码后的问题具有更简单的优化目标形式，有利于设计后续机器学习算法；最后通过采用相同结构的神经网络进行分类，在Titanic数据集下对比CZT算法和one-hot算法编码数据后对分类器性能的影响，结果表明CZT编码的数据的分类准确率和稳定性均有提升。
关键词：	深度学习非数值化特征特征工程联合条件概率编码
收稿时间：	2018/10/15 0:00:00
修稿时间：	2020/4/27 0:00:00
Conditional-probability zone transformation coding for categorical features

He Liang,Xu Zhengguo,Li Yun and Shen Chao.Conditional-probability zone transformation coding for categorical features[J].Application Research of Computers,2020,37(5):1400-1405.

Authors:	He Liang Xu Zhengguo Li Yun and Shen Chao

Affiliation:	National Key Laboratory of Science and Technology on Blind Signal Processing,,,

Abstract:	Categorical features always exist in the dataset and coding them is a key issue for solving problems efficiently by machine learning models. One-hot coding is a wide accepted method to convert the features into feature values, and however it attracted sparse space and meaningless value after coding. To improve the coding performance, this paper designed a novel co-ding method based on conditional probability after dividing the features into zones, which was called CZT coding. The CZT coding calculated the conditional probability of each feature and then divided the features into several zones and finally coding the features in each zone. It mathematically proves that compared with the state-of-the-art method-one-hot coding, CZT coding reduces the code length by at least the mean of feature spaces and the issue switches into an easier one after CZT coding for the following machine learning model. Finally, it used the same neuron network as the classifier, the performance of CZT coding and one-hot coding is compared by using the Titanic dataset, and the result is that CZT coding makes the classifier performs better both on the accuracy and steadiness.

Keywords:	deep learning categorical features feature engineering conditional probability
本文献已被万方数据等数据库收录！
	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏