首页 | 本学科首页   官方微博 | 高级检索  
     

基于信息论方法的分类数据相似性度量
引用本文:郑碧如,吴广潮.基于信息论方法的分类数据相似性度量[J].计算机与现代化,2018,0(5):30.
作者姓名:郑碧如  吴广潮
基金项目:国家自然科学基金面上项目(61370102)
摘    要:两实例的距离或相似性度量在数据挖掘和机器学习中扮演着重要的角色。常用的距离度量方法主要适用于数值数据,针对分类数据,本文提出一种数据驱动的相似性度量方法。该方法利用属性值与类标签的信息,将属性值的类条件概率结合信息论来度量分类数据的相似性。为了与已提出的相似性度量方法作比较,把各度量方法与k最近邻算法结合,对多个分类数据集进行分类,通过十折交叉验证比较结果的错误率。实验表明该度量结合k最近邻方法使分类具有较低的错误率。

关 键 词:相似性    分类数据    信息论    条件概率  
收稿时间:2018-06-13

Similarity Measure of Categorical Data Based on Information Theory
ZHENG Bi-ru,WU Guang-chao.Similarity Measure of Categorical Data Based on Information Theory[J].Computer and Modernization,2018,0(5):30.
Authors:ZHENG Bi-ru  WU Guang-chao
Abstract:The measure of distance or similarity between two instances plays an important role in data mining and machine learning. The common distance measures are mainly suitable for numerical data, to the classification data, this paper proposes a data-driven similarity measure. This method uses the information of attribute values and class labels to measure the similarity of categorical data by combining the label’s conditional probability of attribute values with information theory. In order to compare with the proposed similarity measures, this paper combines 8 kinds of measure methods with k-nearest neighbor algorithm to classify a plurality of categorical data sets, and the error rates of the results are compared through ten-fold cross validation. Experiments show that this metric combined with k-nearest neighbor method makes a lower error classification rate.
Keywords:similarity  categorical data  information theory  conditional probability  
点击此处可从《计算机与现代化》浏览原始摘要信息
点击此处可从《计算机与现代化》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号