基于信息论方法的分类数据相似性度量 Similarity Measure of Categorical Data Based on Information Theory期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于信息论方法的分类数据相似性度量

引用本文：	郑碧如,吴广潮. 基于信息论方法的分类数据相似性度量[J]. 计算机与现代化, 2018, 0(5): 30. DOI: 10.3969/j.issn.1006-2475.2018.05.007

作者姓名：	郑碧如吴广潮

基金项目：	国家自然科学基金面上项目（61370102）

摘要：	两实例的距离或相似性度量在数据挖掘和机器学习中扮演着重要的角色。常用的距离度量方法主要适用于数值数据，针对分类数据，本文提出一种数据驱动的相似性度量方法。该方法利用属性值与类标签的信息，将属性值的类条件概率结合信息论来度量分类数据的相似性。为了与已提出的相似性度量方法作比较，把各度量方法与k最近邻算法结合，对多个分类数据集进行分类，通过十折交叉验证比较结果的错误率。实验表明该度量结合k最近邻方法使分类具有较低的错误率。
关键词：	相似性分类数据信息论条件概率
收稿时间：	2018-06-13
Similarity Measure of Categorical Data Based on Information Theory

ZHENG Bi-ru,WU Guang-chao. Similarity Measure of Categorical Data Based on Information Theory[J]. Computer and Modernization, 2018, 0(5): 30. DOI: 10.3969/j.issn.1006-2475.2018.05.007

Authors:	ZHENG Bi-ru WU Guang-chao

Abstract:	The measure of distance or similarity between two instances plays an important role in data mining and machine learning. The common distance measures are mainly suitable for numerical data, to the classification data, this paper proposes a data-driven similarity measure. This method uses the information of attribute values and class labels to measure the similarity of categorical data by combining the label’s conditional probability of attribute values with information theory. In order to compare with the proposed similarity measures, this paper combines 8 kinds of measure methods with k-nearest neighbor algorithm to classify a plurality of categorical data sets, and the error rates of the results are compared through ten-fold cross validation. Experiments show that this metric combined with k-nearest neighbor method makes a lower error classification rate.

Keywords:	similarity categorical data information theory conditional probability

	点击此处可从《计算机与现代化》浏览原始摘要信息
	点击此处可从《计算机与现代化》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏