首页 | 本学科首页   官方微博 | 高级检索  
     

规范化相似度的符号序列层次聚类
引用本文:张 豪,陈黎飞,郭躬德.规范化相似度的符号序列层次聚类[J].计算机科学,2015,42(5):114-118, 141.
作者姓名:张 豪  陈黎飞  郭躬德
作者单位:福建师范大学数学与计算机科学学院福建省网络安全与密码技术重点实验室 福州350007
基金项目:本文受国家自然科学基金(61175123),深圳市基础研究(重点)项目(JCYJ20120617120716224)资助
摘    要:符号序列由有限个符号按一定顺序排列而成,广泛存在于数据挖掘的许多应用领域,如基因序列、蛋白质序列和语音序列等.作为序列挖掘的一种主要方法,序列聚类分析在识别序列数据内在结构等方面具有重要的应用价值;同时,由于符号序列间相似性度量较为困难,序列聚类也是当前的一项开放性难题.首先提出一种新的符号序列相似度度量,引入长度规范因子解决现有度量对序列长度敏感的问题,从而提高了符号序列相似度度量的有效性.在此基础上,提出一种新的聚类方法,根据样本相似度构建无回路连通图,通过图划分进行符号序列的层次聚类.在多个实际数据集上的实验结果表明,采用规范化度量的新方法可以有效提高符号序列的聚类精度.

关 键 词:符号序列  聚类  相似度  规范化因子

Hierarchical Clustering of Categorical Sequences by Similarity Normalization
ZHANG Hao,CHEN Li-fei and GUO Gong-de.Hierarchical Clustering of Categorical Sequences by Similarity Normalization[J].Computer Science,2015,42(5):114-118, 141.
Authors:ZHANG Hao  CHEN Li-fei and GUO Gong-de
Affiliation:Fujian Provincial Key Laboratory of Network Security and Cryptology,School of Mathematics and Computer Science,Fujian Normal University,Fuzhou 350007,China,Fujian Provincial Key Laboratory of Network Security and Cryptology,School of Mathematics and Computer Science,Fujian Normal University,Fuzhou 350007,China and Fujian Provincial Key Laboratory of Network Security and Cryptology,School of Mathematics and Computer Science,Fujian Normal University,Fuzhou 350007,China
Abstract:A categorical sequence is composed of finite symbols which are arranged in a certain order.Nowadays,categorical sequences,such as gene sequences,protein sequences,and speech sequences,etc.,widely exist in many application domains of data mining.As a major method for sequence data mining,sequence clustering has a great value in identifying the intrinsic structural of sequence data,while it is also an open problem due to the difficulties in measuring the similarity between sequences.This paper proposed a new similarity measure for categorical sequences,and introduced a length-normalization factor to address the problem that the existing methods are sensitive to the sequences length,and to improve the effectiveness of measuring sequences similarity.Based on the new similarity measure,a new clustering method was proposed,where directed acyclic graphs are constructed according to the similarity between samples and a hierarchical clustering of categorical sequences is performed by graph partitioning.Experimental results on real-world datasets show that the new methods based on the normalized similarity measure are able to improve the clustering accuracy significantly.
Keywords:Categorical sequence  Clustering  Similarity  Normalized variant
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号