首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
徐鲲鹏  陈黎飞  孙浩军  王备战 《软件学报》2020,31(11):3492-3505
现有的类属型数据子空间聚类方法大多基于特征间相互独立假设,未考虑属性间存在的线性或非线性相关性.提出一种类属型数据核子空间聚类方法.首先引入原作用于连续型数据的核函数将类属型数据投影到核空间,定义了核空间中特征加权的类属型数据相似性度量.其次,基于该度量推导了类属型数据核子空间聚类目标函数,并提出一种高效求解该目标函数的优化方法.最后,定义了一种类属型数据核子空间聚类算法.该算法不仅在非线性空间中考虑了属性间的关系,而且在聚类过程中赋予每个属性衡量其与簇类相关程度的特征权重,实现了类属型属性的嵌入式特征选择.还定义了一个聚类有效性指标,以评价类属型数据聚类结果的质量.在合成数据和实际数据集上的实验结果表明,与现有子空间聚类算法相比,核子空间聚类算法可以发掘类属型属性间的非线性关系,并有效提高了聚类结果的质量.  相似文献   

2.
陈黎飞  郭躬德 《软件学报》2013,24(11):2628-2641
类属型数据广泛分布于生物信息学等许多应用领域,其离散取值的特点使得类属数据聚类成为统计机器学习领域一项困难的任务.当前的主流方法依赖于类属属性的模进行聚类优化和相关属性的权重计算.提出一种非模的类属型数据统计聚类方法.首先,基于新定义的相异度度量,推导了属性加权的类属数据聚类目标函数.该函数以对象与簇之间的平均距离为基础,从而避免了现有方法以模为中心导致的问题.其次,定义了一种类属型数据的软子空间聚类算法.该算法在聚类过程中根据属性取值的总体分布,而不仅限于属性的模,赋予每个属性衡量其与簇类相关程度的权重,实现自动的特征选择.在合成数据和实际应用数据集上的实验结果表明,与现有的基于模的聚类算法和基于蒙特卡罗优化的其他非模算法相比,该算法有效地提高了聚类结果的质量.  相似文献   

3.
传统的K-modes算法采用简单的属性匹配方式计算同一属性下不同属性值的距离,并且计算样本距离时令所有属性权重相等。在此基础上,综合考虑有序型分类数据中属性值的顺序关系、无序型分类数据中不同属性值之间的相似性以及各属性之间的关系等,提出一种更加适用于混合型分类数据的改进聚类算法,该算法对无序型分类数据和有序型分类数据采用不同的距离度量,并且用平均熵赋予相应的权重。实验结果表明,改进算法在人工数据集和真实数据集上均有比K-modes算法及其改进算法更好的聚类效果。  相似文献   

4.
Clustering is one of the most popular techniques in data mining. The goal of clustering is to identify distinct groups in a dataset. Many clustering algorithms have been published so far, but often limited to numeric or categorical data. However, most real world data are mixed, numeric and categorical. In this paper, we propose a clustering algorithm CAVE which is based on variance and entropy, and is capable of mining mixed data. The variance is used to measure the similarity of the numeric part of the data. To express the similarity between categorical values, distance hierarchy has been proposed. Accordingly, the similarity of the categorical part is measured based on entropy weighted by the distances in the hierarchies. A new validity index for evaluating the clustering results has also been proposed. The effectiveness of CAVE is demonstrated by a series of experiments on synthetic and real datasets in comparison with that of several traditional clustering algorithms. An application of mining a mixed dataset for customer segmentation and catalog marketing is also presented.  相似文献   

5.
In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been proposed for either numerical or categorical data sets. However, these algorithms are not very effective for a mixed data set containing both numerical attributes and categorical attributes. To overcome this deficiency, a generalized mechanism is presented in this paper by integrating Rényi entropy and complement entropy together. The mechanism is able to uniformly characterize within-cluster entropy and between-cluster entropy and to identify the worst cluster in a mixed data set. In order to evaluate the clustering results for mixed data, an effective cluster validity index is also defined in this paper. Furthermore, by introducing a new dissimilarity measure into the k-prototypes algorithm, we develop an algorithm to determine the number of clusters in a mixed data set. The performance of the algorithm has been studied on several synthetic and real world data sets. The comparisons with other clustering algorithms show that the proposed algorithm is more effective in detecting the optimal number of clusters and generates better clustering results.  相似文献   

6.
K-means type clustering algorithms for mixed data that consists of numeric and categorical attributes suffer from cluster center initialization problem. The final clustering results depend upon the initial cluster centers. Random cluster center initialization is a popular initialization technique. However, clustering results are not consistent with different cluster center initializations. K-Harmonic means clustering algorithm tries to overcome this problem for pure numeric data. In this paper, we extend the K-Harmonic means clustering algorithm for mixed datasets. We propose a definition for a cluster center and a distance measure. These cluster centers and the distance measure are used with the cost function of K-Harmonic means clustering algorithm in the proposed algorithm. Experiments were carried out with pure categorical datasets and mixed datasets. Results suggest that the proposed clustering algorithm is quite insensitive to the cluster center initialization problem. Comparative studies with other clustering algorithms show that the proposed algorithm produce better clustering results.  相似文献   

7.
一种有效的用于数据挖掘的动态概念聚类算法   总被引:11,自引:0,他引:11  
郭建生  赵奕  施鹏飞 《软件学报》2001,12(4):582-591
概念聚类适用于领域知识不完整或领域知识缺乏时的数据挖掘任务.定义了一种基于语义的距离判定函数,结合领域知识对连续属性值进行概念化处理,对于用分类属性和数值属性混合描述数据对象的情况,提出了一种动态概念聚类算法DDCA(domain-baseddynamicclusteringalgorithm).该算法能够自动确定聚类数目,依据聚类内部属性值的频繁程度修正聚类中心,通过概念归纳处理,用概念合取表达式解释聚类输出.研究表明,基于语义距离判定函数和基于领域知识的动态概念聚类的算法DDCA是有效的.  相似文献   

8.
Almost all subspace clustering algorithms proposed so far are designed for numeric datasets. In this paper, we present a k-means type clustering algorithm that finds clusters in data subspaces in mixed numeric and categorical datasets. In this method, we compute attributes contribution to different clusters. We propose a new cost function for a k-means type algorithm. One of the advantages of this algorithm is its complexity which is linear with respect to the number of the data points. This algorithm is also useful in describing the cluster formation in terms of attributes contribution to different clusters. The algorithm is tested on various synthetic and real datasets to show its effectiveness. The clustering results are explained by using attributes weights in the clusters. The clustering results are also compared with published results.  相似文献   

9.
一种面向分类属性数据的聚类融合算法研究*   总被引:1,自引:1,他引:0  
为了解决单一聚类算法存在结果不准确和随机性大,且现有算法对分类数据聚类时将其装换成数值型会产生误差等问题,提出了一种面向分类属性数据的聚类融合算法。算法利用原有分类属性值的差异产生聚类成员,然后采用相似度方法进行划分,通过寻求目标函数最小的划分来简化聚类过程。算法在UCI数据集上进行了验证,结果表明算法的效率和精度都优于现有算法,说明算法的设计和更新策略是有效的。  相似文献   

10.
Dimensionality reduction is a useful technique to cope with high dimensionality of the real-world data. However, traditional methods were studied in the context of datasets with only numeric attributes. With the demand of analyzing datasets involving categorical attributes, an extension to the recent dimensionality-reduction technique t-SNE is proposed. The extension facilitates t-SNE to handle mixed-type datasets. Each attribute of the data is associated with a distance hierarchy which allows the distance between numeric values and between categorical values be measured in a unified manner. More importantly, domain knowledge regarding distance considering semantics embedded in categorical values can be specified via the hierarchy. Consequently, the extended t-SNE can project the high-dimensional, mixed data to a low-dimensional space with topological order which reflects user's intuition.  相似文献   

11.
传统[K]-modes算法在分类属性聚类中有着广泛的应用,但是传统算法并不区分有序分类属性与无序分类属性。在区分这两种属性的基础上,提出了一种新的距离公式,并优化了算法流程。基于无序分类属性的距离数值,确定了有序分类属性相邻属性值之间距离数值的合理范围。借助有序分类属性蕴含的顺序关系,构建了有序分类属性的距离公式。计算样本点与质心距离之时,引入了簇内各属性值的比例作为总体距离公式的重要参数。综上,新的距离公式良好地刻画了有序分类属性的距离,并且平衡了两种不同分类属性距离公式之间的差异性。实验结果表明,提出的改进算法和距离公式在UCI真实数据集上比原始[K]-modes算法及其改进算法均有显著的效果。  相似文献   

12.
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.  相似文献   

13.
一种混合属性数据流聚类算法   总被引:5,自引:0,他引:5  
杨春宇  周杰 《计算机学报》2007,30(8):1364-1371
数据流聚类是数据流挖掘中的重要问题.现实世界中的数据流往往同时具有连续属性和标称属性,但现有算法局限于仅处理其中一种属性,而对另一种采取简单舍弃的办法.目前还没有能在算法层次上进行混合属性数据流聚类的算法.文中提出了一种针对混合属性数据流的聚类算法;建立了数据流到达的泊松过程模型;用频度直方图对离散属性进行了描述;给出了混合属性条件下微聚类生成、更新、合并和删除算法.在公共数据集上的实验表明,文中提出的算法具有鲁棒的性能.  相似文献   

14.
针对基于密度的传统算法不能处理混合属性数据,以及目前的混合属性聚类算法大多数聚类质量不高等问题,提出了基于密度和混合距离度量方法的混合属性聚类算法.该算法通过分析混合属性数据特征,将混合属性数据分为数值占优、分类占优和均衡型混合属性数据3类,分析不同情况的特征选取相应的距离度量方式,通过预设参数能够发现数据密集区域,确定核心点,再利用核心点确定密度相连的对象实现聚类,获得最终的聚类结果.将算法应用于多种数据集上的实验结果表明,该算法具有较高的聚类质量,能够有效处理混合属性数据.  相似文献   

15.
陈晋音  何辉豪 《自动化学报》2015,41(10):1798-1813
面对广泛存在的混合属性数据,现有大部分混合属性聚类算法普遍存在聚类 质量低、聚类算法参数依赖性大、聚类类别个数和聚类中心无法准确自动确定等问题,针对 这些问题本文提出了一种基于密度的聚类中心自动确定的混合属性数据 聚类算法.该算法通过分析混合属性数据特征,将混合属性数据分为数 值占优、分类占优和均衡型混合属性数据三类,分析不同情况的特征选取 相应的距离度量方式.在计算数据集各个点的密度和距离分布图基础 上,深入分析获得规律: 高密度且与比它更高密度的数据点有较大距离的数 据点最可能成为聚类中心,通过线性回归模型和残差分析确定奇异 点,理论论证这些奇异点即为聚类中心,从而实现了自动确定聚类中心.采 用粒子群算法(Particle swarm optimization, PSO)寻找最优dc值,通过参数dc能够计算得到 任意数据对象的密度和到比它密度更高的点的最小距离,根据聚类 中心自动确定方法确定每个簇中心,并将其他点按到最近邻的更高 密度对象的最小距离划分到相应的簇中,从而实现聚类.最终将本文 提出算法与其他现有的多种混合属性聚类算法在多个数据集上进行 算法性能比较,验证本文提出算法具有较高的聚类质量.  相似文献   

16.
由于分类型和数值型属性特性的差异,设计混合类型数据聚类算法时通常需要对两种类型属性区别对待,增加了聚类算法的设计与实现难度。另外,不同属性所包含的信息量存在差异,但现有算法通常平等对待各个属性。提出了一种融合单纯形映射与信息熵加权的混合类型数据聚类算法。基于单纯形理论将分类型属性映射为高维数值属性向量,应用信息熵理论为各属性分配权重建立相似性度量公式,将该度量方法应用于K-Means算法框架得到聚类算法。在6个UCI的混合数据集上的实验表明,提出的聚类算法优于传统映射聚类算法和K-Prototype算法,在准确度上分别提高了2.70%和18.33%。  相似文献   

17.
BIRCH混合属性数据聚类方法   总被引:2,自引:1,他引:1       下载免费PDF全文
数据聚类是数据挖掘中的重要研究内容。现实世界中的数据往往同时具有连续属性和离散属性,但现有大多数算法局限于仅处理其中一种属性,而对另一种采取简单舍弃的办法丢失聚类信息和降低聚类质量。一些能处理混合属性的算法又往往处理的属性过多,导致计算量的大增。提出了一种基于BIRCH算法的混合属性数据的聚类算法;在UCI数据集上的实验表明,文中提出的算法具有较好的性能。  相似文献   

18.
Clustering consists in partitioning a set of objects into disjoint and homogeneous clusters. For many years, clustering methods have been applied in a wide variety of disciplines and they also have been utilized in many scientific areas. Traditionally, clustering methods deal with numerical data, i.e. objects represented by a conjunction of numerical attribute values. However, nowadays commercial or scientific databases usually contain categorical data, i.e. objects represented by categorical attributes. In this paper we present a dissimilarity measure which is capable to deal with tree structured categorical data. Thus, it can be used for extending the various versions of the very popular k-means clustering algorithm to deal with such data. We discuss how such an extension can be achieved. Moreover, we empirically prove that the proposed dissimilarity measure is accurate, compared to other well-known (dis)similarity measures for categorical data.  相似文献   

19.
可处理混合属性的任意形状聚类   总被引:1,自引:1,他引:0       下载免费PDF全文
聚类是数据挖掘中一个非常活跃的研究分支,任意形状的聚类则是一个有待研究的开放问题。提出一种包含分类属性取值频率信息的类间差异性度量和一种对象与类的相似度定义,在此基础上提出一种能处理任意形状的聚类算法,可处理混合属性数据集。在人造数据集和真实数据集上检验了提出的算法,并与相关算法进行了对比,实验结果表明,提出的算法是有效可行的。  相似文献   

20.
聚类混合型数据,通常是依据样本属性类别的不同分别进行评价。但这种将样本属性划分到不同子空间中分别度量的方式,割裂了样本属性原有的统一性;导致对样本个体的相似性评价产生了非一致的度量偏差。针对这一问题,提出以二进制编码样本属性,再由海明差异对属性编码施行统一度量的新的聚类算法。新算法通过在统一的框架内对混合型数据实施相似性度量,避免了对样本属性的切割,在此基础上又根据不同属性的性质赋予其不同的权重,并以此评价样本个体之间的相似程度。实验结果表明,新算法能够有效地聚类混合型数据;与已有的其他聚类算法相比较,表现出更好的聚类准确率及稳定性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号