共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
现有的针对分类数据的算法需要多次扫描数据库,对于数据开采经常处理的大容量数据,多遍I/O操作是一项沉重的系统开销.CACD(clustering algorithm for categoricaldata)是针对分类属性数据的聚类算法,该算法采用压缩技术缩小需要处理的数据量以提高效率,同时算法提出了一种新的基于压缩数据结构的标准用于衡量分类数据的相似度.CACD只需扫描数据库一遍,算法理论分析和实验分析都表明该算法比同类针对分类数据的聚类算法效率要高,并且压缩技术对聚类结果的质量影响不大. 相似文献
3.
传统的K-modes算法采用简单的属性匹配方式计算同一属性下不同属性值的距离,并且计算样本距离时令所有属性权重相等。在此基础上,综合考虑有序型分类数据中属性值的顺序关系、无序型分类数据中不同属性值之间的相似性以及各属性之间的关系等,提出一种更加适用于混合型分类数据的改进聚类算法,该算法对无序型分类数据和有序型分类数据采用不同的距离度量,并且用平均熵赋予相应的权重。实验结果表明,改进算法在人工数据集和真实数据集上均有比K-modes算法及其改进算法更好的聚类效果。 相似文献
4.
5.
Ming Lei Pilian He Zhichao Li 《通讯和计算机》2006,3(8):20-24
Most of the earlier work on clustering is mainly focused on numerical data the inherent geometric properties of which can be exploited to naturally define distance functions between the data points. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The k-means algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. This paper shows how to apply the notion of "cluster centers" to a dataset of categorical objects, and a k-means-like algorithm for clustering categorical data is introduced. 相似文献
6.
在本文中,我们提出了一种新的非数值数据聚类算法-VBCCD.VBCCD算法由关系表计算关系的一维分割,再由关系的分割来构造一个超图,而后通过超图分割算法,对构造出来的超图进行优化分割,得到最终的聚类结果。试验结果表明,该算法比传统的针对数值数据设计的聚类算法有更好的效果。 相似文献
7.
密度峰值聚类算法在处理分类型数据时难以产生较好的聚类效果。针对该现象,详细分析了其产生的原因:距离计算的重叠问题和密度计算的聚集问题。同时为了解决上述问题,提出了一种面向分类型数据的密度峰值聚类算法(Cauchy kernel-based density peaks clustering for categorical data,CDPCD)。算法首先指出分类型数据距离度量过程中有序特性(分类型数据属性值之间的顺序关系)鲜有考虑的现状,进而提出一种基于概率分布的加权有序距离度量来缓解重叠问题。通过结合柯西核函数,在共享最近邻密度峰值聚类算法基础上重新评估数据密度值,改进了密度计算和二次分配方式,增强了密度多样性,降低了聚集问题带来的影响。多个真实数据集上的实验结果表明,相较于传统的基于划分和密度的聚类算法,CDPCD都取得了更好的聚类结果。 相似文献
8.
Cesario E. Manco G. Ortale R. 《Knowledge and Data Engineering, IEEE Transactions on》2007,19(12):1607-1624
A parameter-free, fully-automatic approach to clustering high-dimensional categorical data is proposed. The technique is based on a two-phase iterative procedure, which attempts to improve the overall quality of the whole partition. In the first phase, cluster assignments are given, and a new cluster is added to the partition by identifying and splitting a low-quality cluster. In the second phase, the number of clusters is fixed, and an attempt to optimize cluster assignments is done. On the basis of such features, the algorithm attempts to improve the overall quality of the whole partition and finds clusters in the data, whose number is naturally established on the basis of the inherent features of the underlying data set rather than being previously specified. Furthermore, the approach is parametric to the notion of cluster quality: Here, a cluster is defined as a set of tuples exhibiting a sort of homogeneity. We show how a suitable notion of cluster homogeneity can be defined in the context of high-dimensional categorical data, from which an effective instance of the proposed clustering scheme immediately follows. Experiments on both synthetic and real data prove that the devised algorithm scales linearly and achieves nearly optimal results in terms of compactness and separation. 相似文献
9.
聚类算法是数据挖掘中的重要方法,针对现有适用类属性和混合型属性的数据集聚类算法如k-modes算法、k-prototypes算法和模糊k-prototypes算法等的不足,提出一种新的方法——类属性分解法。这种方法有更高的稳定性和可靠性,并能有效地减少随机性。 相似文献
10.
Chen Hung-Leng Chen Ming-Syan Lin Su-Chen 《Knowledge and Data Engineering, IEEE Transactions on》2009,21(5):652-665
Although the problem of clustering numerical time-evolving data is well-explored, the problem of clustering categorical time-evolving data remains as a challenge issue. In this paper, we propose a generalized clustering framework which utilizes existing clustering algorithms and adopts sliding window technique to detect if there is a drifting-concept or not in the incoming sliding window. The framework is composed of two algorithms: Drifting Concept Detecting (abbreviated as DCD) algorithm detecting the changes of cluster distributions between the current sliding window and the last clustering result, and Cluster Relationship Analysis (abbreviated as CRA) algorithm analyzing the relationship between clustering results at different time. In DCD, the concept is said to drift if quite a large number of outliers are found in the current sliding window, or if quite a large number of clusters are varied in the ratio of data points. The drifted sliding window will perform re-clustering to capture the recent concept. In CRA, a visualizing method is devised to facilitate the observation of the evolving clustering results. The framework is validated on real and synthetic data sets, and is shown to not only accurately detect the drifting-concepts but also attain clustering results of better quality. 相似文献
11.
一种面向高维符号数据的随机投影聚类算法 总被引:1,自引:0,他引:1
现实数据往往分布在高维空间中,从整个向量空间来看,这些数据间的联系非常分散,因此如何降低维数实现高维数据的聚类受到众多研究者的普遍关注.介绍了一种适用于符号型高维数据的随机投影聚类算法.其根据频率选择与聚类相关的维向量,随机产生并根据投影聚类效果择优选择聚类中心及相关维向量,将投影聚类算法扩展至符号数据空间.实验结果证实了这种算法的实用性与有效性. 相似文献
12.
传统[K]-modes算法在分类属性聚类中有着广泛的应用,但是传统算法并不区分有序分类属性与无序分类属性。在区分这两种属性的基础上,提出了一种新的距离公式,并优化了算法流程。基于无序分类属性的距离数值,确定了有序分类属性相邻属性值之间距离数值的合理范围。借助有序分类属性蕴含的顺序关系,构建了有序分类属性的距离公式。计算样本点与质心距离之时,引入了簇内各属性值的比例作为总体距离公式的重要参数。综上,新的距离公式良好地刻画了有序分类属性的距离,并且平衡了两种不同分类属性距离公式之间的差异性。实验结果表明,提出的改进算法和距离公式在UCI真实数据集上比原始[K]-modes算法及其改进算法均有显著的效果。 相似文献
13.
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values 总被引:76,自引:0,他引:76
Zhexue Huang 《Data mining and knowledge discovery》1998,2(3):283-304
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values
prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms
which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes
algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with
modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function.
With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The
k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes
algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean
disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on
two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large
data sets, which is critical to data mining applications. 相似文献
14.
类别型数据聚类被广泛应用于现实世界的不同领域中,如医学科学、计算机科学等。通常的类别型数据聚类,是在基于相异度量上进行研究,针对不同特点的数据集,聚类结果会受到数据集自身特点和噪音信息的影响。此外,基于表示学习的类别型数据聚类,实现复杂,聚类结果受到表示结果的影响较大。本文以共现矩阵为基础,提出一种可以直接考虑类别型数据原始信息关联关系的聚类方法——基于从共现矩阵提取关联的类别型数据聚类方法(CDCBCM)。共现矩阵可被看作是一种对原始数据空间中信息关联情况的汇总。本文通过计算不同对象在各个属性子空间下的共现频率值来构建共现矩阵,并从共现矩阵中去除一些噪音信息,再使用归一化切割来得到聚类结果。本文方法在16个不同领域的公开数据集中进行测试,与8种现有方法进行比较,并采用F1-score指标进行检测。实验结果表明,本文方法在7个数据集上效果最好,平均排名最高,能更好地完成对类别型数据的聚类任务。 相似文献
15.
采用多属性频率权重以及多目标簇集质量聚类准则,提出一种分类数据子空间聚类算法.该算法利用粗糙集理论中的等价类,定义了一种多属性权重计算方法,有效地提高了属性的聚类区分能力;在多目标簇集质量函数的基础上,采用层次凝聚策略,迭代合并子簇,有效地度量了各类尺度的聚类簇;利用区间离散度,解决了使用阈值删除噪音点所带来的参数问题;利用属性对簇的依附程度,确定了聚类簇的属性相关子空间,提高了聚类簇的可理解性.最后,采用人工合成、UCI和恒星光谱数据集,实验验证了该聚类算法的可行性和有效性. 相似文献
16.
Among the huge diversity of ideas that show up while studying graph theory, one that has obtained a lot of popularity is the concept of labelings of graphs. Graph labelings give valuable mathematical models for a wide scope of applications in high technologies (cryptography, astronomy, data security, various coding theory problems, communication networks, etc.). A labeling or a valuation of a graph is any mapping that sends a certain set of graph elements to a certain set of numbers subject to certain conditions. Graph labeling is a mapping of elements of the graph, i.e., vertex and/or edges to a set of numbers (usually positive integers), called labels. If the domain is the vertex-set or the edge-set, the labelings are called vertex labelings or edge labelings respectively. Similarly, if the domain is V (G)[E(G), then the labeling is called total labeling. A reflexive edge irregular k-labeling of graph introduced by Tanna et al.: A total labeling of graph such that for any two different edges ab and a'b' of the graph their weights has ωtχ(ab) = χ(a) + χ(ab) + χ(b) and ωtχ(a'b') = χ(a') + χ(a'b') + χ(b') are distinct. The smallest value of k for which such labeling exist is called the reflexive edge strength of the graph and is denoted by res(G). In this paper we have found the exact value of the reflexive edge irregularity strength of the categorical
product of two paths Pa × Pb for any choice of a ≥ 3 and b ≥ 3. 相似文献
17.
分布不均衡的数据在通过传统聚类分析的方式进行标注时,聚类效果容易偏向于样本数多的类,从而造成标注出现误差的问题。针对此问题提出改进的含有均衡约束聚类算法的标注方法,对不均衡数据的聚类标注准确率实现了比较有效的提高,方法包含数据初始聚类、专家知识调整,数据均衡化处理,含均衡约束聚类等步骤。通过初始聚类对不均衡数据进行初始类标签分配,专家知识调整对部分数据错误标注进行标签调整修改,对数据进行均衡化处理得到均衡数据集,通过均衡约束聚类对均衡数据进行标签最终精确分配。经仿真验证表明,上述方法比较有效的提高了不均衡数据标注准确率。 相似文献
18.
高维分类属性的子空间聚类算法 总被引:3,自引:0,他引:3
高维分类数据的处理一直是数据挖掘研究所面临的巨大挑战.传统聚类算法主要针对低雏连续性数据的聚类,难以处理高维分类属性数据集.本文提出一种处理高维分类数据集的子空间聚类算法(FP-Tree-based SUBspace clustering algorithm,FPSUB),利用频繁模式树将聚类问题转化为寻找属性值的频繁模式发现问题,得到的频繁模式即候选子空间,然后基于这些子空间进行聚类.针对真实数据集的实验结果表明,FPSUB算法比其他算法具有更高的准确度. 相似文献
19.
《IEEE transactions on audio, speech, and language processing》2009,17(1):138-149
Automatic speech recognition (ASR) systems rely almost exclusively on short-term segment-level features (MFCCs), while ignoring higher level suprasegmental cues that are characteristic of human speech. However, recent experiments have shown that categorical representations of prosody, such as those based on the Tones and Break Indices (ToBI) annotation standard, can be used to enhance speech recognizers. However, categorical prosody models are severely limited in scope and coverage due to the lack of large corpora annotated with the relevant prosodic symbols (such as pitch accent, word prominence, and boundary tone labels). In this paper, we first present an architecture for augmenting a standard ASR with symbolic prosody. We then discuss two novel, unsupervised adaptation techniques for improving, respectively, the quality of the linguistic and acoustic components of our categorical prosody models. Finally, we implement the augmented ASR by enriching ASR lattices with the adapted categorical prosody models. Our experiments show that the proposed unsupervised adaptation techniques significantly improve the quality of the prosody models; the adapted prosodic language and acoustic models reduce binary pitch accent (presence versus absence) classification error rate by 13.8% and 4.3%, respectively (relative to the seed models) on the Boston University Radio News Corpus, while the prosody-enriched ASR exhibits a 3.1% relative reduction in word error rate (WER) over the baseline system. 相似文献
20.
面向分类数据的自组织神经网络 总被引:1,自引:2,他引:1
作为一种优良的聚类和降维工具,自组织神经网络SOM(SelfOrganizingFeatureMaps)已经得到广泛应用。其不足之处是仅适合于数值数据,这对时常需要处理分类型数据(Categoricalvalueddata)或数值型与分类型混合数据(Mixednumericandcategoricalvalueddata)的数据挖掘应用是不够的。该文提出了一种新的基于覆盖(Overlap)的距离函数并将其用于SOM训练。实验结果表明,在不增加时空开销的前提下可取得较好的聚类效果。 相似文献