首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 970 毫秒
1.
Clustering methods are a powerful tool for discovering patterns in a given data set through an organization of data into subsets of objects that share common features. Motivated by the independent use of some different partitions criteria and the theoretical and empirical analysis of some of its properties, in this paper, we introduce an incremental nested partition method which combines these partitions criteria for finding the inner structure of static and dynamic datasets. For this, we proved that there are relationships of nesting between partitions obtained, respectively, from these partition criteria, and besides that the sensitivity when a new object arrives to the dataset is rigorously studied. Our algorithm exploits all of these mathematical properties for obtaining the hierarchy of clusterings. Moreover, we realize a theoretical and experimental comparative study of our method with classical hierarchical clustering methods such as single-link and complete-link and other more recently introduced methods. The experimental results over databases of UCI repository and the AFP and TDT2 news collections show the usefulness and capability of our method to reveal different levels of information hidden in datasets.  相似文献   

2.
MapReduce作为一种分布式编程模型,被广泛应用于大规模和高维度数据集的处理中。其采用原始Hash函数 划分 数据,当数据分布不均匀时,常会出现数据倾斜的问题。基于MapReduce的聚类算法,需要多次迭代且不清楚各阶段Reduce的输入数据分布,因此现有的解决数据倾斜的方法并不适用。为解决数据划分的不均衡问题,提出一种当存在数据倾斜时更改剩余分区索引的策略。该方法在Map运行的过程中统计将要分给各reducer的数据量,由JobTrackcr监控全局的分区信息并根据数据倾斜模型动态修改原分区函数;在接下来的分区过程中,Partitioner把即将导致倾斜的分区索引到其余负载较轻的reducer上,使各节点的负载达到均衡。基于Zipf分布数据集和真实数据集,将所提算法与现有的解决数据倾斜的方法进行对比,结果证明,所提策略解决了MapReduce聚类中的数据倾斜问题,且在稳定性与执行时间上优于Hash和基于采样的动态分区法。  相似文献   

3.
快速模糊C均值聚类彩色图像分割方法   总被引:33,自引:3,他引:33       下载免费PDF全文
模糊C均值(FCM)聚类用于彩色图像分割具有简单直观、易于实现的特点,但存在聚类性能受中心点初始化影响且计算量大等问题,为此,提出了一种快速模糊聚类方法(FFCM)。这种方法利用分层减法聚类把图像数据分成一定数量的色彩相近的子集,一方面,子集中心用于初始化聚类中心点;另一方面,利用子集中心点和分布密度进行模糊聚类,由于聚类样本数量显著减少以及分层减法聚类计算量小,故可以大幅提高模糊C均值算法的计算速度,进而可以利用聚类有效性分析指标快速确定聚类数目。实验表明,这种方法不需事先确定聚类数目并且在优化聚类性能不变的前提下,可以使模糊聚类的速度得到明显提高,实现彩色图像的快速分割。  相似文献   

4.
In this study, we introduce a novel clustering architecture, in which several subsets of patterns can be processed together with an objective of finding a common structure. The structure revealed at the global level is determined by exchanging prototypes of the subsets of data and by moving prototypes of the corresponding clusters toward each other. Thereby, the required communication links are established at the level of cluster prototypes and partition matrices, without hampering the security concerns. A detailed clustering algorithm is developed by integrating the advantages of both fuzzy sets and rough sets, and a measure of quantitative analysis of the experimental results is provided for synthetic and real-world data.  相似文献   

5.
Rough-fuzzy collaborative clustering.   总被引:3,自引:0,他引:3  
In this study, we introduce a novel clustering architecture, in which several subsets of patterns can be processed together with an objective of finding a common structure. The structure revealed at the global level is determined by exchanging prototypes of the subsets of data and by moving prototypes of the corresponding clusters toward each other. Thereby, the required communication links are established at the level of cluster prototypes and partition matrices, without hampering the security concerns. A detailed clustering algorithm is developed by integrating the advantages of both fuzzy sets and rough sets, and a measure of quantitative analysis of the experimental results is provided for synthetic and real-world data.  相似文献   

6.
Discovering interesting patterns or substructures in data streams is an important challenge in data mining. Clustering algorithms are very often applied to identify single substructures although they are designed to partition a data set. Another problem of clustering algorithms is that most of them are not designed for data streams. This paper discusses a recently introduced procedure that deals with both problems. The procedure explores ideas from cluster analysis, but was designed to identify single clusters without the necessity to partition the whole data set into clusters. The new extended version of the algorithm is an incremental clustering approach applicable to stream data. It identifies new clusters formed by the incoming data and updates the data space partition. Clustering of artificial and real data sets illustrates the abilities of the proposed method.  相似文献   

7.
In this paper, we introduce a new algorithm for clustering and aggregating relational data (CARD). We assume that data is available in a relational form, where we only have information about the degrees to which pairs of objects in the data set are related. Moreover, we assume that the relational information is represented by multiple dissimilarity matrices. These matrices could have been generated using different sensors, features, or mappings. CARD is designed to aggregate pairwise distances from multiple relational matrices, partition the data into clusters, and learn a relevance weight for each matrix in each cluster simultaneously. The cluster dependent relevance weights offer two advantages. First, they guide the clustering process to partition the data set into more meaningful clusters. Second, they can be used in subsequent steps of a learning system to improve its learning behavior. The performance of the proposed algorithm is illustrated by using it to categorize a collection of 500 color images. We represent the pairwise image dissimilarities by six different relational matrices that encode color, texture, and structure information.  相似文献   

8.
在传统确定数据集聚类数算法原理的基础上,提出一种新的算法——MHC算法。该算法采用自底向上的策略生成不同层次的数据集划分,计算每个层次的聚类划分质量,通过聚类质量选择最佳的聚类数。还设计一种新的有效性指标——BIP指标,用于衡量不同划分的聚类质量,该指标主要依托数据集的几何结构。实验结果表明,该算法能准确地确定多维数据集中的最佳聚类数。  相似文献   

9.
用基于免疫机制的单亲遗传算法求解数据聚类优化问题   总被引:4,自引:0,他引:4  
数据聚类是数据挖掘中的一个重要课题。数据聚类问题可以转化为一个图形分割的最优化问题。针对该问题的特点,文章构造了基于免疫机制的单亲遗传算法。该算法不仅提高了收敛速度,而且避免了陷于局部极小,从而能较快地收敛到全局最优解。仿真结果表明该算法是有效的。  相似文献   

10.
In order to import the domain knowledge or application-dependent parameters into the data mining systems, constraint-based mining has attracted a lot of research attention recently. In this paper, the attributes employed to model the constraints are called constraint attributes and those attributes involved in the objective function to be optimized are called optimization attributes. The constrained clustering considered in this paper is conducted in such a way that the objective function of optimization attributes is optimized subject to the condition that the imposed constraint is satisfied. Explicitly, we address the problem of constrained clustering with numerical constraints, in which the constraint attribute values of any two data items in the same cluster are required to be within the corresponding constraint range. This numerical constrained clustering problem, however, cannot be dealt with by any conventional clustering algorithms. Consequently, we devise several effective and efficient algorithms to solve such a clustering problem. It is noted that due to the intrinsic nature of the numerical constrained clustering, there is an order dependency on the process of attaining the clustering, which in many cases degrades the clustering results. In view of this, we devise a progressive constraint relaxation technique to remedy this drawback and improve the overall performance of clustering results. Explicitly, by using a smaller (tighter) constraint range in earlier iterations of merge, we will have more room to relax the constraint and seek for better solutions in subsequent iterations. It is empirically shown that the progressive constraint relaxation technique is able to improve not only the execution efficiency but also the clustering quality.  相似文献   

11.
The unprecedented large size and high dimensionality of existing geographic datasets make the complex patterns that potentially lurk in the data hard to find. Clustering is one of the most important techniques for geographic knowledge discovery. However, existing clustering methods have two severe drawbacks for this purpose. First, spatial clustering methods focus on the specific characteristics of distributions in 2- or 3-D space, while general-purpose high-dimensional clustering methods have limited power in recognizing spatial patterns that involve neighbors. Second, clustering methods in general are not geared toward allowing the human-computer interaction needed to effectively tease-out complex patterns. In the current paper, an approach is proposed to open up the black box of the clustering process for easy understanding, steering, focusing and interpretation, and thus to support an effective exploration of large and high dimensional geographic data. The proposed approach involves building a hierarchical spatial cluster structure within the high-dimensional feature space, and using this combined space for discovering multi-dimensional (combined spatial and non-spatial) patterns with efficient computational clustering methods and highly interactive visualization techniques. More specifically, this includes the integration of: (1) a hierarchical spatial clustering method to generate a 1-D spatial cluster ordering that preserves the hierarchical cluster structure, and (2) a density- and grid-based technique to effectively support the interactive identification of interesting subspaces and subsequent searching for clusters in each subspace. The implementation of the proposed approach is in a fully open and interactive manner supported by various visualization techniques.  相似文献   

12.
图像的无监督聚类就是基于图像数据,在无任何先验信息的情况下将整个图像集合划分成若干子集的过程。由于图像的本征维度很高,在图像处理中会遇到“维数灾难”问题。针对图像无监督聚类的特点,提出了一种图像的扩散界面无监督聚类算法,将图像编码成高维观测空间中的点,再通过投影变换映射到低维特征空间,在低维特征空间中构建扩散界面无监督聚类模型,并在模型中引入维度约简算子,采用循环迭代算法优化扩散界面模型的能量函数。基于最优的扩散界面,将整个图像集合聚类成不同的子集。实验结果表明,扩散界面无监督聚类算法优于传统聚类算法中的K-means算法、DBSCAN算法和Spectral Clustering算法,能够更好地实现图像的无监督聚类,在相同条件下具有更高的准确度。  相似文献   

13.
一种基于网格方法的高维数据流子空间聚类算法   总被引:4,自引:0,他引:4  
基于对网格聚类方法的分析,结合由底向上的网格方法和自顶向下的网格方法,设计了一个能在线处理高维数据流的子空间聚类算法。通过利用由底向上网格方法对数据的压缩能力和自顶向下网格方法处理高维数据的能力,算法能基于对数据流的一次扫描,快速识别数据中位于不同子空间内的簇。理论分析以及在多个数据集上的实验表明算法具有较高的计算精度与计算效率。  相似文献   

14.
朱林  雷景生  毕忠勤  杨杰 《软件学报》2013,24(11):2610-2627
针对高维数据的聚类研究表明,样本在不同数据簇往往与某些特定的数据特征子集相对应.因此,子空间聚类技术越来越受到关注.然而,现有的软子空间聚类算法都是基于批处理技术的聚类算法,不能很好地应用于高维数据流或大规模数据的聚类研究中.为此,利用模糊可扩展聚类框架,与熵加权软子空间聚类算法相结合,提出了一种有效的熵加权流数据软子空间聚类算法——EWSSC(entropy-weighting streaming subspace clustering).该算法不仅保留了传统软子空间聚类算法的特性,而且利用了模糊可扩展聚类策略,将软子空间聚类算法应用于流数据的聚类分析中.实验结果表明,EWSSC 算法对于高维数据流可以得到与批处理软子空间聚类方法近似一致的实验结果.  相似文献   

15.
一种快速的模糊C均值聚类彩色图像分割方法   总被引:4,自引:0,他引:4       下载免费PDF全文
FCM用于彩色图像分割存在聚类数目需要事先确定、计算速度慢的问题,为此,提出一种快速的模糊C均值聚类方法(FFCM)。首先,对原始彩色图像进行基于梯度图的分水岭变换,从而把原始彩色图像数据分成一些具有色彩一致性的子集;然后,利用这些子集的大小和中心点进行模糊聚类。由于FFCM聚类样本数量显著减小,因此可以大幅提高模糊C均值聚类算法的计算速度,进而可以采用聚类有效性指标确定聚类数目。实验表明,这种方法不需要事先确定聚类数目,在聚类有效性能不变的前提下,可以使模糊聚类的速度得到明显提高,实现了彩色图像的快速分割。  相似文献   

16.
Traditional minimum spanning tree-based clustering algorithms only make use of information about edges contained in the tree to partition a data set. As a result, with limited information about the structure underlying a data set, these algorithms are vulnerable to outliers. To address this issue, this paper presents a simple while efficient MST-inspired clustering algorithm. It works by finding a local density factor for each data point during the construction of an MST and discarding outliers, i.e., those whose local density factor is larger than a threshold, to increase the separation between clusters. This algorithm is easy to implement, requiring an implementation of iDistance as the only k-nearest neighbor search structure. Experiments performed on both small low-dimensional data sets and large high-dimensional data sets demonstrate the efficacy of our method.  相似文献   

17.
一种快速的模拟退火算法及其在数据聚类中的应用   总被引:12,自引:3,他引:12  
文中把求解数据聚类问题转换为一个图形分割最优问题,提出一种快速的模拟退火算法。实验结果证明,快速模拟退火算法退火时间短,收敛速度快,把它应用于数据聚类中,可以获得较好的聚类结果。  相似文献   

18.
Information granules form an abstract and efficient characterization of large volumes of numeric data. Fuzzy clustering is a commonly encountered information granulation approach. A reconstruction (degranulation) is about decoding information granules into numeric data. In this study, to enhance quality of reconstruction, we augment the generic data reconstruction approach by introducing a transformation mapping of the originally produced partition matrix and setting up an adjustment mechanism modifying a localization of the prototypes. We engage several population-based search algorithms to optimize interaction matrices and prototypes. A series of experimental results dealing with both synthetic and publicly available data sets are reported to show the enhancement of the data reconstruction performance provided by the proposed method.  相似文献   

19.
随着信息技术的飞速发展和大数据时代的来临,数据呈现出高维性、非线性等复杂特征。对于高维数据来说,在全维空间上往往很难找到反映分布模式的特征区域,而大多数传统聚类算法仅对低维数据具有良好的扩展性。因此,传统聚类算法在处理高维数据的时候,产生的聚类结果可能无法满足现阶段的需求。而子空间聚类算法搜索存在于高维数据子空间中的簇,将数据的原始特征空间分为不同的特征子集,减少不相关特征的影响,保留原数据中的主要特征。通过子空间聚类方法可以发现高维数据中不易展现的信息,并通过可视化技术展现数据属性和维度的内在结构,为高维数据可视分析提供了有效手段。总结了近年来基于子空间聚类的高维数据可视分析方法研究进展,从基于特征选择、基于子空间探索、基于子空间聚类的3种不同方法进行阐述,并对其交互分析方法和应用进行分析,同时对高维数据可视分析方法的未来发展趋势进行了展望。  相似文献   

20.
基于投票机制的融合聚类算法   总被引:1,自引:0,他引:1  
以一趟聚类算法作为划分数据的基本算法,讨论聚类融合问题.通过重复使用一趟聚类算法划分数据,并随机选择阈值和数据输入顺序,得到不同的聚类结果,将这些聚类结果映射为模式间的关联矩阵,在关联矩阵上使用投票机制获得最终的数据划分.在真实数据集和人造数据集上检验了提出的聚类融合算法,并与相关聚类算法进行了对比,实验结果表明,文中提出的算法是有效可行的.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号