首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
自适应的软子空间聚类算法   总被引:6,自引:0,他引:6  
陈黎飞  郭躬德  姜青山 《软件学报》2010,21(10):2513-2523
软子空间聚类是高维数据分析的一种重要手段.现有算法通常需要用户事先设置一些全局的关键参数,且没有考虑子空间的优化.提出了一个新的软子空间聚类优化目标函数,在最小化子空间簇类的簇内紧凑度的同时,最大化每个簇类所在的投影子空间.通过推导得到一种新的局部特征加权方式,以此为基础提出一种自适应的k-means型软子空间聚类算法.该算法在聚类过程中根据数据集及其划分的信息,动态地计算最优的算法参数.在实际应用和合成数据集上的实验结果表明,该算法大幅度提高了聚类精度和聚类结果的稳定性.  相似文献   

2.
Since 1998, a graphical representation used in visual clustering called the reordered dissimilarity image or cluster heat map has appeared in more than 4000 biological or biomedical publications. These images are typically used to visually estimate the number of clusters in a data set, which is the most important input to most clustering algorithms, including the popularly chosen fuzzy c‐means and crisp k‐means. This paper presents a new formulation of a matrix reordering algorithm, coVAT, which is the only known method for providing visual clustering information on all four types of cluster structure in rectangular relational data. Finite rectangular relational data are an m× n array R of relational values between m row objects Or and n column objects Oc. R presents four clustering problems: clusters in Or, Oc, Or∪c, and coclusters containing some objects from each of Or and Oc. coVAT1 is a clustering tendency algorithm that provides visual estimates of the number of clusters to seek in each of these problems by displaying reordered dissimilarity images. We provide several examples where coVAT1 fails to do its job. These examples justify the introduction of coVAT2, a modification of coVAT1 based on a different reordering scheme. We offer several examples to illustrate that coVAT2 may detect coclusters in R when coVAT1 does not. Furthermore, coVAT2 is not limited to just relational data R. The R matrix can also take the form of feature data, such as gene microarray data where each data element is a real number: Positive values indicate upregulation, and negative values indicate downregulation. We show examples of coVAT2 on microarray data that indicate coVAT2 shows cluster tendency in these data. © 2012 Wiley Periodicals, Inc.  相似文献   

3.
The attributes describing a data set may often be arranged in meaningful subsets, each of which corresponds to a different aspect of the data. An unsupervised algorithm (SCAD) that simultaneously performs fuzzy clustering and aspects weighting was proposed in the literature. However, SCAD may fail and halt given certain conditions. To fix this problem, its steps are modified and then reordered to reduce the number of parameters required to be set by the user. In this paper we prove that each step of the resulting algorithm, named ASCAD, globally minimizes its cost-function with respect to the argument being optimized. The asymptotic analysis of ASCAD leads to a time complexity which is the same as that of fuzzy c-means. A hard version of the algorithm and a novel validity criterion that considers aspect weights in order to estimate the number of clusters are also described. The proposed method is assessed over several artificial and real data sets.  相似文献   

4.
提出了一种改进的基于对称点距离的蚂蚁聚类算法。该算法不再采用Euclidean距离来计算类内对象的相似性,而是使用新的对称点距离来计算相似性,在处理带有对称性质的数据集时,可以有效地识别给定数据集的聚类数目和合适的划分。在该算法中,用人工蚂蚁代表数据对象,根据算法给定的聚类规则来寻找最合适的聚类划分。最后用本算法与标准的蚂蚁聚类算法分别对不同的数据集进行了聚类实验。实验结果证实了算法的有效性。  相似文献   

5.
The first stage of knowledge acquisition and reduction of complexity concerning a group of entities is to partition or divide the entities into groups or clusters based on their attributes or characteristics. Clustering algorithms normally require both a method of measuring proximity between patterns and prototypes and a method for aggregating patterns. However sometimes feature vectors or patterns may not be available for objects and only the proximities between the objects are known. Even if feature vectors are available some of the features may not be numeric and it may not be possible to find a satisfactory method of aggregating patterns for the purpose of determining prototypes. Clustering of objects however can be performed on the basis of data describing the objects in terms of feature vectors or on the basis of relational data. The relational data is in terms of proximities between objects. Clustering of objects on the basis of relational data rather than individual object data is called relational clustering. The premise of this paper is that the proximities between the membership vectors, which are obtained as the objective of clustering, should be proportional to the proximities between the objects. The values of the components of the membership vector corresponding to an object are the membership degrees of the object in the various clusters. The membership vector is just a type of feature vector. Based on this premise, this paper describes another fuzzy relational clustering method for finding a fuzzy membership matrix. The method involves solving a rather challenging optimization problem, since the objective function has many local minima. This makes the use of a global optimization method such as particle swarm optimization (PSO) attractive for determining the membership matrix for the clustering. To minimize computational effort, a Bayesian stopping criterion is used in combination with a multi-start strategy for the PSO. Other relational clustering methods generally find local optimum of their objective function.  相似文献   

6.
We introduce an efficient synchronization model that organizes a population of integrate-and-fire oscillators into stable and structured groups. Each oscillator fires synchronously with all the others within its group, but the groups themselves fire with a constant phase difference. The structure of the synchronized groups depends on the choice of the coupling function. We show that by defining the interaction between oscillators according to the relative distance between them, our model can be used as a general clustering algorithm. Unlike existing models, our model incorporates techniques from relational and prototype-based clustering methods and results in a clustering algorithm that is simple, efficient, robust, unbiased by the size of the clusters, and that can find an arbitrary number of clusters. In addition to helping the model self-organize into stable groups, the synergy between clustering and synchronization reduces the computational complexity significantly. The resulting clustering algorithm has several advantages over conventional clustering techniques. In particular, it can generate a nested sequence of partitions and it can determine the optimum number of clusters in an efficient manner. Moreover, since our approach does not involve optimizing an objective function, it is not sensitive to initialization and it can incorporate nonmetric similarity measures. We illustrate the performance of our algorithms with several synthetic and real data sets  相似文献   

7.
一种半监督K均值多关系数据聚类算法   总被引:1,自引:0,他引:1  
高滢  刘大有  齐红  刘赫 《软件学报》2008,19(11):2814-2821
提出了一种半监督K均值多关系数据聚类算法.该算法在K均值聚类算法的基础上扩展了其初始类簇的选择方法和对象相似性度量方法,以用于多关系数据的半监督学习.为了获取高性能,该算法在聚类过程中充分利用了标记数据、对象属性及各种关系信息.多关系数据库Movie上的实验结果验证了该算法的有效性.  相似文献   

8.
Rapid technological advances imply that the amount of data stored in databases is rising very fast. However, data mining can discover helpful implicit information in large databases. How to detect the implicit and useful information with lower time cost, high correctness, high noise filtering rate and fit for large databases is of priority concern in data mining, specifying why considerable clustering schemes have been proposed in recent decades. This investigation presents a new data clustering approach called PHD, which is an enhanced version of KIDBSCAN. PHD is a hybrid density-based algorithm, which partitions the data set by K-means, and then clusters the resulting partitions with IDBSCAN. Finally, the closest pairs of clusters are merged until the natural number of clusters of data set is reached. Experimental results reveal that the proposed algorithm can perform the entire clustering, and efficiently reduce the run-time cost. They also indicate that the proposed new clustering algorithm conducts better than several existing well-known schemes such as the K-means, DBSCAN, IDBSCAN and KIDBSCAN algorithms. Consequently, the proposed PHD algorithm is efficient and effective for data clustering in large databases.  相似文献   

9.
孙伟鹏 《计算机应用研究》2020,37(1):163-166,171
针对FSDP聚类算法在计算数据对象的局部密度与最小距离时,由于需要遍历整个数据集而导致算法的整体时间复杂度较高的问题,提出了一种基于Spark的并行FSDP聚类算法SFSDP。首先,通过空间网格划分将待聚类数据集划分成多个数据量相对均衡的数据分区;然后,利用改进的FSDP聚类算法并行地对各个分区内的数据执行聚类分析;最后,通过将分区间的局部簇集合并,生成全局簇集。实验结果表明,SFSDP与FSDP算法相比能够有效地进行大规模数据集的聚类分析,并且算法在准确性和扩展性方面都有很好的表现。  相似文献   

10.
Different extensions of fuzzy c‐means (FCM) clustering have been developed to approximate FCM clustering in very large (unloadable) image (eFFCM) and object vector (geFFCM) data. Both extensions share three phases: (1) progressive sampling of the VL data, terminated when a sample passes a statistical goodness of fit test; (2) clustering with (literal or exact) FCM; and (3) noniterative extension of the literal clusters to the remainder of the data set. This article presents a comparable method for the remaining case of interest, namely, clustering in VL relational data. We will propose and discuss each of the four phases of eNERF and our algorithm for this last case: (1) finding distinguished features that monitor progressive sampling, (2) progressively sampling a square N × N relation matrix RN until an n × n sample relation Rn passes a statistical test, (3) clustering Rn with literal non‐Euclidean relational fuzzy c‐means, and (4) extending the clusters in Rn to the remainder of the relational data. The extension phase in this third case is not as straightforward as it was in the image and object data cases, but our numerical examples suggest that eNERF has the same approximation qualities that eFFCM and geFFCM do. © 2006 Wiley Periodicals, Inc. Int J Int Syst 21: 817–841, 2006.  相似文献   

11.
This article presents a multi-objective genetic algorithm which considers the problem of data clustering. A given dataset is automatically assigned into a number of groups in appropriate fuzzy partitions through the fuzzy c-means method. This work has tried to exploit the advantage of fuzzy properties which provide capability to handle overlapping clusters. However, most fuzzy methods are based on compactness and/or separation measures which use only centroid information. The calculation from centroid information only may not be sufficient to differentiate the geometric structures of clusters. The overlap-separation measure using an aggregation operation of fuzzy membership degrees is better equipped to handle this drawback. For another key consideration, we need a mechanism to identify appropriate fuzzy clusters without prior knowledge on the number of clusters. From this requirement, an optimization with single criterion may not be feasible for different cluster shapes. A multi-objective genetic algorithm is therefore appropriate to search for fuzzy partitions in this situation. Apart from the overlap-separation measure, the well-known fuzzy Jm index is also optimized through genetic operations. The algorithm simultaneously optimizes the two criteria to search for optimal clustering solutions. A string of real-coded values is encoded to represent cluster centers. A number of strings with different lengths varied over a range correspond to variable numbers of clusters. These real-coded values are optimized and the Pareto solutions corresponding to a tradeoff between the two objectives are finally produced. As shown in the experiments, the approach provides promising solutions in well-separated, hyperspherical and overlapping clusters from synthetic and real-life data sets. This is demonstrated by the comparison with existing single-objective and multi-objective clustering techniques.  相似文献   

12.
In this paper, we introduce a new algorithm for clustering and aggregating relational data (CARD). We assume that data is available in a relational form, where we only have information about the degrees to which pairs of objects in the data set are related. Moreover, we assume that the relational information is represented by multiple dissimilarity matrices. These matrices could have been generated using different sensors, features, or mappings. CARD is designed to aggregate pairwise distances from multiple relational matrices, partition the data into clusters, and learn a relevance weight for each matrix in each cluster simultaneously. The cluster dependent relevance weights offer two advantages. First, they guide the clustering process to partition the data set into more meaningful clusters. Second, they can be used in subsequent steps of a learning system to improve its learning behavior. The performance of the proposed algorithm is illustrated by using it to categorize a collection of 500 color images. We represent the pairwise image dissimilarities by six different relational matrices that encode color, texture, and structure information.  相似文献   

13.
In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results.  相似文献   

14.
Abstract

Two fuzzy versions of the k-means optimal, least squared error partitioning problem are formulated for finite subsets X of a general inner product space. In both cases, the extremizing solutions are shown to be fixed points of a certain operator T on the class of fuzzy, k-partitions of X, and simple iteration of T provides an algorithm which has the descent property relative to the least squared error criterion function. In the first case, the range of T consists largely of ordinary (i.e. non-fuzzy) partitions of X and the associated iteration scheme is essentially the well known ISODATA process of Ball and Hall. However, in the second case, the range of T consists mainly of fuzzy partitions and the associated algorithm is new; when X consists of k compact well separated (CWS) clusters, Xi , this algorithm generates a limiting partition with membership functions which closely approximate the characteristic functions of the clusters Xi . However, when X is not the union of k CWS clusters, the limiting partition is truly fuzzy in the sense that the values of its component membership functions differ substantially from 0 or 1 over certain regions of X. Thus, unlike ISODATA, the “fuzzy” algorithm signals the presence or absence of CWS clusters in X. Furthermore, the fuzzy algorithm seems significantly less prone to the “cluster-splitting” tendency of ISODATA and may also be less easily diverted to uninteresting locally optimal partitions. Finally, for data sets X consisting of dense CWS clusters embedded in a diffuse background of strays, the structure of X is accurately reflected in the limiting partition generated by the fuzzy algorithm. Mathematical arguments and numerical results are offered in support of the foregoing assertions.  相似文献   

15.
Comparing, clustering and merging ellipsoids are problems that arise in various applications, e.g., anomaly detection in wireless sensor networks and motif-based patterned fabrics. We develop a theory underlying three measures of similarity that can be used to find groups of similar ellipsoids in p-space. Clusters of ellipsoids are suggested by dark blocks along the diagonal of a reordered dissimilarity image (RDI). The RDI is built with the recursive iVAT algorithm using any of the three (dis) similarity measures as input and performs two functions: (i) it is used to visually assess and estimate the number of possible clusters in the data; and (ii) it offers a means for comparing the three similarity measures. Finally, we apply the single linkage and CLODD clustering algorithms to three two-dimensional data sets using each of the three dissimilarity matrices as input. Two data sets are synthetic, and the third is a set of real WSN data that has one known second order node anomaly. We conclude that focal distance is the best measure of elliptical similarity, iVAT images are a reliable basis for estimating cluster structures in sets of ellipsoids, and single linkage can successfully extract the indicated clusters.  相似文献   

16.
针对基于区间值模糊集的图像阈值分割问题,提出了一种基于中心扰动的区间值模糊集图像阈值分割算法.采用对目标及背景中心进行扰动的方式,考虑不确定、不精确信息对图像类别中心的影响,并利用限制等价函数构建图像的区间值模糊集模型;在提出一种区间值模糊集上区别度量的基础上建立目标函数来搜索最佳分割阈值.通过对三种类型的图像数据进行仿真实验,结果表明提出的方法在视觉和指标上总体得到了较好的结果,证明了该算法的有效性.  相似文献   

17.
In this paper a new multiobjective (MO) clustering technique (GenClustMOO) is proposed which can automatically partition the data into an appropriate number of clusters. Each cluster is divided into several small hyperspherical subclusters and the centers of all these small sub-clusters are encoded in a string to represent the whole clustering. For assigning points to different clusters, these local sub-clusters are considered individually. For the purpose of objective function evaluation, these sub-clusters are merged appropriately to form a variable number of global clusters. Three objective functions, one reflecting the total compactness of the partitioning based on the Euclidean distance, the other reflecting the total symmetry of the clusters, and the last reflecting the cluster connectedness, are considered here. These are optimized simultaneously using AMOSA, a newly developed simulated annealing based multiobjective optimization method, in order to detect the appropriate number of clusters as well as the appropriate partitioning. The symmetry present in a partitioning is measured using a newly developed point symmetry based distance. Connectedness present in a partitioning is measured using the relative neighborhood graph concept. Since AMOSA, as well as any other MO optimization technique, provides a set of Pareto-optimal solutions, a new method is also developed to determine a single solution from this set. Thus the proposed GenClustMOO is able to detect the appropriate number of clusters and the appropriate partitioning from data sets having either well-separated clusters of any shape or symmetrical clusters with or without overlaps. The effectiveness of the proposed GenClustMOO in comparison with another recent multiobjective clustering technique (MOCK), a single objective genetic algorithm based automatic clustering technique (VGAPS-clustering), K-means and single linkage clustering techniques is comprehensively demonstrated for nineteen artificial and seven real-life data sets of varying complexities. In a part of the experiment the effectiveness of AMOSA as the underlying optimization technique in GenClustMOO is also demonstrated in comparison to another evolutionary MO algorithm, PESA2.  相似文献   

18.
数据仓库为海量数据上的决策支持提供了一个高效的信息管理平台,ROLAP利用关系型数据仓库操纵灵活和技术成熟等优势,为面向数据仓库的分析和决策提供了有效的存取、建模和操作方法.然而,传统关系存取方法造成ROLAP的I/O有效性面临严峻的挑战.首先通过分析DSS应用的特点,提出了关系算子访问基表属性的时态行为,定义了算子对属性的时态局部访问.通过对查询样本集的解析建立算子与属性的时态访问映射矩阵,将有效增益作为属性的聚类准则得到时态访问模型PD.最后,给出了求解该模型的粗集算法以及依据聚类结果设计的基表属性的垂直分区方案.实验证明:在决策支持应用中,该方法的效率优于同类的其它优化分区方法.  相似文献   

19.
This paper is concerned with the computational efficiency of fuzzy clustering algorithms when the data set to be clustered is described by a proximity matrix only (relational data) and the number of clusters must be automatically estimated from such data. A fuzzy variant of an evolutionary algorithm for relational clustering is derived and compared against two systematic (pseudo-exhaustive) approaches that can also be used to automatically estimate the number of fuzzy clusters in relational data. An extensive collection of experiments involving 18 artificial and two real data sets is reported and analyzed.  相似文献   

20.
Mao  YiMin  Gan  DeJin  Mwakapesa  D. S.  Nanehkaran  Y. A.  Tao  Tao  Huang  XueYu 《The Journal of supercomputing》2022,78(4):5181-5202

The partitioning-based k-means clustering is one of the most important clustering algorithms. However, in big data environment, it faces the problems of random selection of initial cluster centers randomly, expensive communication overhead among MapReduce nodes and data skewing in data partitions, and others. To solve these problems, this paper proposes a parallel clustering algorithm based on grid density and local sensitive hash function (MR-PGDLSH) which takes into account the advantages of MapReduce and LSH (locality sensitive hash function). In the MR-PGDLSH, firstly the GDS (grid density strategy) is designed to obtain the relatively reasonable initial cluster centers. Then, a DP-LSH (data partition based on locality sensitive hash function) is proposed to divide the data set into multiple segments. The relevant data objects are mapped to the same sub-data set. The similarity function is designed to generate clusters, thereby reducing frequent communication overhead between nodes. Next, the AGS (adaptive grouping strategy) is applied to distribute the amount of data on each node evenly, which solves the problem of data skew on the node. Finally, the MR-PGDLSH is applied to mine the cluster centers in parallel, which obtains the final clustering results. Both theoretical analysis and experimental results have shown that the MR-PGDLSH is superior to the existing clustering algorithms.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号