首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we introduce a new algorithm for clustering and aggregating relational data (CARD). We assume that data is available in a relational form, where we only have information about the degrees to which pairs of objects in the data set are related. Moreover, we assume that the relational information is represented by multiple dissimilarity matrices. These matrices could have been generated using different sensors, features, or mappings. CARD is designed to aggregate pairwise distances from multiple relational matrices, partition the data into clusters, and learn a relevance weight for each matrix in each cluster simultaneously. The cluster dependent relevance weights offer two advantages. First, they guide the clustering process to partition the data set into more meaningful clusters. Second, they can be used in subsequent steps of a learning system to improve its learning behavior. The performance of the proposed algorithm is illustrated by using it to categorize a collection of 500 color images. We represent the pairwise image dissimilarities by six different relational matrices that encode color, texture, and structure information.  相似文献   

2.
3.
4.
Unsupervised clustering for datasets with severe outliers inside is a difficult task. In this approach, we propose a cluster-dependent multi-metric clustering approach which is robust to severe outliers. A dataset is modeled as clusters each contaminated by noises of cluster-dependent unknown noise level in formulating outliers of the cluster. With such a model, a multi-metric Lp-norm transformation is proposed and learnt which maps each cluster to the most Gaussian distribution by minimizing some non-Gaussianity measure. The approach is composed of two consecutive phases: multi-metric location estimation (MMLE) and multi-metric iterative chi-square cutoff (ICSC). Algorithms for MMLE and ICSC are proposed. It is proved that the MMLE algorithm searches for the solution of a multi-objective optimization problem and in fact learns a cluster-dependent multi-metric Lq-norm distance and/or a cluster-dependent multi-kernel defined in data space for each cluster. Experiments on heavy-tailed alpha-stable mixture datasets, Gaussian mixture datasets with radial and diffuse outliers added respectively, and the real Wisconsin breast cancer dataset and lung cancer dataset show that the proposed method is superior to many existent robust clustering and outlier detection methods in both clustering and outlier detection performances.  相似文献   

5.
Categorical data clustering is a difficult and challenging task due to the special characteristic of categorical attributes: no natural order. Thus, this study aims to propose a two-stage method named partition-and-merge based fuzzy genetic clustering algorithm (PM-FGCA) for categorical data. The proposed PM-FGCA uses a fuzzy genetic clustering algorithm to partition the dataset into a maximum number of clusters in the first stage. Then, the merge stage is designed to select two clusters among the clusters that generated in the first stage based on its inter-cluster distances and merge two selected clusters to one cluster. This procedure is repeated until the number of clusters equals to the predetermined number of clusters. Thereafter, some particular instances in each cluster are considered to be re-assigned to other clusters based on the intra-cluster distances. The proposed PM-FGCA is implemented on ten categorical datasets from UCI machine learning repository. In order to evaluate the clustering performance, the proposed PM-FGCA is compared with some existing methods such as k-modes algorithm, fuzzy k-modes algorithm, genetic fuzzy k-modes algorithm, and non-dominated sorting genetic algorithm using fuzzy membership chromosomes. Adjusted Ranked Index (ARI), Normalized Mutual Information (NMI), and Davies–Bouldin (DB) index are selected as three clustering validation indices which are represented to both external index (i.e., ARI and NMI) and internal index (i.e., DB). Consequently, the experimental result shows that the proposed PM-FGCA outperforms the benchmark methods in terms of the tested indices.  相似文献   

6.
Clustering aims to partition a data set into homogenous groups which gather similar objects. Object similarity, or more often object dissimilarity, is usually expressed in terms of some distance function. This approach, however, is not viable when dissimilarity is conceptual rather than metric. In this paper, we propose to extract the dissimilarity relation directly from the available data. To this aim, we train a feedforward neural network with some pairs of points with known dissimilarity. Then, we use the dissimilarity measure generated by the network to guide a new unsupervised fuzzy relational clustering algorithm. An artificial data set and a real data set are used to show how the clustering algorithm based on the neural dissimilarity outperforms some widely used (possibly partially supervised) clustering algorithms based on spatial dissimilarity.  相似文献   

7.
Clustering is an underspecified task: there are no universal criteria for what makes a good clustering. This is especially true for relational data, where similarity can be based on the features of individuals, the relationships between them, or a mix of both. Existing methods for relational clustering have strong and often implicit biases in this respect. In this paper, we introduce a novel dissimilarity measure for relational data. It is the first approach to incorporate a wide variety of types of similarity, including similarity of attributes, similarity of relational context, and proximity in a hypergraph. We experimentally evaluate the proposed dissimilarity measure on both clustering and classification tasks using data sets of very different types. Considering the quality of the obtained clustering, the experiments demonstrate that (a) using this dissimilarity in standard clustering methods consistently gives good results, whereas other measures work well only on data sets that match their bias; and (b) on most data sets, the novel dissimilarity outperforms even the best among the existing ones. On the classification tasks, the proposed method outperforms the competitors on the majority of data sets, often by a large margin. Moreover, we show that learning the appropriate bias in an unsupervised way is a very challenging task, and that the existing methods offer a marginal gain compared to the proposed similarity method, and can even hurt performance. Finally, we show that the asymptotic complexity of the proposed dissimilarity measure is similar to the existing state-of-the-art approaches. The results confirm that the proposed dissimilarity measure is indeed versatile enough to capture relevant information, regardless of whether that comes from the attributes of vertices, their proximity, or connectedness of vertices, even without parameter tuning.  相似文献   

8.
The first stage of knowledge acquisition and reduction of complexity concerning a group of entities is to partition or divide the entities into groups or clusters based on their attributes or characteristics. Clustering algorithms normally require both a method of measuring proximity between patterns and prototypes and a method for aggregating patterns. However sometimes feature vectors or patterns may not be available for objects and only the proximities between the objects are known. Even if feature vectors are available some of the features may not be numeric and it may not be possible to find a satisfactory method of aggregating patterns for the purpose of determining prototypes. Clustering of objects however can be performed on the basis of data describing the objects in terms of feature vectors or on the basis of relational data. The relational data is in terms of proximities between objects. Clustering of objects on the basis of relational data rather than individual object data is called relational clustering. The premise of this paper is that the proximities between the membership vectors, which are obtained as the objective of clustering, should be proportional to the proximities between the objects. The values of the components of the membership vector corresponding to an object are the membership degrees of the object in the various clusters. The membership vector is just a type of feature vector. Based on this premise, this paper describes another fuzzy relational clustering method for finding a fuzzy membership matrix. The method involves solving a rather challenging optimization problem, since the objective function has many local minima. This makes the use of a global optimization method such as particle swarm optimization (PSO) attractive for determining the membership matrix for the clustering. To minimize computational effort, a Bayesian stopping criterion is used in combination with a multi-start strategy for the PSO. Other relational clustering methods generally find local optimum of their objective function.  相似文献   

9.
EVCLUS: evidential clustering of proximity data   总被引:1,自引:0,他引:1  
A new relational clustering method is introduced, based on the Dempster-Shafer theory of belief functions (or evidence theory). Given a matrix of dissimilarities between n objects, this method, referred to as evidential clustering (EVCLUS), assigns a basic belief assignment (or mass function) to each object in such a way that the degree of conflict between the masses given to any two objects reflects their dissimilarity. A notion of credal partition is introduced, which subsumes those of hard, fuzzy, and possibilistic partitions, allowing to gain deeper insight into the structure of the data. Experiments with several sets of real data demonstrate the good performances of the proposed method as compared with several state-of-the-art relational clustering techniques.  相似文献   

10.
This paper presents a new technique for clustering either object or relational data. First, the data are represented as a matrix D of dissimilarity values. D is reordered to D * using a visual assessment of cluster tendency algorithm. If the data contain clusters, they are suggested by visually apparent dark squares arrayed along the main diagonal of an image I( D *) of D *. The suggested clusters in the object set underlying the reordered relational data are found by defining an objective function that recognizes this blocky structure in the reordered data. The objective function is optimized when the boundaries in I( D *) are matched by those in an aligned partition of the objects. The objective function combines measures of contrast and edginess and is optimized by particle swarm optimization. We prove that the set of aligned partitions is exponentially smaller than the set of partitions that needs to be searched if clusters are sought in D . Six numerical examples are given to illustrate various facets of the algorithm. © 2009 Wiley Periodicals, Inc.  相似文献   

11.
The first stage of organizing objects is to partition them into groups or clusters. The clustering is generally done on individual object data representing the entities such as feature vectors or on object relational data incorporated in a proximity matrix.This paper describes another method for finding a fuzzy membership matrix that provides cluster membership values for all the objects based strictly on the proximity matrix. This is generally referred to as relational data clustering. The fuzzy membership matrix is found by first finding a set of vectors that approximately have the same inter-vector Euclidian distances as the proximities that are provided. These vectors can be of very low dimension such as 5 or less. Fuzzy c-means (FCM) is then applied to these vectors to obtain a fuzzy membership matrix. In addition two-dimensional vectors are also created to provide a visual representation of the proximity matrix. This allows comparison of the result of automatic clustering to visual clustering. The method proposed here is compared to other relational clustering methods including NERFCM, Rouben’s method and Windhams A-P method. Various clustering quality indices are also calculated for doing the comparison using various proximity matrices as input. Simulations show the method to be very effective and no more computationally expensive than other relational data clustering methods. The membership matrices that are produced by the proposed method are less crisp than those produced by NERFCM and more representative of the proximity matrix that is used as input to the clustering process.  相似文献   

12.
This paper presents variable-wise kernel hard clustering algorithms in the feature space in which dissimilarity measures are obtained as sums of squared distances between patterns and centroids computed individually for each variable by means of kernels. The methods proposed in this paper are supported by the fact that a kernel function can be written as a sum of kernel functions evaluated on each variable separately. The main advantage of this approach is that it allows the use of adaptive distances, which are suitable to learn the weights of the variables on each cluster, providing a better performance. Moreover, various partition and cluster interpretation tools are introduced. Experiments with synthetic and benchmark datasets show the usefulness of the proposed algorithms and the merit of the partition and cluster interpretation tools.  相似文献   

13.
K-means算法的基本思想是通过迭代方法把所有的元素都唯一聚类到不同的簇中,使得同一簇中的质点具有最小相异度,不同簇间的元素具有最大相异度。但是,这种聚类方法使得那些属于不同簇的交叉区域中的质点也被简单地聚类到了某个簇中,因此无法表达某些元素的跨簇特性。本文提出了基于模糊逻辑的K-means算法,利用模糊逻辑来计算不同簇交叉区域中质点属于某个簇的权重,在获得聚类结果的同时可以有效描述质点的跨簇特性。实验结果表明该算法是有效的。  相似文献   

14.
陈黎飞  郭躬德 《软件学报》2013,24(11):2628-2641
类属型数据广泛分布于生物信息学等许多应用领域,其离散取值的特点使得类属数据聚类成为统计机器学习领域一项困难的任务.当前的主流方法依赖于类属属性的模进行聚类优化和相关属性的权重计算.提出一种非模的类属型数据统计聚类方法.首先,基于新定义的相异度度量,推导了属性加权的类属数据聚类目标函数.该函数以对象与簇之间的平均距离为基础,从而避免了现有方法以模为中心导致的问题.其次,定义了一种类属型数据的软子空间聚类算法.该算法在聚类过程中根据属性取值的总体分布,而不仅限于属性的模,赋予每个属性衡量其与簇类相关程度的权重,实现自动的特征选择.在合成数据和实际应用数据集上的实验结果表明,与现有的基于模的聚类算法和基于蒙特卡罗优化的其他非模算法相比,该算法有效地提高了聚类结果的质量.  相似文献   

15.
Semisupervised clustering algorithms partition a given data set using limited supervision from the user. The success of these algorithms depends on the type of supervision and also on the kind of dissimilarity measure used while creating partitions of the space. This paper proposes a clustering algorithm that uses supervision in terms of relative comparisons, viz., x is closer to y than to z. The proposed clustering algorithm simultaneously learns the underlying dissimilarity measure while finding compact clusters in the given data set using relative comparisons. Through our experimental studies on high-dimensional textual data sets, we demonstrate that the proposed algorithm achieves higher accuracy and is more robust than similar algorithms using pairwise constraints for supervision.  相似文献   

16.
在现实世界中经常遇到混合数值属性和分类属性的数据, k-prototypes是聚类该类型数据的主要算法之一。针对现有混合属性聚类算法的不足,提出一种基于分布式质心和新差异测度的改进的 k-prototypes 算法。在新算法中,首先引入分布式质心来表示簇中的分类属性的簇中心,然后结合均值和分布式质心来表示混合属性的簇中心,并提出一种新的差异测度来计算数据对象与簇中心的距离,新差异测度考虑了不同属性在聚类过程中的重要性。在三个真实数据集上的仿真实验表明,与传统的聚类算法相比,本文算法的聚类精度要优于传统的聚类算法,从而验证了本文算法的有效性。  相似文献   

17.
Learning from Cluster Examples   总被引:2,自引:0,他引:2  
Learning from cluster examples (LCE) is a hybrid task combining features of two common grouping tasks: learning from examples and clustering. In LCE, each training example is a partition of objects. The task is then to learn from a training set, a rule for partitioning unseen object sets. A general method for learning such partitioning rules is useful in any situation where explicit algorithms for deriving partitions are hard to formalize, while individual examples of correct partitions are easy to specify. In the past, clustering techniques have been applied to such problems, despite being essentially unsuited to the task. We present a technique that has qualitative advantages over standard clustering approaches. We demonstrate these advantages by applying our method to problems in two domains; one with dot patterns and one with more realistic vector-data images.  相似文献   

18.
19.
This paper describes a system for the automatically learned partitioning of visual patterns in 2D images, based on sophisticated band-pass filtering with fixed scale and orientation sensitivity. The visual patterns are defined as the features which have the highest degree of alignment in the statistical structure across different frequency bands. The analysis reorganizes the image according to an invariance constraint in statistical structure and consists of three stages: pre-attentive stage, integration stage, and learning stage. The first stage takes the input image and performs filtering with log-Gabor filters. Based on their responses, activated filters which are selectively sensitive to patterns in the image are short listed. In the integration stage, common grounds between several activated sensors are explored. The filtered responses are analyzed through a family of statistics. For any given two activated filters, a distance between them is derived via distances between their statistics. The third stage performs cluster partitioning for learning the subspace of log-Gabor filters needed to partition the image data. The clustering is based on a dissimilarity measure intended to highlight scale and orientation invariance of the responses. The technique is illustrated on real and simulated data sets. Finally, this paper presents a computational visual distinctness measure computed from the image representational model based on visual patterns. Experiments are performed to investigate its relation to distinctness as measured by human observers  相似文献   

20.
Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based clustering algorithms find clusters based on density of data points in a region. DBSCAN algorithm is one of the density-based clustering algorithms. It can discover clusters with arbitrary shapes and only requires two input parameters. DBSCAN has been proved to be very effective for analyzing large and complex spatial databases. However, DBSCAN needs large volume of memory support and often has difficulties with high-dimensional data and clusters of very different densities. So, partitioning-based DBSCAN algorithm (PDBSCAN) was proposed to solve these problems. But PDBSCAN will get poor result when the density of data is non-uniform. Meanwhile, to some extent, DBSCAN and PDBSCAN are both sensitive to the initial parameters. In this paper, we propose a new hybrid algorithm based on PDBSCAN. We use modified ant clustering algorithm (ACA) and design a new partitioning algorithm based on ‘point density’ (PD) in data preprocessing phase. We name the new hybrid algorithm PACA-DBSCAN. The performance of PACA-DBSCAN is compared with DBSCAN and PDBSCAN on five data sets. Experimental results indicate the superiority of PACA-DBSCAN algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号