首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
介绍一种基于模糊逻辑的数据聚类技术,讨论了模糊C均值聚类方法。模糊C均值算法就是利用模糊逻辑理论和聚类思想,将n样本划分到c个类别中的一个,使得被划分到同一簇的对象之间相似度最大,而不同簇之间的相似度最小。  相似文献   

2.
Conventional clustering ensemble algorithms employ a set of primary results; each result includes a set of clusters which are emerged from data. Given a large number of available clusters, one is faced with the following questions: (a) can we obtain the same quality of results with a smaller number of clusters instead of full ensemble? (b) If so, which subset of clusters is more efficient to be used in the ensemble? In this paper, these two questions are going to be answered. We explore a clustering ensemble approach combined with a cluster stability criterion as well as a dataset simplicity criterion to discover the finest subset of base clusters for each kind of datasets. Also, a novel method is proposed in order to accumulate the selected clusters and to extract final partitioning. Although it is expected that by reducing the size of ensemble the performance decreases, our experimental results show that our selecting mechanism generally lead to superior results.  相似文献   

3.
Clustering is a widely used unsupervised data mining technique. It allows to identify structures in collections of objects by grouping them into classes, named clusters, in such a way that similarity of objects within any cluster is maximized and similarity of objects belonging to different clusters is minimized. In density-based clustering, a cluster is defined as a connected dense component and grows in the direction driven by the density. The basic structure of density-based clustering presents some common drawbacks: (i) parameters have to be set; (ii) the behavior of the algorithm is sensitive to the density of the starting object; and (iii) adjacent clusters of different densities could not be properly identified. In this paper, we address all the above problems. Our method, based on the concept of space stratification, efficiently identifies the different densities in the dataset and, accordingly, ranks the objects of the original space. Next, it exploits such a knowledge by projecting the original data into a space with one more dimension. It performs a density based clustering taking into account the reverse-nearest-neighbor of the objects. Our method also reduces the number of input parameters by giving a guideline to set them in a suitable way. Experimental results indicate that our algorithm is able to deal with clusters of different densities and outperforms the most popular algorithms DBSCAN and OPTICS in all the standard benchmark datasets.  相似文献   

4.
陆林花 《计算机仿真》2009,26(7):122-125,158
为了在聚类数不明确的情况下实现聚类分析,提出一种新的结合最近邻聚类和遗传算法的动态聚类算法.新算法包括两个阶段:第一阶段用最近邻聚类算法根据最近邻方法把最相似的实例分到同一个簇中并根据一些相似性或相异性度量过滤掉噪声数据从而得到初始聚类集,第二阶段是遗传优化阶段,利用动态聚类评估函数,动态地合并初始聚类集,从而获得接近最优的解.最后对算法进行了实验仿真,实验结果表明方法在事先不知道聚类数的情况下能够有效地进行聚类.  相似文献   

5.
Clustering consists in partitioning a set of objects into disjoint and homogeneous clusters. For many years, clustering methods have been applied in a wide variety of disciplines and they also have been utilized in many scientific areas. Traditionally, clustering methods deal with numerical data, i.e. objects represented by a conjunction of numerical attribute values. However, nowadays commercial or scientific databases usually contain categorical data, i.e. objects represented by categorical attributes. In this paper we present a dissimilarity measure which is capable to deal with tree structured categorical data. Thus, it can be used for extending the various versions of the very popular k-means clustering algorithm to deal with such data. We discuss how such an extension can be achieved. Moreover, we empirically prove that the proposed dissimilarity measure is accurate, compared to other well-known (dis)similarity measures for categorical data.  相似文献   

6.
Clustering is the process of grouping objects that are similar, where similarity between objects is usually measured by a distance metric. The groups formed by a clustering method are referred as clusters. Clustering is a widely used activity with multiple applications ranging from biology to economics. Each clustering technique has some advantages and disadvantages. Some clustering algorithms may even require input parameters which strongly affect the result. In most cases, it is not possible to choose the best distance metric, the best clustering method, and the best input argument values for an input data set. Therefore, multiple clusterings can be obtained by several distance metrics, several clustering methods, and several input argument values. And, multiple clusterings can be combined into a new and better quality final clustering. We propose a family of combining multiple clustering algorithms that are memory efficient, scalable, robust, and intuitive. Our new algorithms offer tremendous speed gain and low memory requirements by working at cluster level, while producing very good quality final clusters. Extensive experimental evaluations on some very challenging artificially generated and real data sets from a diverse set of domains establish the usefulness of our methods.  相似文献   

7.
Robust projected clustering   总被引:4,自引:2,他引:2  
Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many overlapping clusters. Such algorithms have been extensively studied for numerical data, but only a few have been proposed for categorical data. Typical drawbacks of existing projected and subspace clustering algorithms for numerical or categorical data are that they rely on parameters whose appropriate values are difficult to set appropriately or that they are unable to identify projected clusters with few relevant attributes. We present P3C, a robust algorithm for projected clustering that can effectively discover projected clusters in the data while minimizing the number of required parameters. P3C does not need the number of projected clusters as input, and can discover, under very general conditions, the true number of projected clusters. P3C is effective in detecting very low-dimensional projected clusters embedded in high dimensional spaces. P3C positions itself between projected and subspace clustering in that it can compute both disjoint or overlapping clusters. P3C is the first projected clustering algorithm for both numerical and categorical data.  相似文献   

8.
基于数据场的粗糙聚类算法   总被引:1,自引:1,他引:1  
聚类分析是数据挖掘的研究热点.传统的聚类算法都是把一个对象精确地划分到一个聚类簇中,类别之间的界限是非常精确的.随着Web挖掘技术的发展,精确地划分每个对象的聚类算法面临着巨大的挑战.根据数据场理论和经典粗糙集理论所具有处理不精确与不确定性数据的特性,提出一种新的基于数据场的粗糙聚类算法,该粗糙聚类算法采用势值作为对象的划分依据,避免传统粗糙聚类算法一贯采用基于欧氏距离的划分方法.算法首先通过对数据对象进行粗分然后再不断迭代细分,直至形成稳定的聚类簇.实验分析过程中,把提出的算法与粗糙K-means算法和粗糙K-medoids算法进行了比较,结果表明该算法在交叉数据集上具有较好的聚类效果,而且收敛速度较快.  相似文献   

9.
Cluster validity indices are used for estimating the quality of partitions produced by clustering algorithms and for determining the number of clusters in data. Cluster validation is difficult task, because for the same data set more partitions exists regarding the level of details that fit natural groupings of a given data set. Even though several cluster validity indices exist, they are inefficient when clusters widely differ in density or size. We propose a clustering validity index that addresses these issues. It is based on compactness and overlap measures. The overlap measure, which indicates the degree of overlap between fuzzy clusters, is obtained by calculating the overlap rate of all data objects that belong strongly enough to two or more clusters. The compactness measure, which indicates the degree of similarity of data objects in a cluster, is calculated from membership values of data objects that are strongly enough associated to one cluster. We propose ratio and summation type of index using the same compactness and overlap measures. The maximal value of index denotes the optimal fuzzy partition that is expected to have a high compactness and a low degree of overlap among clusters. Testing many well-known previously formulated and proposed indices on well-known data sets showed the superior reliability and effectiveness of the proposed index in comparison to other indices especially when evaluating partitions with clusters that widely differ in size or density.  相似文献   

10.
Appropriately defining and efficiently calculating similarities from large data sets are often essential in data mining, both for gaining understanding of data and generating processes and for building tractable representations. Given a set of objects and their correlations, we here rely on the premise that each object is characterized by its context, i.e., its correlations to the other objects. The similarity between two objects can then be expressed in terms of the similarity between their contexts. In this way, similarity pertains to the general notion that objects are similar if they are exchangeable in the data. We propose a scalable approach for calculating all relevant similarities among objects by relating them in a correlation graph that is transformed to a similarity graph. These graphs can express rich structural properties among objects. Specifically, we show that concepts—abstractions of objects—are constituted by groups of similar objects that can be discovered by clustering the objects in the similarity graph. These principles and methods are applicable in a wide range of fields and will be demonstrated here in three domains: computational linguistics, music, and molecular biology, where the numbers of objects and correlations range from small to very large.  相似文献   

11.
基于划分和凝聚层次聚类的无监督异常检测   总被引:3,自引:1,他引:2       下载免费PDF全文
李娜  钟诚 《计算机工程》2008,34(2):120-123
将信息熵理论应用于入侵检测的聚类问题,给出在混合属性条件下数据之间距离、数据与簇之间距离、簇与簇之间距离的定义,以整体相似度的聚类质量评价标准作为聚类合并的策略,提出了一种基于划分和凝聚层次聚类的无监督的异常检测算法。算法分析和实验结果表明,该算法具有较好的检测性能并能有效检测出未知入侵行为。  相似文献   

12.
Chameleon: hierarchical clustering using dynamic modeling   总被引:8,自引:0,他引:8  
Clustering is a discovery process in data mining. It groups a set of data in a way that maximizes the similarity within clusters and minimizes the similarity between two different clusters. Many advanced algorithms have difficulty dealing with highly variable clusters that do not follow a preconceived model. By basing its selections on both interconnectivity and closeness, the Chameleon algorithm yields accurate results for these highly variable clusters. Existing algorithms use a static model of the clusters and do not use information about the nature of individual clusters as they are merged. Furthermore, one set of schemes (the CURE algorithm and related schemes) ignores the information about the aggregate interconnectivity of items in two clusters. Another set of schemes (the Rock algorithm, group averaging method, and related schemes) ignores information about the closeness of two clusters as defined by the similarity of the closest items across two clusters. By considering either interconnectivity or closeness only, these algorithms can select and merge the wrong pair of clusters. Chameleon's key feature is that it accounts for both interconnectivity and closeness in identifying the most similar pair of clusters. Chameleon finds the clusters in the data set by using a two-phase algorithm. During the first phase, Chameleon uses a graph partitioning algorithm to cluster the data items into several relatively small subclusters. During the second phase, it uses an algorithm to find the genuine clusters by repeatedly combining these subclusters  相似文献   

13.
Rough k-means clustering describes uncertainty by assigning some objects to more than one cluster. Rough cluster quality index based on decision theory is applicable to the evaluation of rough clustering. In this paper we analyze rough k-means clustering with respect to the selection of the threshold, the value of risk for assigning an object and uncertainty of objects. According to the analysis, clusters presented as interval sets with lower and upper approximations in rough k-means clustering are not adequate to describe clusters. This paper proposes an interval set clustering based on decision theory. Lower and upper approximations in the proposed algorithm are hierarchical and constructed as outer-level approximations and inner-level ones. Uncertainty of objects in out-level upper approximation is described by the assignment of objects among different clusters. Accordingly, ambiguity of objects in inner-level upper approximation is represented by local uniform factors of objects. In addition, interval set clustering can be improved to obtain a satisfactory clustering result with the optimal number of clusters, as well as optimal values of parameters, by taking advantage of the usefulness of rough cluster quality index in the evaluation of clustering. The experimental results on synthetic and standard data demonstrate how to construct clusters with satisfactory lower and upper approximations in the proposed algorithm. The experiments with a promotional campaign for the retail data illustrates the usefulness of interval set clustering for improving rough k-means clustering results.  相似文献   

14.
增强的基于GCA(Gravity-based clustering approach)的入侵检测方法是先对训练集采用GCA进行聚类,然后依据凝聚层次聚类算法的思想,以簇间的差异度和整体相似度作为聚类质量评价标准对GCA聚类产生的簇进行一些合并,合并后能使簇中心更集中,簇内对象更紧密。再根据标记算法标记出哪些簇属于正常簇,哪些属于异常簇,最后用检测算法对测试集数据进行检测。实验表明该方法对未知攻击的检测能力有所增强,特别是能有效降低误报率。  相似文献   

15.
Cluster analysis deals with the problem of organization of a collection of objects into clusters based on a similarity measure, which can be defined using various distance functions. The use of different similarity measures allows one to find different cluster structures in a data set. In this article, an algorithm is developed to solve clustering problems where the similarity measure is defined using the L1‐norm. The algorithm is designed using the nonsmooth optimization approach to the clustering problem. Smoothing techniques are applied to smooth both the clustering function and the L1‐norm. The algorithm computes clusters sequentially and finds global or near global solutions to the clustering problem. Results of numerical experiments using 12 real‐world data sets are reported, and the proposed algorithm is compared with two other clustering algorithms.  相似文献   

16.
在比特流未知协议识别过程中,针对如何将得到的多协议数据帧分为单协议数据帧这一问题,提出了一种改进的凝聚型层次聚类算法。该算法以传统的凝聚型层次聚类算法思想为基础,结合比特流数据帧的特征,定义了数据帧之间及类簇之间的相似度,采用边聚类边提取符合要求类簇的方式,能快速有效地对数据帧进行聚类;并且该算法能自动地确定聚类的个数,所得的类簇含有相似度评价指标。利用林肯实验室公布的数据集进行测试,说明该算法能以较高的正确率对协议数据帧进行聚类。  相似文献   

17.
As one of the most fundamental yet important methods of data clustering, center-based partitioning approach clusters the dataset into k subsets, each of which is represented by a centroid or medoid. In this paper, we propose a new medoid-based k-partitions approach called Clustering Around Weighted Prototypes (CAWP), which works with a similarity matrix. In CAWP, each cluster is characterized by multiple objects with different representative weights. With this new cluster representation scheme, CAWP aims to simultaneously produce clusters of improved quality and a set of ranked representative objects for each cluster. An efficient algorithm is derived to alternatingly update the clusters and the representative weights of objects with respect to each cluster. An annealing-like optimization procedure is incorporated to alleviate the local optimum problem for better clustering results and at the same time to make the algorithm less sensitive to parameter setting. Experimental results on benchmark document datasets show that, CAWP achieves favorable effectiveness and efficiency in clustering, and also provides useful information for cluster-specified analysis.  相似文献   

18.
改进的基于遗传算法的粗糙聚类方法   总被引:2,自引:0,他引:2       下载免费PDF全文
传统的聚类算法都是使用硬计算来对数据对象进行划分,然而现实中不同类之间对象通常没有明确的界限。粗糙集理论提供了一种处理边界对象不确定的方法。因此将粗糙理论与k-均值方法相结合。同时,传统的k-均值聚类方法必须事先给定聚类数k,但实际情况下k很难确定;另外虽然传统k-均值算法局部搜索能力强,但容易陷入局部最优。遗传算法能得到全局最优解,但收敛过快。鉴于此,提出了一种改进的基于遗传算法的的粗糙聚类方法。该算法能动态地生成k-均值聚类数,采用最大最小原则生成初始聚类中心,同时结合粗糙集理论的上近似和下近似处理边界对象。最后,用UCI的Iris数据集分别对算法进行实际验证。实验结果表明,该算法具有较高的正确率,综合性能更加稳定。  相似文献   

19.
聚类分析是数据挖掘中经常用到的一种分析数据之间关系的方法.它把数据对象集合划分成多个不同的组或簇,每个簇内的数据对象之间的相似性要高于与其他簇内的对象的相似性.密度中心聚类算法是一个最近发表在《Science》上的新型聚类算法,它通过评估每个数据对象的2个属性值(密度值ρ和斥群值δ)来进行聚类.相对于其他传统聚类算法,它的优越性体现在交互性、无迭代性、无数据分布依赖性等方面.但是密度中心聚类算法在计算每个数据对象的密度值和斥群值时,需要O(N\\+2)复杂度的距离计算,当处理海量高维数据时,该算法的效率会受到很大的影响.为了提高该算法的效率和扩展性,提出一种高效的分布式密度中心聚类算法EDDPC (efficient distributed density peaks clustering),它利用Voronoi分割与合理的数据复制及过滤,避免了大量无用的距离计算开销和数据传输开销.实验结果显示:与简单的MapReduce分布式实现比较,EDDPC可以达到40倍左右的性能提升.  相似文献   

20.
图的聚类是数据聚类的一种很重要的变体,一方面通常可以用图来表示数据集中数据的相似度;另一方面对大型复杂网络的分析也引起人们越来越多地关注;而且对图进行聚类分析可以增强图的可视性,有助于可视化的分析、观测和导航。将最大最小方法的基本思想应用于非加权图的聚类,提出一种无向连通非加权图的快速聚类方法,该方法具有简单、聚类时间短、运行效率高、对于大型静态图的聚类具有良好的适应性等特点。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号