首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
聚类有效性评价指标分为外部评价指标和内部评价指标两大类。现有外部评价指标没有考虑聚类结果类偏斜现象;现有内部评价指标的聚类有效性检验效果难以得到最佳类簇数。针对现有内外部聚类评价指标的缺陷,提出同时考虑正负类信息的分别基于相依表和样本对的外部评价指标,用于评价任意分布数据集的聚类结果;提出采用方差度量类内紧密度和类间分离度,以类间分离度与类内紧密度之比作为度量指标的内部评价指标。UCI数据集和人工模拟数据集实验测试表明,提出的新内部评价指标能有效发现数据集的真实类簇数;提出的基于相依表和样本对的外部评价指标,可有效评价存在类偏斜与噪音数据的聚类结果。  相似文献   

2.
基于密度的聚类算法(DBSCAN)是最有效的轨迹数据挖掘方法之一,但基于密度的聚类算法往往受到输入参数选择的限制。在轨迹数据挖掘中,聚类结果不仅受到类内距离和类间距离的影响,还受到聚类中坐标点个数的影响。因此,提出了一种新的基于内外占空比的集群有效性指标来平衡这三个因素,该指标可以自动选择密度聚类的输入参数,并在不同的数据集上形成有效的聚类,优化后的聚类方法可应用于出行者行为轨迹的深度分析和挖掘。实验结果证明,与传统的有效性指标相比,提出的基于占空比的评价指标能够优化输入参数,获得较好的出行者位置信息聚类结果。  相似文献   

3.
Abstract

Clustering is concerned with grouping a collection of input objects. Conventional clustering algorithms cluster unlabelled objects. We argue that there are useful applications that involve clustering of labelled objects. We propose an approach for clustering of labelled objects. The proposed approach makes use of the domain knowledge represented in the form of a directed acyclic graph for clustering. We also propose a set of proper axioms in logic as a basis for the proposed algorithm. We study some of the properties of the approach such as order-independence and describe in detail an application of the proposed algorithm in the context of document retrieval.  相似文献   

4.
This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.  相似文献   

5.
With the rapid growth of text documents, document clustering technique is emerging for efficient document retrieval and better document browsing. Recently, some methods had been proposed to resolve the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels by using frequent itemsets derived from association rule mining for clustering documents. In order to improve the quality of document clustering results, we propose an effective Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach that combines fuzzy association rule mining with the background knowledge embedded in WordNet. A term hierarchy generated from WordNet is applied to discover generalized frequent itemsets as candidate cluster labels for grouping documents. We have conducted experiments to evaluate our approach on Classic4, Re0, R8, and WebKB datasets. Our experimental results show that our proposed approach indeed provide more accurate clustering results than prior influential clustering methods presented in recent literature.  相似文献   

6.
Hierarchical clustering algorithms provide a set of nested partitions called a cluster hierarchy. Since the hierarchy is usually too complex it is reduced to a single partition by using cluster validity indices. We show that the classical method is often not useful and we propose SEP, a new method that efficiently searches in an extended partition set. Furthermore, we propose a new cluster validity index, COP, since many of the commonly used indices cannot be used with SEP. Experiments performed with 80 synthetic and 7 real datasets confirm that SEP/COP is superior to the method currently used and furthermore, it is less sensitive to noise.  相似文献   

7.

In the current paper, we have developed two bio-inspired fuzzy clustering algorithms by incorporating the optimization techniques, namely differential evolution and particle swarm optimization. Both these clustering techniques can detect symmetrical-shaped clusters utilizing the established point symmetry-based distance measure. Both the proposed approaches are automatic in nature and can detect the number of clusters automatically from a given dataset. A symmetry-based cluster validity measure, F-Sym-index, is used as the objective function to be optimized in order to automatically determine the correct partitioning by both the approaches. The effectiveness of the proposed approaches is shown for automatically clustering some artificial and real-life datasets as well as for clustering some real-life gene expression datasets. The current paper presents a comparative analysis of some meta-heuristic-based clustering approaches, namely newly proposed two techniques and the already existing automatic genetic clustering techniques, VGAPS, GCUK, HNGA. The obtained results are compared with respect to some external cluster validity indices. Moreover, some statistical significance tests, as well as biological significance tests, are also conducted. Finally, results on gene expression datasets have been visualized by using some visualization tools, namely Eisen plot and cluster profile plot.

  相似文献   

8.
不同的聚类算法用于设计各自的策略,然而,每种技术在执行特定数据集时都有一定的局限性。选择恰当的识别信息方法(DIM)可确保文档聚类的进行。针对这些问题提出一种基于共识和分类的文档聚类(DCCC)的DIM。首先,选择识别信息最大化聚类(CDIM)作为数据集生成初始聚类的解决方法,并使用两种不同的CDIM方法生成两个初始聚集;其次,使用不同的参数方法对两初始聚集再进行初始化,通过簇标签信息间的关系建立共识,最大限度地提高文档的识别数总和;最后,选择识别文本权重分类(DTWC)作为文本分类器给共识分配新的簇标签,通过训练文本分类器更改基础分区,并根据预报标签信息生成最后的分区。采用8个网络数据集进行实验,选择BCubed的精度和召回率指标进行聚类验证。实验结果表明,所提出的共识分类方法的聚类结果优于对比方法的聚类结果。  相似文献   

9.
A cluster operator takes a set of data points and partitions the points into clusters (subsets). As with any scientific model, the scientific content of a cluster operator lies in its ability to predict results. This ability is measured by its error rate relative to cluster formation. To estimate the error of a cluster operator, a sample of point sets is generated, the algorithm is applied to each point set and the clusters evaluated relative to the known partition according to the distributions, and then the errors are averaged over the point sets composing the sample. Many validity measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random-point-set models. Validity measures fall broadly into three classes: internal validation is based on calculating properties of the resulting clusters; relative validation is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; and external validation compares the partition generated by the clustering algorithm and a given partition of the data. To quantify the degree of similarity between the validation indices and the clustering errors, we use Kendall's rank correlation between their values. Our results indicate that, overall, the performance of validity indices is highly variable. For complex models or when a clustering algorithm yields complex clusters, both the internal and relative indices fail to predict the error of the algorithm. Some external indices appear to perform well, whereas others do not. We conclude that one should not put much faith in a validity score unless there is evidence, either in terms of sufficient data for model estimation or prior model knowledge, that a validity measure is well-correlated to the error rate of the clustering algorithm.  相似文献   

10.
Cluster analysis is used to explore structure in unlabeled batch data sets in a wide range of applications. An important part of cluster analysis is validating the quality of computationally obtained clusters. A large number of different internal indices have been developed for validation in the offline setting. However, this concept cannot be directly extended to the online setting because streaming algorithms do not retain the data, nor maintain a partition of it, both needed by batch cluster validity indices. In this paper, we develop two incremental versions (with and without forgetting factors) of the Xie-Beni and Davies-Bouldin validity indices, and use them to monitor and control two streaming clustering algorithms (sk-means and online ellipsoidal clustering), In this context, our new incremental validity indices are more accurately viewed as performance monitoring functions. We also show that incremental cluster validity indices can send a distress signal to online monitors when evolving structure leads an algorithm astray. Our numerical examples indicate that the incremental Xie-Beni index with a forgetting factor is superior to the other three indices tested.  相似文献   

11.
聚类有效性评价综述*   总被引:11,自引:3,他引:8  
在聚类分析应用中,迫切需要一种客观公正的质量评价方法来评判聚类结果的有效性。为此,从外部评价法、内部评价法和相对评价法三个方面,归纳综述了常用的聚类有效性评价方法,并讨论了模糊聚类评价法和聚类最佳类别数的自动确定问题。  相似文献   

12.
Automatic document summarization aims to create a compressed summary that preserves the main content of the original documents. It is a well-recognized fact that a document set often covers a number of topic themes with each theme represented by a cluster of highly related sentences. More important, topic themes are not equally important. The sentences in an important theme cluster are generally deemed more salient than the sentences in a trivial theme cluster. Existing clustering-based summarization approaches integrate clustering and ranking in sequence, which unavoidably ignore the interaction between them. In this paper, we propose a novel approach developed based on the spectral analysis to simultaneously clustering and ranking of sentences. Experimental results on the DUC generic summarization datasets demonstrate the improvement of the proposed approach over the other existing clustering-based approaches.  相似文献   

13.
In this paper, we define a validity measure for fuzzy criterion clustering which is a novel approach to fuzzy clustering that in addition to being non-distance-based, addresses the cluster validity problem. The model is then recast as a bilevel fuzzy criterion clustering problem. We propose an algorithm for this model that solves both the validity and clustering problems. Our approach is validated via some sample problems.  相似文献   

14.
Many validity indices have been proposed for quantitatively assessing the performance of clustering algorithms. One limitation of existing indices is their lack of generalizability, due to their dependence on the specific algorithms and structures of the data space. To handle large-scale datasets with arbitrary structures, this research study proposes a new cluster separation measure for improving the effectiveness of existing validity indices. This is achieved by partitioning the original data space into a grid-based structure which allows the introduction of a new measurement for assessing the true data distribution between any two clusters instead of the distance between the two cluster prototypes. To validate the effectiveness of the proposed separation measure, we adopt two commonly used validity indices, the Davies-Bouldin’s function (DB) and Tibshirani’s Gap statistic (GS). These indices are denoted as R-DB-1 and R-GS-1 for clusters with sphere-shaped structures and R-DB-2 and R-GS-2 for irregular-shaped structures. This integration enables the indices to evaluate both partitional algorithms and hierarchical algorithms. Partitional algorithms including C-Means (CM), Fuzzy C-Means (FCM), and hierarchical algorithms, including DBSCAN and CLIQUE, are used to test the performance of the new indices. Two synthetic datasets with spherical structures and four synthetic datasets with irregular shapes are first compared. Five real datasets from the UCI machine learning repository are then used to further test the measure’s performance. The experimental results provide evidence that the new indices outperform the original indices.  相似文献   

15.
针对目前聚类算法对大数据集的聚类分析中存在时间花费过大的问题,提出了一种基于最近邻相似性的数据集压缩算法。通过将若干个相似性最近邻的数据点划分成一个数据簇并随机选择簇头构成新的数据集,大大缩减了数据的规模。然后分别采用k-means算法和AP算法对压缩后的数据集进行聚类分析。实验结果表明,压缩后的数据集与原始数据集的聚类分析相比,在保证聚类准确率基本一致的前提下有效降低了聚类的花费时长,提高了算法的聚类性能,证明该数据集压缩算法在聚类分析中的有效性与可靠性。  相似文献   

16.

Text document clustering is used to separate a collection of documents into several clusters by allowing the documents in a cluster to be substantially similar. The documents in one cluster are distinct from documents in other clusters. The high-dimensional sparse document term matrix reduces the clustering process efficiency. This study proposes a new way of clustering documents using domain ontology and WordNet ontology. The main objective of this work is to increase cluster output quality. This work aims to investigate and examine the method of selecting feature dimensions to minimize the features of the document name matrix. The sports documents are clustered using conventional K-Means with the dimension reduction features selection process and density-based clustering. A novel approach named ontology-based document clustering is proposed for grouping the text documents. Three critical steps were used in order to develop this technique. The initial step for an ontology-based clustering approach starts with data pre-processing, and the characteristics of the DR method are reduced with the Info-Gain collection. The documents are clustered using two clustering methods: K-Means and Density-Based clustering with DR Feature Selection Process. These methods validate the findings of ontology-based clustering, and this study compared them using the measurement metrics. The second step of this study examines the sports field ontology development and describes the principles and relationship of the terms using sports-related documents. The semantic web rational process is used to test the ontology for validation purposes. An algorithm for the synonym retrieval of the sports domain ontology terms has been proposed and implemented. The retrieved terms from the documents and sport ontology concepts are mapped to the retrieved synonym set words from the WorldNet ontology. The suggested technique is based on synonyms of mapped concepts. The proposed ontology approach employs the reduced feature set in order to clustering the text documents. The results are compared with two traditional approaches on two datasets. The proposed ontology-based clustering approach is found to be effective in clustering the documents with high precision, recall, and accuracy. In addition, this study also compared the different RDF serialization formats for sports ontology.

  相似文献   

17.
神经模糊系统中模糊规则的优选   总被引:5,自引:0,他引:5  
贾立  俞金寿 《控制与决策》2002,17(3):306-309
提出一种基于两级聚类算法的自组织神经模糊系统,该系统采用两级聚类算法(改进的最近邻域聚类算法和Gustafson-Kessel模糊聚类算法)对输入/输出数据进行模糊聚类,并由模糊聚类的划分熵确定最优划分,建立模糊模型,模型精度可由梯度下降法进一步提高。仿真结果表明,这种神经模糊系统具有结构简单、规则数少、学习速度快以及建模精度高等特点。  相似文献   

18.
为了提高分类型数据集聚类的准确性和对广泛数据集聚类的适应性,引入3种核函数,再利用基于山方法的核K-means作分类型的数据聚类,核函数把分类型数据映射到高维特征空间,从而给缺乏测度的分类型数据引入了数值型数据的测度.改进后用多个公开数据集对这些方法进行了实验评测,结果显示这些方法对分类型数据的聚类是有效的.  相似文献   

19.
Identification of the correct number of clusters and the appropriate partitioning technique are some important considerations in clustering where several cluster validity indices, primarily utilizing the Euclidean distance, have been used in the literature. In this paper a new measure of connectivity is incorporated in the definitions of seven cluster validity indices namely, DB-index, Dunn-index, Generalized Dunn-index, PS-index, I-index, XB-index and SV-index, thereby yielding seven new cluster validity indices which are able to automatically detect clusters of any shape, size or convexity as long as they are well-separated. Here connectivity is measured using a novel approach following the concept of relative neighborhood graph. It is empirically established that incorporation of the property of connectivity significantly improves the capabilities of these indices in identifying the appropriate number of clusters. The well-known clustering techniques, single linkage clustering technique and K-means clustering technique are used as the underlying partitioning algorithms. Results on eight artificially generated and three real-life data sets show that connectivity based Dunn-index performs the best as compared to all the other six indices. Comparisons are made with the original versions of these seven cluster validity indices.  相似文献   

20.
Based on clonal selection mechanism in immune system, a dynamic local search based immune automatic clustering algorithm (DLSIAC) is proposed to automatically evolve the number of clusters as well as a proper partition of datasets. The real based antibody encoding consists of the activation thresholds and the clustering centers. Then based on the special structures of chromosomes, a particular dynamic local search scheme is proposed to exploit the neighborhood of each antibody as much as possible so to realize automatic variation of the antibody length during evolution. The dynamic local search scheme includes four basic operations, namely, the external cluster swapping, the internal cluster swapping, the cluster addition and the cluster decrease. Moreover, a neighborhood structure based clonal mutation is adopted to further improve the performance of the algorithm. The proposed algorithm has been extensively compared with five state-of-the-art automatic clustering techniques over a suit of datasets. Experimental results indicate that the DLSIAC is superior to other five clustering algorithms on the optimum number of clusters found and the clustering accuracy. In addition, DLSIAC is applied to a real problem, namely image segmentation, with a good performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号