首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
This study applies K-means method, fuzzy c-means clustering method and bagged clustering algorithm to the analysis of customer value for an outfitter in Taipei, Taiwan. These three techniques bear similar philosophy for data classification. Thus, it would be of interest to know which clustering technique performs best in a real world case of evaluating customer value. Using cluster quality assessment, this study concludes that bagged clustering algorithm outperforms the other two methods. To conclude the analyses, this study also suggests marketing strategies for each cluster based on the results generated by bagged clustering technique.  相似文献   

2.
The self-organizing map (SOM) has been widely used in many industrial applications. Classical clustering methods based on the SOM often fail to deliver satisfactory results, specially when clusters have arbitrary shapes. In this paper, through some preprocessing techniques for filtering out noises and outliers, we propose a new two-level SOM-based clustering algorithm using a clustering validity index based on inter-cluster and intra-cluster density. Experimental results on synthetic and real data sets demonstrate that the proposed clustering algorithm is able to cluster data better than the classical clustering algorithms based on the SOM, and find an optimal number of clusters.  相似文献   

3.
“Best K”: critical clustering structures in categorical datasets   总被引:2,自引:2,他引:0  
The demand on cluster analysis for categorical data continues to grow over the last decade. A well-known problem in categorical clustering is to determine the best K number of clusters. Although several categorical clustering algorithms have been developed, surprisingly, none has satisfactorily addressed the problem of best K for categorical clustering. Since categorical data does not have an inherent distance function as the similarity measure, traditional cluster validation techniques based on geometric shapes and density distributions are not appropriate for categorical data. In this paper, we study the entropy property between the clustering results of categorical data with different K number of clusters, and propose the BKPlot method to address the three important cluster validation problems: (1) How can we determine whether there is significant clustering structure in a categorical dataset? (2) If there is significant clustering structure, what is the set of candidate “best Ks”? (3) If the dataset is large, how can we efficiently and reliably determine the best Ks?  相似文献   

4.
Most of the well-known clustering methods based on distance measures, distance metrics and similarity functions have the main problem of getting stuck in the local optima and their performance strongly depends on the initial values of the cluster centers. This paper presents a new approach to enhance the clustering problems with the bio-inspired Cuttlefish Algorithm (CFA) by searching the best cluster centers that can minimize the clustering metrics. Various UCI Machine Learning Repository datasets are used to test and evaluate the performance of the proposed method. For the sake of comparison, we have also analysed several algorithms such as K-means, Genetic Algorithm and the Particle Swarm Optimization (PSO) Algorithm. The simulations and obtained results demonstrate that the performance of the proposed CFA-Clustering method is superior to the other counterpart algorithms in most cases. Therefore, the CFA can be considered as an alternative stochastic method to solve clustering problems.  相似文献   

5.
Identification of the correct number of clusters and the appropriate partitioning technique are some important considerations in clustering where several cluster validity indices, primarily utilizing the Euclidean distance, have been used in the literature. In this paper a new measure of connectivity is incorporated in the definitions of seven cluster validity indices namely, DB-index, Dunn-index, Generalized Dunn-index, PS-index, I-index, XB-index and SV-index, thereby yielding seven new cluster validity indices which are able to automatically detect clusters of any shape, size or convexity as long as they are well-separated. Here connectivity is measured using a novel approach following the concept of relative neighborhood graph. It is empirically established that incorporation of the property of connectivity significantly improves the capabilities of these indices in identifying the appropriate number of clusters. The well-known clustering techniques, single linkage clustering technique and K-means clustering technique are used as the underlying partitioning algorithms. Results on eight artificially generated and three real-life data sets show that connectivity based Dunn-index performs the best as compared to all the other six indices. Comparisons are made with the original versions of these seven cluster validity indices.  相似文献   

6.
XML(eXtensible Markup Language)is a standard which is widely applied in data representation and data exchange,However,as an important concept of XML,DTD(Document Type Definition)is not taken full advantage in current applications.In this paper,a new method for clustering DTDs is presented.and it can be used in XML document clustering.The two-level method clusters the elements in DTDs and clusters DTDs separately.Element clustering forms the first level and provides element clusters,which are the generalization of relevant elements.DTD clustering utilizes the generalized information and forms the second level in the whole clustering process.The two-level method has the following advantages:1) It takes into consideration both the content and the structure within DTDs;2) The generalized information about elements is more useful than the separated words in the vector model;3) The two-level method facilitates the searching of outliers.The experiments show that this method is able to categorize the relevant DTDs effectively.  相似文献   

7.
Clustering trajectory data discovers and visualizes available structure in movement patterns of mobile objects and has numerous potential applications in traffic control, urban planning, astronomy, and animal science. In this paper, an automated technique for clustering trajectory data using a Particle Swarm Optimization (PSO) approach has been proposed, and Dynamic Time Warping (DTW) distance as one of the most commonly-used distance measures for trajectory data is considered. The proposed technique is able to find (near) optimal number of clusters as well as (near) optimal cluster centers during the clustering process. To reduce the dimensionality of the search space and improve the performance of the proposed method (in terms of a certain performance index), a Discrete Cosine Transform (DCT) representation of cluster centers is considered. The proposed method is able to admit various cluster validity indexes as objective function for optimization. Experimental results over both synthetic and real-world datasets indicate the superiority of the proposed technique to fuzzy C-means, fuzzy K-medoids, and two evolutionary-based clustering techniques proposed in the literature.  相似文献   

8.
协同聚类是对数据矩阵的行和列两个方向同时进行聚类的一类算法。本文将双层加权的思想引入协同聚类,提出了一种双层子空间加权协同聚类算法(TLWCC)。TLWCC对聚类块(co-cluster)加一层权重,对行和列再加一层权重,并且算法在迭代过程中自动计算块、行和列这三组权重。TLWCC考虑不同的块、行和列与相应块、行和列中心的距离,距离越大,认为其噪声越强,就给予小权重;反之噪声越弱,给予大权重。通过给噪声信息小权重,TLWCC能有效地降低噪声信息带来的干扰,提高聚类效果。本文通过四组实验展示TLWCC算法识别噪声信息的能力、参数选取对算法聚类结果的影响程度,算法的聚类性能和时间性能。  相似文献   

9.
Recently Fourier Transform Infrared (FTIR) spectroscopic imaging has been used as a tool to detect the changes in cellular composition that may reflect the onset of a disease. This approach has been investigated as a mean of monitoring the change of the biochemical composition of cells and providing a diagnostic tool for various human cancers and other diseases. The discrimination between different types of tissue based upon spectroscopic data is often achieved using various multivariate clustering techniques. However, the number of clusters is a common unknown feature for the clustering methods, such as hierarchical cluster analysis, k-means and fuzzy c-means. In this study, we apply a FCM based clustering algorithm to obtain the best number of clusters as given by the minimum validity index value. This often results in an excessive number of clusters being created due to the complexity of this biochemical system. A novel method to automatically merge clusters was developed to try to address this problem. Three lymph node tissue sections were examined to evaluate our new method. These results showed that this approach can merge the clusters which have similar biochemistry. Consequently, the overall algorithm automatically identifies clusters that accurately match the main tissue types that are independently determined by the clinician.  相似文献   

10.
Web sites contain an ever increasing amount of information within their pages. As the amount of information increases so does the complexity of the structure of the web site. Consequently it has become difficult for visitors to find the information relevant to their needs. To overcome this problem various clustering methods have been proposed to cluster data in an effort to help visitors find the relevant information. These clustering methods have typically focused either on the content or the context of the web pages. In this paper we are proposing a method based on Kohonen’s self-organizing map (SOM) that utilizes both content and context mining clustering techniques to help visitors identify relevant information quicker. The input of the content mining is the set of web pages of the web site whereas the source of the context mining is the access-logs of the web site. SOM can be used to identify clusters of web sessions with similar context and also clusters of web pages with similar content. It can also provide means of visualizing the outcome of this processing. In this paper we show how this two-level clustering can help visitors identify the relevant information faster. This procedure has been tested to the access-logs and web pages of the Department of Informatics and Telecommunications of the University of Athens.  相似文献   

11.
Data clustering has been proven to be an effective method for discovering structure in medical datasets. The majority of clustering algorithms produce exclusive clusters meaning that each sample can belong to one cluster only. However, most real-world medical datasets have inherently overlapping information, which could be best explained by overlapping clustering methods that allow one sample belong to more than one cluster. One of the simplest and most efficient overlapping clustering methods is known as overlapping k-means (OKM), which is an extension of the traditional k-means algorithm. Being an extension of the k-means algorithm, the OKM method also suffers from sensitivity to the initial cluster centroids. In this paper, we propose a hybrid method that combines k-harmonic means and overlapping k-means algorithms (KHM-OKM) to overcome this limitation. The main idea behind KHM-OKM method is to use the output of KHM method to initialize the cluster centers of OKM method. We have tested the proposed method using FBCubed metric, which has been shown to be the most effective measure to evaluate overlapping clustering algorithms regarding homogeneity, completeness, rag bag, and cluster size-quantity tradeoff. According to results from ten publicly available medical datasets, the KHM-OKM algorithm outperforms the original OKM algorithm and can be used as an efficient method for clustering medical datasets.  相似文献   

12.
A similarity-based robust clustering method   总被引:6,自引:0,他引:6  
This paper presents an alternating optimization clustering procedure called a similarity-based clustering method (SCM). It is an effective and robust approach to clustering on the basis of a total similarity objective function related to the approximate density shape estimation. We show that the data points in SCM can self-organize local optimal cluster number and volumes without using cluster validity functions or a variance-covariance matrix. The proposed clustering method is also robust to noise and outliers based on the influence function and gross error sensitivity analysis. Therefore, SCM exhibits three robust clustering characteristics: 1) robust to the initialization (cluster number and initial guesses), 2) robust to cluster volumes (ability to detect different volumes of clusters), and 3) robust to noise and outliers. Several numerical data sets and actual data are used in the SCM to show these good aspects. The computational complexity of SCM is also analyzed. Some experimental results of comparing the proposed SCM with the existing methods show the superiority of the SCM method.  相似文献   

13.
A Novel Density-Based Clustering Framework by Using Level Set Method   总被引:1,自引:0,他引:1  
In this paper, a new density-based clustering framework is proposed by adopting the assumption that the cluster centers in data space can be regarded as target objects in image space. First, the level set evolution is adopted to find an approximation of cluster centers by using a new initial boundary formation scheme. Accordingly, three types of initial boundaries are defined so that each of them can evolve to approach the cluster centers in different ways. To avoid the long iteration time of level set evolution in data space, an efficient termination criterion is presented to stop the evolution process in the circumstance that no more cluster centers can be found. Then, a new effective density representation called level set density (LSD) is constructed from the evolution results. Finally, the valley seeking clustering is used to group data points into corresponding clusters based on the LSD. The experiments on some synthetic and real data sets have demonstrated the efficiency and effectiveness of the proposed clustering framework. The comparisons with DBSCAN method, OPTICS method, and valley seeking clustering method further show that the proposed framework can successfully avoid the overfitting phenomenon and solve the confusion problem of cluster boundary points and outliers.  相似文献   

14.
The unprecedented large size and high dimensionality of existing geographic datasets make the complex patterns that potentially lurk in the data hard to find. Clustering is one of the most important techniques for geographic knowledge discovery. However, existing clustering methods have two severe drawbacks for this purpose. First, spatial clustering methods focus on the specific characteristics of distributions in 2- or 3-D space, while general-purpose high-dimensional clustering methods have limited power in recognizing spatial patterns that involve neighbors. Second, clustering methods in general are not geared toward allowing the human-computer interaction needed to effectively tease-out complex patterns. In the current paper, an approach is proposed to open up the black box of the clustering process for easy understanding, steering, focusing and interpretation, and thus to support an effective exploration of large and high dimensional geographic data. The proposed approach involves building a hierarchical spatial cluster structure within the high-dimensional feature space, and using this combined space for discovering multi-dimensional (combined spatial and non-spatial) patterns with efficient computational clustering methods and highly interactive visualization techniques. More specifically, this includes the integration of: (1) a hierarchical spatial clustering method to generate a 1-D spatial cluster ordering that preserves the hierarchical cluster structure, and (2) a density- and grid-based technique to effectively support the interactive identification of interesting subspaces and subsequent searching for clusters in each subspace. The implementation of the proposed approach is in a fully open and interactive manner supported by various visualization techniques.  相似文献   

15.
Grouping customer transactions into segments may help understand customers better. The marketing literature has concentrated on identifying important segmentation variables (e.g., customer loyalty) and on using cluster analysis and mixture models for segmentation. The data mining literature has provided various clustering algorithms for segmentation without focusing specifically on clustering customer transactions. Building on the notion that observable customer transactions are generated by latent behavioral traits, in this paper, we investigate using a pattern-based clustering approach to grouping customer transactions. We define an objective function that we maximize in order to achieve a good clustering of customer transactions and present an algorithm, GHIC, that groups customer transactions such that itemsets generated from each cluster, while similar to each other, are different from ones generated from others. We present experimental results from user-centric Web usage data that demonstrates that GHIC generates a highly effective clustering of transactions.  相似文献   

16.
Pairwise-nearest-neighbor (PNN) is an effective method of data clustering, which can always generate good clustering results, but with high computational complexity. Fast exact PNN (FPNN) algorithm proposed by Fränti et al. is an effective method to speed up PNN and generates the same clustering results as those generated by PNN. In this paper, we present a novel method to improve the FPNN algorithm. Our algorithm uses the property that the cluster distance increases as the cluster merge process proceeds and adopts a fast search algorithm to reject impossible candidate clusters. Experimental results show that our proposed method can effectively reduce the number of distance calculations and computation time of FPNN algorithm. Compared with FPNN, our proposed approach can reduce the computation time and number of distance calculations by a factor of 24.8 and 146.4, respectively, for the data set from three real images. It is noted that our method generates the same clustering results as those produced by PNN and FPNN.  相似文献   

17.
A mobile ad hoc network (MANET) is dynamic in nature and is composed of wirelessly connected nodes that perform hop-by-hop routing without the help of any fixed infrastructure. One of the important requirements of a MANET is the efficiency of energy, which increases the lifetime of the network. Several techniques have been proposed by researchers to achieve this goal and one of them is clustering in MANETs that can help in providing an energy-efficient solution. Clustering involves the selection of cluster-heads (CHs) for each cluster and fewer CHs result in greater energy efficiency as these nodes drain more power than noncluster-heads. In the literature, several techniques are available for clustering by using optimization and evolutionary techniques that provide a single solution at a time. In this paper, we propose a multi-objective solution by using multi-objective particle swarm optimization (MOPSO) algorithm to optimize the number of clusters in an ad hoc network as well as energy dissipation in nodes in order to provide an energy-efficient solution and reduce the network traffic. In the proposed solution, inter-cluster and intra-cluster traffic is managed by the cluster-heads. The proposed algorithm takes into consideration the degree of nodes, transmission power, and battery power consumption of the mobile nodes. The main advantage of this method is that it provides a set of solutions at a time. These solutions are achieved through optimal Pareto front. We compare the results of the proposed approach with two other well-known clustering techniques; WCA and CLPSO-based clustering by using different performance metrics. We perform extensive simulations to show that the proposed approach is an effective approach for clustering in mobile ad hoc networks environment and performs better than the other two approaches.  相似文献   

18.
高全胜  洪炳熔 《软件学报》2007,18(9):2356-2364
利用运动捕获数据,通过学习获得虚拟人运动的统计模型,从而创建真实、可控的虚拟人运动.提出了一种方法:通过对原始运动数据聚类,提取出局部动态运动特征--动态纹理,并用线性动态系统描述,有选择地注释有明确含义的线性动态系统,构建注释动态纹理图.利用这一统计模型,可生成真实感强、可控的虚拟人运动.结果表明,这种方法在交互环境中能够生成流畅、自然的人体运动.  相似文献   

19.
Cluster ensembles in collaborative filtering recommendation   总被引:1,自引:0,他引:1  
Recommender systems, which recommend items of information that are likely to be of interest to the users, and filter out less favored data items, have been developed. Collaborative filtering is a widely used recommendation technique. It is based on the assumption that people who share the same preferences on some items tend to share the same preferences on other items. Clustering techniques are commonly used for collaborative filtering recommendation. While cluster ensembles have been shown to outperform many single clustering techniques in the literature, the performance of cluster ensembles for recommendation has not been fully examined. Thus, the aim of this paper is to assess the applicability of cluster ensembles to collaborative filtering recommendation. In particular, two well-known clustering techniques (self-organizing maps (SOM) and k-means), and three ensemble methods (the cluster-based similarity partitioning algorithm (CSPA), hypergraph partitioning algorithm (HGPA), and majority voting) are used. The experimental results based on the Movielens dataset show that cluster ensembles can provide better recommendation performance than single clustering techniques in terms of recommendation accuracy and precision. In addition, there are no statistically significant differences between either the three SOM ensembles or the three k-means ensembles. Either the SOM or k-means ensembles could be considered in the future as the baseline collaborative filtering technique.  相似文献   

20.
We explore in this paper the efficient clustering of market-basket data. Different from those of the traditional data, the features of market-basket data are known to be of high dimensionality and sparsity. Without explicitly considering the presence of the taxonomy, most prior efforts on clustering market-basket data can be viewed as dealing with items in the leaf level of the taxonomy tree. Clustering transactions across different levels of the taxonomy is of great importance for marketing strategies as well as for the result representation of the clustering techniques for market-basket data. In view of the features of market-basket data, we devise in this paper a novel measurement, called the category-based adherence, and utilize this measurement to perform the clustering. With this category-based adherence measurement, we develop an efficient clustering algorithm, called algorithm k-todes, for market-basket data with the objective to minimize the category-based adherence. The distance of an item to a given cluster is defined as the number of links between this item and its nearest tode. The category-based adherence of a transaction to a cluster is then defined as the average distance of the items in this transaction to that cluster. A validation model based on information gain is also devised to assess the quality of clustering for market-basket data. As validated by both real and synthetic datasets, it is shown by our experimental results, with the taxonomy information, algorithm k-todes devised in this paper significantly outperforms the prior works in both the execution efficiency and the clustering quality as measured by information gain, indicating the usefulness of category-based adherence in market-basket data clustering.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号