首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
In this work we consider spatial clustering problem with no a priori information. The number of clusters is unknown, and clusters may have arbitrary shapes and density differences. The proposed clustering methodology addresses several challenges of the clustering problem including solution evaluation, neighborhood construction, and data set reduction. In this context, we first introduce two objective functions, namely adjusted compactness and relative separation. Each objective function evaluates the clustering solution with respect to the local characteristics of the neighborhoods. This allows us to measure the quality of a wide range of clustering solutions without a priori information. Next, using the two objective functions we present a novel clustering methodology based on Ant Colony Optimization (ACO-C). ACO-C works in a multi-objective setting and yields a set of non-dominated solutions. ACO-C has two pre-processing steps: neighborhood construction and data set reduction. The former extracts the local characteristics of data points, whereas the latter is used for scalability. We compare the proposed methodology with other clustering approaches. The experimental results indicate that ACO-C outperforms the competing approaches. The multi-objective evaluation mechanism relative to the neighborhoods enhances the extraction of the arbitrary-shaped clusters having density variations.  相似文献   

2.
This paper presents a new k-means type algorithm for clustering high-dimensional objects in sub-spaces. In high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. For example, in text clustering, clusters of documents of different topics are categorized by different subsets of terms or keywords. The keywords for one cluster may not occur in the documents of other clusters. This is a data sparsity problem faced in clustering high-dimensional data. In the new algorithm, we extend the k-means clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimensions that categorize different clusters. This is achieved by including the weight entropy in the objective function that is minimized in the k-means clustering process. An additional step is added to the k-means clustering process to automatically compute the weights of all dimensions in each cluster. The experiments on both synthetic and real data have shown that the new algorithm can generate better clustering results than other subspace clustering algorithms. The new algorithm is also scalable to large data sets.  相似文献   

3.
This paper reviews some frequently used methods to initialize an radial basis function (RBF) network and presents systematic design procedures for pre-processing unit(s) to initialize RBF network from available input–output data sets. The pre-processing units are computationally hybrid two-step training algorithms that can be named as (1) construction of initial structure and (2) coarse-tuning of free parameters. The first step, the number, and the locations of the initial centers of RBF network can be determined. Thus, an orthogonal least squares algorithm and a modified counter propagation network can be employed for this purpose. In the second step, a coarse-tuning of free parameters is achieved by using clustering procedures. Thus, the Gustafson–Kessel and the fuzzy C-means clustering methods are evaluated for the coarse-tuning. The first two-step behaves like a pre-processing unit for the last stage (or fine-tuning stage—a gradient descent algorithm). The initialization ability of the proposed four pre-processing units (modular combination of the existing methods) is compared with three non-linear benchmarks in terms of root mean square errors. Finally, the proposed hybrid pre-processing units may initialize a fairly accurate, IF–THEN-wise readable initial model automatically and efficiently with a minimum user inference.  相似文献   

4.

Data clustering is an important unsupervised learning technique and has wide application in various fields including pattern recognition, data mining, image analysis and bioinformatics. A vast amount of clustering algorithms have been proposed in the past decades. However, existing algorithms still face many problems in practical applications. One typical problem is the parameter dependence, which means that user-specified parameters are required as input and the clustering results are influenced by these parameters. Another problem is that many algorithms are not able to generate clusters of non-spherical shapes. In this paper, a cluster merging method is proposed to solve the above-mentioned problems based on a decision threshold and the dominant sets algorithm. Firstly, the influence of similarity parameter on dominant sets clustering results is studied, and it is found that the obtained clusters become larger with the increase in similarity parameter. We analyze the reason behind this behavior and propose to generate small initial clusters in the first step and then merge the initial clusters to improve the clustering results. Specifically, we select a similarity parameter which generates small but not too small clusters. Then, we calculate pairwise merging decisions among the initial clusters and obtain a merging decision threshold. Based on this threshold, we merge the small clusters and obtain the final clustering results. Experiments on several datasets are used to validate the effectiveness of the proposed algorithm.

  相似文献   

5.
在比特流未知协议识别过程中,针对如何将得到的多协议数据帧分为单协议数据帧这一问题,提出了一种改进的凝聚型层次聚类算法。该算法以传统的凝聚型层次聚类算法思想为基础,结合比特流数据帧的特征,定义了数据帧之间及类簇之间的相似度,采用边聚类边提取符合要求类簇的方式,能快速有效地对数据帧进行聚类;并且该算法能自动地确定聚类的个数,所得的类簇含有相似度评价指标。利用林肯实验室公布的数据集进行测试,说明该算法能以较高的正确率对协议数据帧进行聚类。  相似文献   

6.
利用图片类日志信息改进会话识别质量   总被引:2,自引:0,他引:2  
范纯龙  姜宏飞  李华 《计算机应用》2010,30(4):1056-1058
数据预处理是Web日志挖掘的基础,而会话识别则是数据预处理的关键步骤,其质量严重影响Web日志挖掘的结果。在分析现有会话识别方法的基础上,提出了利用数据预处理中废弃的图片等日志数据,并结合扩展Web图结构,从页面分组规则和路径补全算法两个方面改进会话识别质量,并通过实验证实该方法对改善会话识别质量是有效的。  相似文献   

7.

Text document clustering is used to separate a collection of documents into several clusters by allowing the documents in a cluster to be substantially similar. The documents in one cluster are distinct from documents in other clusters. The high-dimensional sparse document term matrix reduces the clustering process efficiency. This study proposes a new way of clustering documents using domain ontology and WordNet ontology. The main objective of this work is to increase cluster output quality. This work aims to investigate and examine the method of selecting feature dimensions to minimize the features of the document name matrix. The sports documents are clustered using conventional K-Means with the dimension reduction features selection process and density-based clustering. A novel approach named ontology-based document clustering is proposed for grouping the text documents. Three critical steps were used in order to develop this technique. The initial step for an ontology-based clustering approach starts with data pre-processing, and the characteristics of the DR method are reduced with the Info-Gain collection. The documents are clustered using two clustering methods: K-Means and Density-Based clustering with DR Feature Selection Process. These methods validate the findings of ontology-based clustering, and this study compared them using the measurement metrics. The second step of this study examines the sports field ontology development and describes the principles and relationship of the terms using sports-related documents. The semantic web rational process is used to test the ontology for validation purposes. An algorithm for the synonym retrieval of the sports domain ontology terms has been proposed and implemented. The retrieved terms from the documents and sport ontology concepts are mapped to the retrieved synonym set words from the WorldNet ontology. The suggested technique is based on synonyms of mapped concepts. The proposed ontology approach employs the reduced feature set in order to clustering the text documents. The results are compared with two traditional approaches on two datasets. The proposed ontology-based clustering approach is found to be effective in clustering the documents with high precision, recall, and accuracy. In addition, this study also compared the different RDF serialization formats for sports ontology.

  相似文献   

8.
Data clustering is a very well studied problem in machine learning, data mining, and related disciplines. Most of the existing clustering methods have focused on optimizing a single clustering objective. Often, several recent disciplines such as robot team deployment, ad hoc networks, multi-agent systems, facility location, etc., need to consider multiple criteria, often conflicting, during clustering. Motivated by this, in this paper, we propose a sequential game theoretic approach for multi-objective clustering, called ClusSMOG-II. It is specially designed to optimize simultaneously intrinsically conflicting objectives, which are inter-cluster/intra-cluster inertia and connectivity. This technique has an advantage of keeping the number of clusters dynamic. The approach consists of three main steps. The first step sets initial clusters with their representatives, whereas the second step calculates the correct number of clusters by resolving a sequence of multi-objective multi-act sequential two-player games for conflict-clusters. Finally, the third step constructs homogenous clusters by resolving sequential two-player game between each cluster representative and the representative of its nearest neighbor. For each game, we define payoff functions that correspond to the model objectives. We use a methodology based on backward induction to calculate a pure Nash equilibrium for each game. Experimental results confirm the effectiveness of the proposed approach over state-of-the-art clustering algorithms.  相似文献   

9.
Dynamic clustering using combinatorial particle swarm optimization   总被引:5,自引:5,他引:0  
Combinatorial Particle Swarm Optimization (CPSO) is a relatively recent technique for solving combinatorial optimization problems. CPSO has been used in different applications, e.g., partitional clustering and project scheduling problems, and it has shown a very good performance. In partitional clustering problem, CPSO needs to determine the number of clusters in advance. However, in many clustering problems, the correct number of clusters is unknown, and it is usually impossible to estimate. In this paper, an improved version, called CPSOII, is proposed as a dynamic clustering algorithm, which automatically finds the best number of clusters and simultaneously categorizes data objects. CPSOII uses a renumbering procedure as a preprocessing step and several extended PSO operators to increase population diversity and remove redundant particles. Using the renumbering procedure increases the diversity of population, speed of convergence and quality of solutions. For performance evaluation, we have examined CPSOII using both artificial and real data. Experimental results show that CPSOII is very effective, robust and can solve clustering problems successfully with both known and unknown number of clusters. Comparing the obtained results from CPSOII with CPSO and other clustering techniques such as KCPSO, CGA and K-means reveals that CPSOII yields promising results. For example, it improves 9.26 % of the value of DBI criterion for Hepato data set.  相似文献   

10.
This paper presents a general approach for the identification of objects in procedural programs. The approach is based on neural architectures that perform an unsupervised learning of clusters. We describe two such neural architectures, explain how to use them in identifying objects in software systems and briefly describe a prototype tool, which implements the clustering algorithms. With the aid of several examples, we explain how our approach can identify abstract data types as well as groups of routines which reference a common set of data. The clustering results are compared to the results of many other object identification techniques. Finally, several case studies were performed on existing programs to evaluate the object identification approach. Results concerning two representative programs and their generated clusters are discussed.  相似文献   

11.
聚类作为一种无监督的学习方法,通常需要人为地提供聚类的簇数。在先验知识缺乏的情况下,通过人为指定聚类参数是不合实际的。近年来研究的聚类有效性函数(Cluster Validity Index) 用于估计簇的数目及聚类效果的优劣。本文提出了一种新的基于有效性指数的聚类算法,无需提供聚类的参数。算法每步合并两个簇,使有效性指数值增加最大或减小最少。本文运用引力模型度量相似度,对可能出现的异常点情况作均匀化的处理。实验表明,本文的算法能正确发现特定数据的簇个数,和其它聚类方法比较,聚类结果具有较低的错误率,并在效率上优于一般的基于有效性指数的聚类算法。  相似文献   

12.
This article describes a clustering technique that can automatically detect any number of well-separated clusters which may be of any shape, convex and/or non-convex. This is in contrast to most other techniques which assume a value for the number of clusters and/or a particular cluster structure. The proposed technique is based on an iterative partitioning of the relative neighborhood graph, coupled with a post-processing step for merging small clusters. Techniques for improving the efficiency of the proposed scheme are implemented. The clustering scheme is able to detect outliers in data. It is also able to indicate the inherent hierarchical nature of the clusters present in a data set. Moreover, the proposed technique is also able to identify the situation when the data do not have any natural clusters at all. Results demonstrating the effectiveness of the clustering scheme are provided for several data sets.  相似文献   

13.
Clustering provides a knowledge acquisition method for intelligent systems. This paper proposes a novel data-clustering algorithm, by combining a new initialization technique, K-means algorithm and a new gradual data transformation approach to provide more accurate clustering results than the K-means algorithm and its variants by increasing the clusters’ coherence. The proposed data transformation approach solves the problem of generating empty clusters, which frequently occurs for other clustering algorithms. An efficient method based on the principal component transformation and a modified silhouette algorithm is also proposed in this paper to determine the number of clusters. Several different data sets are used to evaluate the efficacy of the proposed method to deal with the empty cluster generation problem and its accuracy and computational performance in comparison with other K-means based initialization techniques and clustering methods. The developed estimation method for determining the number of clusters is also evaluated and compared with other estimation algorithms. Significances of the proposed method include addressing the limitations of the K-means based clustering and improving the accuracy of clustering as an important method in the field of data mining and expert systems. Application of the proposed method for the knowledge acquisition in time series data such as wind, solar, electric load and stock market provides a pre-processing tool to select the most appropriate data to feed in neural networks or other estimators in use for forecasting such time series. In addition, utilization of the knowledge discovered by the proposed K-means clustering to develop rule based expert systems is one of the main impacts of the proposed method.  相似文献   

14.

Graphs are commonly used to express the communication of various data. Faced with uncertain data, we have probabilistic graphs. As a fundamental problem of such graphs, clustering has many applications in analyzing uncertain data. In this paper, we propose a novel method based on ensemble clustering for large probabilistic graphs. To generate ensemble clusters, we develop a set of probable possible worlds of the initial probabilistic graph. Then, we present a probabilistic co-association matrix as a consensus function to integrate base clustering results. It relies on co-occurrences of node pairs based on the probability of the corresponding common cluster graphs. Also, we apply two improvements in the steps before and after of ensembles generation. In the before step, we append neighborhood information based on node features to the initial graph to achieve a more accurate estimation of the probability between the nodes. In the after step, we use supervised metric learning-based Mahalanobis distance to automatically learn a metric from ensemble clusters. It aims to gain crucial features of the base clustering results. We evaluate our work using five real-world datasets and three clustering evaluation metrics, namely the Dunn index, Davies–Bouldin index, and Silhouette coefficient. The results show the impressive performance of clustering large probabilistic graphs.

  相似文献   

15.
As the first major step in each object-oriented feature extraction approach, segmentation plays an essential role as a preliminary step towards further and higher levels of image processing. The primary objective of this paper is to illustrate the potential of Polarimetric Synthetic Aperture Radar (PolSAR) features extracted from Compact Polarimetry (CP) SAR data for image segmentation using Markov Random Field (MRF). The proposed method takes advantage of both spectral and spatial information to segment the CP SAR data. In the first step of the proposed method, k-means clustering was applied to over-segment the image using the appropriate features optimally selected using Genetic Algorithm (GA). As a similarity criterion in each cluster, a probabilistic distance was used for an agglomerative hierarchical merging of small clusters into an appropriate number of larger clusters. In the agglomerative clustering approach, the estimation of the appropriate number of clusters using the data log-likelihood algorithm differs depending on the distance criterion used in the algorithm. In particular, the Wishart Chernoff distance which is independent of samples (pixels) tends to provide a higher appropriate number of clusters compared to the Wishart test statistic distance. This is because the Wishart Chernoff distance preserves detailed data information corresponding to small clusters. The probabilistic distance used in this study is Wishart Chernoff distance which evaluates the similarity of clusters by measuring the distance between their complex Wishart probability density functions. The output of this step, as the initial segmentation of the image, is applied to a Markov Random Field model to improve the final segmentation using vicinity information. The method combines Wishart clustering and enhanced initial clusters in order to access the posterior MRF energy function. The contextual image classifier adopts the Iterated Conditional Mode (ICM) approach to converge to a local minimum and represent a good trade-off between segmentation accuracy and computation burden. The results showed that the PolSAR features extracted from CP mode can provide an acceptable overall accuracy in segmentation when compared to the full polarimetry (FP) and Dual Polarimetry (DP) data. Moreover, the results indicated that the proposed algorithm is superior to the existing image segmentation techniques in terms of segmentation accuracy.  相似文献   

16.
Mining geo-tagged social photo media has received large amounts of attention from researchers recently. Points of interest (POI) mining from a collection of geo-tagged photos is one of these problems. POI mining refers to the processes of pattern recognition (namely clustering), extraction and semantic annotation. However, based on unsupervised clustering methods, many POIs might not be mined. Additionally, there is a great challenge for the proper semantic annotation to data clusters after clustering. In practice, there are many applications which require the accuracy of semantic annotation and high quality of pattern recognition such as POI recommendation. In this paper, we study POI mining from a collection of geo-tagged photos in combination with proper semantic annotation by using additional POI information from high coverage external POI databases. We propose a novel POI mining framework by using two-level clustering, random walk and constrained clustering. In random walk clustering step, we separate a large-scale collection of geo-tagged photos into many clusters. In the constrained clustering step, we continue to divide the clusters that include many POIs into many sub-clusters, where the geo-tagged photos in a sub-cluster associate with a particular POI. Experimental results on two datasets of geo-tagged Flickr photos of two cities in California, USA have shown that the proposed method substantially outperforms existing approaches that are adapted to handle the problem.  相似文献   

17.
This paper presents an efficient approach for automatic speaker identification based on cepstral features and the Normalized Pitch Frequency (NPF). Most relevant speaker identification methods adopt a cepstral strategy. Inclusion of the pitch frequency as a new feature in the speaker identification process is expected to enhance the speaker identification accuracy. In the proposed framework for speaker identification, a neural classifier with a single hidden layer is used. Different transform domains are investigated for reliable feature extraction from the speech signal. Moreover, a pre-processing noise reduction step, is used prior to the feature extraction process to enhance the performance of the speaker identification system. Simulation results prove that the NPF as a feature in speaker identification enhances the performance of the speaker identification system, especially with the Discrete Cosine Transform (DCT) and wavelet denoising pre-processing step.  相似文献   

18.
Although there are a lot of clustering algorithms available in the literature, existing algorithms are usually afflicted by practical problems of one form or another, including parameter dependence and the inability to generate clusters of arbitrary shapes. In this paper we aim to solve these two problems by merging the merits of dominant sets and density based clustering algorithms. We firstly apply histogram equalization to eliminate the parameter dependence problem of the dominant sets algorithm. Noticing that the obtained clusters are usually smaller than the real ones, a density threshold based cluster growing step is then used to improve the clustering results, where the involved parameters are determined based on the initial clusters. This is followed by the second cluster growing step which makes use of the density relationship between neighboring data. Data clustering experiments and comparison with other algorithms validate the effectiveness of the proposed algorithm.  相似文献   

19.
Persons’ identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool (Poignant et al. 2012); these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2 % for speaker identification and 60.2 % for face identification. Adding few biometric models improves results and leads to 82.4 % and 65.6 % for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8 % F-measure, while 908 face models provide only 30.5 % F-measure.  相似文献   

20.
Model-based approaches and in particular finite mixture models are widely used for data clustering which is a crucial step in several applications of practical importance. Indeed, many pattern recognition, computer vision and image processing applications can be approached as feature space clustering problems. For complex high-dimensional data, however, the use of these approaches presents several challenges such as the presence of many irrelevant features which may affect the speed and also compromise the accuracy of the used learning algorithm. Another problem is the presence of outliers which potentially influence the resulting model’s parameters. For this purpose, we propose and discuss an algorithm that partitions a given data set without a priori information about the number of clusters, the saliency of the features or the number of outliers. We illustrate the performance of our approach using different applications involving synthetic data, real data and objects shape clustering.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号