首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 114 毫秒
1.
In this paper, we propose a word sense learning algorithm which is capable of unsupervised feature selection and cluster number identification. Feature selection for word sense learning is built on an entropy-based filter and formalized as a constraint optimization problem, the output of which is a set of important features. Cluster number identification is built on a Gaussian mixture model with a MDL-based criterion, and the optimal model order is inferred by minimizing the criterion. To evaluate closeness between the learned sense clusters with the ground-truth classes, we introduce a kind of weighted F-measure to model the effort needed to reconstruct the classes from the clusters. Experiments show that the algorithm can retrieve important features, roughly estimate the class numbers automatically and outperforms other algorithms in terms of the weighted F-measure. In addition, we also try to apply the algorithm to a specific task of adding new words into a Chinese thesaurus.  相似文献   

2.
An evaluation of four clustering methods and four external criterion measures was conducted with respect to the effect of the number of clusters, dimensionality, and relative cluster sizes on the recovery of true cluster structure. The four methods were the single link, complete link, group average (UPGMA), and Ward's minimum variance algorithms. The results indicated that the four criterion measures were generally consistent with each other, of which two highly similar pairs were identified. The tirst pair consisted of the Rand and corrected Rand statistics, and the second pair was the Jaccard and the Fowlkes and Mallows indexes. With respect to the methods, recovery was found to improve as the number of clusters increased and as the number of dimensions increased. The relative cluster size factor produced differential performance effects, with Ward's procedure providing the best recovery when the clusters were of equal size. The group average method gave equivalent or better recovery when the clusters were of unequal size.  相似文献   

3.
Multiple resolution segmentation of textured images   总被引:15,自引:0,他引:15  
A multiple resolution algorithm is presented for segmenting images into regions with differing statistical behavior. In addition, an algorithm is developed for determining the number of statistically distinct regions in an image and estimating the parameters of those regions. Both algorithms use a causal Gaussian autoregressive model to describe the mean, variance, and spatial correlation of the image textures. Together, the algorithms can be used to perform unsupervised texture segmentation. The multiple resolution segmentation algorithm first segments images at coarse resolution and then progresses to finer resolutions until individual pixels are classified. This method results in accurate segmentations and requires significantly less computation than some previously known methods. The field containing the classification of each pixel in the image is modeled as a Markov random field. Segmentation at each resolution is then performed by maximizing the a posteriori probability of this field subject to the resolution constraint. At each resolution, the a posteriori probability is maximized by a deterministic greedy algorithm which iteratively chooses the classification of individual pixels or pixel blocks. The unsupervised parameter estimation algorithm determines both the number of textures and their parameters by minimizing a global criterion based on the AIC information criterion. Clusters corresponding to the individual textures are formed by alternately estimating the cluster parameters and repartitioning the data into those clusters. Concurrently, the number of distinct textures is estimated by combining clusters until a minimum of the criterion is reached  相似文献   

4.
A common procedure for evaluating hierarchical cluster techniques is to compare the input data, in terms of for example a matrix of similarities or dissimilarities, with the output hierarchy expressed in matrix form. If an ordinary product-moment correlation is used for this comparison, the technique is known as that of cophenetic correlations, frequently used by numerical taxonomists. A high correlation between the input similarities and the output dendrogram has been regarded as a criterion of a successful classification. This paper contains a Monte Carlo study of the characteristics of the cophenetic correlation and a related measure of agreement which have been both interpreted in terms of generalized variance for some different hierarchical cluster algorithms. The generalized variance criterion chosen for this study is Wilk's lambda, whose sampling distribution under the null hypothesis of identical group centroids is used in this context to define the degree of separation between clusters. Thus, a probabilistic approach is introduced into the evaluation procedure. With the above definition of presence of clusters, use of the cophenetic correlation and related measures of agreement as criteria of goodness-of-fit is shown to be quite misleading in most cases. This is due to their large variability for low separation of clusters.  相似文献   

5.
Conventional clustering ensemble algorithms employ a set of primary results; each result includes a set of clusters which are emerged from data. Given a large number of available clusters, one is faced with the following questions: (a) can we obtain the same quality of results with a smaller number of clusters instead of full ensemble? (b) If so, which subset of clusters is more efficient to be used in the ensemble? In this paper, these two questions are going to be answered. We explore a clustering ensemble approach combined with a cluster stability criterion as well as a dataset simplicity criterion to discover the finest subset of base clusters for each kind of datasets. Also, a novel method is proposed in order to accumulate the selected clusters and to extract final partitioning. Although it is expected that by reducing the size of ensemble the performance decreases, our experimental results show that our selecting mechanism generally lead to superior results.  相似文献   

6.
对于所提出的建立在成对约束基础之上的半监督凝聚层次聚类算法,对聚类簇进行半监督处理的最主要目的在于借助于对样本监督信息的合理应用,达到提高样本在无监督状态下学习性能的目标.在现阶段的技术条件支持下,以半监督聚类分析为核心,建立在must link以及cannot link基础之上的约束关系被广泛地应用于样本聚类分析的过程当中.从这一角度上来说,为了使聚类簇与聚类簇之间的距离关系表述更加的真实与精确,就要求通过对成对约束关系的综合应用,实现对聚类簇距离的有效调整与优化.  相似文献   

7.
目前说话人聚类时将说话人分割后的语音段作为初始类,直接对这些数量庞大语音段进行聚类的计算量非常大。为了降低说话人聚类时的计算量,提出一种面向说话人聚类的初始类生成方法。提取说话人分割后语音段的特征参数及特征参数的质心,结合层次聚类法和贝叶斯信息准则,对语音段进行具有宽松停止准则的“预聚类”,生成初始类。与直接对说话人分割后的语音段进行聚类的方法相比,该方法能在保持原有聚类性能的情况下,减少40.04%的计算时间;在允许聚类性能略有下降的情形下,减少60.03%以上的计算时间。  相似文献   

8.
We analyze mathematical aspects of one of the fundamental data analysis problems consisting in the search (selection) for the subset with the largest number of similar elements among a collection of objects. In particular, the problem appears in connection with the analysis of data in the form of time series (discrete signals). One of the problems in modeling this challenge is considered, namely, the problem of finding the cluster of the largest size (cardinality) in a 2-partition of a finite sequence of points in Euclidean space into two clusters (subsequences) under two constraints. The first constraint is on the choice of the indices of elements included in the clusters. This constraint simulates the set of time-admissible configurations of similar elements in the observed discrete signal. The second constraint is imposed on the value of the quadratic clustering function. This constraint simulates the level of intracluster proximity of objects. The clustering function under the second constraint is the sum (over both clusters) of the intracluster sums of squared distances between the cluster elements and its center. The center of one of the clusters is unknown and defined as the centroid (the arithmetic mean over all elements of this cluster). The center of the other cluster is the origin. Under the first constraint, the difference between any two subsequent indices of elements contained in a cluster with an unknown center is bounded above and below by some constants. It is established in the paper that the optimization problem under consideration, which models one of the simplest significant problems of data analysis, is strongly NP-hard. We propose an exact algorithm for the case of a problem with integer coordinates of its input points. If the dimension of the space is bounded by a constant, then the algorithm is pseudopolynomial.  相似文献   

9.
A new cluster isolation criterion based on dissimilarity increments   总被引:3,自引:0,他引:3  
This paper addresses the problem of cluster defining criteria by proposing a model-based characterization of interpattern relationships. Taking a dissimilarity matrix between patterns as the basic measure for extracting group structure, dissimilarity increments between neighboring patterns within a cluster are analyzed. Empirical evidence suggests modeling the statistical distribution of these increments by an exponential density; we propose to use this statistical model, which characterizes context, to derive a new cluster isolation criterion. The integration of this criterion in a hierarchical agglomerative clustering framework produces a partitioning of the data, while exhibiting data interrelationships in terms of a dendrogram-type graph. The analysis of the criterion is undertaken through a set of examples, showing the versatility of the method in identifying clusters with arbitrary shape and size; the number of clusters is intrinsically found without requiring ad hoc specification of design parameters nor engaging in a computationally demanding optimization procedure.  相似文献   

10.
This paper examines a method of clustering within a fully decentralized multi-agent system. Our goal is to group agents with similar objectives or data, as is done in traditional clustering. However, we add the additional constraint that agents must remain in place on a network, instead of first being collected into a centralized database. To do this, we connect agents in a random overlay network and have them search in a peer-to-peer fashion for other similar agents. We thus aim to tackle the basic clustering problem on an Internet scale, and create a method by which agents themselves can be grouped, forming coalitions. In order to investigate the feasibility of this decentralized approach, this paper presents simulation experiments that look into the quality of the clusters discovered. First, the clusters found by the agent method are compared to those created by k-means clustering for two-dimensional spatial data points. Results show that the decentralized agent method produces a better clustering than the centralized k-means algorithm, placing 95% to 99% of points correctly. A further experiment explores how agents can be used to cluster a straightforward text document set, demonstrating that agents can discover clusters and keywords that are reasonable estimates of those identified by the central word vector space approach.  相似文献   

11.
为充分发挥机器人群集的协作优势,克服单机器人能力不足问题,利用偏微分方程约束理论,设计机器人群集运动控制系统;扩大机器人群集间通信网络范围,改装机器人传感器、运动控制器和驱动电机设备;在硬件设备的支持下,考虑机械结构、运动与动力工作原理,建立机器人群集数学模型;分配机器人群集运动任务,利用偏微分方程规划机器人群集编队运动路径,设置规划路径作为机器人群集运动的约束条件;从位置和姿态角两个方面计算运动控制量,通过控制指令的生成与转换,实现系统的机器人群集运动控制功能;通过系统测试实验得出结论:与传统运动控制系统相比,在优化设计系统的控制下,机器人群集的运动跟踪控制误差为13 cm,机器人群集运动过程中产生的碰撞次数平均值为15次,得到明显减少,即优化设计系统具有良好的控制效果。  相似文献   

12.
In this article, we address the problem of automatic constraint selection to improve the performance of constraint-based clustering algorithms. To this aim we propose a novel active learning algorithm that relies on a k-nearest neighbors graph and a new constraint utility function to generate queries to the human expert. This mechanism is paired with propagation and refinement processes that limit the number of constraint candidates and introduce a minimal diversity in the proposed constraints. Existing constraint selection heuristics are based on a random selection or on a min–max criterion and thus are either inefficient or more adapted to spherical clusters. Contrary to these approaches, our method is designed to be beneficial for all constraint-based clustering algorithms. Comparative experiments conducted on real datasets and with two distinct representative constraint-based clustering algorithms show that our approach significantly improves clustering quality while minimizing the number of human expert solicitations.  相似文献   

13.
Fuzzy clustering algorithms are becoming the major technique in cluster analysis. In this paper, we consider the fuzzy clustering based on objective functions. They can be divided into two categories: possibilistic and probabilistic approaches leading to two different function families depending on the conditions required to state that fuzzy clusters are a fuzzy c-partition of the input data. Recently, we have presented in Menard and Eboueya (Fuzzy Sets and Systems, 27, to be published) an axiomatic derivation of the Possibilistic and Maximum Entropy Inference (MEI) clustering approaches, based upon an unifying principle of physics, that of extreme physical information (EPI) defined by Frieden (Physics from Fisher information, A unification, Cambridge University Press, Cambridge, 1999). Here, using the same formalism, we explicitly give a new criterion in order to provide a theoretical justification of the objective functions, constraint terms, membership functions and weighting exponent m used in the probabilistic and possibilistic fuzzy clustering. Moreover, we propose an unified framework including the two procedures. This approach is inspired by the work of Frieden and Plastino and Plastino and Miller (Physics A 235, 577) extending the principle of extremal information in the framework of the non-extensive thermostatistics. Then, we show how, with the help of EPI, one can propose extensions of the FcM and Possibilistic algorithms.  相似文献   

14.
On the basis of cluster size and cluster cohesion, we propose a generalized cluster‐reliability (CR) measure, which indicates the overall reliability of arguments in a cluster. Taking the reliability of clusters as order‐inducing variables, we introduce a generalized cluster‐reliability‐induced ordered weighted averaging (CRI‐OWA) operator from the viewpoint of combining representative arguments of clusters. Furthermore, we propose a grid‐based cohesion measure for grid‐based clusters. On the basis of this cohesion measure, we obtain the special CR measure and CRI‐OWA operator for the grid‐based clusters. Then we introduced two other special CR measures for graph‐based and prototype‐based clusters, respectively. Taking the CR, computed by these two measures, as order‐inducing variables, we can obtain two other kinds of CRI‐OWA operators for graph‐based and prototype‐based clusters, respectively. © 2012 Wiley Periodicals, Inc.  相似文献   

15.
Generally, abnormal points (noise and outliers) cause cluster analysis to produce low accuracy especially in fuzzy clustering. These data not only stay in clusters but also deviate the centroids from their true positions. Traditional fuzzy clustering like Fuzzy C-Means (FCM) always assigns data to all clusters which is not reasonable in some circumstances. By reformulating objective function in exponential equation, the algorithm aggressively selects data into the clusters. However noisy data and outliers cannot be properly handled by clustering process therefore they are forced to be included in a cluster because of a general probabilistic constraint that the sum of the membership degrees across all clusters is one. In order to improve this weakness, possibilistic approach relaxes this condition to improve membership assignment. Nevertheless, possibilistic clustering algorithms generally suffer from coincident clusters because their membership equations ignore the distance to other clusters. Although there are some possibilistic clustering approaches that do not generate coincident clusters, most of them require the right combination of multiple parameters for the algorithms to work. In this paper, we theoretically study Possibilistic Exponential Fuzzy Clustering (PXFCM) that integrates possibilistic approach with exponential fuzzy clustering. PXFCM has only one parameter and not only partitions the data but also filters noisy data or detects them as outliers. The comprehensive experiments show that PXFCM produces high accuracy in both clustering results and outlier detection without generating coincident problems.  相似文献   

16.
H.T. Toivonen 《Automatica》1983,19(4):415-418
A self-tuning regulator for a variance constrained optimal control problem is given. The criterion for control is to minimize the stationary variance of the output. In the cases when the regulator which gives minimum variance requires too large control signals an inequality constraint on the input variance is introduced. In practice it is easier to select a constraint on the variance of the input than to choose the relative weights in a quadratic loss function. The self-tuning regulator applies the Robbins-Monro scheme to adjust the Lagrange multiplier of the variance constrained control problem. The behaviour of the algorithm is illustrated by a simulated example. The asymptotic behaviour of the regulator is studied using a set of associated ordinary differential equations.  相似文献   

17.
Automatic network clustering is an important technique for mining the meaningful communities (or clusters) of a network. Communities in a network are clusters of nodes where the intra-cluster connection density is high and the inter-cluster connection density is low. The most popular scheme of automatic network clustering aims at maximizing a criterion function known as modularity in partitioning all the nodes into clusters. But it is found that the modularity suffers from the resolution limit problem, which remains an open challenge. In this paper, the automatic network clustering is formulated as a constrained optimization problem: maximizing a criterion function with a density constraint. With this scheme, the established algorithm can be free from the resolution limit problem. Furthermore, it is found that the density constraint can improve the detection accuracy of the modularity optimization. The efficiency of the proposed scheme is verified by comparative experiments on large scale benchmark networks.  相似文献   

18.
In some applications of industrial robots, the robot manipulator must traverse a pre-specified Cartesian path with its hand tip while links of the robot safely move among obstacles cluttered in the robot's scene (environment). In order to reduce the costs of collision detection, one approach is to reduce the number of collision checks by enclosing a few real obstacles with a larger (artificial) bounding volume (a cluster), e.g., by their convex hull [4, 14], without cutting the specified path.In this paper, we propose a recursive algorithm composed of four procedures to tackle the problem of clustering convex polygons cluttered around a specified path in a dynamic environment. A key fact observed is that the number k of clusters is actually determined by the specified path not by any criterion used in clustering. Based on this fact, an initial set of k clusters could be rapidly generated. Then, the initial set of clusters and its number is further refined for satisfying the minimum Euclidean distance criterion imposed in clustering. Compared to the heuristic algorithm in [14], complexity of the proposed algorithm is reduced by one order with respect to the number n of obstacles. Simulation are performed in both static and dynamic environments, which show that the recursive algorithm is very efficient and acquires less number k of clusters.  相似文献   

19.
Clustering is a popular technique for analyzing microarray data sets, with n genes and m experimental conditions. As explored by biologists, there is a real need to identify coregulated gene clusters, which include both positive and negative regulated gene clusters. The existing pattern-based and tendency-based clustering approaches cannot directly be applied to find such coregulated gene clusters, because they are designed for finding positive regulated gene clusters. In this paper, in order to cluster coregulated genes, we propose a coding scheme that allows us to cluster two genes into the same cluster if they have the same code, where two genes that have the same code can be either positive or negative regulated. Based on the coding scheme, we propose a new algorithm for finding maximal subspace coregulated gene clusters with new pruning techniques. A maximal subspace coregulated gene cluster clusters a set of genes on a condition sequence such that the cluster is not included in any other subspace coregulated gene clusters. We conduct extensive experimental studies. Our approach can effectively and efficiently find maximal subspace coregulated gene clusters. In addition, our approach outperforms the existing approaches for finding positive regulated gene clusters.  相似文献   

20.
Cluster validity indexes are very important tools designed for two purposes: comparing the performance of clustering algorithms and determining the number of clusters that best fits the data. These indexes are in general constructed by combining a measure of compactness and a measure of separation. A classical measure of compactness is the variance. As for separation, the distance between cluster centers is used. However, such a distance does not always reflect the quality of the partition between clusters and sometimes gives misleading results. In this paper, we propose a new cluster validity index for which Jeffrey divergence is used to measure separation between clusters. Experimental results are conducted using different types of data and comparison with widely used cluster validity indexes demonstrates the outperformance of the proposed index.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号