期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An agglomerative clustering algorithm using a dynamic k-nearest-neighbor list

Jim Z.C. Lai 《Information Sciences》2011,181(9):1722-3410

In this paper, a new algorithm is developed to reduce the computational complexity of Ward’s method. The proposed approach uses a dynamic k-nearest-neighbor list to avoid the determination of a cluster’s nearest neighbor at some steps of the cluster merge. Double linked algorithm (DLA) can significantly reduce the computing time of the fast pairwise nearest neighbor (FPNN) algorithm by obtaining an approximate solution of hierarchical agglomerative clustering. In this paper, we propose a method to resolve the problem of a non-optimal solution for DLA while keeping the corresponding advantage of low computational complexity. The computational complexity of the proposed method DKNNA + FS (dynamic k-nearest-neighbor algorithm with a fast search) in terms of the number of distance calculations is O(N²), where N is the number of data points. Compared to FPNN with a fast search (FPNN + FS), the proposed method using the same fast search algorithm (DKNNA + FS) can reduce the computing time by a factor of 1.90-2.18 for the data set from a real image. In comparison with FPNN + FS, DKNNA + FS can reduce the computing time by a factor of 1.92-2.02 using the data set generated from three images. Compared to DLA with a fast search (DLA + FS), DKNNA + FS can decrease the average mean square error by 1.26% for the same data set. 相似文献

2.

Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing 总被引：1，自引：6，他引：1

Hisashi Koga Tetsuo Ishibashi Toshinori Watanabe 《Knowledge and Information Systems》2007,12(1):25-53

The single linkage method is a fundamental agglomerative hierarchical clustering algorithm. This algorithm regards each point as a single cluster initially. In the agglomeration step, it connects a pair of clusters such that the distance between the nearest members is the shortest. This step is repeated until only one cluster remains. The single linkage method can efficiently detect clusters in arbitrary shapes. However, a drawback of this method is a large time complexity of O(n ²), where n represents the number of data points. This time complexity makes this method infeasible for large data. This paper proposes a fast approximation algorithm for the single linkage method. Our algorithm reduces the time complexity to O(nB) by rapidly finding the near clusters to be connected by Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. Here, B represents the maximum number of points going into a single hash entry and it practically diminishes to a small constant as compared to n for sufficiently large hash tables. Experimentally, we show that (1) the proposed algorithm obtains clustering results similar to those obtained by the single linkage method and (2) it runs faster for large data than the single linkage method. Hisashi Koga received the M.S. and Ph.D. degree in information science in 1995 and 2002, respectively, from the University of Tokyo. From 1995 to 2003, he worked as a researcher at Fujitsu Laboratories Ltd. Since 2003, he has been a faculty member at the University of Electro-Communications, Tokyo (Japan). Currently, he is an associate professor at the Graduate School of Information Systems, University of Electro-Communications. His research interest includes various kinds of algorithms such as clustering algorithms, on-line algorithms, and algorithms in network communications. Tetsuo Ishibashi received the M.E. degree in information systems design from the Graduate School of Information Systems at the University of Electro-Communications in 2004. Presently, he is a system engineer at Fujitsu Broad Solution & Consulting Inc. Toshinori Watanabe received the B.E. degree in aeronautical engineering in 1971 and the D.E. degree in 1985, both from the University of Tokyo. In 1971, he worked at Hitachi as a researcher in the field of information systems design. His experience includes demand forecasting, inventory and production management, VLSI design automation, knowledge-based nonlinear optimizer, and a case-based evolutionary learning system nicknamed TAMPOPO. He also engaged in FGCS (Fifth Generation Computer System) project of Japan and developed a new hierarchical message-passing parallel cooperative VLSI layout problem solver that ran on PIM (Parallel Inference Machine) in 1991. Since 1992, he has been a professor at the Graduate School of Information Systems, University of Electro-Communications, Tokyo, Japan. His areas of interest include media analysis, learning intelligence, and the semantics of information systems. He is a member of the IEEE. 相似文献

3.

Fast agglomerative clustering using a k-nearest neighbor graph 总被引：1，自引：0，他引：1

Fränti P Virmajoki O Hautamäki V 《IEEE transactions on pattern analysis and machine intelligence》2006,28(11):1875-1881

We propose a fast agglomerative clustering method using an approximate nearest neighbor graph for reducing the number of distance calculations. The time complexity of the algorithm is improved from O(tauN²) to O(tauN log N) at the cost of a slight increase in distortion; here, tau denotes the lumber of nearest neighbor updates required at each iteration. According to the experiments, a relatively small neighborhood size is sufficient to maintain the quality close to that of the full search 相似文献

4.

Applying agglomerative hierarchical clustering algorithms to component identification for legacy systems 总被引：1，自引：0，他引：1

Jian Feng CuiHeung Seok Chae 《Information and Software Technology》2011,53(6):601-614

Context

Component identification, the process of evolving legacy system into finely organized component-based software systems, is a critical part of software reengineering. Currently, many component identification approaches have been developed based on agglomerative hierarchical clustering algorithms. However, there is a lack of thorough investigation on which algorithm is appropriate for component identification.

Objective

This paper focuses on analyzing agglomerative hierarchical clustering algorithms in software reengineering, and then identifying their respective strengths and weaknesses in order to apply them effectively for future practical applications.

Method

A series of experiments were conducted for 18 clustering strategies combined according to various similarity measures, weighting schemes and linkage methods. Eleven subject systems with different application domains and source code sizes were used in the experiments. The component identification results are evaluated by the proposed size, coupling and cohesion criteria.

Results

The experimental results suggested that the employed similarity measures, weighting schemes and linkage methods can have various effects on component identification results with respect to the proposed size, coupling and cohesion criteria, so the hierarchical clustering algorithms produced quite different clustering results.

Conclusions

According to the experimental results, it can be concluded that it is difficult to produce perfectly satisfactory results for a given clustering algorithm. Nevertheless, these algorithms demonstrated varied capabilities to identify components with respect to the proposed size, coupling and cohesion criteria. 相似文献

5.

An accelerated K-means clustering algorithm using selection and erasure rules

Suiang-Shyan LEE Ja-Chen LIN 《浙江大学学报:C卷英文版》2012,(10):761-768

The K-means method is a well-known clustering algorithm with an extensive range of applications,such as biological classification,disease analysis,data mining,and image compression.However,the plain K-means method is not fast when the number of clusters or the number of data points becomes large.A modified K-means algorithm was presented by Fahim et al.(2006).The modified algorithm produced clusters whose mean square error was very similar to that of the plain K-means,but the execution time was shorter.In this study,we try to further increase its speed.There are two rules in our method:a selection rule,used to acquire a good candidate as the initial center to be checked,and an erasure rule,used to delete one or many unqualified centers each time a specified condition is satisfied.Our clustering results are identical to those of Fahim et al.(2006).However,our method further cuts computation time when the number of clusters increases.The mathematical reasoning used in our design is included. 相似文献

6.

Paraphrase Extraction using fuzzy hierarchical clustering

《Applied Soft Computing》2015

相似文献

7.

Hierarchical agglomerative clustering procedure

Alena Lukasov 《Pattern recognition》1979,11(5-6):365-381

In this paper the hierarchical agglomerative clustering procedure with the dissimilarity coefficient D [HACP, D] and the definite hierarchical clustering procedure [DHACP, D] including some of the published hierarchical clustering methods are introduced. The formal definitions of both of them are given, by means of which the properties of these procedures can be investigated. An example of applying the procedure concerns the classification in paleobiology. 相似文献

8.

Comparison of a nearest neighbor and other approaches to the detection of space-time clustering

Timothy L. McAuliffe A.A. Afifi 《Computational statistics & data analysis》1984,2(2):125-142

A new method based on nearest neighbor (NN) distances is proposed for testing whether space-time clustering exists in a series of occurrences of an event. Unlike the methods of Knox and Mantel, the NN procedure does not require specifying arbitrary constants for the calculations of the test statistic. Based on the results of a simulation experiment, the NN test procedure is recommended because of its control of the significance level and its power. If appropriate constants can be found for the other two methods, they would also do relatively well. The three methods are applied to a real data example from a public education program. 相似文献

9.

基于成对约束的半监督凝聚层次聚类算法

盛俊杰谢丽聪《微型机与应用》2012,31(24):67-69

半监督聚类就是利用样本的监督信息来帮助提升无监督学习的性能。在半监督聚类中,成对约束(must-link约束和cannot-link约束)作为样本的先验知识被广泛地使用。凝聚层次聚类(AHC)也叫合成聚类,是层次聚类法的一种。提出了一种基于成对约束的半监督凝聚层次聚类算法(PS-AHC),该算法利用成对约束来改变聚类簇之间的距离,使聚类簇之间的距离更真实。在UCI数据集上的实验表明,PS-AHC能有效地提高聚类的准确率,是一种有前景的半监督聚类算法。相似文献

10.

Extraction of major object features using VQ clustering for content-based image retrieval

Hun-Woo Yoo She-Hwan Jung Dong-Sik Jang Yoon-Kyoon Na 《Pattern recognition》2002,35(5):1115-1126

An image representation method using vector quantization (VQ) on color and texture is proposed in this paper. The proposed method is also used to retrieve similar images from database systems. The basic idea is a transformation from the raw pixel data to a small set of image regions, which are coherent in color and texture space. A scheme is provided for object-based image retrieval. Features for image retrieval are the three color features (hue, saturation, and value) from the HSV color model and five textural features (ASM, contrast, correlation, variance, and entropy) from the gray-level co-occurrence matrices. Once the features are extracted from an image, eight-dimensional feature vectors represent each pixel in the image. The VQ algorithm is used to rapidly cluster those feature vectors into groups. A representative feature table based on the dominant groups is obtained and used to retrieve similar images according to the object within the image. This method can retrieve similar images even in cases where objects are translated, scaled, and rotated. 相似文献

11.

Fuzzy clustering algorithms based on resolution and their application in image compression

Xiangwei Kong^{Author Vitae} Renying WangAuthor VitaeGuoping LiAuthor Vitae 《Pattern recognition》2002,35(11):2439-2444

This paper presents an idea of clustering resolution. On the basis of the idea, fuzzy clustering algorithms based on resolution are deduced, which naturally comprise a set of clustering algorithms. Thus, c-means algorithm and fuzzy c-means algorithms are actually special examples in the set. As an application for codebook design in image compression based on vector quantization, fuzzy clustering algorithms based on multiresolution are developed, which are almost prior to conventional algorithms in all aspects. 相似文献

12.

An efficient prototype merging strategy for the condensed 1-NN rule through class-conditional hierarchical clustering

R. A. F. J. E. 《Pattern recognition》2002,35(12):2771-2782

A generalized prototype-based classification scheme founded on hierarchical clustering is proposed. The basic idea is to obtain a condensed 1-NN classification rule by merging the two same-class nearest clusters, provided that the set of cluster representatives correctly classifies all the original points. Apart from the quality of the obtained sets and its flexibility which comes from the fact that different intercluster measures and criteria can be used, the proposed scheme includes a very efficient four-stage procedure which conveniently exploits geometric cluster properties to decide about each possible merge. Empirical results demonstrate the merits of the proposed algorithm taking into account the size of the condensed sets of prototypes, the accuracy of the corresponding condensed 1-NN classification rule and the computing time. 相似文献

13.

A novel encoding algorithm for vector quantization using transformed codebook

Jim Z.C. Lai Author Vitae Author Vitae 《Pattern recognition》2009,42(11):3065-3070

In this paper, a novel encoding algorithm for vector quantization is presented. Our method uses a set of transformed codewords and partial distortion rejection to determine the reproduction vector of an input vector. Experimental results show that our algorithm is superior to other methods in terms of the computing time and number of distance calculations. Compared with available approaches, our method can reduce the computing time and number of distance calculations significantly. Compared with the available best method of reducing the number of distance computations, our approach can reduce the number of distance calculations by 32.3-67.1%. Compared with the best encoding algorithm for vector quantization, our method can also further reduce the computing time by 19.7-23.9%. The performance of our method is better when a larger codebook is used and is weakly correlated to codebook size. 相似文献

14.

基于空间距离的快速模糊C均值聚类算法

王军玲王士同包芳周建林《计算机工程与应用》2015,(1):177-183,188

针对传统的模糊C均值聚类算法在进行图像分割时对孤立点、噪声点敏感性较强,聚类耗时随图像变大而快速增长等缺陷,基于临近元素空间距离的模糊C均值聚类算法即SFGFCM算法,采用核化的空间距离公式,计算出空间临近像素与考察像素的相似度Sij,然后用邻近像素灰度加权和计算出邻近信息制约图像,并进一步在邻近信息制约图像的灰度级统计的基础上进行聚类。该算法考察了临近像素灰度和位置等信息,并且它们之间取得了很好的平衡;不仅表现出较强的鲁棒性且很好地保留了原图像边缘等细节信息,提高了聚类精度,同时大大缩短了大幅图像的聚类时间。通过在合成图像、医学图像及自然图像上的大量实验,与传统算法对比该算法聚类性能明显提高,在图像分割上体现出了较好的分割效果。相似文献

15.

一种基于EPDS的快速K均值聚类算法

陈作平叶正麟刘明《计算机工程》2006,32(12):191-192,195

KT均值聚类是经常使用的一种数据聚类方法，但对大数据量情形，其聚类过程较慢，主要原因在于聚类过程中每个待聚类向量要反复进行一个最近邻搜索过程，以寻找与其距离最近的聚类中心；据此，文章提出使用扩展的部分失真搜索（Extended Partial Distonion Search，EPDS）来完成该最近邻搜索，极大地减少了完成聚类所需乘法次数。实验表明，相对于基本的K均值聚类算法，该方法可以节约1/3以上的计算量。相似文献

16.

Fuzzy clustering with supervision

Witold Pedrycz Author Vitae George Vukovich Author Vitae 《Pattern recognition》2004,37(7):1339-1349

This study is concerned with clustering carried out in presence of labeled patterns. An objective of this optimization is to reconcile between the structure residing in data (and being primarily discovered by the underlying clustering mechanism) and the labels of the patterns forming such structure. In this sense, one can consider the supervised fuzzy clustering to be a framework of preliminary data analysis providing with a thorough insight into the structure of the data and supporting the ensuing design of detailed classifiers. The proposed method augments the standard fuzzy C-means algorithm by extending the original objective function by the supervision component (labeled patterns). Experimental results illustrate the approach and discuss the use of this type of clustering in vector quantization. 相似文献

17.

Rectilinear steiner tree heuristics and minimum spanning tree algorithms using geographic nearest neighbors

Y. C. Wee S. Chaiken S. S. Ravi 《Algorithmica》1994,12(6):421-435

We study the application of the geographic nearest neighbor approach to two problems. The first problem is the construction of an approximately minimum length rectilinear Steiner tree for a set ofn points in the plane. For this problem, we introduce a variation of a subgraph of sizeO(n) used by YaO [31] for constructing minimum spanning trees. Using this subgraph, we improve the running times of the heuristics discussed by Bern [6] fromO(n ² log n) toO(n log² n). The second problem is the construction of a rectilinear minimum spanning tree for a set ofn noncrossing line segments in the plane. We present an optimalO(n logn) algorithm for this problem. The rectilinear minimum spanning tree for a set of points can thus be computed optimally without using the Voronoi diagram. This algorithm can also be extended to obtain a rectilinear minimum spanning tree for a set of nonintersecting simple polygons.The results in this paper are a part of Y. C. Yee's Ph.D. thesis done at SUNY at Albany. He was supported in part by NSF Grants IRI-8703430 and CCR-8805782. S. S. Ravi was supported in part by NSF Grants DCI-86-03318 and CCR-89-05296. 相似文献

18.

Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation

Weiling Cai Author Vitae Author Vitae Daoqiang Zhang Author Vitae 《Pattern recognition》2007,40(3):825-838

Fuzzy c-means (FCM) algorithms with spatial constraints (FCM_S) have been proven effective for image segmentation. However, they still have the following disadvantages: (1) although the introduction of local spatial information to the corresponding objective functions enhances their insensitiveness to noise to some extent, they still lack enough robustness to noise and outliers, especially in absence of prior knowledge of the noise; (2) in their objective functions, there exists a crucial parameter α used to balance between robustness to noise and effectiveness of preserving the details of the image, it is selected generally through experience; and (3) the time of segmenting an image is dependent on the image size, and hence the larger the size of the image, the more the segmentation time. In this paper, by incorporating local spatial and gray information together, a novel fast and robust FCM framework for image segmentation, i.e., fast generalized fuzzy c-means (FGFCM) clustering algorithms, is proposed. FGFCM can mitigate the disadvantages of FCM_S and at the same time enhances the clustering performance. Furthermore, FGFCM not only includes many existing algorithms, such as fast FCM and enhanced FCM as its special cases, but also can derive other new algorithms such as FGFCM_S1 and FGFCM_S2 proposed in the rest of this paper. The major characteristics of FGFCM are: (1) to use a new factor S_ij as a local (both spatial and gray) similarity measure aiming to guarantee both noise-immunity and detail-preserving for image, and meanwhile remove the empirically-adjusted parameter α; (2) fast clustering or segmenting image, the segmenting time is only dependent on the number of the gray-levels q rather than the size N(?q) of the image, and consequently its computational complexity is reduced from O(NcI₁) to O(qcI₂), where c is the number of the clusters, I₁ and are the numbers of iterations, respectively, in the standard FCM and our proposed fast segmentation method. The experiments on the synthetic and real-world images show that FGFCM algorithm is effective and efficient. 相似文献

19.

Fast global k-means clustering using cluster membership and inequality

Jim Z.C. Lai Author Vitae Author Vitae 《Pattern recognition》2010,43(5):1954-1963

In this paper, we present a fast global k-means clustering algorithm by making use of the cluster membership and geometrical information of a data point. This algorithm is referred to as MFGKM. The algorithm uses a set of inequalities developed in this paper to determine a starting point for the jth cluster center of global k-means clustering. Adopting multiple cluster center selection (MCS) for MFGKM, we also develop another clustering algorithm called MFGKM+MCS. MCS determines more than one starting point for each step of cluster split; while the available fast and modified global k-means clustering algorithms select one starting point for each cluster split. Our proposed method MFGKM can obtain the least distortion; while MFGKM+MCS may give the least computing time. Compared to the modified global k-means clustering algorithm, our method MFGKM can reduce the computing time and number of distance calculations by a factor of 3.78-5.55 and 21.13-31.41, respectively, with the average distortion reduction of 5,487 for the Statlog data set. Compared to the fast global k-means clustering algorithm, our method MFGKM+MCS can reduce the computing time by a factor of 5.78-8.70 with the average reduction of distortion of 30,564 using the same data set. The performances of our proposed methods are more remarkable when a data set with higher dimension is divided into more clusters. 相似文献

20.

Fast and versatile algorithm for nearest neighbor search based on a lower bound tree

Yong-Sheng Chen Author Vitae Yi-Ping Hung Author Vitae Ting-Fang Yen^{Author Vitae} 《Pattern recognition》2007,40(2):360-375

In this paper, we present a fast and versatile algorithm which can rapidly perform a variety of nearest neighbor searches. Efficiency improvement is achieved by utilizing the distance lower bound to avoid the calculation of the distance itself if the lower bound is already larger than the global minimum distance. At the preprocessing stage, the proposed algorithm constructs a lower bound tree (LB-tree) by agglomeratively clustering all the sample points to be searched. Given a query point, the lower bound of its distance to each sample point can be calculated by using the internal node of the LB-tree. To reduce the amount of lower bounds actually calculated, the winner-update search strategy is used for traversing the tree. For further efficiency improvement, data transformation can be applied to the sample and the query points. In addition to finding the nearest neighbor, the proposed algorithm can also (i) provide the k-nearest neighbors progressively; (ii) find the nearest neighbors within a specified distance threshold; and (iii) identify neighbors whose distances to the query are sufficiently close to the minimum distance of the nearest neighbor. Our experiments have shown that the proposed algorithm can save substantial computation, particularly when the distance of the query point to its nearest neighbor is relatively small compared with its distance to most other samples (which is the case for many object recognition problems). 相似文献