首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Different extensions of fuzzy c‐means (FCM) clustering have been developed to approximate FCM clustering in very large (unloadable) image (eFFCM) and object vector (geFFCM) data. Both extensions share three phases: (1) progressive sampling of the VL data, terminated when a sample passes a statistical goodness of fit test; (2) clustering with (literal or exact) FCM; and (3) noniterative extension of the literal clusters to the remainder of the data set. This article presents a comparable method for the remaining case of interest, namely, clustering in VL relational data. We will propose and discuss each of the four phases of eNERF and our algorithm for this last case: (1) finding distinguished features that monitor progressive sampling, (2) progressively sampling a square N × N relation matrix RN until an n × n sample relation Rn passes a statistical test, (3) clustering Rn with literal non‐Euclidean relational fuzzy c‐means, and (4) extending the clusters in Rn to the remainder of the relational data. The extension phase in this third case is not as straightforward as it was in the image and object data cases, but our numerical examples suggest that eNERF has the same approximation qualities that eFFCM and geFFCM do. © 2006 Wiley Periodicals, Inc. Int J Int Syst 21: 817–841, 2006.  相似文献   

2.
Since 1998, a graphical representation used in visual clustering called the reordered dissimilarity image or cluster heat map has appeared in more than 4000 biological or biomedical publications. These images are typically used to visually estimate the number of clusters in a data set, which is the most important input to most clustering algorithms, including the popularly chosen fuzzy c‐means and crisp k‐means. This paper presents a new formulation of a matrix reordering algorithm, coVAT, which is the only known method for providing visual clustering information on all four types of cluster structure in rectangular relational data. Finite rectangular relational data are an m× n array R of relational values between m row objects Or and n column objects Oc. R presents four clustering problems: clusters in Or, Oc, Or∪c, and coclusters containing some objects from each of Or and Oc. coVAT1 is a clustering tendency algorithm that provides visual estimates of the number of clusters to seek in each of these problems by displaying reordered dissimilarity images. We provide several examples where coVAT1 fails to do its job. These examples justify the introduction of coVAT2, a modification of coVAT1 based on a different reordering scheme. We offer several examples to illustrate that coVAT2 may detect coclusters in R when coVAT1 does not. Furthermore, coVAT2 is not limited to just relational data R. The R matrix can also take the form of feature data, such as gene microarray data where each data element is a real number: Positive values indicate upregulation, and negative values indicate downregulation. We show examples of coVAT2 on microarray data that indicate coVAT2 shows cluster tendency in these data. © 2012 Wiley Periodicals, Inc.  相似文献   

3.
Approximating clusters in very large (VL=unloadable) data sets has been considered from many angles. The proposed approach has three basic steps: (i) progressive sampling of the VL data, terminated when a sample passes a statistical goodness of fit test; (ii) clustering the sample with a literal (or exact) algorithm; and (iii) non-iterative extension of the literal clusters to the remainder of the data set. Extension accelerates clustering on all (loadable) data sets. More importantly, extension provides feasibility—a way to find (approximate) clusters—for data sets that are too large to be loaded into the primary memory of a single computer. A good generalized sampling and extension scheme should be effective for acceleration and feasibility using any extensible clustering algorithm. A general method for progressive sampling in VL sets of feature vectors is developed, and examples are given that show how to extend the literal fuzzy (c-means) and probabilistic (expectation-maximization) clustering algorithms onto VL data. The fuzzy extension is called the generalized extensible fast fuzzy c-means (geFFCM) algorithm and is illustrated using several experiments with mixtures of five-dimensional normal distributions.  相似文献   

4.
Content based image retrieval is an active area of research. Many approaches have been proposed to retrieve images based on matching of some features derived from the image content. Color is an important feature of image content. The problem with many traditional matching-based retrieval methods is that the search time for retrieving similar images for a given query image increases linearly with the size of the image database. We present an efficient color indexing scheme for similarity-based retrieval which has a search time that increases logarithmically with the database size.In our approach, the color features are extracted automatically using a color clustering algorithm. Then the cluster centroids are used as representatives of the images in 3-dimensional color space and are indexed using a spatial indexing method that usesR-tree. The worst case search time complexity of this approach isOn q log(N* navg)), whereN is the number of images in the database, andn q andn avg are the number of colors in the query image and the average number of colors per image in the database respectively. We present the experimental results for the proposed approach on two databases consisting of 337 Trademark images and 200 Flag images.  相似文献   

5.
Pairwise clustering methods have shown great promise for many real-world applications. However, the computational demands of these methods make them impractical for use with large data sets. The contribution of this paper is a simple but efficient method, called eSPEC, that makes clustering feasible for problems involving large data sets. Our solution adopts a “sampling, clustering plus extension” strategy. The methodology starts by selecting a small number of representative samples from the relational pairwise data using a selective sampling scheme; then the chosen samples are grouped using a pairwise clustering algorithm combined with local scaling; and finally, the label assignments of the remaining instances in the data are extended as a classification problem in a low-dimensional space, which is explicitly learned from the labeled samples using a cluster-preserving graph embedding technique. Extensive experimental results on several synthetic and real-world data sets demonstrate both the feasibility of approximately clustering large data sets and acceleration of clustering in loadable data sets of our method.  相似文献   

6.
While spectral clustering can produce high-quality clusterings on small data sets, computational cost makes it infeasible for large data sets. Affinity Propagation (AP) has a limitation that it is hard to determine the value of parameter ‘preference’ which can lead to an optimal clustering solution. These problems limit the scope of application of the two methods. In this paper, we develop a novel fast two-stage spectral clustering framework with local and global consistency. Under this framework, we propose a Fast density-Weighted low-rank Approximation Spectral Clustering (FWASC) algorithm to address the above issues. The proposed algorithm is a high-quality graph partitioning method, and simultaneously considers both the local and global structure information contained in the data sets. Specifically, we first present a new Fast Two-Stage AP (FTSAP) algorithm to coarsen the input sparse graph and produce a small number of final representative exemplars, which is a simple and efficient sampling scheme. Then we present a density-weighted low-rank approximation spectral clustering algorithm to operate those representative exemplars on the global underlying structure of data manifold. Experimental results show that our algorithm outperforms the state-of-the-art spectral clustering and original AP algorithms in terms of speed, memory usage, and quality.  相似文献   

7.
HereR andN denote respectively the real numbers and the nonnegative integers. Also 0 <n εN, ands(x) =x 1+...+x n when x = (x 1,...,x n) εR n. Adiagonal function of dimensionn is a mapf onN n (or any larger set) that takesN n bijectively ontoN and, for all x, y inN n, hasf(x) <f(y) whenevers(x) <s(y). We show that diagonalpolynomials f of dimensionn all have total degreen and have the same terms of that degree, so that the lower-degree terms characterize any suchf. We call two polynomialsequivalent if relabeling variables makes them identical. Then, up to equivalence, dimension two admits just one diagonal polynomial, and dimension three admits just two.  相似文献   

8.
丁世飞  贾洪杰  史忠植 《软件学报》2014,25(9):2037-2049
面对结构复杂的数据集,谱聚类是一种灵活而有效的聚类方法,它基于谱图理论,通过将数据点映射到一个由特征向量构成的低维空间,优化数据的结构,得到令人满意的聚类结果.但在谱聚类的过程中,特征分解的计算复杂度通常为O(n3),限制了谱聚类算法在大数据中的应用.Nyström扩展方法利用数据集中的部分抽样点,进行近似计算,逼近真实的特征空间,可以有效降低计算复杂度,为大数据谱聚类算法提供了新思路.抽样策略的选择对Nyström扩展技术至关重要,设计了一种自适应的Nyström采样方法,每个数据点的抽样概率都会在一次采样完成后及时更新,而且从理论上证明了抽样误差会随着采样次数的增加呈指数下降.基于自适应的Nyström采样方法,提出一种适用于大数据的谱聚类算法,并对该算法的可行性和有效性进行了实验验证.  相似文献   

9.
Maximum‐margin clustering is an extension of the support vector machine (SVM) to clustering. It partitions a set of unlabeled data into multiple groups by finding hyperplanes with the largest margins. Although existing algorithms have shown promising results, there is no guarantee of convergence of these algorithms to global solutions due to the nonconvexity of the optimization problem. In this paper, we propose a simulated annealing‐based algorithm that is able to mitigate the issue of local minima in the maximum‐margin clustering problem. The novelty of our algorithm is twofold, ie, (i) it comprises a comprehensive cluster modification scheme based on simulated annealing, and (ii) it introduces a new approach based on the combination of k‐means++ and SVM at each step of the annealing process. More precisely, k‐means++ is initially applied to extract subsets of the data points. Then, an unsupervised SVM is applied to improve the clustering results. Experimental results on various benchmark data sets (of up to over a million points) give evidence that the proposed algorithm is more effective at solving the clustering problem than a number of popular clustering algorithms.  相似文献   

10.
The mountain clustering method and the subtractive clustering method are useful methods for finding cluster centers based on local density in object data. These methods have been extended to shell clustering. In this article, we propose a relational mountain clustering method (RMCM), which produces a set of (proto) typical objects as well as a crisp partition of the objects generating the relation, using a new concept that we call relational density. We exemplify RMCM by clustering several relational data sets that come from object data. Finally, RMCM is applied to web log analysis, where it produces useful user profiles from web log data. © 2005 Wiley Periodicals, Inc. Int J Int Syst 20: 375–392, 2005.  相似文献   

11.
HereR andN denote the real numbers and the nonnegative integers, respectively. Alsos(x)=x 1+···+x n whenx=(x 1, …,x n) inR n. A mapf:R nR is call adiagonal function of dimensionn iff|N n is a bijection ontoN and, for allx, y inN n, f(x)<f(y) whens(x)<s(y). Morales and Lew [6] constructed 2 n−2 inequivalent diagonal polynomial functions of dimensionn for eachn>1. Here we use new combinatorial ideas to show that numberd n of such functions is much greater than 2 n−2 forn>3. These combinatorial ideas also give an inductive procedure to constructd n+1 diagonal orderings of {1, …,n}.  相似文献   

12.
Cluster analysis deals with the problem of organization of a collection of objects into clusters based on a similarity measure, which can be defined using various distance functions. The use of different similarity measures allows one to find different cluster structures in a data set. In this article, an algorithm is developed to solve clustering problems where the similarity measure is defined using the L1‐norm. The algorithm is designed using the nonsmooth optimization approach to the clustering problem. Smoothing techniques are applied to smooth both the clustering function and the L1‐norm. The algorithm computes clusters sequentially and finds global or near global solutions to the clustering problem. Results of numerical experiments using 12 real‐world data sets are reported, and the proposed algorithm is compared with two other clustering algorithms.  相似文献   

13.
Existing fault-tolerant routing schemes for Benes networks either consider only the control line stuck-at faults, or handle the switch faults by some graceful degradation routing schemes that reconfigure the network into a smaller system with minimal loss. Now, even in the presence of a single switch fault in anN×NBenes networkB(n), (n= log2N), noN×Npermutation can be realized in a single pass. In this paper, we attempt to characterize the switch fault sets inB(n), in the presence of which the network is always capable of realizing any arbitraryN×NpermutationPin two passes, such that any source–destination path is set up in a single pass, no recirculation is needed, but the whole set ofNsource–destination paths ofPis partitioned in two subsets and are realized in two successive passes. We propose an algorithm that will detect if the switch fault set present in aB(n), belongs to this class; if it is yes, we present another algorithm that computes the fault-tolerant routing to realize any arbitrary permutationPin two passes. This scheme enables us to makeB(n) fault-tolerant in the presence of a restricted class of multiple switch faults, without any recirculation through intermediate nodes, or any reconfiguration of the system.  相似文献   

14.
The triangulation refinement problem, as formulated in the adaptive finite element setting (also useful in the rendering of complex scenes), is discussed. This can be formulated as follows: given a valid, non-degenerate triangulation of a polygonal region, construct a locally refined triangulation, with triangles of prescribed size in a refinement regionR, and such that the smallest (or the largest) angle is bounded. To cope with this problem, longest-side refinement algorithms guarantee the construction of good quality irregular triangulations. This is due in part to their natural refinement propagation strategy farther than the (refinement) area of interestR. In this paper we prove that, asymptotically, the numberN of points inserted inR to obtain triangles of prescribed size, is optimal. Furthermore, in spite of the unavoidable propagation outside the refinement regionR, the time cost of the algorithm is linear inN, independent of the size of the triangulation. Specifically, the number of points inserted outsideR is of orderO(n log 2 n) whereN=O(n2). We prove the latter result for circular and rectangular refinement regions, which allows us to conclude that this is true for general convex refinement regions. We also include empirical evidence, both in two and three dimensions, which is in complete agreement with the theory, even for small values ofN.  相似文献   

15.
A remote sensing approach was applied to estimate near‐noon values of shortwave albedo (α), the fraction of solar radiation reflected by a surface, for alfalfa and tall fescue grass at Kimberly, Idaho. The approach was based on the (P/T) ratio, which is the ratio of the partial radiation (P) sensed by a multi‐band radiometer and the total incident radiation (T) in a given wavelength range. It was found that instead of being constant, as previously suggested, the upward component of the (P/T) ratio under clear‐sky conditions [(P/T)u] followed a logistic growth function of solar altitude angle (Λz) for both crops (r 2 = 0.84). The downward component [(P/T)d], on the other hand, linearly increased with Λz (r 2 = 0.83). By applying the (P/T) ratio methodology, using variable ratios, it was found that the diurnal pattern of clear‐sky α for both crops followed a decreasing function of Λz (r 2 = 0.80). Near‐noon α values for alfalfa estimated using remote sensing were linearly related to plant canopy height (h) (r 2 = 0.92), but not to Λz. For grass, on the other hand, the near‐noon α values obtained by remote sensing were not correlated with either h or Λz. The near‐noon α values for alfalfa obtained with remote sensing deviated considerably from those estimated using an empirical function of day of the year (DOY). For alfalfa, the near‐noon net radiation (R n) values calculated using α values derived by remote sensing were better correlated to measured R n values than those obtained using α estimated as a function of DOY. For grass, the α values derived from remote sensing did not significantly improve the accuracy of the calculated near‐noon R n compared with using α values estimated as a function of Λz.  相似文献   

16.
This paper represents an improved data hiding scheme, CIE, which uses a codebook to improve the Exploiting Modification Direction (EMD) embedding scheme. In our scheme, one secret (2 n?+?x ???1)-ary digit is hidden in a group of pixels in an image as a modified secret digit. Our proposed scheme has an embedding rate $R=\log_2(2^{n+x}-1)/n$ , which is greater than the rate of the EMD scheme, which is R?=?log2(2n?+?1)/n for n?≥?2 . Embedding rate R is the number of secret bits embedded in each cover pixel. Our experimental results demonstrate that our scheme is able to embed 3 times as many secret bits in an image compared to the original EMD embedding scheme when n?=?2 and x?=?5. Our scheme has low time complexity and achieves this higher embedding performance while retaining reasonable perceptual quality for the image. An experiment verifies these features of our proposed data hiding scheme.  相似文献   

17.
Scalable Clustering Algorithms with Balancing Constraints   总被引:2,自引:0,他引:2  
Clustering methods for data-mining problems must be extremely scalable. In addition, several data mining applications demand that the clusters obtained be balanced, i.e., of approximately the same size or importance. In this paper, we propose a general framework for scalable, balanced clustering. The data clustering process is broken down into three steps: sampling of a small representative subset of the points, clustering of the sampled data, and populating the initial clusters with the remaining data followed by refinements. First, we show that a simple uniform sampling from the original data is sufficient to get a representative subset with high probability. While the proposed framework allows a large class of algorithms to be used for clustering the sampled set, we focus on some popular parametric algorithms for ease of exposition. We then present algorithms to populate and refine the clusters. The algorithm for populating the clusters is based on a generalization of the stable marriage problem, whereas the refinement algorithm is a constrained iterative relocation scheme. The complexity of the overall method is O(kN log N) for obtaining k balanced clusters from N data points, which compares favorably with other existing techniques for balanced clustering. In addition to providing balancing guarantees, the clustering performance obtained using the proposed framework is comparable to and often better than the corresponding unconstrained solution. Experimental results on several datasets, including high-dimensional (>20,000) ones, are provided to demonstrate the efficacy of the proposed framework.
Joydeep GhoshEmail:
  相似文献   

18.
Clustering of geometric objects is a very familiar and important problem in many different areas of applications as well as in the theoretical foundation of some modern fields of computer science. This paper describes how design problems, especially the design of an assembly line, can be transformed into a clustering problem. In order to solve the problem for large sizes of input data we introduce a structure, called Voronoi Tree, which applied to our real world data (assembly line design) did not only reduce the time to get a feasible design of an assembly line dramatically, but additionally increased the value of the design by more than 30% (in comparison with standard design methods). In addition to this we introduce a clustering method which is of interest for those applications which can be transformed to planar clustering problems. In this particular case it is possible to compute an (hierarchically) optimized clustering with resp. to a large class of clustering measures in timeO(nn1/2log3 n+U F(n)nn1/2+P F(n)) [n: number of points;U F(n), PF(n) dependent on the chosen clustering measure].  相似文献   

19.
Feature selection is an essential data processing step to remove irrelevant and redundant attributes for shorter learning time, better accuracy, and better comprehensibility. A number of algorithms have been proposed in both data mining and machine learning areas. These algorithms are usually used in a single table environment, where data are stored in one relational table or one flat file. They are not suitable for a multi‐relational environment, where data are stored in multiple tables joined to one another by semantic relationships. To address this problem, in this article, we propose a novel approach called FARS to conduct both Feature And Relation Selection for efficient multi‐relational classification. Through this approach, we not only extend the traditional feature selection method to select relevant features from multi‐relations, but also develop a new method to reconstruct the multi‐relational database schema and eliminate irrelevant tables to improve classification performance further. The results of the experiments conducted on both real and synthetic databases show that FARS can effectively choose a small set of relevant features, thereby enhancing classification efficiency and prediction accuracy significantly.  相似文献   

20.
Turbidity is an important indicator of water environments and water-quality conditions. Ocean colour remote sensing has proved to be an efficient way of monitoring water turbidity because of its wide synoptic coverage and repeated regular sampling. However, operational tasks are still challenging in high-turbidity waters, especially in estuaries and the coastal regions of China. In these areas, the existing algorithms derived from remote-sensing reflectance (Rrs) are usually invalid because it is difficult to correctly estimate the reflectance Rrs from satellite data such as Moderate Resolution Imaging Spectroradiometer (MODIS) data. A new algorithm that uses Rayleigh-corrected reflectance (Rrc) instead of Rrs has been recently introduced and was used to estimate water turbidity in Zhejiang (ZJ) coastal areas from Geostationary Ocean Color Imager (GOCI) data. The Rrc algorithm has previously shown a capability to estimate water turbidity. However, its performance still requires careful evaluation. In this article, we compared the new Rrc algorithm with two other existing algorithms. Differences among the three algorithms were assessed by comparing the results from using Rrc data and Rrs reflectance data derived from both GOCI and MODIS imagery data. The capability of the new Rrc algorithm to estimate water turbidity in larger areas and extended seasons in the coastal seas of China was also estimated. The results showed that the new Rrc algorithm is suitable for the coastal waters of China, especially for highly turbid waters.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号