首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
滑动窗口内基于密度网格的数据流聚类算法   总被引:1,自引:0,他引:1  
李子文  邢长征 《计算机应用》2010,30(4):1093-1095
提出了一种基于密度网格的数据流聚类算法。通过引入“隶度”,对传统的基于网格密度的数据流聚类算法,以网格内数据点的个数作为网格密度的思想加以改进,解决了一个网格内属于两个类的数据点以及边界点的处理问题。从而既利用了基于网格算法的高效率,还较大程度地提高了聚类精度。  相似文献   

2.
Due to data sparseness and attribute redundancy in high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. To effectively address this issue, this paper presents a new optimization algorithm for clustering high-dimensional categorical data, which is an extension of the k-modes clustering algorithm. In the proposed algorithm, a novel weighting technique for categorical data is developed to calculate two weights for each attribute (or dimension) in each cluster and use the weight values to identify the subsets of important attributes that categorize different clusters. The convergence of the algorithm under an optimization framework is proved. The performance and scalability of the algorithm is evaluated experimentally on both synthetic and real data sets. The experimental studies show that the proposed algorithm is effective in clustering categorical data sets and also scalable to large data sets owning to its linear time complexity with respect to the number of data objects, attributes or clusters.  相似文献   

3.
A histogram clustering algorithm is suggested, which builds the hierarchy of distributions better in cluster separability. The algorithm optimizes the average cluster separability choosing the system of the data subdomain quantization grid and allows a significant decrease in the number of clusters. Application of the algorithm for uncontrolled Earth’s surface classification by satellite spectral data is shown.  相似文献   

4.
Spectral clustering is an important component of clustering method, via tightly relying on the affinity matrix. However, conventional spectral clustering methods 1). equally treat each data point, so that easily affected by the outliers; 2). are sensitive to the initialization; 3). need to specify the number of cluster. To conquer these problems, we have proposed a novel spectral clustering algorithm, via employing an affinity matrix learning to learn an intrinsic affinity matrix, using the local PCA to resolve the intersections; and further taking advantage of a robust clustering that is insensitive to initialization to automatically generate clusters without an input of number of cluster. Experimental results on both artificial and real high-dimensional datasets have exhibited our proposed method outperforms the clustering methods under comparison in term of four clustering metrics.  相似文献   

5.
利用单元间的数据分布特征,提出了二分网格的多密度聚类算法BGMC.该算法根据两相邻单元的相邻区域中样本数量的积比两相邻单元的数据量积的相对数,判断两单元间的关系,寻找相似单元和边界单元,确定边界单元数据归属.实验结果表明,该算法可以很好的区分不同密度、形状和大小的类,聚类结果与数据输入顺序和起始单元选择顺序无关,算法执行效率高,具有良好的空间和维数的可扩展性.  相似文献   

6.
Deng  Chao  Song  Jinwei  Sun  Ruizhi  Cai  Saihua  Shi  Yinxue 《Multimedia Tools and Applications》2018,77(22):29623-29637
Multimedia Tools and Applications - The notion of density has been widely used in many spatial-temporal (ST) clustering methods. This paper proposes the novel notion of an ST density-wave, which is...  相似文献   

7.
Clustering has been widely adopted in numerous applications, including pattern recognition, data analysis, image processing, and market research. When performing data mining, traditional clustering algorithms which use distance-based measurements to calculate the difference between data are unsuitable for non-numeric attributes such as nominal, Boolean, and categorical data. Applying an unsuitable similarity measurement in clustering may cause some valuable information embedded in the data attributes to be lost, and hence low quality clusters will be created. This paper proposes a novel hierarchical clustering algorithm, referred to as MPM, for the clustering of non-numeric data. The goals of MPM are to retain the data features of interest while effectively grouping data objects into clusters with high intra-similarity and low inter-similarity. MPM achieves these goals through two principal methods: (1) the adoption of a novel similarity measurement which has the ability to capture the “characterized properties” of information, and (2) the application of matrix permutation and matrix participation partitioning to the results of the similarity measurement (constructed in the form of a similarity matrix) in order to assign data to appropriate clusters. This study also proposes a heuristic-based algorithm, the Heuristic_MPM, to reduce the processing times required for matrix permutation and matrix partitioning, which together constitute the bulk of the total MPM execution time. An erratum to this article is available at .  相似文献   

8.
The Publisher regrets that the original article incorrectly listed the authors’ location as the “People’s Republic of China”. This was a typesetting error. Their correct location is the “Republic of China”, which is also known as Taiwan. Their full affiliation is Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan. The online version of the original article can be found at .  相似文献   

9.
Microarrays are used for measuring expression levels of thousands of genes simultaneously. Clustering algorithms are used on gene expression data to find co-regulated genes. An often used clustering strategy is the Pearson correlation coefficient based hierarchical clustering algorithm presented in [Proc. Nat. Acad. Sci. 95 (25) (1998) 14863-14868], which takes O(N3) time. We note that this run time can be reduced to O(N2) by applying known hierarchical clustering algorithms [Proc. 9th Annual ACM-SIAM Symposium on Discrete Algorithms, 1998, pp. 619-628] to this problem. In this paper, we present an algorithm which runs in O(NlogN) time using a geometrical reduction and show that it is optimal.  相似文献   

10.
Advanced satellite tracking technologies provide biologists with long-term location sequence data to understand movement of wild birds then to find explicit correlation between dynamics of migratory birds and the spread of avian influenza. In this paper, we propose a hierarchical clustering algorithm based on a recursive grid partition and kernel density estimation (KDE) to hierarchically identify wild bird habitats with different densities. We hierarchically cluster the GPS data by taking into account the following observations: 1) the habitat variation on a variety of geospatial scales; 2) the spatial variation of the activity patterns of birds in different stages of the migration cycle. In addition, we measure the site fidelity of wild birds based on clustering. To assess effectiveness, we have evaluated our system using a large-scale GPS dataset collected from 59 birds over three years. As a result, our approach can identify the hierarchical habitats and distribution of wild birds more efficiently than several commonly used algorithms such as DBSCAN and DENCLUE.  相似文献   

11.
Pattern Analysis and Applications - The curse of dimensionality in high-dimensional data is one of the major challenges in data clustering. Recently, a considerable amount of literature has been...  相似文献   

12.
In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results.  相似文献   

13.
基于数据预处理的并行分层聚类算法*   总被引:3,自引:0,他引:3  
分层聚类技术在图像处理、入侵检测和生物信息学等方面有着极为重要的应用,是数据挖掘领域的研究热点之一。针对目前基于SIMD模型的并行分层聚类算法处理海量数据时效果不理想的问题,提出一种基于数据预处理的自适应并行分层聚类算法,在O((λn)2/p)的时间内对n个输入数据点进行聚类。其中1≤p≤n/log n,0.1≤λ≤0.3。将提出的算法与现有文献结论进行的性能对比分析表明,本算法明显改进了现有文献的研究结果。  相似文献   

14.
高维数据聚类方法综述*   总被引:10,自引:2,他引:10  
总结了高维数据聚类算法的研究现状,分析比较了算法性能的主要差异,并指出其今后的发展趋势,即在子空间聚类过程中融入其他传统聚类方法的思想,以提高聚类性能。  相似文献   

15.
Tick data are used in several applications that need to keep track of values changing over time, like prices on the stock market or meteorological measurements. Due to the possibly very frequent changes, the size of tick data tends to increase rapidly. Therefore, it becomes of paramount importance to reduce the storage space of tick data while, at the same time, allowing queries to be executed efficiently. In this paper, we propose an approach to decompose the original tick data matrix by clustering their attributes using a new clustering algorithm called Storage-Optimizing Hierarchical Agglomerative Clustering (SOHAC). We additionally propose a method for speeding up SOHAC based on a new lower bounding technique that allows SOHAC to be applied to high-dimensional tick data. Our experimental evaluation shows that the proposed approach compares favorably to several baselines in terms of compression. Additionally, it can lead to significant speedup in terms of running time.  相似文献   

16.
Subspace clustering algorithms have shown their advantage in handling high-dimensional data by optimizing a linear combination of clustering criteria. However, setting the coefficients of these criteria items without prior knowledge will lead to inaccurate and poor robust clustering results. To address this problem, in this paper, we propose to optimize the multiple clustering criteria simultaneously without any predefined coefficients by a multi-objective evolutionary algorithm. Furthermore, to accelerate the convergence of the algorithm, we provide a novel local search method. In it, the multi-objective clustering problem is decomposed into many localized scalarizing sub-problems by reference vectors. Solutions are then locally searched around their associated sub-problems. Thirdly, we develop a knee-pruning fuzzy ensemble method for selecting the final solution. This method applies clustering ensemble in solutions selected from knee regions to get robust results. Experiments on UCI benchmarks and gene expression datasets show that our proposed algorithm can efficiently handle high-dimensional clustering problems without any user-defined coefficients.  相似文献   

17.
KNN-kernel density-based clustering for high-dimensional multivariate data   总被引:1,自引:0,他引:1  
Density-based clustering algorithms for multivariate data often have difficulties with high-dimensional data and clusters of very different densities. A new density-based clustering algorithm, called KNNCLUST, is presented in this paper that is able to tackle these situations. It is based on the combination of nonparametric k-nearest-neighbor (KNN) and kernel (KNN-kernel) density estimation. The KNN-kernel density estimation technique makes it possible to model clusters of different densities in high-dimensional data sets. Moreover, the number of clusters is identified automatically by the algorithm. KNNCLUST is tested using simulated data and applied to a multispectral compact airborne spectrographic imager (CASI)_image of a floodplain in the Netherlands to illustrate the characteristics of the method.  相似文献   

18.
Krleža  Dalibor  Vrdoljak  Boris  Brčić  Mario 《Machine Learning》2021,110(1):139-184
Machine Learning - Anomaly detection is a hard data analysis process that requires constant creation and improvement of data analysis algorithms. Using traditional clustering algorithms to analyse...  相似文献   

19.
针对字符型数据和混合型数据的聚类方法进行了研究。首先在经典粗糙集理论的基础上,通过松弛对 象之间的不可分辨和相容性条件,得到了基于和谐关系的扩展粗糙集模型;然后定义了新的个体间不可区分度、 类间不可区分度、聚类结果的综合近似精度等概念,提出了新的混合数据类型层次聚类算法。该算法不仅能处 理数值型数据,而且能处理大多数聚类算法不能处理的字符型数据和混合型数据。实验验证了算法的可行性。  相似文献   

20.
以密度敏感距离作为相似性测度,结合近邻传播聚类算法和谱聚类算法,提出了一种密度敏感的层次化聚类算法。算法以密度敏感距离为相似度,多次应用近邻传播算法在数据集中选取一些“可能的类代表点”;用谱聚类算法将“可能的类代表点”再聚类得到“最终的类代表点”;每个数据点根据其类代表点的类标签信息找到自己的类标签。实验结果表明,该算法在处理时间、内存占用率和聚类错误率上都优于传统的近邻传播算法和谱聚类算法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号