共查询到20条相似文献,搜索用时 0 毫秒
1.
滑动窗口内基于密度网格的数据流聚类算法 总被引:1,自引:0,他引:1
提出了一种基于密度网格的数据流聚类算法。通过引入“隶度”,对传统的基于网格密度的数据流聚类算法,以网格内数据点的个数作为网格密度的思想加以改进,解决了一个网格内属于两个类的数据点以及边界点的处理问题。从而既利用了基于网格算法的高效率,还较大程度地提高了聚类精度。 相似文献
2.
Liang Bai Jiye Liang Chuangyin Dang Fuyuan CaoAuthor vitae 《Pattern recognition》2011,44(12):2843-2861
Due to data sparseness and attribute redundancy in high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. To effectively address this issue, this paper presents a new optimization algorithm for clustering high-dimensional categorical data, which is an extension of the k-modes clustering algorithm. In the proposed algorithm, a novel weighting technique for categorical data is developed to calculate two weights for each attribute (or dimension) in each cluster and use the weight values to identify the subsets of important attributes that categorize different clusters. The convergence of the algorithm under an optimization framework is proved. The performance and scalability of the algorithm is evaluated experimentally on both synthetic and real data sets. The experimental studies show that the proposed algorithm is effective in clustering categorical data sets and also scalable to large data sets owning to its linear time complexity with respect to the number of data objects, attributes or clusters. 相似文献
3.
V. S. Sidorova 《Pattern Recognition and Image Analysis》2011,21(2):328-331
A histogram clustering algorithm is suggested, which builds the hierarchy of distributions better in cluster separability.
The algorithm optimizes the average cluster separability choosing the system of the data subdomain quantization grid and allows
a significant decrease in the number of clusters. Application of the algorithm for uncontrolled Earth’s surface classification
by satellite spectral data is shown. 相似文献
4.
Spectral clustering is an important component of clustering method, via tightly relying on the affinity matrix. However, conventional spectral clustering methods 1). equally treat each data point, so that easily affected by the outliers; 2). are sensitive to the initialization; 3). need to specify the number of cluster. To conquer these problems, we have proposed a novel spectral clustering algorithm, via employing an affinity matrix learning to learn an intrinsic affinity matrix, using the local PCA to resolve the intersections; and further taking advantage of a robust clustering that is insensitive to initialization to automatically generate clusters without an input of number of cluster. Experimental results on both artificial and real high-dimensional datasets have exhibited our proposed method outperforms the clustering methods under comparison in term of four clustering metrics. 相似文献
5.
李光兴 《计算机工程与设计》2012,33(5):1876-1880
利用单元间的数据分布特征,提出了二分网格的多密度聚类算法BGMC.该算法根据两相邻单元的相邻区域中样本数量的积比两相邻单元的数据量积的相对数,判断两单元间的关系,寻找相似单元和边界单元,确定边界单元数据归属.实验结果表明,该算法可以很好的区分不同密度、形状和大小的类,聚类结果与数据输入顺序和起始单元选择顺序无关,算法执行效率高,具有良好的空间和维数的可扩展性. 相似文献
6.
Deng Chao Song Jinwei Sun Ruizhi Cai Saihua Shi Yinxue 《Multimedia Tools and Applications》2018,77(22):29623-29637
Multimedia Tools and Applications - The notion of density has been widely used in many spatial-temporal (ST) clustering methods. This paper proposes the novel notion of an ST density-wave, which is... 相似文献
7.
Hewijin Christine Jiau Yi-Jen Su Yeou-Min Lin Shang-Rong Tsai 《Journal of Intelligent Information Systems》2006,26(2):185-207
Clustering has been widely adopted in numerous applications, including pattern recognition, data analysis, image processing,
and market research. When performing data mining, traditional clustering algorithms which use distance-based measurements
to calculate the difference between data are unsuitable for non-numeric attributes such as nominal, Boolean, and categorical
data. Applying an unsuitable similarity measurement in clustering may cause some valuable information embedded in the data
attributes to be lost, and hence low quality clusters will be created. This paper proposes a novel hierarchical clustering
algorithm, referred to as MPM, for the clustering of non-numeric data. The goals of MPM are to retain the data features of
interest while effectively grouping data objects into clusters with high intra-similarity and low inter-similarity. MPM achieves
these goals through two principal methods: (1) the adoption of a novel similarity measurement which has the ability to capture
the “characterized properties” of information, and (2) the application of matrix permutation and matrix participation partitioning
to the results of the similarity measurement (constructed in the form of a similarity matrix) in order to assign data to appropriate
clusters. This study also proposes a heuristic-based algorithm, the Heuristic_MPM, to reduce the processing times required
for matrix permutation and matrix partitioning, which together constitute the bulk of the total MPM execution time.
An erratum to this article is available at . 相似文献
8.
Hewijin Christine Jiau Yi-Jen Su Yeou-Min Lin Shang-Rong Tsai 《Journal of Intelligent Information Systems》2006,26(3):303-303
The Publisher regrets that the original article incorrectly listed the authors’ location as the “People’s Republic of China”.
This was a typesetting error. Their correct location is the “Republic of China”, which is also known as Taiwan. Their full
affiliation is Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan.
The online version of the original article can be found at . 相似文献
9.
Sudip Seal 《Information Processing Letters》2005,93(3):143-147
Microarrays are used for measuring expression levels of thousands of genes simultaneously. Clustering algorithms are used on gene expression data to find co-regulated genes. An often used clustering strategy is the Pearson correlation coefficient based hierarchical clustering algorithm presented in [Proc. Nat. Acad. Sci. 95 (25) (1998) 14863-14868], which takes O(N3) time. We note that this run time can be reduced to O(N2) by applying known hierarchical clustering algorithms [Proc. 9th Annual ACM-SIAM Symposium on Discrete Algorithms, 1998, pp. 619-628] to this problem. In this paper, we present an algorithm which runs in O(NlogN) time using a geometrical reduction and show that it is optimal. 相似文献
10.
Yuwei Wang Yuanchun Zhou Ying Liu Ze Luo Danhuai Guo Jing Shao Fei Tan Liang Wu Jianhui Li Baoping Yan 《Frontiers of Computer Science》2013,7(4):475-485
Advanced satellite tracking technologies provide biologists with long-term location sequence data to understand movement of wild birds then to find explicit correlation between dynamics of migratory birds and the spread of avian influenza. In this paper, we propose a hierarchical clustering algorithm based on a recursive grid partition and kernel density estimation (KDE) to hierarchically identify wild bird habitats with different densities. We hierarchically cluster the GPS data by taking into account the following observations: 1) the habitat variation on a variety of geospatial scales; 2) the spatial variation of the activity patterns of birds in different stages of the migration cycle. In addition, we measure the site fidelity of wild birds based on clustering. To assess effectiveness, we have evaluated our system using a large-scale GPS dataset collected from 59 birds over three years. As a result, our approach can identify the hierarchical habitats and distribution of wild birds more efficiently than several commonly used algorithms such as DBSCAN and DENCLUE. 相似文献
11.
Pattern Analysis and Applications - The curse of dimensionality in high-dimensional data is one of the major challenges in data clustering. Recently, a considerable amount of literature has been... 相似文献
12.
In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results. 相似文献
13.
14.
15.
《Expert systems with applications》2014,41(9):4148-4157
Tick data are used in several applications that need to keep track of values changing over time, like prices on the stock market or meteorological measurements. Due to the possibly very frequent changes, the size of tick data tends to increase rapidly. Therefore, it becomes of paramount importance to reduce the storage space of tick data while, at the same time, allowing queries to be executed efficiently. In this paper, we propose an approach to decompose the original tick data matrix by clustering their attributes using a new clustering algorithm called Storage-Optimizing Hierarchical Agglomerative Clustering (SOHAC). We additionally propose a method for speeding up SOHAC based on a new lower bounding technique that allows SOHAC to be applied to high-dimensional tick data. Our experimental evaluation shows that the proposed approach compares favorably to several baselines in terms of compression. Additionally, it can lead to significant speedup in terms of running time. 相似文献
16.
Subspace clustering algorithms have shown their advantage in handling high-dimensional data by optimizing a linear combination of clustering criteria. However, setting the coefficients of these criteria items without prior knowledge will lead to inaccurate and poor robust clustering results. To address this problem, in this paper, we propose to optimize the multiple clustering criteria simultaneously without any predefined coefficients by a multi-objective evolutionary algorithm. Furthermore, to accelerate the convergence of the algorithm, we provide a novel local search method. In it, the multi-objective clustering problem is decomposed into many localized scalarizing sub-problems by reference vectors. Solutions are then locally searched around their associated sub-problems. Thirdly, we develop a knee-pruning fuzzy ensemble method for selecting the final solution. This method applies clustering ensemble in solutions selected from knee regions to get robust results. Experiments on UCI benchmarks and gene expression datasets show that our proposed algorithm can efficiently handle high-dimensional clustering problems without any user-defined coefficients. 相似文献
17.
Thanh N. Tran 《Computational statistics & data analysis》2006,51(2):513-525
Density-based clustering algorithms for multivariate data often have difficulties with high-dimensional data and clusters of very different densities. A new density-based clustering algorithm, called KNNCLUST, is presented in this paper that is able to tackle these situations. It is based on the combination of nonparametric k-nearest-neighbor (KNN) and kernel (KNN-kernel) density estimation. The KNN-kernel density estimation technique makes it possible to model clusters of different densities in high-dimensional data sets. Moreover, the number of clusters is identified automatically by the algorithm. KNNCLUST is tested using simulated data and applied to a multispectral compact airborne spectrographic imager (CASI)_image of a floodplain in the Netherlands to illustrate the characteristics of the method. 相似文献
18.
Machine Learning - Anomaly detection is a hard data analysis process that requires constant creation and improvement of data analysis algorithms. Using traditional clustering algorithms to analyse... 相似文献
19.
20.
以密度敏感距离作为相似性测度,结合近邻传播聚类算法和谱聚类算法,提出了一种密度敏感的层次化聚类算法。算法以密度敏感距离为相似度,多次应用近邻传播算法在数据集中选取一些“可能的类代表点”;用谱聚类算法将“可能的类代表点”再聚类得到“最终的类代表点”;每个数据点根据其类代表点的类标签信息找到自己的类标签。实验结果表明,该算法在处理时间、内存占用率和聚类错误率上都优于传统的近邻传播算法和谱聚类算法。 相似文献