共查询到20条相似文献,搜索用时 15 毫秒
1.
一种基于引力的聚类方法 总被引:8,自引:1,他引:8
将万有引力的思想引入聚类分析中,提出了一种基于引力的聚类方法GCA(Gravitybased Clustering Approach),同时给出了一种计算聚类阈值的简单而有效的方法。GCA关于数据库的大小和属性个数具有近似线性时间复杂度,这使得聚类方法GCA具有好的扩展性。实验结果表明GCA可产生高质量的聚类结果。 相似文献
2.
Chien-I Lee Cheng-Jung Tsai Tong-Qin Wu Wei-Pang Yang 《Expert systems with applications》2008,34(4):3021-3032
The class imbalance problem is an important issue in classification of Data mining. For example, in the applications of fraudulent telephone calls, telecommunications management, and rare diagnoses, users would be more interested in the minority than the majority. Although there are many proposed algorithms to solve the imbalanced problem, they are unsuitable to be directly applied on a multi-relational database. Nevertheless, many data nowadays such as financial transactions and medical anamneses are stored in a multi-relational database rather than a single data sheet. On the other hand, the widely used multi-relational classification approaches, such as TILDE, FOIL and CrossMine, are insensitive to handle the imbalanced databases. In this paper, we propose a multi-relational g-mean decision tree algorithm to solve the imbalanced problem in a multi-relational database. As shown in our experiments, our approach can more accurately mine a multi-relational imbalanced database. 相似文献
3.
4.
We explore in this paper the efficient clustering of market-basket data. Different from those of the traditional data, the features of market-basket data are known to be of high dimensionality and sparsity. Without explicitly considering the presence of the taxonomy, most prior efforts on clustering market-basket data can be viewed as dealing with items in the leaf level of the taxonomy tree. Clustering transactions across different levels of the taxonomy is of great importance for marketing strategies as well as for the result representation of the clustering techniques for market-basket data. In view of the features of market-basket data, we devise in this paper a novel measurement, called the category-based adherence, and utilize this measurement to perform the clustering. With this category-based adherence measurement, we develop an efficient clustering algorithm, called algorithm k-todes, for market-basket data with the objective to minimize the category-based adherence. The distance of an item to a given cluster is defined as the number of links between this item and its nearest tode. The category-based adherence of a transaction to a cluster is then defined as the average distance of the items in this transaction to that cluster. A validation model based on information gain is also devised to assess the quality of clustering for market-basket data. As validated by both real and synthetic datasets, it is shown by our experimental results, with the taxonomy information, algorithm k-todes devised in this paper significantly outperforms the prior works in both the execution efficiency and the clustering quality as measured by information gain, indicating the usefulness of category-based adherence in market-basket data clustering. 相似文献
5.
马海云 《自动化与仪器仪表》2010,(1):14-15,27
总结了数据挖掘中聚类算法的研究现状,分析比较了它们的差异及局限性。提出了一种新的聚类方法。通过实例得出该方法为数据挖掘提供了有效的平台。 相似文献
6.
Clustering is a classic problem in the machine learning and pattern recognition area, however a few complications arise when we try to transfer proposed solutions in the data stream model. Recently there have been proposed new algorithms for the basic clustering problem for massive data sets that produce an approximate solution using efficiently the memory, which is the most critical resource for streaming computation. In this paper, based on these solutions, we present a new model for clustering clickstream data which applies three different phases in the data processing, and is validated through a set of experiments. 相似文献
7.
In this paper, we develop a novel framework, called Monitoring Vehicle Outliers based on a Clustering technique (MVOC), for monitoring vehicle outliers caused by complex vehicle states. The vehicle outlier monitoring is a method to continuously check the current vehicle conditions. Most of previous monitoring methods have conducted simple operations depending on uncomplicated analyses or expected lifetimes in regard to vehicle components. However, many serious vehicle outliers such as turning off during a drive result from the complex vehicle states influenced by correlated components. The proposed method monitors the current vehicle conditions based on not simple components like the previous methods but more complex and various vehicle states using a clustering technique. We perform vehicle data clustering and then analyze the generated clusters with information of vehicle outliers caused by complex correlations of vehicle components. Thus, we can learn vehicle information in more detail. To facilitate MVOC, we also propose related techniques such as sampling cluster data with representative attributes and deciding cluster characteristics on the basis of relations between vehicle data and states. Then, we demonstrate the performance of our approach in terms of monitoring vehicle outliers on the basis of real complex correlations between outliers and vehicle data through various experiments. Experimental results show that the proposed method can not only monitor the complex outliers by predicting their occurrence possibilities in advance but also outperform a standard technique. Moreover, we present statistical significance of the results through significance tests. 相似文献
8.
9.
《Expert systems with applications》2014,41(9):4148-4157
Tick data are used in several applications that need to keep track of values changing over time, like prices on the stock market or meteorological measurements. Due to the possibly very frequent changes, the size of tick data tends to increase rapidly. Therefore, it becomes of paramount importance to reduce the storage space of tick data while, at the same time, allowing queries to be executed efficiently. In this paper, we propose an approach to decompose the original tick data matrix by clustering their attributes using a new clustering algorithm called Storage-Optimizing Hierarchical Agglomerative Clustering (SOHAC). We additionally propose a method for speeding up SOHAC based on a new lower bounding technique that allows SOHAC to be applied to high-dimensional tick data. Our experimental evaluation shows that the proposed approach compares favorably to several baselines in terms of compression. Additionally, it can lead to significant speedup in terms of running time. 相似文献
10.
Thanh N. Tran 《Computational statistics & data analysis》2006,51(2):513-525
Density-based clustering algorithms for multivariate data often have difficulties with high-dimensional data and clusters of very different densities. A new density-based clustering algorithm, called KNNCLUST, is presented in this paper that is able to tackle these situations. It is based on the combination of nonparametric k-nearest-neighbor (KNN) and kernel (KNN-kernel) density estimation. The KNN-kernel density estimation technique makes it possible to model clusters of different densities in high-dimensional data sets. Moreover, the number of clusters is identified automatically by the algorithm. KNNCLUST is tested using simulated data and applied to a multispectral compact airborne spectrographic imager (CASI)_image of a floodplain in the Netherlands to illustrate the characteristics of the method. 相似文献
11.
Almost all subspace clustering algorithms proposed so far are designed for numeric datasets. In this paper, we present a k-means type clustering algorithm that finds clusters in data subspaces in mixed numeric and categorical datasets. In this method, we compute attributes contribution to different clusters. We propose a new cost function for a k-means type algorithm. One of the advantages of this algorithm is its complexity which is linear with respect to the number of the data points. This algorithm is also useful in describing the cluster formation in terms of attributes contribution to different clusters. The algorithm is tested on various synthetic and real datasets to show its effectiveness. The clustering results are explained by using attributes weights in the clusters. The clustering results are also compared with published results. 相似文献
12.
Robust projected clustering 总被引:2,自引:2,他引:2
Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a
subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many
overlapping clusters. Such algorithms have been extensively studied for numerical data, but only a few have been proposed
for categorical data. Typical drawbacks of existing projected and subspace clustering algorithms for numerical or categorical
data are that they rely on parameters whose appropriate values are difficult to set appropriately or that they are unable
to identify projected clusters with few relevant attributes. We present P3C, a robust algorithm for projected clustering that
can effectively discover projected clusters in the data while minimizing the number of required parameters. P3C does not need
the number of projected clusters as input, and can discover, under very general conditions, the true number of projected clusters.
P3C is effective in detecting very low-dimensional projected clusters embedded in high dimensional spaces. P3C positions itself
between projected and subspace clustering in that it can compute both disjoint or overlapping clusters. P3C is the first projected
clustering algorithm for both numerical and categorical data. 相似文献
13.
Elaine Y. Chan Author Vitae Author Vitae Michael K. Ng Author Vitae Joshua Z. Huang Author Vitae 《Pattern recognition》2004,37(5):943-952
One of the main problems in cluster analysis is the weighting of attributes so as to discover structures that may be present. By using weighted dissimilarity measures for objects, a new approach is developed, which allows the use of the k-means-type paradigm to efficiently cluster large data sets. The optimization algorithm is presented and the effectiveness of the algorithm is demonstrated with both synthetic and real data sets. 相似文献
14.
Clustering consists in partitioning a set of objects into disjoint and homogeneous clusters. For many years, clustering methods have been applied in a wide variety of disciplines and they also have been utilized in many scientific areas. Traditionally, clustering methods deal with numerical data, i.e. objects represented by a conjunction of numerical attribute values. However, nowadays commercial or scientific databases usually contain categorical data, i.e. objects represented by categorical attributes. In this paper we present a dissimilarity measure which is capable to deal with tree structured categorical data. Thus, it can be used for extending the various versions of the very popular k-means clustering algorithm to deal with such data. We discuss how such an extension can be achieved. Moreover, we empirically prove that the proposed dissimilarity measure is accurate, compared to other well-known (dis)similarity measures for categorical data. 相似文献
15.
Darío García-García Author Vitae Emilio Parrado-Hernández Author Vitae Author Vitae 《Pattern recognition》2011,44(5):1014-1022
This paper proposes a novel similarity measure for clustering sequential data. We first construct a common state space by training a single probabilistic model with all the sequences in order to get a unified representation for the dataset. Then, distances are obtained attending to the transition matrices induced by each sequence in that state space. This approach solves some of the usual overfitting and scalability issues of the existing semi-parametric techniques that rely on training a model for each sequence. Empirical studies on both synthetic and real-world datasets illustrate the advantages of the proposed similarity measure for clustering sequences. 相似文献
16.
B.B Chaudhuri 《Pattern recognition letters》1985,3(3):179-183
An efficient divisive clustering technique based on hierarchical partitioning of space is proposed. It may be executed in O(N) time at each level of hierarchy. The technique along with some of its modifications are implemented on typical data sets and the results is discussed. 相似文献
17.
Clusters and grids of workstations provide available resources for data mining processes. To exploit these resources, new distributed algorithms are necessary, particularly concerning the way to distribute data and to use this partition. We present a clustering algorithm dubbed Progressive Clustering that provides an “intelligent” distribution of data on grids. The usefulness of this algorithm is shown for several distributed datamining tasks. 相似文献
18.
Yanchang ZhaoAuthor Vitae Chengqi ZhangAuthor Vitae 《Journal of Systems and Software》2011,84(9):1524-1539
We propose an enhanced grid-density based approach for clustering high dimensional data. Our technique takes objects (or points) as atomic units in which the size requirement to cells is waived without losing clustering accuracy. For efficiency, a new partitioning is developed to make the number of cells smoothly adjustable; a concept of the ith-order neighbors is defined for avoiding considering the exponential number of neighboring cells; and a novel density compensation is proposed for improving the clustering accuracy and quality. We experimentally evaluate our approach and demonstrate that our algorithm significantly improves the clustering accuracy and quality. 相似文献
19.
20.
Muneaki Ohshima Ning Zhong YiYu Yao Chunnian Liu 《Data mining and knowledge discovery》2007,15(2):249-273
Peculiarity rules are a new type of useful knowledge that can be discovered by searching the relevance among peculiar data.
A main task in mining such knowledge is peculiarity identification. Previous methods for finding peculiar data focus on attribute
values. By extending to record-level peculiarity, this paper investigates relational peculiarity-oriented mining. Peculiarity
rules are mined, and more importantly explained, in a relational mining framework. Several experiments are carried out and
the results show that relational peculiarity-oriented mining is effective. 相似文献