共查询到20条相似文献,搜索用时 15 毫秒
1.
在现有的算法DBSCAN基础上,提出一种基于密度的处理购物篮事务数据的聚类方法-DCMBD(density-based clustering for market basketdata)。使用了一种新的事务表示法,解决了购物篮数据的高维性和稀疏性问题。并对算法进行了相应的改进,从而提高了聚类速度。实验结果表明此方法是有效可行的。 相似文献
2.
改进的混合属性数据聚类算法 总被引:1,自引:0,他引:1
k-prototypes是目前处理数值属性和分类属性混合数据主要的聚类算法,但其聚类结果对初值有明显的依赖性.对k-prototypes初值选取方法进行了分析和研究,提出一种新的改进方法.该方法有更高的稳定性和较强的伸缩性,可减少一定程度的上随机性.实际数据集仿真结果表明,改进算法是正确和有效的. 相似文献
3.
Based on the molecular kinetic theory, a molecular dynamics-like data clustering approach is proposed in this paper. Clusters are extracted after data points fuse in the iterating space by the dynamical mechanism that is similar to the interacting mechanism between molecules through molecular forces. This approach is to find possible natural clusters without pre-specifying the number of clusters. Compared with 3 other clustering methods (trimmed k-means, JP algorithm and another gravitational model based method), this approach found clusters better than the other 3 methods in the experiments. 相似文献
4.
Clustering in high-dimensional spaces is a difficult problem which is recurrent in many domains, for example in image analysis. The difficulty is due to the fact that high-dimensional data usually exist in different low-dimensional subspaces hidden in the original space. A family of Gaussian mixture models designed for high-dimensional data which combine the ideas of subspace clustering and parsimonious modeling are presented. These models give rise to a clustering method based on the expectation-maximization algorithm which is called high-dimensional data clustering (HDDC). In order to correctly fit the data, HDDC estimates the specific subspace and the intrinsic dimension of each group. Experiments on artificial and real data sets show that HDDC outperforms existing methods for clustering high-dimensional data. 相似文献
5.
Bi-Ru Dai Cheng-Ru Lin Ming-Syan Chen 《The VLDB Journal The International Journal on Very Large Data Bases》2007,16(2):201-217
In order to import the domain knowledge or application-dependent parameters into the data mining systems, constraint-based
mining has attracted a lot of research attention recently. In this paper, the attributes employed to model the constraints
are called constraint attributes and those attributes involved in the objective function to be optimized are called optimization
attributes. The constrained clustering considered in this paper is conducted in such a way that the objective function of
optimization attributes is optimized subject to the condition that the imposed constraint is satisfied. Explicitly, we address
the problem of constrained clustering with numerical constraints, in which the constraint attribute values of any two data
items in the same cluster are required to be within the corresponding constraint range. This numerical constrained clustering
problem, however, cannot be dealt with by any conventional clustering algorithms. Consequently, we devise several effective
and efficient algorithms to solve such a clustering problem. It is noted that due to the intrinsic nature of the numerical
constrained clustering, there is an order dependency on the process of attaining the clustering, which in many cases degrades
the clustering results. In view of this, we devise a progressive constraint relaxation technique to remedy this drawback and improve the overall performance of clustering results. Explicitly, by using a smaller
(tighter) constraint range in earlier iterations of merge, we will have more room to relax the constraint and seek for better
solutions in subsequent iterations. It is empirically shown that the progressive constraint relaxation technique is able to
improve not only the execution efficiency but also the clustering quality. 相似文献
6.
为了有效聚类动态数据,妥善处理已存在的类簇与新增数据的关系,高效利用计算资源,提高聚类的效率,扩散涌现的增量聚类算法被提出.该算法在扩散涌现聚类算法的基础上,利用近邻传播算法完善了算法的分裂机制,实现了新旧数据的有效聚合.实验结果表明,该算法有效实现了动态数据的聚类,提高了聚合动态数据的效率和资源的利用率. 相似文献
7.
Privacy preserving clustering on horizontally partitioned data 总被引:3,自引:0,他引:3
Data mining has been a popular research area for more than a decade due to its vast spectrum of applications. However, the popularity and wide availability of data mining tools also raised concerns about the privacy of individuals. The aim of privacy preserving data mining researchers is to develop data mining techniques that could be applied on databases without violating the privacy of individuals. Privacy preserving techniques for various data mining models have been proposed, initially for classification on centralized data then for association rules in distributed environments. In this work, we propose methods for constructing the dissimilarity matrix of objects from different sites in a privacy preserving manner which can be used for privacy preserving clustering as well as database joins, record linkage and other operations that require pair-wise comparison of individual private data objects horizontally distributed to multiple sites. We show communication and computation complexity of our protocol by conducting experiments over synthetically generated and real datasets. Each experiment is also performed for a baseline protocol, which has no privacy concern to show that the overhead comes with security and privacy by comparing the baseline protocol and our protocol. 相似文献
8.
9.
一种基于划分的动态聚类算法 总被引:8,自引:5,他引:8
聚类分析是数据挖掘的一个重要研究分支,已经提出了许多聚类算法,划分方法是其中之一。划分方法的缺点是要求事先给定聚类结果数,对初始划分和输入顺序敏感等。为克服这些缺陷,以划分方法为基础,提出了一种基于划分的动态聚类算法。该算法按密度从大到小,依距离选择较为分散的初始值,同时可以过滤噪声数据,并在聚类的过程中动态地改变聚类结果数,改善了聚类质量,获得了更自然的结果。 相似文献
10.
在Deep Web页面的背后隐藏着海量的可以通过结构化的查询接口进行访问的数据源。将这些数据源按所属领域进行组织划分,是DeepWeb数据集成中的一个关键步骤。已有的划分方法主要是基于查询接口模式和提交查询返回结果,存在查询接口特征难以完全抽取和提交数据库查询效率不高等问题。提出了一种结合网页文本信息,基于频繁项集的聚类方法,根据数据源查询接口所在页面的标题、关键词和提示文本,将数据源按照领域进行聚类,有效解决了传统方法中依赖查询接口特征以及文本模型的高维性问题。实验结果表明该方法是可行的,具有较高的效率。 相似文献
11.
Joonho KwonAuthor Vitae 《Data & Knowledge Engineering》2012,71(1):69-91
Recently, there has been growing interest in developing web services composition search systems. Current solutions have the drawback of including redundant web services in the results. In this paper, we proposed a non-redundant web services composition search system called NRC, which is based on a two-phase algorithm. In the NRC system, the Link Index is built over web services according to their connectivity. In the forward phase, the candidate compositions are efficiently found by searching the Link Index. In the backward phase, the candidate compositions decomposed into several non-redundant web services compositions by using the concept of tokens. Results of experiments involving data sets with different characteristics show the performance benefits of the NRC techniques in comparison to state-of-the-art composition approaches. 相似文献
12.
For streaming data that arrive continuously such as multimedia data and financial transactions, clustering algorithms are typically allowed to scan the data set only once. Existing research in this domain mainly focuses on improving the accuracy of clustering. In this paper, a novel density-based hierarchical clustering scheme for streaming data is proposed in order to improve both accuracy and effectiveness; it is based on the agglomerative clustering framework. Traditionally, clustering algorithms for streaming data often use the cluster center to represent the whole cluster when conducting cluster merging, which may lead to unsatisfactory results. We argue that even if the data set is accessed only once, some parameters, such as the variance within cluster, the intra-cluster density and the inter-cluster distance, can be calculated accurately. This may bring measurable benefits to the process of cluster merging. Furthermore, we employ a general framework that can incorporate different criteria and, given the same criteria, will produce similar clustering results for both streaming and non-streaming data. In experimental studies, the proposed method demonstrates promising results with reduced time and space complexity. 相似文献
13.
Time-focused clustering of trajectories of moving objects 总被引:5,自引:0,他引:5
Spatio-temporal, geo-referenced datasets are growing rapidly, and will be more in the near future, due to both technological
and social/commercial reasons. From the data mining viewpoint, spatio-temporal trajectory data introduce new dimensions and,
correspondingly, novel issues in performing the analysis tasks. In this paper, we consider the clustering problem applied
to the trajectory data domain. In particular, we propose an adaptation of a density-based clustering algorithm to trajectory
data based on a simple notion of distance between trajectories. Then, a set of experiments on synthesized data is performed
in order to test the algorithm and to compare it with other standard clustering approaches. Finally, a new approach to the
trajectory clustering problem, called temporal focussing, is sketched, having the aim of exploiting the intrinsic semantics of the temporal dimension to improve the quality of trajectory
clustering.
The authors are members of the Pisa KDD Laboratory, a joint research initiative of ISTI-CNR and the University of Pisa: . 相似文献
14.
15.
子空间聚类是高维数据聚类的一种有效手段,子空间聚类的原理就是在最大限度地保留原始数据信息的同时用尽可能小的子空间对数据聚类。在研究了现有的子空间聚类的基础上,引入了一种新的子空间的搜索方式,它结合簇类大小和信息熵计算子空间维的权重,进一步用子空间的特征向量计算簇类的相似度。该算法采用类似层次聚类中凝聚层次聚类的思想进行聚类,克服了单用信息熵或传统相似度的缺点。通过在Zoo、Votes、Soybean三个典型分类型数据集上进行测试发现:与其他算法相比,该算法不仅提高了聚类精度,而且具有很高的稳定性。 相似文献
16.
结合传统的Parzen窗方法并引入一种更加合理的历史数据丢弃策略,在此基础上,通过计算可以得到整个数据集在低维空间投影的信息熵,利用信息熵实现了一种适用于高维数据流的子空间聚类算法(PStream)。理论及实验均表明,与传统的算法相比,该算法可以在一次遍历的前提下,完成对数据流的高精度聚类,虽然其运行效率与现有的方法(如HPStream)相比差别不大,但是却明显地改善了聚类效果。 相似文献
17.
BIRCH聚类算法优化及并行化研究 总被引:1,自引:0,他引:1
为了提高聚类质量,针对BIRCH算法中在聚类精度方面所存在的不足,提出了聚类特征树中的不同簇应使用不同阀值的思想,较好地改善了对体积相差悬殊的簇不能很好聚类的问题.并且深入地研究和分析了如何在集群系统中进行快速聚类,提出了自定义数据类型、采用数据并行思想和非均匀数据划分策略等几点改进意见.最后实验结果表明,通过改进能够获得比较理想的运行时间和加速比性能. 相似文献
18.
针对目前区间数据模糊聚类研究中区间距离定义存在的局限性,引入能够考虑区间数值分布特征的Wasserstein距离测度,提出基于Wasserstein距离测度的单指标和双指标自适应模糊聚类算法及迭代模型。通过仿真实验和CR指数,证实了该类模型的优势。该算法在海量、堆积如山的数据挖掘中有着重要的实践意义。 相似文献
19.
Yanchang ZhaoAuthor Vitae Chengqi ZhangAuthor Vitae 《Journal of Systems and Software》2011,84(9):1524-1539
We propose an enhanced grid-density based approach for clustering high dimensional data. Our technique takes objects (or points) as atomic units in which the size requirement to cells is waived without losing clustering accuracy. For efficiency, a new partitioning is developed to make the number of cells smoothly adjustable; a concept of the ith-order neighbors is defined for avoiding considering the exponential number of neighboring cells; and a novel density compensation is proposed for improving the clustering accuracy and quality. We experimentally evaluate our approach and demonstrate that our algorithm significantly improves the clustering accuracy and quality. 相似文献
20.
In recent years, there have been numerous attempts to extend the k-means clustering protocol for single database to a distributed multiple database setting and meanwhile keep privacy of each data site. Current solutions for (whether two or more) multiparty k-means clustering, built on one or more secure two-party computation algorithms, are not equally contributory, in other words, each party does not equally contribute to k-means clustering. This may lead a perfidious attack where a party who learns the outcome prior to other parties tells a lie of the outcome to other parties. In this paper, we present an equally contributory multiparty k-means clustering protocol for vertically partitioned data, in which each party equally contributes to k-means clustering. Our protocol is built on ElGamal's encryption scheme, Jakobsson and Juels's plaintext equivalence test protocol, and mix networks, and protects privacy in terms that each iteration of k-means clustering can be performed without revealing the intermediate values. 相似文献