首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 296 毫秒
1.
一种基于语料特性的聚类算法   总被引:3,自引:0,他引:3  
曾依灵  许洪波  吴高巍  白硕 《软件学报》2010,21(11):2802-2813
为寻求模型不匹配问题的一种恰当的解决途径,提出了基于语料分布特性的CADIC(clustering algorithm based on the distributions of intrinsic clusters)聚类算法。CADIC以重标度的形式隐式地将语料特性融入算法框架,从而使算法模型具备更灵活的适应能力。在聚类过程中,CADIC选择一组具有良好区分度的方向构建CADIC坐标系,在该坐标系下统计固有簇的分布特性,以构造各个坐标轴的重标度函数,并以重标度的形式对语料分布进行隐式的归一化,从而提高聚  相似文献   

2.
针对海量数据背景下K-means聚类结果不稳定和收敛速度较慢的问题,提出了基于MapReduce框架下的K-means改进算法。首先,为了能获得K-means聚类的初始簇数,利用凝聚层次聚类法对数据集进行聚类,并用轮廓系数对聚类结果进行初步评价,将获得数据集的簇数作为K-means算法的初始簇中心进行聚类;其次,为了能适应于海量数据的聚类挖掘,将改进的K-means算法部署在MapReduce框架上进行运算。实验结果表明,在单机性能上,该方法具有较高的准确率和召回率,同时也具有较强的聚类稳定性;在集群性能上,也具有较好的加速比和运行速度。  相似文献   

3.
BK-means:骨架初始解K-means   总被引:2,自引:0,他引:2       下载免费PDF全文
K-means是典型的启发式聚类算法,容易受到初始解的影响而无法获得高质量的聚类结果。骨架是近年来启发式算法设计的研究热点,它是指所有全局最优解中相同的部分,对于提高启发式算法性能具有重要意义。给出的骨架初始解K-means算法(BK-means)的基本思想是:首先利用K-means算法得到一组局部最优解(聚类结果),通过对局部最优解求交得到骨架簇。利用骨架簇构造骨架初始解及新的搜索空间。最后以骨架初始解引导K-means算法在新的搜索空间中搜索聚类结果。在15组仿真数据集和4组实际数据集上的实验结果表明,BK-means算法具有获得高内聚、高分离的聚类结果能力。  相似文献   

4.
K-means算法所使用的聚类准则函数是将数据集中各个簇的误差平方值直接相加而得到的,不能有效处理簇的密度不均且大小差异较大的数据集。为此,将K-means算法的聚类准则函数定义为加权的簇内标准差之和,权重为簇内数据对象数占总数目的比例。同时,调整了传统K-means算法将数据对象重新分配给簇的方法,采用一个数据对象到中心点的加权距离代替传统K-means算法中的距离,将数据对象分配给使加权距离最小的中心点所在的簇。实验结果表明,针对模拟数据集的聚类,改进K-means算法可以明显减少大而稀的簇中数据对象被错误地分配到相邻的小而密簇的可能性,改善了聚类的质量;针对UCI数据集的聚类,改进算法使得各个簇更为紧凑,从而验证了改进K-means算法的有效性。  相似文献   

5.
陈俊芬  张明  何强 《计算机科学》2018,45(Z11):474-479
基于图论理论的NJW谱聚类算法的核心思想是将数据点映射到特征空间后再利用K-means算法进行聚类,从而得到原始数据的聚类结果。NJW算法是K-means算法的推广,并且在任意形状的数据上都具有较好的聚类效果,从而有着广泛的应用。但是,类数C和高斯核函数中的尺度参数σ较大程度地影响着NJW的聚类性能;另外,K-means对随机初始值的敏感性也影响着NJW的聚类结果。为此,一种基于启发式确定类数的谱聚类算法(记为DP-NJW)被提出。该算法先根据数据的密度分布确定类中心点和类数,这些类中心点作为特征空间中K-means聚类的初始类中心,然后用NJW进行聚类。文中通过实验将DP-NJW算法和经典聚类算法在7个公共数据集上进行测试和对比,其中DP-NJW算法在5个数据集上的聚类精度高于NJW的平均聚类精度,在另2个数据集上二者持平。对比DPC算法,所提算法在5个数据集上也有不俗的聚类精度,而且DP-NJW的计算消耗较小,在较大的数据集aggregation上表现更为突出。实验结果表明,文中所提的DP-NJW算法更具优势。  相似文献   

6.
大多数集成聚类算法使用K-means算法生成基聚类,得到的基聚类效果不太理想.通常在使用共协矩阵对基聚类进行集成时,忽视了基聚类多样性的不同,平等地对待基聚类,且以样本为操作单元生成共协矩阵.当样本数目或集成规模较大时,计算负担显著增加.针对上述问题,提出超簇加权的集成聚类算法(ECWSC).该算法使用随机选点与K-means选点相结合来获取地标点,对地标点使用谱聚类算法得到其聚类结果,再将样本点映射到与之最近邻的地标点上生成基聚类.在此基础上,以信息熵为依据计算基聚类的不确定性,并对基聚类赋予相应权重,使用加权的方式得到加权超簇的共协矩阵,对共协矩阵使用层次聚类算法得到集成结果.选取7个真实数据集和4个人工数据集作为实验数据集,从准确度、鲁棒性和时间复杂度方面进行验证.对比实验结果表明,该算法能够有效提升集成聚类的性能.  相似文献   

7.
在推荐系统中应用K-means算法聚类可有效降维,然而聚类效果往往依赖于选定的初始中心,并且一旦选定目标簇后,推荐过程只针对目标簇进行,与其他簇无关。针对上述两个问题,提出一种基于满二叉树的二分K-means聚类并行推荐算法。该算法首先反复迭代二分K-means算法,迭代过程中使用簇内凝聚度作为分裂阈值,形成一颗满二叉树;然后通过层次遍历将用户归入到K个叶子节点(簇);最后针对K个簇,应用MapReduce框架进行并行推荐预测。MovieLens上的实验结果表明,该算法可大幅度提高推荐系统准确性,同时增强系统可扩展性。  相似文献   

8.
《传感器与微系统》2019,(1):152-154
针对传统聚类算法无法处理大规模数据的特点,结合增量算法和簇特征的思想,在初始聚类阶段,采用基于距离的K-means聚类算法获取相应簇的特征。根据簇特征,并结合K最近邻(KNN)的思想处理增量,提出了基于簇特征的增量聚类算法。提出的方法已经在加州大学尔湾分校(UCI)机器学习库中提供的真实数据集的帮助下得到验证。实验结果表明:提出的增量聚类方法的聚类精度较普通K-means算法和原始增量K-means算法有明显提高。  相似文献   

9.
肖升生  刘鹏 《计算机应用研究》2011,28(10):3665-3670
为了深入地探索聚类结果簇的形态特征,提出了一种基于维度映射的类圆簇识别算法。该算法将结果簇按维度进行映射,通过比较、分析簇在各个映射维度上的频数曲线及形态特征,自动将类圆簇从众多结构复杂的聚类结果簇中识别出来。算法经过大量实验验证,具有很好的识别能力和抗干扰能力,对于高维度数据集合也具有很强的扩展性。  相似文献   

10.
为提高金融业务数据集上的聚类质量和聚类效率,提出簇的直径、簇间的相似度这2个概念。利用距离尺度降维的中心距序降维法,将多维数据降至一维,在一维上利用自适应排序聚类算法ASC聚类。该算法和传统的Cobweb算法、K-means算法做对比,实验表明该方法能提高簇间相似度,最大提高200%。  相似文献   

11.
An Evolutionary Approach to Multiobjective Clustering   总被引:6,自引:0,他引:6  
The framework of multiobjective optimization is used to tackle the unsupervised learning problem, data clustering, following a formulation first proposed in the statistics literature. The conceptual advantages of the multiobjective formulation are discussed and an evolutionary approach to the problem is developed. The resulting algorithm, multiobjective clustering with automatic k-determination, is compared with a number of well-established single-objective clustering algorithms, a modern ensemble technique, and two methods of model selection. The experiments demonstrate that the conceptual advantages of multiobjective clustering translate into practical and scalable performance benefits  相似文献   

12.
13.
The problem is motivated by stochastic modeling of chemical reactions, but it is studied in a general framework. The fluctuations of a sequence of jump Markov processes around their deterministic limit are considered. It is shown that, if the deterministic limit has an equilibrium point which is at least k-asymptotically stable, the sequence of fluctuations (suitably rescaled) converges to a diffusion process. Moreover, for a particular choice of the rescaling factor, the limit diffusion admits a unique stationary distribution.  相似文献   

14.
联邦学习(federated learning)可以解决分布式机器学习中基于隐私保护的数据碎片化和数据隔离问题。在联邦学习系统中,各参与者节点合作训练模型,利用本地数据训练局部模型,并将训练好的局部模型上传到服务器节点进行聚合。在真实的应用环境中,各节点之间的数据分布往往具有很大差异,导致联邦学习模型精确度较低。为了解决非独立同分布数据对模型精确度的影响,利用不同节点之间数据分布的相似性,提出了一个聚类联邦学习框架。在Synthetic、CIFAR-10和FEMNIST标准数据集上进行了广泛实验。与其他联邦学习方法相比,基于数据分布的聚类联邦学习对模型的准确率有较大提升,且所需的计算量也更少。  相似文献   

15.
One of the central problems in cognitive science is determining the mental representations that underlie human inferences. Solutions to this problem often rely on the analysis of subjective similarity judgments, on the assumption that recognizing likenesses between people, objects, and events is crucial to everyday inference. One such solution is provided by the additive clustering model, which is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. Existing approaches for implementing additive clustering often lack a complete framework for statistical inference, particularly with respect to choosing the number of features. To address these problems, this article develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features and their importance.  相似文献   

16.
基于集成聚类的流量分类架构   总被引:1,自引:0,他引:1  
鲁刚  余翔湛  张宏莉  郭荣华 《软件学报》2016,27(11):2870-2883
流量分类是优化网络服务质量的基础与关键.机器学习算法利用数据流统计特征分类流量,对于识别加密私有协议流量具有重要意义.然而,特征偏置和类别不平衡是基于机器学习的流量分类研究所面临的两大挑战.特征偏置是指一些数据流统计特征在提高部分应用识别准确率的同时也降低了另外一部分应用识别的准确率.类别不平衡是指机器学习流量分类器对样本数较少的应用识别的准确率较低.为解决上述问题,提出了基于集成聚类的流量分类架构(traffic classification framework based on ensemble clustering,简称TCFEC).TCFEC由多个基于不同特征子空间聚类的基分类器和一个最优决策部件构成,能够提高流量分类的准确率.具体而言,与传统的机器学习流量分类器相比,TCFEC的平均流准确率最高提升5%,字节准确率最高提升6%.  相似文献   

17.
Mixtures of factor analyzers have been receiving wide interest in statistics as a tool for performing clustering and dimension reduction simultaneously. In this model it is assumed that, within each component, the data are generated according to a factor model. Therefore, the number of parameters on which the covariance matrices depend is reduced. Several estimation methods have been proposed for this model, both in the classical and in the Bayesian framework. However, so far, a direct maximum likelihood procedure has not been developed. This direct estimation problem, which simultaneously allows one to derive the information matrix for the mixtures of factor analyzers, is solved. The effectiveness of the proposed procedure is shown on a simulation study and on a toy example.  相似文献   

18.
We propose a graph model for mutual information based clustering problem. This problem was originally formulated as a constrained optimization problem with respect to the conditional probability distribution of clusters. Based on the stationary distribution induced from the problem setting, we propose a function which measures the relevance among data objects under the problem setting. This function is utilized to capture the relation among data objects, and the entire objects are represented as an edge-weighted graph where pairs of objects are connected with edges with their relevance. We show that, in hard assignment, the clustering problem can be approximated as a combinatorial problem over the proposed graph model when data is uniformly distributed. By representing the data objects as a graph based on our graph model, various graph based algorithms can be utilized to solve the clustering problem over the graph. The proposed approach is evaluated on the text clustering problem over 20 Newsgroup and TREC datasets. The results are encouraging and indicate the effectiveness of our approach.  相似文献   

19.
Clustering based on matrix approximation: a unifying view   总被引:8,自引:7,他引:1  
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. Recently, a number of methods have been proposed and demonstrated good performance based on matrix approximation. Despite significant research on these methods, few attempts have been made to establish the connections between them while highlighting their differences. In this paper, we present a unified view of these methods within a general clustering framework where the problem of clustering is formulated as matrix approximations and the clustering objective is minimizing the approximation error between the original data matrix and the reconstructed matrix based on the cluster structures. The general framework provides an elegant base to compare and understand various clustering methods. We provide characterizations of different clustering methods within the general framework including traditional one-side clustering, subspace clustering and two-side clustering. We also establish the connections between our general clustering framework with existing frameworks.
Tao LiEmail:
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号