首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
动态加权模糊核聚类算法   总被引:2,自引:0,他引:2  
为了克服噪声特征向量对聚类的影响,充分考虑各特征向量对聚类结果的贡献度的不同,运用mercer核将待聚类的数据映射到高维空间,提出了一种新的动态加权模糊核聚类算法.该算法运用动态加权,自动消弱噪声特征向量在分类中的作用,在对数据没有任何先验信息的情况下,不仅能够准确划分线性数据,而且能够做到非线性划分非团状数据.仿真和实际数据分类结果表明,数据中的噪声对分类结果影响较小,该算法具有很高的实用性.  相似文献   

2.
对支持向量聚类中核区域的形成原理进行了深入分析,阐明了核区域在支持向量聚类进行重叠数据处理时的独特作用。针对视频数据内容存在大量数据重叠分布的特点,提出了一种基于支持向量的镜头聚类算法。利用颜色和时间作为特征向量,计算特征空间的聚类核区域,进而产生镜头聚类,克服了传统镜头聚类算法计算量大、仅以时间阈值判断镜头相似度等缺陷。  相似文献   

3.
动态权值混合C-均值模糊核聚类算法*   总被引:2,自引:1,他引:1  
PCM算法存在聚类重叠的缺陷,PFCM算法同时利用隶属度与典型值把数据样本划分到不同的类中,提高了算法的抗噪能力,但PFCM算法对样本分布不均衡的聚类效果并不十分理想。针对此不足,可以通过Mercer核把原来的数据空间映射到特征空间,并为特征空间的每个向量分配一个动态权值,从而得到特征空间内的目标函数。理论分析和实验结果表明,相对于其他经典模糊聚类算法,新算法具有更好的健壮性和聚类效果。  相似文献   

4.
自适应K-means聚类的散乱点云精简   总被引:1,自引:0,他引:1       下载免费PDF全文
目的 点云精简是曲面重建等点云处理的一个重要前提,针对以往散乱点云精简算法的精简结果存在失真较大、空洞及不适用于片状点云的问题,提出一种自适应K-means聚类的点云精简算法。方法 首先,根据k邻域计算每个数据点的曲率、点法向与邻域点法向夹角的平均值、点到邻域重心的距离、点到邻域点的平均距离,据此运用多判别参数混合的特征提取方法识别并保留特征点,包括曲面尖锐点和边界点;然后,对点云数据建立自适应八叉树,为K-means聚类提供与点云密度分布相关的初始化聚类中心以及K值;最后,遍历整个聚类,如果聚类结果中含有特征点则剔除其中的特征点并更新聚类中心,计算更新后聚类中数据点的最大曲率差,将最大曲率差大于设定阈值的聚类进行细分,保留最终聚类中距聚类中心最近的数据点。结果 在聚类方面,将传统的K-means聚类和自适应K-means聚类算法应用于bunny点云,后者在聚类的迭代次数、评价函数值和时间上均优于前者;在精简方面,将提出的精简算法应用于封闭及片状两种不同类型的点云,在精简比例为1/5时fandisk及saddle模型的精简误差分别为0.29×10-3、-0.41×10-3和0.037、-0.094,对于片状的saddle点云模型,其边界收缩误差为0.030 805,均小于栅格法和曲率法。结论 本文提出的散乱点云精简算法可应用于封闭及片状点云,精简后的数据点分布均匀无空洞,对片状点云进行精简时能够保护模型的边界数据点。  相似文献   

5.
A novel kernel method for clustering   总被引:10,自引:0,他引:10  
Kernel methods are algorithms that, by replacing the inner product with an appropriate positive definite function, implicitly perform a nonlinear mapping of the input data into a high-dimensional feature space. In this paper, we present a kernel method for clustering inspired by the classical k-means algorithm in which each cluster is iteratively refined using a one-class support vector machine. Our method, which can be easily implemented, compares favorably with respect to popular clustering algorithms, like k-means, neural gas, and self-organizing maps, on a synthetic data set and three UCI real data benchmarks (IRIS data, Wisconsin breast cancer database, Spam database).  相似文献   

6.
7.
In this paper, a novel clustering method in the kernel space is proposed. It effectively integrates several existing algorithms to become an iterative clustering scheme, which can handle clusters with arbitrary shapes. In our proposed approach, a reasonable initial core for each of the cluster is estimated. This allows us to adopt a cluster growing technique, and the growing cores offer partial hints on the cluster association. Consequently, the methods used for classification, such as support vector machines (SVMs), can be useful in our approach. To obtain initial clusters effectively, the notion of the incomplete Cholesky decomposition is adopted so that the fuzzy c‐means (FCM) can be used to partition the data in a kernel defined‐like space. Then a one‐class and a multiclass soft margin SVMs are adopted to detect the data within the main distributions (the cores) of the clusters and to repartition the data into new clusters iteratively. The structure of the data set is explored by pruning the data in the low‐density region of the clusters. Then data are gradually added back to the main distributions to assure exact cluster boundaries. Unlike the ordinary SVM algorithm, whose performance relies heavily on the kernel parameters given by the user, the parameters are estimated from the data set naturally in our approach. The experimental evaluations on two synthetic data sets and four University of California Irvine real data benchmarks indicate that the proposed algorithms outperform several popular clustering algorithms, such as FCM, support vector clustering (SVC), hierarchical clustering (HC), self‐organizing maps (SOM), and non‐Euclidean norm fuzzy c‐means (NEFCM). © 2009 Wiley Periodicals, Inc.4  相似文献   

8.
一个基于关联规则的多层文档聚类算法   总被引:3,自引:0,他引:3  
提出了一种新的基于关联规则的多层文档聚类算法,该算法利用新的文档特征抽取方法构造了文档的主题和关键字特征向量。首先在主题特征向量空间中利用频集快速算法对文档进行初始聚类,然后在基于主题关键字的新的特征向量空间中利用类间距和连接度对初始文档类进行求精,从而得到最终聚类。由于使用了两层聚类方法,使算法的效率和精度都大大提高;使用新的文档特征抽取方法还解决了由于文档关键字过多而导致文档特征向量的维数过高的问题。  相似文献   

9.
经典的模糊C-均值聚类算法存在对噪声数据较为敏感、未考虑样本属性特征间的不平衡性及对高维数据聚类不理想等问题,而可能性聚类算法虽然解决了噪声敏感和一致性聚类问题,但算法假定每个样本对聚类的贡献程度一样。针对以上问题,提出了一种基于样本-特征加权的可能性模糊核聚类算法,将可能性聚类应用到模糊聚类中以提高其对噪声或例外点的抗干扰能力;同时,根据不同类的具体特性动态计算样本各个属性特征对不同类别的重要性权值及各个样本对聚类的重要性权值,并优化选取核参数,不断修正核函数把原始空间中非线性可分的数据集映射到高维空间中的可分数据集。实验结果表明,基于样本-特征加权模糊聚类算法能够减少噪声数据和例外点的影响,比传统的聚类算法具有更好的聚类准确率。  相似文献   

10.
杜航原  张晶  王文剑   《智能系统学报》2020,15(6):1113-1120
针对聚类集成中一致性函数设计问题,本文提出一种深度自监督聚类集成算法。该算法首先根据基聚类划分结果采用加权连通三元组算法计算样本之间的相似度矩阵,基于相似度矩阵表达邻接关系,将基聚类由特征空间中的数据表示变换至图数据表示;在此基础上,基聚类的一致性集成问题被转化为对基聚类图数据表示的图聚类问题。为此,本文利用图神经网络构造自监督聚类集成模型,一方面采用图自动编码器学习图的低维嵌入,依据低维嵌入似然分布估计聚类集成的目标分布;另一方面利用聚类集成目标对低维嵌入过程进行指导,确保模型获得的图低维嵌入与聚类集成结果是一致最优的。在大量数据集上进行了仿真实验,结果表明本文算法相比HGPA、CSPA和MCLA等算法可以进一步提高聚类集成结果的准确性。  相似文献   

11.
王一宾    李田力  程玉胜   《智能系统学报》2019,14(5):966-973
标记分布是一种新的学习范式,现有算法大多数直接使用条件概率建立参数模型,未充分考虑样本之间的相关性,导致计算复杂度增大。基于此,引入谱聚类算法,通过样本之间相似性关系将聚类问题转化为图的全局最优划分问题,进而提出一种结合谱聚类的标记分布学习算法(label distribution learning with spectral clustering,SC-LDL)。首先,计算样本相似度矩阵;然后,对矩阵进行拉普拉斯变换,构造特征向量空间;最后,通过K-means算法对数据进行聚类建立参数模型,预测未知样本的标记分布。与现有算法在多个数据集上的实验表明,本算法优于多个对比算法,统计假设检验进一步说明算法的有效性和优越性。  相似文献   

12.
本文针对交通数据挖掘领域的交通流预测问题进行研究和实现.主要对数据挖掘技术应用于交通流数据的特征选择和交通流预测模型的建立提出算法.在对采样数据进行清洗后,以分类与回归决策树作为基学习器,采用梯度提升决策树进行回归拟合,计算出交通数据的特征重要度.并以此重要度作为自适应特征选择的依据.其次,采用聚类算法对选取后的特征数据进行聚类分析,缩小样本大小的同时,同类数据更加相似.最后,以实时数据匹配相应聚类作为训练数据集,使用经过人工鱼群算法优化参数后的支持向量机进行交通流预测.本文结尾通过实验数据论证本文所提出的算法和模型.  相似文献   

13.
为了克服k-均值聚类算法容易受到数据空间分布影响的缺点,将线性规划下的一类支持向量机算法与K-均值聚类方法相结合提出一种支持向量聚类算法,该算法的每次循环都采用线性规划下的一类支持向量机进行运算.该算法实现简单,与二次规划下的支持向量机聚类算法相比,该算法能够大大减小计算的复杂性,而且能保持良好的聚类效果.与K-均值聚类算法、自组织映射聚类算法等进行仿真比较,人工数据和实际数据表明了该算法的有效性和可行性.  相似文献   

14.
An algorithm for optimizing data clustering in feature space is studied in this work. Using graph Laplacian and extreme learning machine (ELM) mapping technique, we develop an optimal weight matrix W for feature mapping. This work explicitly performs a mapping of the original data for clustering into an optimal feature space, which can further increase the separability of original data in the feature space, and the patterns points in same cluster are still closely clustered. Our method, which can be easily implemented, gets better clustering results than some popular clustering algorithms, like k-means on the original data, kernel clustering method, spectral clustering method, and ELM k-means on data include three UCI real data benchmarks (IRIS data, Wisconsin breast cancer database, and Wine database).  相似文献   

15.
基于核的非凸数据模糊K-均值聚类研究   总被引:4,自引:4,他引:0  
将模糊K-均值聚类算法与核函数相结合,采用基于核的模糊K-均值聚类算法来进行聚类。核函数隐含地定义了一个非线性变换,将数据非线性映射到高维特征空间来增加数据的可分性。该算法能够解决模糊K-均值聚类算法对于非凸形状数据不能正确聚类的问题。  相似文献   

16.
The computer-assisted analysis of biomedical records has become an essential tool in clinical settings. However, current devices provide a growing amount of data that often exceeds the processing capacity of normal computers. As this amount of information rises, new demands for more efficient data extracting methods appear. This paper addresses the task of data mining in physiological records using a feature selection scheme. An unsupervised method based on relevance analysis is described. This scheme uses a least-squares optimization of the input feature matrix in a single iteration. The output of the algorithm is a feature weighting vector. The performance of the method was assessed using a heartbeat clustering test on real ECG records. The quantitative cluster validity measures yielded a correctly classified heartbeat rate of 98.69% (specificity), 85.88% (sensitivity) and 95.04% (general clustering performance), which is even higher than the performance achieved by other similar ECG clustering studies. The number of features was reduced on average from 100 to 18, and the temporal cost was a 43% lower than in previous ECG clustering schemes.  相似文献   

17.
This paper is concerned with the identification of a class of piecewise affine systems called a piecewise affine autoregressive exogenous (PWARX) model. The PWARX model is composed of ARX sub-models each of which corresponds to a polyhedral region of the regression space. Under the temporary assumption that the number of sub-models is known a priori, the input-output data are collected into several clusters by using a statistical clustering algorithm. We utilize support vector classifiers to estimate the boundary hyperplane between two adjacent regions in the regression space. In each cluster, the parameter vector of the sub-model is obtained by the least squares method. It turns out that the present statistical clustering approach enables us to estimate the number of sub-models based on the information criteria such as CAIC and MDL. The estimate of the number of sub-models is performed by applying the identification procedure several times to the same data set, after having fixed the number of sub-models to different values. Finally, we verify the applicability of the present identification method through a numerical example of a Hammerstein model.  相似文献   

18.
This paper studies supervised clustering in the context of label ranking data. The goal is to partition the feature space into K clusters, such that they are compact in both the feature and label ranking space. This type of clustering has many potential applications. For example, in target marketing we might want to come up with K different offers or marketing strategies for our target audience. Thus, we aim at clustering the customers’ feature space into K clusters by leveraging the revealed or stated, potentially incomplete customer preferences over products, such that the preferences of customers within one cluster are more similar to each other than to those of customers in other clusters. We establish several baseline algorithms and propose two principled algorithms for supervised clustering. In the first baseline, the clusters are created in an unsupervised manner, followed by assigning a representative label ranking to each cluster. In the second baseline, the label ranking space is clustered first, followed by partitioning the feature space based on the central rankings. In the third baseline, clustering is applied on a new feature space consisting of both features and label rankings, followed by mapping back to the original feature and ranking space. The RankTree principled approach is based on a Ranking Tree algorithm previously proposed for label ranking prediction. Our modification starts with K random label rankings and iteratively splits the feature space to minimize the ranking loss, followed by re-calculation of the K rankings based on cluster assignments. The MM-PL approach is a multi-prototype supervised clustering algorithm based on the Plackett-Luce (PL) probabilistic ranking model. It represents each cluster with a union of Voronoi cells that are defined by a set of prototypes, and assign each cluster with a set of PL label scores that determine the cluster central ranking. Cluster membership and ranking prediction for a new instance are determined by cluster membership of its nearest prototype. The unknown cluster PL parameters and prototype positions are learned by minimizing the ranking loss, based on two variants of the expectation-maximization algorithm. Evaluation of the proposed algorithms was conducted on synthetic and real-life label ranking data by considering several measures of cluster goodness: (1) cluster compactness in feature space, (2) cluster compactness in label ranking space and (3) label ranking prediction loss. Experimental results demonstrate that the proposed MM-PL and RankTree models are superior to the baseline models. Further, MM-PL is has shown to be much better than other algorithms at handling situations with significant fraction of missing label preferences.  相似文献   

19.
针对高维数据聚类的问题,许多有效的方法已经被提出,级联的子空间聚类算法CSC就是一种有效的解决法案。但是CSC算法定义的聚类损失可能破坏特征空间,从而取得非代表性的无意义特征,进而损害聚类性能。为了解决这一问题,提出了一种结合自编码器保留数据结构的改进算法。具体地说,使用聚类损失作为引导,分散特征空间数据点,同时采用一种欠完备的自动编码器作为重构损失,约束操作和维护数据生成分布的局部结构。将两者结合,共同优化聚类标签的分配,学习适合聚类的局部结构保留特征。使用自适应矩估计(Adam)和小批量随机梯度下降(mini-batch SGD)两种优化方法调整模型参数。在多个数据集上,使用聚类结果准确率(Acc)、标准互信息(NMI)和调整Rand指数(ARI)三个评价指标验证了该算法的有效性和优越性。  相似文献   

20.
In this paper, we develop a genetic algorithm method based on a latent semantic model (GAL) for text clustering. The main difficulty in the application of genetic algorithms (GAs) for document clustering is thousands or even tens of thousands of dimensions in feature space which is typical for textual data. Because the most straightforward and popular approach represents texts with the vector space model (VSM), that is, each unique term in the vocabulary represents one dimension. Latent semantic indexing (LSI) is a successful technology in information retrieval which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. Meanwhile, LSI takes into account the effects of synonymy and polysemy, which constructs a semantic structure in textual data. GA belongs to search techniques that can efficiently evolve the optimal solution in the reduced space. We propose a variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering. GA can be used in conjunction with the reduced latent semantic structure and improve clustering efficiency and accuracy. The superiority of GAL approach over conventional GA applied in VSM model is demonstrated by providing good Reuter document clustering results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号