首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到17条相似文献,搜索用时 515 毫秒
1.
武森  姜丹丹  王蔷 《工程科学学报》2016,38(7):1017-1024
CABOSFV_C是一种针对分类属性高维数据的高效聚类算法,该算法采用集合稀疏差异度进行距离计算,并采用稀疏特征向量实现数据压缩.该算法的聚类效果受集合稀疏差异度上限参数的影响,而该参数的选取没有明确的指导.针对该问题提出基于集合稀疏差异度的启发式分类属性数据层次聚类算法(heuristic hierarchical clustering algorithm of categorical data based on sparse feature dissimilarity,HABOS),该方法从聚结型层次聚类思想的角度出发,在聚类数上限参数的约束下,应用新的内部聚类有效性评价指标(clustering validation index based on sparse feature dissimilarity,CVISFD)进行启发式度量,从而实现对聚类层次的自动选取.UCI基准数据集的实验结果表明,HABOS有效地提高了聚类准确性和稳定性.   相似文献   

2.
针对经典K–means算法对不均衡数据进行聚类时产生的“均匀效应”问题,提出一种基于近邻的不均衡数据聚类算法(Clustering algorithm for imbalanced data based on nearest neighbor,CABON)。CABON算法首先对数据对象进行初始聚类,通过定义的类别待定集来确定初始聚类结果中类别归属有待进一步核定的数据对象集合;并给出一种类别待定集的动态调整机制,利用近邻思想实现此集合中数据对象所属类别的重新划分,按照从集合边缘到中心的顺序将类别待定集中的数据对象依次归入其最近邻居所在的类别中,得到最终的聚类结果,以避免“均匀效应”对聚类结果的影响。将该算法与K–means、多中心的非平衡K_均值聚类方法(Imbalanced K–means clustering method with multiple centers,MC_IK)和非均匀数据的变异系数聚类算法(Coefficient of variation clustering for non-uniform data,CVCN)在人工数据集和真实数据集上分别进行实验对比,结果表明CABON算法能够有效消减K–means算法对不均衡数据聚类时所产生的“均匀效应”,聚类效果明显优于K–means、MC_IK和CVCN算法。   相似文献   

3.
武森  冯小东  杨杰  张晓楠 《工程科学学报》2014,36(10):1411-1419
建立快速有效的针对大规模文本数据的聚类分析方法是当前数据挖掘研究和应用领域中的一个热点问题.为了同时保证聚类效果和提高聚类效率,提出基于"互为最小相似度文本对"搜索的文本聚类算法及分布式并行计算模型.首先利用向量空间模型提出一种文本相似度计算方法;其次,基于"互为最小相似度文本对"搜索选择二分簇中心,提出通过一次划分实现簇质心寻优的二分K-means聚类算法;最后,基于MapReduce框架设计面向云计算应用的大规模文本并行聚类模型.在Hadoop平台上运用真实文本数据的实验表明:提出的聚类算法与原始二分K-means相比,在获得相当聚类效果的同时,具有明显效率优势;并行聚类模型在不同数据规模和计算节点数目上具有良好的扩展性.   相似文献   

4.
由于时间序列数据具有高维度、动态性等特点,这就导致传统的数据挖掘技术很难有效的对其进行处理,为此,提出了一种基于多维时间序列形态特征的相似性动态聚类算法(similarity dynamical clustering algorithm based on multidimensionalshape features for time series,SDCTS).首先,提取多维时间序列的特征点以实现降维,然后,根据多维时间序列的斜率、长度和幅值变化的形态特征定义了一种新的时间序列相似性度量标准,进而提出无需人为给定聚类个数的多维时间序列动态聚类算法.实验结果表明,与其他算法相比,此算法对时间序列具有良好的聚类效果.   相似文献   

5.
为提高金属轧制预测模型中各参数数据的有效率,保证预测值的准确性,提出一种基于改进模糊C均值聚类算法的数据去噪方法。通过密度划分方法初始化聚类中心,设计了采用欧氏距离与夹角余弦结合而成的相似度指标的FCMD聚类算法,实现对参数预测所需数据的精确聚类;进而设计了一种双尺度噪声判别方法进行噪声去除。将该方法应用于实际板形预测模型进行实验,结果表明,基于FCMD聚类的去噪算法能有效提升样本数据的信噪比,降低均方根误差,提升轧制参数预测的准确度。  相似文献   

6.
提出基于集合差异度的聚类算法.算法通过定义的集合差异度和集合精简表示,直接进行一个集合内所有对象总体差异程度的计算,而不必计算两两对象间的距离,并且在不影响计算精确度的情况下对分类属性高维数据进行高度压缩,只需一次数据扫描即得到聚类结果.算法计算时间复杂度接近线性.实例表明该算法是有效的.  相似文献   

7.
利用相似度和欧式距离系数建立了描述数据样本间近似程度的归一化综合指标-相似离度,并通过灰度数据向量集的相似离度描述图像的匹配程度,将SVM人脸检测、图像灰度值相似离度匹配和跟踪模板更新三种方法结合起来,设计了基于相似离度匹配的人脸跟踪算法,算法综合考虑了图像内颜色的空间位置和值信息,具有较高的精确性.实验表明,相似离度匹配算法可以在模拟图像灰度值矩阵中寻找到和模板数据最匹配的区域;在静态和动态人脸跟踪中具有较强的抗环境干扰能力,鲁棒性强.  相似文献   

8.
鲁杰  闫炳基  赵伟  李鹏  陈栋  国宏伟 《工程科学学报》2022,44(12):2081-2089
高炉操作炉型与高炉操作、技术经济指标等关系密切,合理的操作炉型有利于保证高炉生产的优质、低耗、高产、长寿.通过对冷却壁温度的聚类分析,能够有效合理地表征高炉操作炉型的变化,对高炉生产有着重要的指导意义.分别采用K-Means、TwoStep对数据集进行聚类分析,基于两种聚类算法的原理,结合Davies-Bouldin index(DBI)与Dunn index(DI)对聚类结果进行评价,分析不同聚类算法间的差异,得出了在所选样本数据及数据特征基础上,K-Means算法聚类结果更好的结论,该研究可为高炉炼铁大数据分析中的聚类算法选择提供有力参考.  相似文献   

9.
高炉操作炉型与高炉长寿、高炉操作及技术经济指标等密切相关,合理的操作炉型有利于保证高炉生产的优质、低耗、高产、长寿。通过对高炉冷却壁温度数据的聚类分析,能够有效合理地表征高炉操作炉型的变化,对高炉生产有着重要的指导意义。基于沙钢5 800 m3高炉冷却壁温度数据,分别采用K均值聚类(K-means)、高斯混合模型(Gaussian mixture model, GMM)对数据集进行聚类分析,基于两种聚类算法,结合戴维斯-唐纳德指数(Davies-Bouldin indicator, DBI)与轮廓系数(Silhouette coefficient, SC)对聚类结果进行评价,并分析了所得聚类簇类别对应生产状态的高炉冶炼情况。得出了在本文所选的样本数据基础上,采用K-means算法且当炉型聚类为3时聚类结果更好,且第3类炉型对应的平均焦比、煤比、燃料比、煤气利用率、铁水温度及产量分别为357.62 kg/t、163.18 kg/t、512.34 kg/t、47.51%、1 502.045℃、12 472.59 t/d,更适合该高炉日常生产的结论。该研究可为高炉炼铁冶...  相似文献   

10.
[目的]采用指标规范值的灰色聚类法对区域生态环境质量进行评价.[方法]以巢湖流域为例,介绍了基于指标规范值的区域生态环境质量灰色聚类评价法,并将评价结果与未确知测度法得出的评价结果相比较,进一步验证该方法在区域生态环境质量评价中的可行性.[结果]基于指标规范值的区域生态环境质量灰色聚类评价,将同级标准规范值相近的指标归为一类,同属一类的指标只需设计为相同的白化函数,从而大大减少了白化函数的设计个数.将指标规范值的灰色聚类法用于巢湖流域的区域生态环境质量评价,其评价结果与未确知测度评价法的评价结果基本一致,除流域总体相差一级外,合肥、巢湖、六安的生态环境质量分别属于3级(及格)、4级(较差)、5级(差),表明该方法具有实用性.[结论]该研究可为区域生态环境综合治理对策的制定提供理论依据.  相似文献   

11.
An amino acid index is a set of 20 numerical values representing any of the different physicochemical and biochemical properties of amino acids. As a follow-up to the previous study, we have increased the size of the database, which currently contains 402 published indices, and re-performed the single-linkage cluster analysis. The results basically confirmed the previous findings. Another important feature of amino acids that can be represented numerically is the similarity between them. Thus, a similarity matrix, also called a mutation matrix, is a set of 20 x 20 numerical values used for protein sequence alignments and similarity searches. We have collected 42 published matrices, performed hierarchical cluster analyses and identified several clusters corresponding to the nature of the data set and the method used for constructing the mutation matrix. Further, we have tried to reproduce each mutation matrix by the combination of amino acid indices in order to understand which properties of amino acids are reflected most. There was a relationship between the PAM units of Dayhoff's mutation matrix and the volume and hydrophobicity of amino acids. The database of 402 amino acid indices and 42 amino acid mutation matrices is made publicly available on the Internet.  相似文献   

12.
This article provides an investigation of cluster validation indices that relates 4 of the indices to the L. Hubert and P. Arable (1985) adjusted Rand index--the cluster validation measure of choice (G. W. Milligan & M. C. Cooper, 1986). It is shown how these other indices can be "roughly" transformed into the same scale as the adjusted Rand index. Furthermore, in-depth explanations are given of why classification rates should not be used in cluster validation research. The article concludes by summarizing several properties of the adjusted Rand index across many conditions and provides a method for testing the significance of observed adjusted Rand indices. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

13.
Using the cluster generation procedure proposed by D. Steinley and R. Henson (2005), the author investigated the performance of K-means clustering under the following scenarios: (a) different probabilities of cluster overlap; (b) different types of cluster overlap; (c) varying samples sizes, clusters, and dimensions; (d) different multivariate distributions of clusters; and (e) various multidimensional data structures. The results are evaluated in terms of the Hubert-Arabie adjusted Rand index, and several observations concerning the performance of K-means clustering are made. Finally, the article concludes with the proposal of a diagnostic technique indicating when the partitioning given by a K-means cluster analysis can be trusted. By combining the information from several observable characteristics of the data (number of clusters, number of variables, sample size, etc.) with the prevalence of unique local optima in several thousand implementations of the K-means algorithm, the author provides a method capable of guiding key data-analysis decisions. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

14.
15.
TROPIX is a practical application project initially designed to help improve health care delivery in the rural/semi urban clinics and public hospitals in Nigeria due largely to limited laboratory facilities, medical doctors, and expertise. This paper is devoted to the use of case-based reasoning (CBR) paradigm in concert with statistical association-based reasoning (ABR) for disease diagnosis, validation and therapy selection components of the research. Essentially, tentative disease diagnosis arrived at by some classification method using similarity and dissimilarity aggregate functions, the matched vector functions (MVF), aided by the application of evidence ratio factors (ERF) for tied match cases is passed to the CBR model for validation by reusing past similar cases. The design and organization of the case-library using singular value decomposition (SVD) technique on the disease-attribute decision matrix to generate primary/secondary storage key clusters, as well as the use of domain-specific case-object properties that help to build a good case-base are described in some detail. The paper presents a disease case validation algorithm for appropriate data filtering and therapy selection enhancement from the new case-base.  相似文献   

16.
高兵  张健沛  邹启杰 《工程科学学报》2014,36(12):1703-1711
现有的基于密度的数据流聚类算法难于发现密度不同的簇,难于区分由若干数据对象桥接的簇和离群点.本文提出了一种基于共享最近邻密度的演化数据流聚类算法.在此算法中,基于共享最近邻图定义了共享最近邻密度,结合数据对象被类似的最近邻对象包围的程度和被其周围对象需要的程度这两个环境因素,使聚类结果不受密度变化的影响.定义了数据对象的平均距离和簇密度,以识别离群点和簇间的桥接.设计了滑动窗口模型下数据流更新算法,维护共享最近邻图中簇的更新.理论分析和实验结果验证了算法的聚类效果和聚类质量.   相似文献   

17.
An on-line agglomerative clustering algorithm for nonstationary data is described. Three issues are addressed. The first regards the temporal aspects of the data. The clustering of stationary data by the proposed algorithm is comparable to the other popular algorithms tested (batch and on-line). The second issue addressed is the number of clusters required to represent the data. The algorithm provides an efficient framework to determine the natural number of clusters given the scale of the problem. Finally, the proposed algorithm implicitly minimizes the local distortion, a measure that takes into account clusters with relatively small mass. In contrast, most existing on-line clustering methods assume stationarity of the data. When used to cluster nonstationary data, these methods fail to generate a good representation. Moreover, most current algorithms are computationally intensive when determining the correct number of clusters. These algorithms tend to neglect clusters of small mass due to their minimization of the global distortion (Energy).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号