期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

吴涛陈黎飞钟韵宁孔祥增《计算机应用研究》2023,40(11):3303-3308+3314

针对传统K-means型软子空间聚类技术中子空间差异度量定义的困难问题,提出一种基于概率距离的子空间差异表示模型,以此为基础提出一种自适应的投影聚类算法。该方法首先基于子空间聚类理论提出一个描述各簇类所关联的软子空间之间的相异度公式;其次,将其与软子空间聚类相结合,定义了聚类目标优化函数,并根据局部搜索策略给出了聚类算法过程。在合成和实际数据集上进行了一系列实验,结果表明该算法引入子空间比较可以为簇类学习更优的软子空间;与现有主流子空间聚类算法相比,所提算法大幅度提升了聚类精度,适用于高维数据聚类分析。相似文献

2.

一种基于数据流的软子空间聚类算法

朱林雷景生毕忠勤杨杰《软件学报》2013,24(11):2610-2627

针对高维数据的聚类研究表明,样本在不同数据簇往往与某些特定的数据特征子集相对应.因此,子空间聚类技术越来越受到关注.然而,现有的软子空间聚类算法都是基于批处理技术的聚类算法,不能很好地应用于高维数据流或大规模数据的聚类研究中.为此,利用模糊可扩展聚类框架,与熵加权软子空间聚类算法相结合,提出了一种有效的熵加权流数据软子空间聚类算法——EWSSC(entropy-weighting streaming subspace clustering).该算法不仅保留了传统软子空间聚类算法的特性,而且利用了模糊可扩展聚类策略,将软子空间聚类算法应用于流数据的聚类分析中.实验结果表明,EWSSC 算法对于高维数据流可以得到与批处理软子空间聚类方法近似一致的实验结果. 相似文献

3.

类属型数据核子空间聚类算法

徐鲲鹏陈黎飞孙浩军王备战《软件学报》2020,31(11):3492-3505

现有的类属型数据子空间聚类方法大多基于特征间相互独立假设,未考虑属性间存在的线性或非线性相关性.提出一种类属型数据核子空间聚类方法.首先引入原作用于连续型数据的核函数将类属型数据投影到核空间,定义了核空间中特征加权的类属型数据相似性度量.其次,基于该度量推导了类属型数据核子空间聚类目标函数,并提出一种高效求解该目标函数的优化方法.最后,定义了一种类属型数据核子空间聚类算法.该算法不仅在非线性空间中考虑了属性间的关系,而且在聚类过程中赋予每个属性衡量其与簇类相关程度的特征权重,实现了类属型属性的嵌入式特征选择.还定义了一个聚类有效性指标,以评价类属型数据聚类结果的质量.在合成数据和实际数据集上的实验结果表明,与现有子空间聚类算法相比,核子空间聚类算法可以发掘类属型属性间的非线性关系,并有效提高了聚类结果的质量. 相似文献

4.

自适应熵的投影聚类算法 总被引：1，自引：0，他引：1

吴涛陈黎飞《计算机科学与探索》2014,(8):933-944

受“维度效应”的影响,许多传统聚类方法运用于高维数据时往往聚类效果不佳。近年来投影聚类方法获得广泛关注,其中软子空间聚类法更是得到了广泛的研究和应用。然而,现有的投影子空间聚类算法大多数均要求用户预先设置一些重要参数,且未能考虑簇类投影子空间的优化问题,从而降低了算法的聚类性能。为此,定义了一种新的优化目标函数,在最小化簇内紧凑度的同时,优化每个簇所在的子空间。通过数学推导得到了新的特征权重计算方法,并提出了一种自适应的“均值型投影聚类算法。该算法在聚类过程中,依靠数据集自身的相关信息及推导获得的公式动态地计算各优化参数。实验结果表明,新算法通过对投影子空间的优化改善了聚类质量,其性能较已有投影聚类算法有了明显提升。相似文献

5.

不平衡数据软子空间聚类算法在临床医学中的应用与研究

《软件》2019,(11):106-110

聚类分析是数据挖掘中重要的研究课题,在信息过滤、生物信息学,医学等领域得到广泛应用。本课题着重于自上而下的子空间聚类方法,主要原因是当前主要的此型算法都是以K-means或K-modes为基础的,在均匀效应的影响下,不平衡数据的问题是现有的软子空间算法不能有效聚类的,所以提出了一种基于划分的不平衡数据软子空间聚类新算法。所提算法提高了不平衡数据的聚类精度,在生物信息学和临床医学等领域具有一定的理论意义和实际应用价值。相似文献

6.

一种基于极大熵的快速无监督线性降维方法

王继奎杨正国刘学文易纪海李冰聂飞平《软件学报》2023,34(4):1779-1795

现实世界中高维数据无处不在,然而在高维数据中往往存在大量的冗余和噪声信息,这导致很多传统聚类算法在对高维数据聚类时不能获得很好的性能.实践中发现高维数据的类簇结构往往嵌入在较低维的子空间中.因而,降维成为挖掘高维数据类簇结构的关键技术.在众多降维方法中,基于图的降维方法是研究的热点.然而,大部分基于图的降维算法存在以下两个问题:(1)需要计算或者学习邻接图,计算复杂度高;(2)降维的过程中没有考虑降维后的用途.针对这两个问题,提出一种基于极大熵的快速无监督降维算法MEDR. MEDR算法融合线性投影和极大熵聚类模型,通过一种有效的迭代优化算法寻找高维数据嵌入在低维子空间的潜在最优类簇结构. MEDR算法不需事先输入邻接图,具有样本个数的线性时间复杂度.在真实数据集上的实验结果表明,与传统的降维方法相比, MEDR算法能够找到更好地将高维数据投影到低维子空间的投影矩阵,使投影后的数据有利于聚类. 相似文献

7.

不平衡数据的软子空间聚类算法

程铃钫杨天鹏陈黎飞《计算机应用》2017,37(10):2952-2957

针对受均匀效应的影响,当前K-means型软子空间算法不能有效聚类不平衡数据的问题,提出一种基于划分的不平衡数据软子空间聚类新算法。首先,提出一种双加权方法,在赋予每个属性一个特征权重的同时,赋予每个簇反映其重要性的一个簇类权重;其次,提出一种混合型数据的新距离度量,以平衡不同类型属性及具有不同符号数目的类属型属性间的差异;第三,定义了基于双加权方法的不平衡数据子空间聚类目标优化函数,给出了优化簇类权重和特征权重的表达式。在实际应用数据集上进行了系列实验,结果表明,新算法使用的双权重方法能够为不平衡数据中的簇类学习更准确的软子空间;与现有的K-means型软子空间算法相比,所提算法提高了不平衡数据的聚类精度,在其中的生物信息学数据上可以取得近50%的提升幅度。相似文献

8.

自适应的软子空间聚类算法 总被引：6，自引：0，他引：6

陈黎飞郭躬德姜青山《软件学报》2010,21(10):2513-2523

软子空间聚类是高维数据分析的一种重要手段.现有算法通常需要用户事先设置一些全局的关键参数,且没有考虑子空间的优化.提出了一个新的软子空间聚类优化目标函数,在最小化子空间簇类的簇内紧凑度的同时,最大化每个簇类所在的投影子空间.通过推导得到一种新的局部特征加权方式,以此为基础提出一种自适应的k-means型软子空间聚类算法.该算法在聚类过程中根据数据集及其划分的信息,动态地计算最优的算法参数.在实际应用和合成数据集上的实验结果表明,该算法大幅度提高了聚类精度和聚类结果的稳定性. 相似文献

9.

基于判别分析的半监督聚类方法 总被引：1，自引：0，他引：1

下载免费PDF全文

陈小冬尹学松林焕祥《计算机工程与应用》2010,46(6):139-143

与无监督聚类相比,半监督聚类是利用一部分先验信息来更好地挖掘和理解数据的内在结构,并紧密遵从用户的偏好。现有的典型半监督聚类算法仅仅适合于低维数据,文中提出一种新颖的基于判别分析的半监督聚类算法来解决高维数据聚类问题。新算法首先使用主成分分析来投影高维数据,进一步在投影空间中,使用基于球形K均值聚类算法对数据聚类;然后利用聚类结果,使用线性判别分析降维输入空间数据;最后在投影空间中对数据再次聚类。在一组真实数据集上的实验表明,所提出的算法不仅可以有效地处理高维数据,还提高了聚类性能。相似文献

10.

一种面向高维符号数据的随机投影聚类算法 总被引：1，自引：0，他引：1

杜奕卢德唐黄丰王磊《小型微型计算机系统》2006,27(9):1605-1607

现实数据往往分布在高维空间中，从整个向量空间来看，这些数据间的联系非常分散，因此如何降低维数实现高维数据的聚类受到众多研究者的普遍关注．介绍了一种适用于符号型高维数据的随机投影聚类算法．其根据频率选择与聚类相关的维向量，随机产生并根据投影聚类效果择优选择聚类中心及相关维向量，将投影聚类算法扩展至符号数据空间．实验结果证实了这种算法的实用性与有效性．相似文献

11.

Projective clustering by histograms 总被引：5，自引：0，他引：5

《Knowledge and Data Engineering, IEEE Transactions on》2005,17(3):369-383

Recent research suggests that clustering for high-dimensional data should involve searching for "hidden" subspaces with lower dimensionalities, in which patterns can be observed when data objects are projected onto the subspaces. Discovering such interattribute correlations and location of the corresponding clusters is known as the projective clustering problem. We propose an efficient projective clustering technique by histogram construction (EPCH). The histograms help to generate "signatures", where a signature corresponds to some region in some subspace, and signatures with a large number of data objects are identified as the regions for subspace clusters. Hence, projected clusters and their corresponding subspaces can be uncovered. Compared to the best previous methods to our knowledge, this approach is more flexible in that less prior knowledge on the data set is required, and it is also much more efficient. Our experiments compare behaviors and performances of this approach and other projective clustering algorithms with different data characteristics. The results show that our technique is scalable to very large databases, and it is able to return accurate clustering results. 相似文献

12.

An extended EM algorithm for subspace clustering

Lifei CHEN Qingshan JIANG 《Frontiers of Computer Science in China》2008,2(1):81-86

Clustering high dimensional data has become a challenge in data mining due to the curse of dimensionality. To solve this problem, subspace clustering has been defined as an extension of traditional clustering that seeks to find clusters in subspaces spanned by different combinations of dimensions within a dataset. This paper presents a new subspace clustering algorithm that calculates the local feature weights automatically in an EM-based clustering process. In the algorithm, the features are locally weighted by using a new unsupervised weighting method, as a means to minimize a proposed clustering criterion that takes into account both the average intra-clusters compactness and the average inter-clusters separation for subspace clustering. For the purposes of capturing accurate subspace information, an additional outlier detection process is presented to identify the possible local outliers of subspace clusters, and is embedded between the E-step and M-step of the algorithm. The method has been evaluated in clustering real-world gene expression data and high dimensional artificial data with outliers, and the experimental results have shown its effectiveness. 相似文献

13.

子空间聚类算法的研究新进展

陈慧萍王煜王建东《计算机仿真》2007,24(3):6-10,34

高维数据聚类是聚类技术的难点和重点,子空间聚类是实现高维数据集聚类的有效途径,它是在高维数据空间中对传统聚类算法的一种扩展,其思想是将搜索局部化在相关维中进行.该文从不同的搜索策略即自顶向下策略和自底向上策略两个方面对子空间聚类算法的思想进行了介绍,对近几年提出的子空间聚类算法作了综述,从算法所需参数、算法对参数的敏感度、算法的可伸缩性以及算法发现聚类的形状等多个方面对典型的子空间聚类算法进行了比较分析,对子空间聚类算法面临的挑战和未来的发展趋势进行了讨论. 相似文献

14.

一种基于排序子空间的高维聚类算法及其可视化研究

刘勘周晓峥周洞汝《计算机研究与发展》2003,40(10):1509-1513

为了有效地发现数据聚簇，尤其是任意形状的聚簇，近年来提出了许多基于密度的聚类算法，如DBSCAN．OPTICS，DENCLUE,CLIQUE等．提出了一个新的基于密度的聚类算法CODU(clustering by ordering dense unit)，基本思想是对单位子空间按密度排序，对每一个子空间，如果其密度大于周围邻居的密度则形成一个新的聚簇．由于子空间的数目远小于数据对象的数目，因此算法效率较高．同时，提出了一个新的数据可视化方法，将数据对象看做刺激光谱映射到三维空间，使聚类的结果清晰地展示出来．相似文献

15.

一种基于网格方法的高维数据流子空间聚类算法 总被引：4，自引：0，他引：4

孙玉芬卢炎生《计算机科学》2007,34(4):199-203

基于对网格聚类方法的分析，结合由底向上的网格方法和自顶向下的网格方法，设计了一个能在线处理高维数据流的子空间聚类算法。通过利用由底向上网格方法对数据的压缩能力和自顶向下网格方法处理高维数据的能力，算法能基于对数据流的一次扫描，快速识别数据中位于不同子空间内的簇。理论分析以及在多个数据集上的实验表明算法具有较高的计算精度与计算效率。相似文献

16.

一种有效的基于网格和密度的聚类分析算法 总被引：12，自引：0，他引：12

胡泱陈刚《计算机应用》2003,23(12):64-67

讨论数据挖掘中聚类的相关概念、技术和算法。提出一种基于网格和密度的算法，它的优点在于能够自动发现包含有趣知识的子空间，并将里面存在的所有聚类挖掘出来；另一方面它能很好地处理高维数据和大数据集的数据表格。算法将最后的结果用DNF的形式表示出来。相似文献

17.

Automatic Subspace Clustering of High Dimensional Data 总被引：8，自引：0，他引：8

Rakesh?Agrawal Email author Johannes?Gehrke Dimitrios?Gunopulos Prabhakar?Raghavan 《Data mining and knowledge discovery》2005,11(1):5-33

相似文献

18.

Enhancing subspace clustering based on dynamic prediction

Ratha PECH Dong HAO Hong CHENG Tao ZHOU 《Frontiers of Computer Science》2019,13(4):802

In high dimensional data, many dimensions are irrelevant to each other and clusters are usually hidden under noise. As an important extension of the traditional clustering, subspace clustering can be utilized to simultaneously cluster the high dimensional data into several subspaces and associate the low-dimensional subspaces with the corresponding points. In subspace clustering, it is a crucial step to construct an affinity matrix with block-diagonal form, in which the blocks correspond to different clusters. The distance-based methods and the representation-based methods are two major types of approaches for building an informative affinity matrix. In general, it is the difference between the density inside and outside the blocks that determines the efficiency and accuracy of the clustering. In this work, we introduce a well-known approach in statistic physics method, namely link prediction, to enhance subspace clustering by reinforcing the affinity matrix.More importantly,we introduce the idea to combine complex network theory with machine learning. By revealing the hidden links inside each block, we maximize the density of each block along the diagonal, while restrain the remaining non-blocks in the affinity matrix as sparse as possible. Our method has been shown to have a remarkably improved clustering accuracy comparing with the existing methods on well-known datasets. 相似文献

19.

Validation indices for projective clustering

Lifei Chen Shanjun He Qingshan Jiang 《Frontiers of Computer Science in China》2009,3(4):477-484

Cluster validation is a major issue in cluster analysis of data mining, which is the process of evaluating performance of clustering algorithms under varying input conditions. Many existing validity indices address clustering results of low-dimensional data. Within high-dimensional data, many of the dimensions are irrelevant, and the clusters usually only exist in some projected subspaces spanned by different combinations of dimensions. This paper presents a solution to the problem of cluster validation for projective clustering. We propose two new measurements for the intracluster compactness and intercluster separation of projected clusters. Based on these measurements and the conventional indices, three new cluster validity indices are presented. Combined with a fuzzy projective clustering algorithm, the new indices are used to determine the number of projected clusters in high-dimensional data. The suitability of our proposal has been demonstrated through an empirical study using synthetic and real-world datasets. 相似文献