共查询到20条相似文献,搜索用时 93 毫秒
1.
为了解决半监督聚类先验知识少、聚类偏差大的问题,提出了基于成对约束的主动半监督聚类算法.引入主动学习算法,增加约束集的信息量以使聚类效果更好;利用该约束集建立投影矩阵映射数据到低维空间,便于计算并提高聚类效果.算法中提出闭包替代思想,试图简化样本空间,以期获得降低聚类偏差的可能.由于聚类算法的实施对象是低维数据,成对约束集信息量大,聚类的时间效率以及性能均可保证.实验结果表明,采用主动学习的半监督聚类算法聚类效果提升显著,高效合理. 相似文献
2.
针对谱聚类存在计算瓶颈的问题,提出了一种快速的集成算法,称为间接谱聚类。它首先运用K-Means算法对数据集进行过分聚类,然后把每个过分簇看成一个基本对象,最后在过分簇的级别上利用标准谱聚类来完成总体的聚类。将该思想应用于大文本数据集的聚类问题后,过分簇中心之间的相似性度度量方法可以采用常用的余弦距离法。在20-Newgroups文本数据上的实验结果表明:间接谱聚类算法在聚类准确性上比K-Means算法平均高出14.72%;比规范割谱聚类仅低0.88%,但算法所需的计算时间平均不到规范割谱聚类的1/16,且随着数据集的增大当规范割谱聚类遭遇计算瓶颈时,提出的算法却能快速地给出次优解。 相似文献
3.
一种基于Rough集的层次聚类算法 总被引:13,自引:0,他引:13
Rough集理论是一种新型的处理含糊和不确定性知识的数学工具,将Rough集理论应用于知识发现中的聚类分析,给出了局部不可区分关系、个体之间的局部不可区分度和总不可区分度、类之间的不可区分度、聚类结果的综合近似精度等定义,在此基础上提出了一种基于Rough集的层次聚类算法,该算法能够自动调整参数,以寻求更优的聚类结果。验结果验证了该算法的可行性,特别是在符号属性聚类方面有较好的聚类性能。 相似文献
4.
提出利用特征增量学习和数据风格信息双知识表达约束的模糊K平面聚类(ISF-KPC)算法.为了获得更好的泛化性,聚类前利用高斯核函数对原输入特征进行增长式的特征扩维.考虑数据集中来源于同一聚类的样本具有相同的风格,以矩阵的形式表达数据风格信息,并采用迭代的方式确定每个聚类的风格矩阵.大量实验结果表明,双知识表达约束的ISF-KPC与对比算法相比能够取得竞争性的聚类性能,尤其在具有典型风格数据集上能够取得优异的聚类性能. 相似文献
5.
基于扩展和网格的多密度聚类算法 总被引:6,自引:1,他引:6
提出了网格密度可达的聚类概念和边界处理技术,并在此基础上提出一种基于扩展的多密度网格聚类算法。该算法使用网格技术提高聚类的速度,使用边界处理技术提高聚类的精度,每次聚类均从最高的密度单元开始逐步向周围扩展形成聚类.实验结果表明,该算法能有效地对多密度数据集和均匀密度数据集进行聚类,具有聚类精度高等优点. 相似文献
6.
7.
无监督学习聚类算法的性能依赖于用户在输入数据集上指定的距离度量,该距离度量直接影响数据样本之间的相似性计算,因此,不同的距离度量往往对数据集的聚类结果具有重要的影响。针对谱聚类算法中距离度量的选取问题,提出一种基于边信息距离度量学习的谱聚类算法。该算法利用数据集本身蕴涵的边信息,即在数据集中抽样产生的若干数据样本之间是否具有相似性的信息,进行距离度量学习,将学习所得的距离度量准则应用于谱聚类算法的相似度计算函数,并据此构造相似度矩阵。通过在UCI标准数据集上的实验进行分析,结果表明,与标准谱聚类算法相比,该算法的预测精度得到明显提高。 相似文献
8.
9.
10.
11.
APSCAN: A parameter free algorithm for clustering 总被引:1,自引:0,他引:1
DBSCAN is a density based clustering algorithm and its effectiveness for spatial datasets has been demonstrated in the existing literature. However, there are two distinct drawbacks for DBSCAN: (i) the performances of clustering depend on two specified parameters. One is the maximum radius of a neighborhood and the other is the minimum number of the data points contained in such neighborhood. In fact these two specified parameters define a single density. Nevertheless, without enough prior knowledge, these two parameters are difficult to be determined; (ii) with these two parameters for a single density, DBSCAN does not perform well to datasets with varying densities. The above two issues bring some difficulties in applications. To address these two problems in a systematic way, in this paper we propose a novel parameter free clustering algorithm named as APSCAN. Firstly, we utilize the Affinity Propagation (AP) algorithm to detect local densities for a dataset and generate a normalized density list. Secondly, we combine the first pair of density parameters with any other pair of density parameters in the normalized density list as input parameters for a proposed DDBSCAN (Double-Density-Based SCAN) to produce a set of clustering results. In this way, we can obtain different clustering results with varying density parameters derived from the normalized density list. Thirdly, we develop an updated rule for the results obtained by implementing the DDBSCAN with different input parameters and then synthesize these clustering results into a final result. The proposed APSCAN has two advantages: first it does not need to predefine the two parameters as required in DBSCAN and second, it not only can cluster datasets with varying densities but also preserve the nonlinear data structure for such datasets. 相似文献
12.
13.
Selim Mimaroglu Ertunc Erdil 《Engineering Applications of Artificial Intelligence》2013,26(10):2525-2539
Clustering is the process of grouping objects that are similar, where similarity between objects is usually measured by a distance metric. The groups formed by a clustering method are referred as clusters. Clustering is a widely used activity with multiple applications ranging from biology to economics. Each clustering technique has some advantages and disadvantages. Some clustering algorithms may even require input parameters which strongly affect the result. In most cases, it is not possible to choose the best distance metric, the best clustering method, and the best input argument values for an input data set. Therefore, multiple clusterings can be obtained by several distance metrics, several clustering methods, and several input argument values. And, multiple clusterings can be combined into a new and better quality final clustering. We propose a family of combining multiple clustering algorithms that are memory efficient, scalable, robust, and intuitive. Our new algorithms offer tremendous speed gain and low memory requirements by working at cluster level, while producing very good quality final clusters. Extensive experimental evaluations on some very challenging artificially generated and real data sets from a diverse set of domains establish the usefulness of our methods. 相似文献
14.
Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in some cases impossible, to satisfy all the assumptions. Therefore, it is beneficial to apply different clustering methods on the same data set, or the same method with varying input parameters or both. Then, the clusterings obtained can be combined into a final clustering having better overall quality. Combining multiple clusterings into a final clustering which has better overall quality has gained significant importance recently. Our contributions are a novel method for combining a collection of clusterings into a final clustering which is based on cliques, and a novel output-sensitive clique finding algorithm which works on large and dense graphs and produces output in a short amount of time. Extensive experimental studies on real and artificial data sets demonstrate the effectiveness of our contributions. 相似文献
15.
T. Warren Liao Author Vitae 《Pattern recognition》2007,40(9):2550-2562
A two-step procedure is developed for the exploratory mining of real-valued vector (multivariate) time series using partition-based clustering methods. The proposed procedure was tested with model-generated data, multiple sensor-based process data, as well as simulation data. The test results indicate that the proposed procedure is quite effective in producing better clustering results than a hidden Markov model (HMM)-based clustering method if there is a priori knowledge about the number of clusters in the data. Two existing validity indices were tested and found ineffective in determining the actual number of clusters. Determining the appropriate number of clusters in the case that there is no a priori knowledge is a known unresolved research issue not only for our proposed procedure but also for the HMM-based clustering method and further development is necessary. 相似文献
16.
一个好的聚类算法应该是用户输入参数少,对噪声不敏感,能够发现任意形状,可以处理高维数据,具有可解释性和可扩展性.将聚类分析应用于地理信息系统中,可以实现对GIS数据信息概括和综合.文中提出一种基于距离阈值相邻的聚类算法,通过距离阈值可达的方式逐个将对象加入到已知聚类中,可以发现任意形状的聚类并对噪声数据有很好的分离效果,实验中将该算法应用于地理信息系统中的数据挖掘实现上,结果证明此算法对于实现GIS聚类具有满意的效果. 相似文献
17.
Roelof K. Brouwer 《Journal of Intelligent Information Systems》2009,32(3):213-235
The first stage of knowledge acquisition and reduction of complexity concerning a group of entities is to partition or divide
the entities into groups or clusters based on their attributes or characteristics. Clustering is one of the most basic processes
that are performed in simplifying data and expressing knowledge in a scientific endeavor. It is akin to defining classes.
Since the output of clustering is a partition of the input data, the quality of the partition must be determined as a way
of measuring the quality of the partitioning (clustering) process. The problem of comparing two different partitions of a
finite set of objects reappears continually in the clustering literature. This paper looks at some commonly used clustering
measures including the rand index (RI), adjusted RI (ARI) and the jaccuard index(JI) that are already defined for crisp clustering
and extends them to fuzzy clustering measures giving FRI,FARI and FJI. These new indices give the same values as the original
indices do in the special case of crisp clustering. The extension is made by first finding equivalent expressions for the
parameters, a, b, c, and d of these indices in the case of crisp clustering. A relationship called bonding that describes
the degree to which two cluster members are in the same cluster or class is first defined. Through use in crisp clustering
and fuzzy clustering the effectiveness of the indices is demonstrated. 相似文献
18.
《Expert systems with applications》2014,41(6):2939-2946
The well known clustering algorithm DBSCAN is founded on the density notion of clustering. However, the use of global density parameter ε-distance makes DBSCAN not suitable in varying density datasets. Also, guessing the value for the same is not straightforward. In this paper, we generalise this algorithm in two ways. First, adaptively determine the key input parameter ε-distance, which makes DBSCAN independent of domain knowledge satisfying the unsupervised notion of clustering. Second, the approach of deriving ε-distance based on checking the data distribution of each dimension makes the approach suitable for subspace clustering, which detects clusters enclosed in various subspaces of high dimensional data. Experimental results illustrate that our approach can efficiently find out the clusters of varying sizes, shapes as well as varying densities. 相似文献
19.
20.
一种进行K-Means聚类的有效方法 总被引:1,自引:0,他引:1
现有的K-Means聚类算法均直接作用于多维数据集上,因此,当数据集基数和聚类属性个数较大时,这些聚类算法的效率极其低下。为此,文中提出一种基于正规格结构的有效聚类方法(KMCRG)。KMCRG算法以单元格为处理对象来有效完成K-Means聚类工作。特别,该算法使用格加权迭代的策略来有效返回最终的K个类。实验结果表明,KMCRG算法在不损失聚类精度的基础上能够快速返回聚类结果。 相似文献