首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Several algorithms for clustering data streams based on k-Means have been proposed in the literature. However, most of them assume that the number of clusters, k, is known a priori by the user and can be kept fixed throughout the data analysis process. Besides the difficulty in choosing k, data stream clustering imposes several challenges to be addressed, such as addressing non-stationary, unbounded data that arrive in an online fashion. In this paper, we propose a Fast Evolutionary Algorithm for Clustering data streams (FEAC-Stream) that allows estimating k automatically from data in an online fashion. FEAC-Stream uses the Page–Hinkley Test to detect eventual degradation in the quality of the induced clusters, thereby triggering an evolutionary algorithm that re-estimates k accordingly. FEAC-Stream relies on the assumption that clusters of (partially unknown) data can provide useful information about the dynamics of the data stream. We illustrate the potential of FEAC-Stream in a set of experiments using both synthetic and real-world data streams, comparing it to four related algorithms, namely: CluStream-OMRk, CluStream-BkM, StreamKM++-OMRk and StreamKM++-BkM. The obtained results show that FEAC-Stream provides good data partitions and that it can detect, and accordingly react to, data changes.  相似文献   

2.
R.  W.  T.  L.  G. 《Journal of Systems Architecture》2003,49(12-15):521
In many application in VLSI CAD, a given netlist has to be partitioned into smaller sub-designs which can be handled much better. In this paper we present a new recursive bi-partitioning algorithm that is especially applicable, if a large number of final partitions, e.g., more than 1000, has to be computed. The algorithm consists of two steps. Based on recursive splits the problem is divided into several sub-problems, but with increasing recursion depth more run time is invested. By this an initial solution is determined very fast. The core of the method is a second step, where a very powerful greedy algorithm is applied to refine the partitions. Experimental results are given that compare the new approach to state-of-the-art tools. The experiments show that the new approach outperforms the standard techniques with respect to run time and quality. Furthermore, the memory usage is very low and is reduced in comparison to other methods by more than a factor of four.  相似文献   

3.
为了更有效地确定数据集的最佳聚类数,提出一种新的确定数据集最佳聚类数的算法。该算法借签层次聚类的思想,一次性地生成所有可能的划分,然后根据有效性指标选择最佳的聚类划分,进而获得最佳聚类数。理论分析和实验结果证明,该算法具有良好的性能。  相似文献   

4.
An important and yet unsolved problem in unsupervised data clustering is how to determine the number of clusters. The proposed slope statistic is a non-parametric and data driven approach for estimating the number of clusters in a dataset. This technique uses the output of any clustering algorithm and identifies the maximum number of groups that breaks down the structure of the dataset. Intensive Monte Carlo simulation studies show that the slope statistic outperforms (for the considered examples) some popular methods that have been proposed in the literature. Applications in graph clustering, in iris and breast cancer datasets are shown.  相似文献   

5.
新的K-均值算法最佳聚类数确定方法   总被引:8,自引:0,他引:8       下载免费PDF全文
K-均值聚类算法是以确定的类数k和随机选定的初始聚类中心为前提对数据集进行聚类的。通常聚类数k事先无法确定,随机选定的初始聚类中心容易使聚类结果不稳定。提出了一种新的确定K-均值聚类算法的最佳聚类数方法,通过设定AP算法的参数,将AP算法产生的聚类数作为聚类数搜索范围的上界kmax,并通过选择合适的有效性指标Silhouette指标,以及基于最大最小距离算法思想设定初始聚类中心,分析聚类效果,确定最佳聚类数。仿真实验和分析验证了以上算法方案的可行性。  相似文献   

6.
The cumulative conformance count (CCC) control chart is often employed to monitor the fraction nonconforming of high-yield processes. Traditional CCC chart is used when the items from a process are inspected one-at-a-time following the production order. In recent years, the CCC chart has been generalized to accommodate some industrial practices where items from a process are inspected sample by sample and not according to the production order. In order to increase the sensitivity of the generalized CCC (GCCC) chart to changes in fraction nonconforming, the variable sampling interval (VSI) scheme is used in this study. The output characteristic within each sample is assumed with correlation. The statistical properties of the GCCC chart with the VSI scheme are deduced using the Markov chain method. In evaluating the usefulness of the VSI feature, GCCC charts with VSI and fixed sampling interval (FSI) schemes are compared in terms of their statistical properties. The comparison results show that using the VSI scheme can improve the speed of GCCC chart in detecting changes in fraction nonconforming. Finally, according to the comparison results, a design procedure is applied to an industrial example to validate its practicability.  相似文献   

7.
Selection of the number of clusters via the bootstrap method   总被引:1,自引:0,他引:1  
Here the problem of selecting the number of clusters in cluster analysis is considered. Recently, the concept of clustering stability, which measures the robustness of any given clustering algorithm, has been utilized in Wang (2010) for selecting the number of clusters through cross validation. In this paper, an estimation scheme for clustering instability is developed based on the bootstrap, and then the number of clusters is selected so that the corresponding estimated clustering instability is minimized. The proposed selection criterion’s effectiveness is demonstrated on simulations and real examples.  相似文献   

8.
Clustering ensembles: models of consensus and weak partitions   总被引:4,自引:0,他引:4  
Clustering ensembles have emerged as a powerful method for improving both the robustness as well as the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graph-based, combinatorial, or statistical perspectives. This study extends previous research on clustering ensembles in several respects. First, we introduce a unified representation for multiple clusterings and formulate the corresponding categorical clustering problem. Second, we propose a probabilistic model of consensus using a finite mixture of multinomial distributions in a space of clusterings. A combined partition is found as a solution to the corresponding maximum-likelihood problem using the EM algorithm. Third, we define a new consensus function that is related to the classical intraclass variance criterion using the generalized mutual information definition. Finally, we demonstrate the efficacy of combining partitions generated by weak clustering algorithms that use data projections and random data splits. A simple explanatory model is offered for the behavior of combinations of such weak clustering components. Combination accuracy is analyzed as a function of several parameters that control the power and resolution of component partitions as well as the number of partitions. We also analyze clustering ensembles with incomplete information and the effect of missing cluster labels on the quality of overall consensus. Experimental results demonstrate the effectiveness of the proposed methods on several real-world data sets.  相似文献   

9.
The severe competition in the market has driven enterprises to produce a wider variety of products to meet consumer’s need. However, frequent variation of product specification and more complexity of product cause the assembly sequence planning of product become more and more complicated. As a result, the issue of assembly sequence planning of complex product becomes a problem which is worthy of concern. In this study, a methodology for assembly sequence planning of complex components is presented, which consists of three phases: assembly-based modular design, assembly subsequences generation for each module and assembly sequences merging. Nested partitions (NP) method is used to merge assembly subsequences. Assembly sequences merging can make full use of subsequences information of modules and simplify assembly sequence planning of the complex products. A desk lamp is used as an example for implementation to validate the feasibility of this research.  相似文献   

10.
A method of predicting the number of clusters using Rand's statistic   总被引:1,自引:0,他引:1  
Distributional and asymptotic results on the moment of Rand's Ck statistic were derived by DuBien and Warde [1981. Some distributional results concerning a comparative statistic used in cluster analysis. ASA Proceedings of the Social Statistics Section, 309–313.]. Based on those results, a method to predict the number of clusters is suggested by applying various agglomerative clustering algorithms. In the procedure, the methods using different indexes are examined and compared based on the concept of agreement (or, disagreement) between clusterings generated by different clustering algorithms on the set of data. Our method having practical generality works better than the other methods and assigns statistical meaning to Ck values in determining the number of clusters from the comparison.  相似文献   

11.
12.
13.
In this paper, we consider a multi-agent consensus problem with an active leader and variable interconnection topology. The state of the considered leader not only keeps changing but also may not be measured. To track such a leader, a neighbor-based local controller together with a neighbor-based state-estimation rule is given for each autonomous agent. Then we prove that, with the proposed control scheme, each agent can follow the leader if the (acceleration) input of the active leader is known, and the tracking error is estimated if the input of the leader is unknown.  相似文献   

14.
针对汽油精制过程中控制变量之间非线性和强耦联性,产品汽油中辛烷值难以测定的问题,提出一种基于自适应变量加权的汽油辛烷值预测方法.首先,利用一种新颖的变量加权模块捕获变量之间的相关性获取变量权重,通过自适应变量加权的方式提升主要变量的重要性,抑制其他次要变量的作用;然后,考虑到汽油脱硫过程对辛烷值的影响,输入加权激活后的变量到辛烷值预测模块,模型同时输出辛烷值和硫含量的预测结果;最后,基于工业数据进行模型验证,结果表明,与无变量加权模块的神经网络预测方法,基于随机森林的神经网络预测方法和基于变量加权堆叠自编码器的预测方法相比较,所提出的自适应变量加权汽油辛烷值预测方法具有更高的预测精度,可以用来优化汽油精制过程的操作条件.  相似文献   

15.
利用FCM求解最佳聚类数的算法   总被引:2,自引:0,他引:2  
利用FCM求解最佳聚类数的算法中,每次应用FCM算法都要重新初始化类中心,而FCM算法对初始类中心敏感,这样使得利用FCM求解最佳聚类数的算法很不稳定。对该算法进行了改进,提出了一个合并函数,使得(c-1)类的类中心依赖于类的类中心。仿真实验表明:新的算法稳定性好,且运算速度明显比旧的算法要快。  相似文献   

16.
This paper proposes a new method for estimating the true number of clusters and initial cluster centers in a dataset with many clusters. The observation points are assigned to the data space to observe the clusters through the distributions of the distances between the observation points and the objects in the dataset. A Gamma Mixture Model (GMM) is built from a distance distribution to partition the dataset into subsets, and a GMM tree is obtained by recursively partitioning the dataset. From the leaves of the GMM tree, a set of initial cluster centers are identified and the true number of clusters is estimated. This method is implemented in the new GMM-Tree algorithm. Two GMM forest algorithms are further proposed to ensemble multiple GMM trees to handle high dimensional data with many clusters. The GMM-P-Forest algorithm builds GMM trees in parallel, whereas the GMM-S-Forest algorithm uses a sequential process to build a GMM forest. Experiments were conducted on 32 synthetic datasets and 15 real datasets to evaluate the performance of the new algorithms. The results have shown that the proposed algorithms outperformed the existing popular methods: Silhouette, Elbow and Gap Statistic, and the recent method I-nice in estimating the true number of clusters from high dimensional complex data.  相似文献   

17.
嵌套分区算法是近年来提出的一种求解大规模优化问题的新型全局优化方法。介绍了嵌套分区算法(NPM)的基本思想,将其应用于求解旅行商问题。分析确定了嵌套分区算法各个算子的策略,提出了一种改进的嵌套分区算法。该算法采用加权抽样法求得初始最可能域,用全局数组记录下每个区域的历史最优解,用3-opt局部搜索算法改进每个区域解的质量。对TSPLIB中部分实例仿真结果表明,所提出的结合3-opt算法的改进嵌套分区算法在求解 TSP问题时可以获得高质量的解。  相似文献   

18.
为解决传统FLICM算法需人为给定图像聚类数的问题,基于该算法通过聚类中心描述聚类的特点,设计以聚类中心为操作对象的分裂合并操作,以实现可变类图像分割.在此基础上定义分裂合并操作的接受率,不但能够有效避免算法陷入局部极值,促进其快速收敛,同时有利于参数阈值的自适应.分别利用所提出算法和传统ISODATA算法分割模拟图像和灰度纹理图像,对其结果的定性定量分析验证了所提出算法的有效性和普适性.  相似文献   

19.
slater投票规则是基于锦标赛的投票规则,主要是通过构造无环锦标赛,找到与原锦标赛差异最小的一个,从中选出获胜者。针对求解难度为NP难的slater投票算法,提出了一种基于相似候选项集的优化求解slater问题的Picat方法。相比于非优化求解slater问题的方法,该方法缩小了slater算法的解空间,有效地减少了求解slater获胜者的计算量,提高了计算速度。实验结果表明,优化求解slater问题的Picat方法的计算速度优于非优化的Picat方法;当候选项人数少于20时,求解slater问题的回答集程序(ASP)方法的计算速度和计算能力优于优化的Picat方法,但当候选项人数超过30时,优化的Picat方法(用可满足问题求解器)的计算速度和计算能力优于ASP方法。  相似文献   

20.
Clustering is actively studied in such fields as statistics, pattern recognition, machine training, et al. A new randomized algorithm is suggested and established for finding the number of clusters in the set of data, the efficiency of which is demonstrated by examples of simulation modeling on synthetic data with thousands of clusters.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号