首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Traditional clustering algorithms (e.g., the K-means algorithm and its variants) are used only for a fixed number of clusters. However, in many clustering applications, the actual number of clusters is unknown beforehand. The general solution to this type of a clustering problem is that one selects or defines a cluster validity index and performs a traditional clustering algorithm for all possible numbers of clusters in sequence to find the clustering with the best cluster validity. This is tedious and time-consuming work. To easily and effectively determine the optimal number of clusters and, at the same time, construct the clusters with good validity, we propose a framework of automatic clustering algorithms (called ETSAs) that do not require users to give each possible value of required parameters (including the number of clusters). ETSAs treat the number of clusters as a variable, and evolve it to an optimal number. Through experiments conducted on nine test data sets, we compared the ETSA with five traditional clustering algorithms. We demonstrate the superiority of the ETSA in finding the correct number of clusters while constructing clusters with good validity.  相似文献   

2.
Many existing clustering algorithms have been used to identify coexpressed genes in gene expression data. These algorithms are used mainly to partition data in the sense that each gene is allowed to belong only to one cluster. Since proteins typically interact with different groups of proteins in order to serve different biological roles, the genes that produce these proteins are therefore expected to coexpress with more than one group of genes. In other words, some genes are expected to belong to more than one cluster. This poses a challenge to gene expression data clustering as there is a need for overlapping clusters to be discovered in a noisy environment. For this task, we propose an effective information theoretical approach, which consists of an initial clustering phase and a second reclustering phase, in this paper. The proposed approach has been tested with both simulated and real expression data. Experimental results show that it can improve the performances of existing clustering algorithms and is able to effectively uncover interesting patterns in noisy gene expression data so that, based on these patterns, overlapping clusters can be discovered.  相似文献   

3.
The rapid advancement of DNA microarray technology has revolutionalized genetic research in bioscience. Due to the enormous amount of gene expression data generated by such technology, computer processing and analysis of such data has become indispensable. In this paper, we present a computational framework for the extraction, analysis and visualization of gene expression data from microarray experiments. A novel, fully automated, spot segmentation algorithm for DNA microarray images, which makes use of adaptive thresholding, morphological processing and statistical intensity modeling, is proposed to: (i) segment the blocks of spots, (ii) generate the grid structure, and (iii) to segment the spot within each subregion. For data analysis, we propose a binary hierarchical clustering (BHC) framework for the clustering of gene expression data. The BHC algorithm involves two major steps. Firstly, the fuzzy C-means algorithm and the average linkage hierarchical clustering algorithm are used to split the data into two classes. Secondly, the Fisher linear discriminant analysis is applied to the two classes to assess whether the split is acceptable. The BHC algorithm is applied to the sub-classes recursively and ends when all clusters cannot be split any further. BHC does not require the number of clusters to be known in advance. It does not place any assumption about the number of samples in each cluster or the class distribution. The hierarchical framework naturally leads to a tree structure representation for effective visualization of gene expressions.  相似文献   

4.
The estimation of the number of clusters (NC) is one of crucial problems in the cluster analysis of gene expression data. Most approaches available give their answers without the intuitive information about separable degrees between clusters. However, this information is useful for understanding cluster structures. To provide this information, we propose system evolution (SE) method to estimate NC based on partitioning around medoids (PAM) clustering algorithm. SE analyzes cluster structures of a dataset from the viewpoint of a pseudothermodynamics system. The system will go to its stable equilibrium state, at which the optimal NC is found, via its partitioning process and merging process. The experimental results on simulated and real gene expression data demonstrate that the SE works well on the data with well-separated clusters and the one with slightly overlapping clusters.  相似文献   

5.
An evolutionary approach for gene expression patterns   总被引:1,自引:0,他引:1  
This study presents an evolutionary algorithm, called a heterogeneous selection genetic algorithm (HeSGA), for analyzing the patterns of gene expression on microarray data. Microarray technologies have provided the means to monitor the expression levels of a large number of genes simultaneously. Gene clustering and gene ordering are important in analyzing a large body of microarray expression data. The proposed method simultaneously solves gene clustering and gene-ordering problems by integrating global and local search mechanisms. Clustering and ordering information is used to identify functionally related genes and to infer genetic networks from immense microarray expression data. HeSGA was tested on eight test microarray datasets, ranging in size from 147 to 6221 genes. The experimental clustering and visual results indicate that HeSGA not only ordered genes smoothly but also grouped genes with similar gene expressions. Visualized results and a new scoring function that references predefined functional categories were employed to confirm the biological interpretations of results yielded using HeSGA and other methods. These results indicate that HeSGA has potential in analyzing gene expression patterns.  相似文献   

6.
卢晶  段勇  刘海波 《电子学报》2018,46(3):730-738
密度峰值聚类算法由于在发现任意形状簇且不需指定聚类个数等方面具有一定的优势而被广泛关注.但是该算法需要计算数据集中所有点的密度和点对之间的距离,因此不适合处理大规模高维数据集.为此,本文提出了一种基于z值的分布式密度峰值聚类算法,DP-z.本方法利用空间z填充曲线将高维数据集映射到一维空间上,根据数据点的z值信息对数据集分组.为了能够得到正确的结果,需要对分组间数据进行交互,然后并行计算每个点密度和斥群值.DP-z算法在分组间数据交互时采用过滤策略,减少大量无效距离计算和数据传输开销,有效提高算法的执行效率.最后,本文在云计算平台上对DP-z算法进行了验证,实验表明在保证DP-z算法与原始密度峰值聚类算法聚类结果相同的情况下有效的提高了算法执行效率.  相似文献   

7.
8.
Circuit partitioning is a fundamental problem in very large-scale integration (VLSI) physical design automation. In this brief, we present a new connectivity-based clustering algorithm for VLSI circuit partitioning. The proposed clustering method focuses on capturing natural clusters in a circuit, i.e., the groups of cells that are highly interconnected in a circuit. Therefore, the proposed clustering method can reduce the size of large-scale partitioning problems without losing partitioning solution qualities. The performance of the proposed clustering algorithm is evaluated on a standard set of partitioning benchmarks-ISPD98 benchmark suite. The experimental results show that by applying the proposed clustering algorithm, the previously reported best partitioning solutions from state-of-the-art partitioners are further improved.  相似文献   

9.
In this paper we propose a routing protocol based on clustering (IGP-C Protocol) to extend the lifetime in the context of wireless sensor networks while optimizing other resources (memory and processor). Firstly, a clustering algorithm and a load balancing technique are used together in order to reap the benefits of both approaches. The proposed clustering algorithm with load balancing (CALB Algorithm) is a fully distributed algorithm performed by each sensor and requires only communication with its immediate neighbors. Secondly, an Improved Gossiping Protocol (IGP) is proposed to extend the CALB algorithm to the data routing. The simulation results demonstrate the better and promising performances of the IGP-C protocol compared with the other protocols proposed in the literature. The IGP-C protocol allows a better distribution of energy, memory and processing capabilities of cluster-heads and reduces the number of clusters consisting of a single sensor along with the number of iterations. This demonstrates the effectiveness of the cluster-heads election process which improves the load balancing in the wireless sensors network in terms of cluster-heads load and clusters size. Furthermore, the proposed routing strategy builds around the clustering algorithm, is effective since it reduces the data transmission delay and prolongs the network lifetime.  相似文献   

10.
Spectral methods are strong tools that can be used for extraction of the data’s structure based on eigenvectors of constructed affinity matrices. In this paper, we aim to propose some new measurement functions to evaluate the ability of each eigenvector of affinity matrix in data clustering. In the proposed strategy, each eigenvector’s elements are clustered by traditional fuzzy c-means algorithm and then informative eigenvectors selection is performed by optimization of an objective function which defined based on three criterions. These criterions are the compactness of clusters, distance between clusters and stability of clustering to evaluate each eigenvector based on considering the structure of clusters which placed on. Finally, Lagrange multipliers method is used to minimize the proposed objective function and extract the most informative eigenvectors. To indicate the merits of our algorithm, we consider UCI Machine Learning Repository databases, COIL20, YALE-B and PicasaWeb as benchmark data sets. Our simulation’s results confirm the superior performance of the proposed strategy in developing spectral clustering compared to conventional clustering methods and recent eigenvector selection based algorithms.  相似文献   

11.
基于人工免疫网络的动态聚类算法   总被引:14,自引:2,他引:12       下载免费PDF全文
钟将  吴中福  吴开贵  欧灵 《电子学报》2004,32(8):1268-1272
聚类分析的两个基本任务是分析数据集中簇的数量以及这些簇的位置.大多数的聚类方法通常只关注后一个问题.为了在聚类数不确定的情况下实现聚类分析,本文提出了一种新的结合人工免疫网络和遗传算法的动态聚类算法—DCBIG.新算法主要包含两个阶段:先使用人工免疫网络算法获得聚类可行解,然后使用遗传算法依据聚类可行解实现动态聚类.本文对获得聚类可行解的条件和概率进行了分析.仿真实验结果表明与现有方法相比,新方法具有更高的收敛概率和收敛速度.  相似文献   

12.
袁昊  马尽文 《信号处理》2023,39(1):176-190
在传统的聚类分析中,通常需要针对给定的数据选择出正确或合理的类别数,否则算法无法得到理想的聚类分析结果。当采用竞争学习(Competitive Learning, CL)算法进行聚类分析时也面临着同样的问题。然而,一般数据集中实际聚类个数(或竞争单元个数)的推断与选择却是一个十分困难的问题。为了解决这一难题,对手惩罚竞争学习(Rival Penalized Competitive Learning, RPCL)算法建立了一种有效的思想和方法。它通过预设较大的聚类个数,在竞争学习中引入了对手惩罚的机制,自动地选择出正确的聚类中心与个数,并将多余的聚类中心排除到无穷点或远离数据的地方。这种独特的思想和方法为聚类分析开辟了一条崭新的途径。本文将深入分析RPCL算法的理论发展,包括产生的根源及其思想、理论基础、在不同情况下的推广和变式,并且总结了RPCL算法在各个领域中的应用。  相似文献   

13.
Generating high-quality gene clusters and identifying the underlying biological mechanism of the gene clusters are the important goals of clustering gene expression analysis. To get high-quality cluster results, most of the current approaches rely on choosing the best cluster algorithm, in which the design biases and assumptions meet the underlying distribution of the dataset. There are two issues for this approach: 1) usually, the underlying data distribution of the gene expression datasets is unknown and 2) there are so many clustering algorithms available and it is very challenging to choose the proper one. To provide a textual summary of the gene clusters, the most explored approach is the extractive approach that essentially builds upon techniques borrowed from the information retrieval, in which the objective is to provide terms to be used for query expansion, and not to act as a stand-alone summary for the entire document sets. Another drawback is that the clustering quality and cluster interpretation are treated as two isolated research problems and are studied separately. In this paper, we design and develop a unified system Gene Expression Miner to address these challenging issues in a principled and general manner by integrating cluster ensemble, text clustering, and multidocument summarization and provide an environment for comprehensive gene expression data analysis. We present a novel cluster ensemble approach to generate high-quality gene cluster. In our text summarization module, given a gene cluster, our expectation-maximization based algorithm can automatically identify subtopics and extract most probable terms for each topic. Then, the extracted top k topical terms from each subtopic are combined to form the biological explanation of each gene cluster. Experimental results demonstrate that our system can obtain high-quality clusters and provide informative key terms for the gene clusters.  相似文献   

14.
何宏  谭永红 《电子学报》2012,40(2):254-259
 如何确定聚类数目一直是聚类分析中的难点问题.为此本文提出了一种基于动态遗传算法的聚类新方法,该方法采用最大属性值范围划分法克服划分聚类算法对初始值的敏感性,并运用两阶段的动态选择和变异策略,使选择概率和变异率跟随种群的聚类数目一致性变化,先进行不同聚类数目的并行搜索,再获取最优的聚类中心.七组数据聚类实验证明该方法能够实现数据集最佳划分的自动全局搜索,同时搜索到最佳聚类数目和最佳聚类中心.  相似文献   

15.
In order to make up for the defect that the traditional spectral clustering algorithm cannot determine the number of clusters and the time-consuming calculation, this paper studies and improves the spectral clustering algorithm. In complex community networks, the spectral clustering algorithm based on modularity optimization is chosen to find the number of communities. In addition, four types of user attribute information are integrated, and a more reasonable user similarity model is constructed. At the same time, the original non-parallelized spectral clustering algorithm is optimized, and its improved scheme is suitable for the application of distributed computing. Many Hadoop optimization strategies are proposed for virtual community discovery scenarios in large-scale communities. Finally, the experimental results show that the efficiency of the parallelized spectral clustering algorithm is greatly improved, which can be applied to the virtual community discovery in large-scale social networks.  相似文献   

16.
高阶异构数据模糊联合聚类算法   总被引:1,自引:0,他引:1  
为了更有效地分析聚簇重叠部分高阶异构数据的聚簇结果,提出了一种高阶异构数据模糊联合聚类(HFCC)算法,该算法最小化每个特征空间中对象与聚簇中心的加权距离。推导出对象隶属度和特征权重的迭代更新公式,设计出聚类过程的迭代算法,并且从理论上证明了该迭代算法的收敛性。另外,通过泛化XB指标,提出适用于评估高阶异构数据聚类质量的指标GXB,用于判断聚簇数目。实验表明,HFCC算法能够有效探测数据内部隐藏的重叠聚簇结构,并且HFCC算法聚类效果明显优于5种有代表性的硬划分算法,此外GXB指标能够有效判定高阶异构数据的聚簇数目。  相似文献   

17.
For many clustering algorithms, it is very important to determine an appropriate number of clusters, which is called cluster validity problem. In this paper, a new clustering validity assessment index is proposed based on a novel method to select the margin point between two clusters for inter-cluster similarity more accurately, and provides an improved scatter function for intra-cluster similarity. Simulation results show the effectiveness of the proposed index on the data sets under consideration regardless of the choice of a clustering algorithm.  相似文献   

18.
目的:探讨基因时间序列信号的多尺度关联规则。方法:将基因时间序列信号进行小波多尺度分解,在单一尺度上建立隶属关系,然后挖掘其总体关联网络。结果:应用新的多尺度关联规则,对小脑组织的一组基因芯片时间序列信号进行分析,其结果与生物学意义接近。结论:此多尺度关联规则,是调控网络建立的一种全新有效方法。  相似文献   

19.
传统的用于Web日志聚类的算法大都需要用户指定聚类个数。提出了一种新的自适应聚类算法并对Web日志用户会话进行聚类。该算法基于凝聚聚类思想和划分聚类思想,用初始数据集中每2个会话之间的相异度作为距离的度量,合并距离小于一定阈值的两个会话以产生初始聚类,再根据一定的规则动态地合并距离最小的会话类或会话,算法的结果是产生自然的聚类。最后,通过比较会话聚类的内部距离和类间距离来验证算法的有效性。这种聚类算法的最大优点在于,他能够产生自动的聚类,而不需要用户事先指定需要产生的聚类个数,并且能有效识别孤立点。实验表明,这种聚类能够产生较高质量的聚类效果。  相似文献   

20.
本文解决了信号处理、工业控制等领域存在的非平稳信号盲分类问题。在聚类中广泛应用的K-Means算法及其它基于中心的聚类算法有两个共同的缺陷-需要预先确定类数目且随机初始化中心引起性能不稳定。本文提出的算法较好地解决了这两个问题,提高了算法稳定性,实现了非平稳信号盲分类。提取非平稳信号的小波系数作为聚类的样本空间,分析聚类结果的统计偏差以估计类的数目,采用调和均值准则进行分类。最后给出的仿真结果表明本文提出的方法较传统的K-Means算法明显降低分类错误率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号