首页 | 本学科首页   官方微博 | 高级检索  
     

基于统计相关性与K-means的区分基因子集选择算法
引用本文:谢娟英,高红超.基于统计相关性与K-means的区分基因子集选择算法[J].软件学报,2014,25(9):2050-2075.
作者姓名:谢娟英  高红超
作者单位:陕西师范大学 计算机科学学院, 陕西 西安 710062;陕西师范大学 计算机科学学院, 陕西 西安 710062
基金项目:国家自然科学基金(31372250); 中央高校基本科研业务费专项基金(GK201102007); 陕西省科技攻关项目(2013K12-03-24)
摘    要:针对高维小样本癌症基因数据集的有效区分基因子集选择难题,提出基于统计相关性和K-means的新颖混合基因选择算法实现有效区分基因子集选择.算法首先采用Pearson相关系数和Wilcoxon秩和检验计算各基因与类标的相关性,根据统计相关性原则选取与类标相关性较大的若干基因构成预选择基因子集;然后,采用K-means算法将预选择基因子集中高度相关的基因聚集到同一类簇,训练SVM分类模型,计算每一个基因的权重,从每一类簇选择一个权重最大或者采用轮盘赌思想从每一类簇选择一个得票数最多的基因作为本类簇的代表基因,各类簇的代表基因构成有效区分基因子集.将该算法与采用随机策略选择各类簇代表基因的随机基因选择算法Random, Guyon的经典基因选择算法SVM-RFE、采用顺序前向搜索策略的基因选择算法SVM-SFS进行实验比较,几个经典基因数据集上的200次重复实验的平均实验结果表明:所提出的混合基因选择算法能够选择到区分性能非常好的基因子集,建立在该区分基因子集上的分类器具有非常好的分类性能.

关 键 词:区分基因子集选择  Pearson相关系数  Wilcoxon秩和检验  K-means聚类  统计相关性  Filter算法  Wrapper算法
收稿时间:4/8/2014 12:00:00 AM
修稿时间:2014/5/14 0:00:00

Statistical Correlation and K-Means Based Distinguishable Gene Subset Selection Algorithms
XIE Juan-Ying and GAO Hong-Chao.Statistical Correlation and K-Means Based Distinguishable Gene Subset Selection Algorithms[J].Journal of Software,2014,25(9):2050-2075.
Authors:XIE Juan-Ying and GAO Hong-Chao
Abstract:To deal with the challenging problem of recognizing the small number of distinguishable genes which can tell the cancer patients from normal people in a dataset with a small number of samples and tens of thousands of genes, novel hybrid gene selection algorithms are proposed in this paper based on the statistical correlation and K-means algorithm. The Pearson correlation coefficient and Wilcoxon signed-rank test are respectively adopted to calculate the importance of each gene to the classification to filter the least important genes and preserve about 10 percent of the important genes as the pre-selected gene subset. Then the related genes in the pre-selected gene subset are clustered via K-means algorithm, and the weight of each gene is calculated from the related coefficient of the SVM classifier. The most important gene, with the biggest weight or with the highest votes when the roulette wheel strategy is used, is chosen as the representative gene of each cluster to construct the distinguishable gene subset. In order to verify the effectiveness of the proposed hybrid gene subset selection algorithms, the random selection strategy (named Random) is also adopted to select the representative genes from clusters. The proposed distinguishable gene subset selection algorithms are compared with Random and the very popular gene selection algorithm SVM-RFE by Guyon and the pre-studied gene selection algorithm SVM-SFS. The average experimental results of 200 runs of the aforementioned gene selection algorithms on some classic and very popular gene expression datasets with extensive experiments demonstrate that the proposed distinguishable gene subset selection algorithms can find the optimal gene subset, and the classifier based on the selected gene subset achieves very high classification accuracy.
Keywords:distinguishable gene subset selection  Pearson correlation coefficient  Wilcxon singed-rank test  K-means clustering  statistical correlation  Filter algorithms  Wrapper algorithms
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号