首页 | 本学科首页   官方微博 | 高级检索  
     

植物抗性基因识别中的随机森林分类方法
引用本文:郭颖婕,刘晓燕,郭茂祖,邹权.植物抗性基因识别中的随机森林分类方法[J].计算机科学与探索,2012,6(1):67-77.
作者姓名:郭颖婕  刘晓燕  郭茂祖  邹权
作者单位:1. 哈尔滨工业大学计算机科学与技术学院,哈尔滨,150001
2. 厦门大学信息科学与技术学院,福建厦门,361005
基金项目:国家自然科学基金),中央高校基本科研业务费专项资金),高等学校博士学科点专项科研基金
摘    要:为了解决传统基于同源序列比对的抗性基因识别方法中假阳性高、无法发现新的抗性基因的问题,提出了一种利用随机森林分类器和K-Means聚类降采样方法的抗性基因识别算法。针对目前研究工作中挖掘盲目性大的问题,进行两点改进:引入了随机森林分类器和188维组合特征来进行抗性基因识别,这种基于样本统计学习的方法能够有效地捕捉抗性基因内在特性;对于训练过程中存在的严重类别不平衡现象,使用基于聚类的降采样方法得到了更具代表性的训练集,进一步降低了识别误差。实验结果表明,该算法可以有效地进行抗性基因的识别工作,能够对现有实验验证数据进行准确的分类,并在反例集上也获得了较高的精度。

关 键 词:随机森林  分类器  抗性基因  聚类  降采样
修稿时间: 

Identification of Plant Resistance Gene with Random Forest
GUO Yingjie , LIU Xiaoyan , GUO Maozu , ZOU Quan.Identification of Plant Resistance Gene with Random Forest[J].Journal of Frontier of Computer Science and Technology,2012,6(1):67-77.
Authors:GUO Yingjie  LIU Xiaoyan  GUO Maozu  ZOU Quan
Affiliation:1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China 2. School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
Abstract:The traditional homology sequence alignment based approaches usually have high false positive rate and consequently new resistance genes are difficult to be identified. This paper presents a resistance gene identification approach by applying random forest classifier and K-Means under-sampling method. In order to solve the aimless problem in gene-mining research, two main contributions are provided. Firstly, it introduces random forest and 188 dimension features to identify resistance genes, accordingly the sample statistic learning approach can efficiently capture the internal characteristic of resistance genes. Secondly, it selects a more representative training subset and reduces the identification errors for solving the serious imbalanced classification during the training process. The experimental results indicate that the approach can efficiently identify the resistance genes, not only precisely clas-sifying the existing experimental verified data, but also obtaining high accuracy on the negative sample dataset.
Keywords:random forest  classifier  resistance gene  cluster  under-sampling
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《计算机科学与探索》浏览原始摘要信息
点击此处可从《计算机科学与探索》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号