首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Gene expression data are expected to be of significant help in the development of efficient cancer diagnosis and classification platforms. One problem arising from these data is how to select a small subset of genes from thousands of genes and a few samples that are inherently noisy. This research aims to select a small subset of informative genes from the gene expression data which will maximize the classification accuracy. A model for gene selection and classification has been developed by using a filter approach, and an improved hybrid of the genetic algorithm and a support vector machine classifier. We show that the classification accuracy of the proposed model is useful for the cancer classification of one widely used gene expression benchmark data set.  相似文献   

2.
Abstract: Cancer classification, through gene expression data analysis, has produced remarkable results, and has indicated that gene expression assays could significantly aid in the development of efficient cancer diagnosis and classification platforms. However, cancer classification, based on DNA array data, remains a difficult problem. The main challenge is the overwhelming number of genes relative to the number of training samples, which implies that there are a large number of irrelevant genes to be dealt with. Another challenge is from the presence of noise inherent in the data set. It makes accurate classification of data more difficult when the sample size is small. We apply genetic algorithms (GAs) with an initial solution provided by t statistics, called t‐GA, for selecting a group of relevant genes from cancer microarray data. The decision‐tree‐based cancer classifier is built on the basis of these selected genes. The performance of this approach is evaluated by comparing it to other gene selection methods using publicly available gene expression data sets. Experimental results indicate that t‐GA has the best performance among the different gene selection methods. The Z‐score figure also shows that some genes are consistently preferentially chosen by t‐GA in each data set.  相似文献   

3.
Nowadays, microarray gene expression data plays a vital role in tumor classification. However, due to the accessibility of a limited number of tissues compared to large number of genes in genomic data, various existing methods have failed to identify a small subset of discriminative genes. To overcome this limitation, in this paper, we developed a new hybrid technique for gene selection, called ensemble multipopulation adaptive genetic algorithm (EMPAGA) that can overlook the irrelevant genes and classify cancer accurately. The proposed hybrid gene selection algorithm comprises of two phase. In the first phase, an ensemble gene selection (EGS) method used to filter the noisy and redundant genes in high-dimensional datasets by combining multilayer and F-score approaches. Then, an adaptive genetic algorithm based on multipopulation strategy with support vector machine and naïve Bayes (NB) classifiers as a fitness function is applied for gene selection to select the extremely sensible genes from the reduced datasets. The performance of the proposed method is estimated on 10 microarray datasets of numerous tumor. The comprehensive results and various comparisons disclose that EGS has a remarkable impact on the efficacy of the adaptive genetic algorithm with multipopulation strategy and enhance the capability of the proposed approach in terms of convergence rate and solution quality. The experiments results demonstrate the superiority of the proposed method when compared to other standard wrappers regarding classification accuracy and optimal number of genes.  相似文献   

4.
Gene expression technology, namely microarrays, offers the ability to measure the expression levels of thousands of genes simultaneously in biological organisms. Microarray data are expected to be of significant help in the development of an efficient cancer diagnosis and classification platform. A major problem in these data is that the number of genes greatly exceeds the number of tissue samples. These data also have noisy genes. It has been shown in literature reviews that selecting a small subset of informative genes can lead to improved classification accuracy. Therefore, this paper aims to select a small subset of informative genes that are most relevant for cancer classification. To achieve this aim, an approach using two hybrid methods has been proposed. This approach is assessed and evaluated on two well-known microarray data sets, showing competitive results. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   

5.
In order to select a small subset of informative genes from gene expression data for cancer classification, many researchers have recently analyzed gene expression data using various computational intelligence methods. However, due to the small number of samples compared with the huge number of genes (high-dimension), irrelevant genes, and noisy genes, many of the computational methods face difficulties in selecting such a small subset. Therefore, we propose an enhancement of binary particle swarm optimization to select the small subset of informative genes that is relevant for classifying cancer samples more accurately. In this method, three approaches have been introduced to increase the probability of the bits in a particle’s position being zero. By performing experiments on two gene expression data sets, we have found that the performance of the proposed method is superior to previous related works, including the conventional version of binary particle swarm optimization (BPSO), in terms of classification accuracy and the number of selected genes. The proposed method also produces lower running times compared with BPSO.  相似文献   

6.
由于基因表达数据高维度、高噪声、小样本的特点,基因选择一直是肿瘤分类的一大挑战。为了提高肿瘤分类的精度,同时保证基因选择的效率,提出一种结合Relief-F和CART决策树的自适应粒子群优化(APSO)算法(R-C-APSO)。该方法首先利用Relief-F快速过滤大量无关基因和噪声,缩小基因选择范围;然后以CART决策树为适应度函数,用APSO算法对基因进行最终搜索。通过6个数据集的分析实验,实验结果表明,R-C-APSO拥有较高的分类精度和较快的基因选择速度,且具有良好的稳定性。  相似文献   

7.
基于代表熵的基因表达数据聚类分析方法   总被引:1,自引:0,他引:1       下载免费PDF全文
针对基因表达数据样本少,维数高的特点,尤其是在样本分型缺乏先验知识的情况下,结合自组织特征映射的优点提出了基于代表熵的双向聚类算法。该算法首先通过自组织特征映射网络(SOM)对基因聚类,根据波动系数挑选特征基因。然后根据代表熵的大小判断基因聚类的好坏,并确定网络的神经元个数。最后采用FCM(Fuzzy C Means)聚类算法对挑选出的特征基因集进行样本分型。将该算法用于两组公开的基因表达数据集,实验结果表明该算法在降低特征维数的同时,得出了较高的聚类准确率。  相似文献   

8.
Gene selection can help the analysis of microarray gene expression data. However, it is very difficult to obtain a satisfactory classification result by machine learning techniques because of both the curse-of-dimensionality problem and the over-fitting problem. That is, the dimensions of the features are too large but the samples are too few. In this study, we designed an approach that attempts to avoid these two problems and then used it to select a small set of significant biomarker genes for diagnosis. Finally, we attempted to use these markers for the classification of cancer. This approach was tested the approach on a number of microarray datasets in order to demonstrate that it performs well and is both useful and reliable.  相似文献   

9.
Hong-Qiang  Hau-San  De-Shuang  Jun 《Pattern recognition》2007,40(12):3379-3392
In this paper, we address the problem of extracting gene regulation information from microarray data for cancer classification. From the biological viewpoint, a model of gene regulation probability is established where three types of gene regulation states in a tissue sample are assumed and then two regulation events correlated with the class distinction are defined. Different from the previous approaches, the proposed algorithm uses gene regulation probabilities as carriers of regulation information to select genes and construct classifiers. The proposed approach is successfully applied to two public available microarray data sets, the leukemia data and the prostate data. Experimental results suggest that gene selection based on regulation information can greatly improve cancer classification, and the classifier based on regulation information is more efficient and more stable than several previous classification algorithms.  相似文献   

10.
There exist several methods for binary classification of gene expression data sets. However, in the majority of published methods, little effort has been made to minimize classifier complexity. In view of the small number of samples available in most gene expression data sets, there is a strong motivation for minimizing the number of free parameters that must be fitted to the data. In this paper, a method is introduced for evolving (using an evolutionary algorithm) simple classifiers involving a minimal subset of the available genes. The classifiers obtained by this method perform well, reaching 97% correct classification of clinical outcome on training samples from the breast cancer data set published by van't Veer, and up to 89% correct classification on validation samples from the same data set, easily outperforming previously published results.  相似文献   

11.
鉴于传统的基因选择方法会选出大量冗余基因从而导致较低的样本预测准确率,提出一种基于聚类和微粒群优化的基因选择算法。首先采用聚类算法将基因分成固定数目的簇;然后,采用极限学习机作为分类器进行簇中的特征基因分类性能评价,得到一个备选基因库;最后,采用基于微粒群优化和极限学习机的缠绕法从备选基因库中选择具有最大分类率、最小数目的基因子集。所选出的基因具有良好的分类性能。在两个公开的微阵列数据集上的实验结果表明,相对于一些经典的方法,新方法能够以较少的基因获得更高的分类性能。  相似文献   

12.
A two-stage gene selection scheme utilizing MRMR filter and GA wrapper   总被引:1,自引:0,他引:1  
Gene expression data usually contain a large number of genes, but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminates biological samples of different types. In this paper, we propose a two-stage selection algorithm for genomic data by combining MRMR (Minimum Redundancy–Maximum Relevance) and GA (Genetic Algorithm). In the first stage, MRMR is used to filter noisy and redundant genes in high-dimensional microarray data. In the second stage, the GA uses the classifier accuracy as a fitness function to select the highly discriminating genes. The proposed method is tested for tumor classification on five open datasets: NCI, Lymphoma, Lung, Leukemia and Colon using Support Vector Machine (SVM) and Naïve Bayes (NB) classifiers. The comparison of the MRMR-GA with MRMR filter and GA wrapper shows that our method is able to find the smallest gene subset that gives the most classification accuracy in leave-one-out cross-validation (LOOCV).  相似文献   

13.
基于相容关系的基因选择方法   总被引:1,自引:0,他引:1  
焦娜  苗夺谦 《计算机科学》2010,37(10):217-220
有效的基因选择是对基因表达数据进行分析的重要内容。粗糙集作为一种软计算方法能够保持在数据集分类能力不变的基础上,对属性进行约简。由于基因表达数据的连续性,为了避免运用粗糙集方法所必需的离散化过程带来的信息丢失,将相容粗糙集应用于基因的特征选取,提出了基于相容关系的基因选择方法。首先,通过i检验对基因表达数据进行排列,选择评分靠前的若干基因;然后,通过相容粗糙集对这些基因进一步约简。在两个标准的基因表达数据上进行了实验,结果表明该方法是可行性和有效性的。  相似文献   

14.
With the arrival of gene expression microarrays a new challenge has opened up for identification or classification of cancer tissues. Due to the large number of genes providing valuable information simultaneously compared to very few available tissue samples the cancer staging or classification becomes very tricky.In this paper we introduce a hierarchical Bayesian probit model for two class cancer classification. Instead of assuming a linear structure for the function that relates the gene expressions with the cancer types we only assume that the relationship is explained by an unknown function which belongs to an abstract functional space like the reproducing kernel Hilbert space. Our formulation automatically reduces the dimension of the problem from the large number of covariates or genes to a small sample size. We incorporate a Bayesian gene selection scheme with the automatic dimension reduction to adaptively select important genes and classify cancer types under an unified model. Our model is highly flexible in terms of explaining the relationship between the cancer types and gene expression measurements and picking up the differentially expressed genes. The proposed model is successfully tested on three simulated data sets and three publicly available leukemia cancer, colon cancer, and prostate cancer real life data sets.  相似文献   

15.
Microarray data classification is a task involving high dimensionality and small samples sizes. A common criterion to decide on the number of selected genes is maximizing the accuracy, which risks overfitting and usually selects more genes than actually needed. We propose, relaxing the maximum accuracy criterion, to select the combination of attribute selection and classification algorithm that using less attributes has an accuracy not statistically significantly worst that the best. Also we give some advice to choose a suitable combination of attribute selection and classifying algorithms for a good accuracy when using a low number of gene expressions. We used some well known attribute selection methods (FCBF, ReliefF and SVM-RFE, plus a Random selection, used as a base line technique) and classifying techniques (Naive Bayes, 3 Nearest Neighbor and SVM with linear kernel) applied to 30 data sets involving different cancer types.  相似文献   

16.
In a DNA microarray dataset, gene expression data often has a huge number of features(which are referred to as genes) versus a small size of samples. With the development of DNA microarray technology, the number of dimensions increases even faster than before, which could lead to the problem of the curse of dimensionality. To get good classification performance, it is necessary to preprocess the gene expression data. Support vector machine recursive feature elimination (SVM-RFE) is a classical method for gene selection. However, SVM-RFE suffers from high computational complexity. To remedy it, this paper enhances SVM-RFE for gene selection by incorporating feature clustering, called feature clustering SVM-RFE (FCSVM-RFE). The proposed method first performs gene selection roughly and then ranks the selected genes. First, a clustering algorithm is used to cluster genes into gene groups, in each which genes have similar expression profile. Then, a representative gene is found to represent a gene group. By doing so, we can obtain a representative gene set. Then, SVM-RFE is applied to rank these representative genes. FCSVM-RFE can reduce the computational complexity and the redundancy among genes. Experiments on seven public gene expression datasets show that FCSVM-RFE can achieve a better classification performance and lower computational complexity when compared with the state-the-art-of methods, such as SVM-RFE.  相似文献   

17.
The monitoring of the expression profiles of thousands of genes have proved to be particularly promising for biological classification. DNA microarray data have been recently used for the development of classification rules, particularly for cancer diagnosis. However, microarray data present major challenges due to the complex, multiclass nature and the overwhelming number of variables characterizing gene expression profiles. A regularized form of sliced inverse regression (REGSIR) approach is proposed. It allows the simultaneous development of classification rules and the selection of those genes that are most important in terms of classification accuracy. The method is illustrated on some publicly available microarray data sets. Furthermore, an extensive comparison with other classification methods is reported. The REGSIR performance is comparable with the best classification methods available, and when appropriate feature selection is made the performance can be considerably improved.  相似文献   

18.
Since most cancer treatments come with a certain degree of toxicity it is very essential to identify a cancer type correctly and then administer the relevant therapy. With the arrival of powerful tools such as gene expression microarrays the cancer classification basis is slowly changing from morphological properties to molecular signatures. Several recent studies have demonstrated a marked improvement in prediction accuracy of tumor types based on gene expression microarray measurements over clinical markers. The main challenge in working with gene expression microarrays is that there is a huge number of genes to work with. Out of them only a small fraction are actually relevant for differentiating between different types of cancer. A Bayesian nearest neighbor model equipped with an integrated variable selection technique is proposed to overcome this challenge. This classification and gene selection model is able to classify different cancer types accurately and simultaneously identify the relevant or important genes. The proposed model is completely automatic in the sense that it adaptively picks up the neighborhood size and the important covariates. The method is successfully applied to three simulated data sets and four well known real data sets. To demonstrate the competitiveness of the method a comparative study is also done with several other “off the shelf” popular classification methods. For all the simulated data sets and real life data sets, the proposed method produced highly competitive if not better results. While the standard approach is two step model building for gene selection and then tumor prediction, this novel adaptive gene selection technique automatically selects the relevant genes along with tumor class prediction in one go. The biological relevance of the selected genes are also discussed to validate the claim.  相似文献   

19.
基于遗传算法及聚类的基因表达数据特征选择   总被引:1,自引:0,他引:1  
特征选择是模式识别及数据挖掘等领域的重要问题之一。针对高维数据对象(如基因表达数据)的特征选择,一方面可以提高分类及聚类的精度和效率,另一方面可以找出富含信息的特征子集,如发现与疾病密切相关的重要基因。针对此问题,本文提出了一种新的面向基因表达数据的特征选择方法,在特征子集搜索上采用遗传算法进行随机搜索,在特征子集评价上采用聚类算法及聚类错误率作为学习算法及评价指标。实验结果表明,该算法可有效地找出具有较好可分离性的特征子集,从而实现降维并提高聚类及分类精度。  相似文献   

20.
段旭 《计算机工程与设计》2011,32(11):3836-3839
一个微阵列数据集包含了成千上万的基因、相对少量的样本,而在这成千上万的基因中,只有一少部分基因对肿瘤分类是有贡献的,因此,对于肿瘤分类来说,最重要的一个问题就是识别选择出对肿瘤分类最有贡献的基因。为了能有效地进行微阵列基因选择,提出用一个边缘分布模型(marginal distribution model,MDM)来描述微阵列数据。该模型不仅能区分基因是否在两样本中差异表达,而且能区分出基因在哪一类样本中表达,从而选择出的基因更具有生物学意义。模拟数据及真实微阵列数据集上的实验结果表明,该方法能有效地进行微阵列基因选择。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号