首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
基因表达数据的一个重要应用是给组织样本进行分类。在基因表达数据中,基因的数量相对于数据样本的个数通常比较多;也就是说,可以得到变量数(基因数)远远大于样本数的数据矩阵。过高的维数(变量或基因数)将给分类问题带来极大的挑战。本文提出结合一种新的特征提取方法——非相关线性判别式分析方法(ULDA)和支持向量机(SVM)分类算法,对结肠癌组织样本进行分类识别。并同其它方法作了比较研究,结果表明了该方法的可行性和有效性。  相似文献   

2.
常用的排列方法从DNA微数据中选择的基因集合往往会包含相关性较高的基因,而且使用单个基因评价方法也不能真正反映由此得到的特征集合分类能力的优劣。另外,基因数量远多于样本数量是进行疾病诊断面临的又一挑战。为此,提出一种DNA微阵列数据特征提取方法用于组织分类。该方法运用K-means方法对基因进行聚类分析,获取各子类DNA微阵列数据中心,用排列法去除对分类无关的子类,然后利用ICA方法提取剩余子类集合的特征,用SVMs方法构造分类器对组织进行分类。真实的生物学数据实验表明,该方法通过提取一种复合基因,能综合评价基因分类能力,减少特征数,提高分类器的分类准确性。  相似文献   

3.
In a DNA microarray dataset, gene expression data often has a huge number of features(which are referred to as genes) versus a small size of samples. With the development of DNA microarray technology, the number of dimensions increases even faster than before, which could lead to the problem of the curse of dimensionality. To get good classification performance, it is necessary to preprocess the gene expression data. Support vector machine recursive feature elimination (SVM-RFE) is a classical method for gene selection. However, SVM-RFE suffers from high computational complexity. To remedy it, this paper enhances SVM-RFE for gene selection by incorporating feature clustering, called feature clustering SVM-RFE (FCSVM-RFE). The proposed method first performs gene selection roughly and then ranks the selected genes. First, a clustering algorithm is used to cluster genes into gene groups, in each which genes have similar expression profile. Then, a representative gene is found to represent a gene group. By doing so, we can obtain a representative gene set. Then, SVM-RFE is applied to rank these representative genes. FCSVM-RFE can reduce the computational complexity and the redundancy among genes. Experiments on seven public gene expression datasets show that FCSVM-RFE can achieve a better classification performance and lower computational complexity when compared with the state-the-art-of methods, such as SVM-RFE.  相似文献   

4.
DNA microarray technology, a high throughput technology evaluates the expression of thousands of genes simultaneously under different experimental conditions. Analysis of the gene expression data reveals that not all but few important genes are responsible for the diseases. However, the DNA microarray data set usually contain multiple missing value and therefore, selection of important genes using the incomplete data set may be erroneous, resulting misclassification in disease prediction. In the paper we propose an integrated framework, which first imputes the missing value and then in order to achieve maximum accuracy in classifying the patients a classifier has been designed to select the genes using the complete microarray data set.Here functionally similar genes are employed to estimate the missing value unlike the existing gene expression value based distance similarity measure. However, the functionally similar genes may differ in their protein production capacity and so the degree of similarity between the genes varies from gene to gene. The problem has been dealt by proposing a novel method to impute the missing value using the concept of fuzzy similarity. After imputing the missing value, the continuous gene expression matrix is discretized using fuzzy sets to distinguish the activation levels of different genes. The proposed fuzzy importance factor (FIf) of each gene represents its activation level or protein production capacity both in the disease and normal class. The importance of each gene is evaluated while optimizing the number of rules in the fuzzy classifier depending on the FIf. The methodology we propose has been demonstrated using nine different cancer data sets and compared with the state of the art methods. Analysis of experimental results reveals that the proposed framework able to classify the diseased and normal patients with improved accuracy.  相似文献   

5.
Cancer classification is the critical basis for patient-tailored therapy. Conventional histological analysis tends to be unreliable because different tumors may have similar appearance. The advances in microarray technology make individualized therapy possible. Various machine learning methods can be employed to classify cancer tissue samples based on microarray data. However, few methods can be elegantly adopted for generating accurate and reliable as well as biologically interpretable rules. In this paper, we introduce an approach for classifying cancers based on the principle of minimal rough fringe. For training rough hypercuboid classifiers from gene expression data sets, the method dynamically evaluates all available genes and sifts the genes with the smallest implicit regions as the dimensions of implicit hypercuboids. An unseen object is predicted to be a certain class if it falls within the corresponding class hypercuboid. Based upon the method, ensemble rough hypercuboid classifiers are subsequently constructed. Experimental results on some open cancer gene expression data sets show that the proposed method is capable of generating accurate and interpretable rules compared with some other machine learning methods. Hence, it is a feasible way of classifying cancer tissues in biomedical applications.  相似文献   

6.
Recently, microarray technology has widely used on the study of gene expression in cancer diagnosis. The main distinguishing feature of microarray technology is that can measure thousands of genes at the same time. In the past, researchers always used parametric statistical methods to find the significant genes. However, microarray data often cannot obey some of the assumptions of parametric statistical methods, or type I error may be over expanded. Therefore, our aim is to establish a gene selection method without assumption restriction to reduce the dimension of the data set. In our study, adaptive genetic algorithm/k-nearest neighbor (AGA/KNN) was used to evolve gene subsets. We find that AGA/KNN can reduce the dimension of the data set, and all test samples can be classified correctly. In addition, the accuracy of AGA/KNN is higher than that of GA/KNN, and it only takes half the CPU time of GA/KNN. After using the proposed method, biologists can identify the relevant genes efficiently from the sub-gene set and classify the test samples correctly.  相似文献   

7.
刘青  周鹏 《计算机工程》2005,31(3):189-191
DNA微阵列技术使人们可同时观测成千上万个基因的表达水平,对其数据的分析已成为生物信息学研究的焦点。针对微阵列基因表达数据维数高、样本小、非线性的特点,设计并实现了一种基因表达数据分类识别方法,针对结肠数据集的实验表明其泛化效果有所增强。  相似文献   

8.
J. Li  X. Tang  J. Liu  J. Huang  Y. Wang 《Pattern recognition》2008,41(6):1975-1984
Various microarray experiments are now done in many laboratories, resulting in the rapid accumulation of microarray data in public repositories. One of the major challenges of analyzing microarray data is how to extract and select efficient features from it for accurate cancer classification. Here we introduce a new feature extraction and selection method based on information gene pairs that have significant change in different tissue samples. Experimental results on five public microarray data sets demonstrate that the feature subset selected by the proposed method performs well and achieves higher classification accuracy on several classifiers. We perform extensive experimental comparison of the features selected by the proposed method and features selected by other methods using different evaluation methods and classifiers. The results confirm that the proposed method performs as well as other methods on acute lymphoblastic-acute myeloid leukemia, adenocarcinoma and breast cancer data sets using a fewer information genes and leads to significant improvement of classification accuracy on colon and diffuse large B cell lymphoma cancer data sets.  相似文献   

9.
一种基于拆分的基因选择算法   总被引:1,自引:0,他引:1  
基因表达数据是由成千上万个基因及几十个样本组成的,有效的基因选择算法是基因表达数据研究的重要内容。粗糙集是一个有效的去掉冗余特征的工具。然而,对于含有成千上万特征、几十个样本的基因表达数据,现有基于粗糙集的特征选择算法的计算效率会变得非常低。为此,将拆分方法应用于特征选择,提出了一种基于拆分的特征选择算法。该算法把一个复杂的表拆分成简单的、更容易处理的主表与子表形式,然后把它们的结果连接到一起解决初始表的问题。实验结果表明,该算法在保证分类精度的同时,能明显提高计算效率。  相似文献   

10.
谢娟英  丁丽娟  王明钊 《软件学报》2020,31(4):1009-1024
基因表达数据具有高维小样本特点,包含了大量与疾病无关的基因,对该类数据进行分析的首要步骤是特征选择.常见的特征选择方法需要有类标的数据,但样本类标获取往往比较困难.针对基因表达数据的特征选择问题,提出基于谱聚类的无监督特征选择思想FSSC(feature selection by spectral clustering).FSSC对所有特征进行谱聚类,将相似性较高的特征聚成一类,定义特征的区分度与特征独立性,以二者之积度量特征重要性,从各特征簇选取代表性特征,构造特征子集.根据使用的不同谱聚类算法,得到FSSC-SD(FSSC based on standard deviation) FSSCMD(FSSC based on mean distance)和FSSC-ST(FSSC based on self-tuning)这3种无监督特征选择算法.以SVMs(support vector machines)和KNN(K-nearest neighbours)为分类器,在10个基因表达数据集上进行实验测试.结果表明,FSSC-SD、FSSC-MD和FSSC-ST算法均能选择到具有强分类能力的特征子集.  相似文献   

11.
基于支持向量机的微阵列基因表达数据分析方法   总被引:5,自引:0,他引:5  
DNA微阵列技术,使人们可以同时观测成千上万个基因的表达水平,对其数据的分析已成为生物信息学研究的焦点.针对微阵列基因表达数据维数高、样本小、非线性的特点,设计了一种基于支持向量机的基因表达数据分类识别方法,该方法采用信噪比进行基因特征提取,运用支持向量机的不同核函数进行性能测试,针对几个典型数据集的实验表明其识别效果良好.  相似文献   

12.
A Bayesian optimal screening method (BOSc) is proposed to classify an individual into one of two groups, based on the observation of pairs of covariates, namely the expression level of pairs of genes (previously selected by a specific method, among the thousands of genes present in the microarray) measured using DNA microarrays technology. The method is general and can be applied to any correlated pair of screening variables, either with a bivariate normal distribution or which can be transformed into a bivariate normal.1 Results on microarray data sets (Leukemia, Prostate and Breast) show that BOSc performance is competitive with, and in some cases significantly better than, quadratic and linear discriminant analyses and support vector machines classifiers. BOSc provides flexible parametric decision rules. Finally, the screening classifier allows the calculation of operating characteristics while addressing information about the prevalence of the disease or type of disease, which is an advantage over other classification methods.  相似文献   

13.
Since most cancer treatments come with a certain degree of toxicity it is very essential to identify a cancer type correctly and then administer the relevant therapy. With the arrival of powerful tools such as gene expression microarrays the cancer classification basis is slowly changing from morphological properties to molecular signatures. Several recent studies have demonstrated a marked improvement in prediction accuracy of tumor types based on gene expression microarray measurements over clinical markers. The main challenge in working with gene expression microarrays is that there is a huge number of genes to work with. Out of them only a small fraction are actually relevant for differentiating between different types of cancer. A Bayesian nearest neighbor model equipped with an integrated variable selection technique is proposed to overcome this challenge. This classification and gene selection model is able to classify different cancer types accurately and simultaneously identify the relevant or important genes. The proposed model is completely automatic in the sense that it adaptively picks up the neighborhood size and the important covariates. The method is successfully applied to three simulated data sets and four well known real data sets. To demonstrate the competitiveness of the method a comparative study is also done with several other “off the shelf” popular classification methods. For all the simulated data sets and real life data sets, the proposed method produced highly competitive if not better results. While the standard approach is two step model building for gene selection and then tumor prediction, this novel adaptive gene selection technique automatically selects the relevant genes along with tumor class prediction in one go. The biological relevance of the selected genes are also discussed to validate the claim.  相似文献   

14.
DNA microarray has been recognized as being an important tool for studying the expression of thousands of genes simultaneously. These experiments allow us to compare two different samples of cDNA obtained under different conditions. A novel method for the analysis of replicated microarray experiments based upon the modelling of gene expression distribution as a mixture of α-stable distributions is presented. Some features of the distribution of gene expression, such as Pareto tails and the fact that the variance of any given array increases concomitantly with an increase in the number of genes studied, suggest the possibility of modelling gene expression distribution on the basis of α-stable density. The proposed methodology uses very well known properties of α-stable distribution, such as the scale mixture of normals. A Bayesian log-posterior odds is calculated, which allows us to decide whether a gene is expressed differentially or not. The proposed methodology is illustrated using simulated and experimental data and the results are compared with other existing statistical approaches. The proposed heavy-tail model improves the performance of other distributions and is easily applicable to microarray gene data, specially if the dataset contains outliers or presents high variance between replicates.  相似文献   

15.
VizCluster and its Application on Classifying Gene Expression Data   总被引:1,自引:0,他引:1  
Visualization enables us to find structures, features, patterns, and relationships in a dataset by presenting the data in various graphical forms with possible interactions. A visualization can provide a qualitative overview of large and complex datasets, can summarize data, and can assist in identifying regions of interest and appropriate parameters focused on quantitative analysis. Recently, DNA microarray technology provides a broad snapshot of the state of the cell, by measuring the expression levels of thousands of genes simultaneously. Such information can thus be used to analyze different samples by gene expression profiles. It has already had a significant impact on the field of bioinformatics, requiring innovative techniques to efficiently and effectively extract, analyze, and visualize these fast growing data.In this paper, we present a dynamic interactive visualization environment, VizCluster, and its application on classifyinggene expression data. VizCluster takes advantage of graphical visualization methods to reveal underlining data patterns. It combines the merits of both high dimensional projection scatter-plot and parallel coordinate plot. In its core lies a nonlinear projection which maps the n-dimensional vectors onto two-dimensional points. To preserve the information at different scales and yet reduce the typical problem of parallel coordinate plots being messy caused by overlapping lines, a zip zooming viewing method is proposed. Integrated with other features, VizCluster is developed to give a simple, fast, intuitive, and yet powerful view of the data set. Its primary applications are on the classification of samples and evaluation of gene clusters for microarray datasets. Three gene expression datasets are used to illustrate the approach. We demonstrate that VizCluster approach is promising to be used for analyzing and visualizing microarray data sets and further development is worthwhile.  相似文献   

16.
高娟  王国胤  胡峰 《计算机科学》2012,39(10):193-197
从信息学角度出发寻找肿瘤相关基因、发现肿瘤基因表达特征对肿瘤的诊断和治疗具有重要的生物学意义,而肿瘤与正常组织的分类是其中一个重要应用。根据多类别肿瘤基因表达谱,提出了一种自动特征选择方法。首先,结合非参数方法和filter思想,利用决策序列的随机性度量基因的权值并排序;然后,采用相关信息熵进行冗余性排除,自动地选择出具有高分辨能力、低冗余度的特征基因子集。实验结果表明,提出的方法能从多类别肿瘤基因表达谱数据中自动选出30个具有良好分类能力的特征基因,且具有较高的正确识别率。  相似文献   

17.
In order to select a small subset of informative genes from gene expression data for cancer classification, many researchers have recently analyzed gene expression data using various computational intelligence methods. However, due to the small number of samples compared with the huge number of genes (high-dimension), irrelevant genes, and noisy genes, many of the computational methods face difficulties in selecting such a small subset. Therefore, we propose an enhancement of binary particle swarm optimization to select the small subset of informative genes that is relevant for classifying cancer samples more accurately. In this method, three approaches have been introduced to increase the probability of the bits in a particle’s position being zero. By performing experiments on two gene expression data sets, we have found that the performance of the proposed method is superior to previous related works, including the conventional version of binary particle swarm optimization (BPSO), in terms of classification accuracy and the number of selected genes. The proposed method also produces lower running times compared with BPSO.  相似文献   

18.
Investigation of genes, using data analysis and computer-based methods, has gained widespread attention in solving human cancer classification problem. DNA microarray gene expression datasets are readily utilized for this purpose. In this paper, we propose a feature selection method using improved regularized linear discriminant analysis technique to select important genes, crucial for human cancer classification problem. The experiment is conducted on several DNA microarray gene expression datasets and promising results are obtained when compared with several other existing feature selection methods.  相似文献   

19.
A reliable and precise classification of tumors is essential for successful treatment of cancer. Gene selection is an important step for improved diagnostics. The modified SFFS (sequential forward floating selection) algorithm based on weighted Mahalanobis distance, called MSWM, is proposed to identify optimal informative gene subsets taking into account joint discriminatory power for accurate discrimination in this study. Firstly, we make use of the one-dimensional weighted Mahalanobis distance to perform a preliminary selection of genes and then make use of the modified SFFS method and multidimensional weighted Mahalanobis distance to obtain the optimal informative gene subset for tumor classification. Finally, we used the k nearest neighbor and naive Bayes methods to classify tumors based on the optimal gene subset selected using the MSWM method. To validate the efficiency, the proposed MSWM method is applied to classify two different DNA microarray datasets. Our empirical study shows that the MSWM method for tumor classification can obtain better effectiveness of classification than the BWR (the ratio of between-groups to within-groups sum of squares) and IVGA_I (independent variable group analysis I) methods. It suggests that the MSWM gene selection method is ability to obtain correct informative gene subsets taking into account genes’ joint discriminatory power for tumor classification.  相似文献   

20.
Bio-chip data that consists of high-dimensional attributes have more attributes than specimens. Thus, it is difficult to obtain covariance matrix from tens thousands of genes within a number of samples. Feature selection and extraction is critical to remove noisy features and reduce the dimensionality in microarray analysis. This study aims to fill the gap by developing a data mining framework with a proposed algorithm for cluster analysis of gene expression data, in which coefficient correlation is employed to arrange genes. Indeed, cluster analysis of microarray data can find coherent patterns of gene expression. The output is displayed as table list for convenient survey. We adopt the breast cancer microarray dataset to demonstrate practical viability of this approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号