首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We introduce a nonparametric test intended for large-scale simultaneous inference in situations where the utility of distribution-free tests is limited because of their discrete nature. Such situations are frequently dealt with in microarray analysis where the number of tests is much larger than the sample size. The proposed test statistic is based on a certain distance between the distributions from which the samples under study are drawn. In a simulation study, the proposed permutation test is compared with permutation counterparts of the t-test and the Kolmogorov–Smirnov test. The usefulness of the proposed test is discussed in the context of microarray gene expression data and illustrated with an application to real datasets.  相似文献   

2.
Self-organising maps (SOM) have become a commonly-used cluster analysis technique in data mining. However, SOM are not able to process incomplete data. To build more capability of data mining for SOM, this study proposes an SOM-based fuzzy map model for data mining with incomplete data sets. Using this model, incomplete data are translated into fuzzy data, and are used to generate fuzzy observations. These fuzzy observations, along with observations without missing values, are then used to train the SOM to generate fuzzy maps. Compared with the standard SOM approach, fuzzy maps generated by the proposed method can provide more information for knowledge discovery.  相似文献   

3.
A biclustering algorithm, based on a greedy technique and enriched with a local search strategy to escape poor local minima, is proposed. The algorithm starts with an initial random solution and searches for a locally optimal solution by successive transformations that improve a gain function. The gain function combines the mean squared residue, the row variance, and the size of the bicluster. Different strategies to escape local minima are introduced and compared. Experimental results on several microarray data sets show that the method is able to find significant biclusters, also from a biological point of view.  相似文献   

4.
Identification of relevant genes from microarray data is an apparent need in many applications. For such identification different ranking techniques with different evaluation criterion are used, which usually assign different ranks to the same gene. As a result, different techniques identify different gene subsets, which may not be the set of significant genes. To overcome such problems, in this study pipelining the ranking techniques is suggested. In each stage of pipeline, few of the lower ranked features are eliminated and at the end a relatively good subset of feature is preserved. However, the order in which the ranking techniques are used in the pipeline is important to ensure that the significant genes are preserved in the final subset. For this experimental study, twenty four unique pipeline models are generated out of four gene ranking strategies. These pipelines are tested with seven different microarray databases to find the suitable pipeline for such task. Further the gene subset obtained is tested with four classifiers and four performance metrics are evaluated. No single pipeline dominates other pipelines in performance; therefore a grading system is applied to the results of these pipelines to find out a consistent model. The finding of grading system that a pipeline model is significant is also established by Nemenyi post-hoc hypothetical test. Performance of this pipeline model is compared with four ranking techniques, though its performance is not superior always but majority of time it yields better results and can be suggested as a consistent model. However it requires more computational time in comparison to single ranking techniques.  相似文献   

5.
Isometric mapping (Isomap) is a popular nonlinear dimensionality reduction technique which has shown high potential in visualization and classification. However, it appears sensitive to noise or scarcity of observations. This inadequacy may hinder its application for the classification of microarray data, in which the expression levels of thousands of genes in a few normal and tumor sample tissues are measured. In this paper we propose a double-bounded tree-connected variant of Isomap, aimed at being more robust to noise and outliers when used for classification and also computationally more efficient. It differs from the original Isomap in the way the neighborhood graph is generated: in the first stage we apply a double-bounding rule that confines the search to at most k nearest neighbors contained within an ε-radius hypersphere; the resulting subgraphs are then joined by computing a minimum spanning tree among the connected components. We therefore achieve a connected graph without unnaturally inflating the values of k and ε. The computational experiences show that the new method performs significantly better in terms of accuracy with respect to Isomap, k-edge-connected Isomap and the direct application of support vector machines to data in the input space, consistently across seven microarray datasets considered in our tests.  相似文献   

6.
Searching for an effective dimension reduction space is an important problem in regression, especially for high-dimensional data such as microarray data. A major characteristic of microarray data consists in the small number of observations n and a very large number of genes p. This “large p, small n” paradigm makes the discriminant analysis for classification difficult. In order to offset this dimensionality problem a solution consists in reducing the dimension. Supervised classification is understood as a regression problem with a small number of observations and a large number of covariates. A new approach for dimension reduction is proposed. This is based on a semi-parametric approach which uses local likelihood estimates for single-index generalized linear models. The asymptotic properties of this procedure are considered and its asymptotic performances are illustrated by simulations. Applications of this method when applied to binary and multiclass classification of the three real data sets Colon, Leukemia and SRBCT are presented.  相似文献   

7.
Gene expression technology, namely microarrays, offers the ability to measure the expression levels of thousands of genes simultaneously in biological organisms. Microarray data are expected to be of significant help in the development of an efficient cancer diagnosis and classification platform. A major problem in these data is that the number of genes greatly exceeds the number of tissue samples. These data also have noisy genes. It has been shown in literature reviews that selecting a small subset of informative genes can lead to improved classification accuracy. Therefore, this paper aims to select a small subset of informative genes that are most relevant for cancer classification. To achieve this aim, an approach using two hybrid methods has been proposed. This approach is assessed and evaluated on two well-known microarray data sets, showing competitive results. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   

8.
本文提出一个通用的方法,通过对Windows下BMP位图文件的格式分析,在Visual C .Net开发环境下,提取分析测试中所需要的图像数据,使其转化为许多专业数据分析处理软件都可以兼容的文本数据文件,为后期的实验数据分析处理提供了极大的便利。实验证明,这种方法是方便、准确的,已应用于科研。  相似文献   

9.
Recently, many methods have been proposed for constructing gene regulatory networks (GRNs). However, most of the existing methods ignored the time delay regulatory relation in the GRN predictions. In this paper, we propose a hybrid method, termed GA/PSO with DTW, to construct GRNs from microarray datasets. The proposed method uses test of correlation coefficient and the dynamic time warping (DTW) algorithm to determine the existence of a time delay relation between two genes. In addition, it uses the particle swarm optimization (PSO) to find thresholds for discretizing the microarray dataset. Based on the discretized microarray dataset and the predicted types of regulatory relations among genes, the proposed method uses a genetic algorithm to generate a set of candidate GRNs from which the predicted GRN is constructed. Three real-life sub-networks of yeast are used to verify the performance of the proposed method. The experimental results show that the GA/PSO with DTW is better than the other existing methods in terms of predicting sensitivity and specificity.  相似文献   

10.
The classical sign test is proper for hypotheses about a specified success probability, when based on independent trials. For such a hypothesis we introduce a new exact test that is appropriate with clustered binary data. It combines a permutation approach and an exact parametric bootstrap calculation. Simulation studies show it to be superior to a sign test based on aggregated cluster level data. The new test is more powerful than or comparable to a standard permutation test whenever (1) the number of clusters is small or (2) for larger cluster numbers under strong clustering. The results from a chemical repellency trial are used to illustrate three legitimate test methods.  相似文献   

11.
In this paper we investigate applying SOM (Self-Organizing Maps) for classification and rule extraction in data sets with missing values, in particular from real clinical data of bladder cancer patients. For this experiment, we used real data of bladder cancer patients provided by Kitasato University Hospital. When using input data with missing values for SOM, the missing value is either interpolated in the preprocessing stage, or the missing value is replaced with a specific value or property that marks it as a missing value. In either case, there is a possibility some rules can be extracted from data with missing values. On the other hand, these data can have a negative influence for the classification for data sets for which missing values should be neglected. In this research we propose a method where SOM is trained using an input vector in which the properties for the missing values are excluded. The influence of information on the missing values can be reduced by using the proposed method. Through computer simulation, we showed that the proposed method gave good results in classification and rule extraction from clinical data of bladder cancer patients. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   

12.
Independent component analysis (ICA) has been widely used to tackle the microarray dataset classification problem, but there still exists an unsolved problem that the independent component (IC) sets may not be reproducible after different ICA transformations. Inspired by the idea of ensemble feature selection, we design an ICA based ensemble learning system to fully utilize the difference among different IC sets. In this system, some IC sets are generated by different ICA transformations firstly. A multi-objective genetic algorithm (MOGA) is designed to select different biologically significant IC subsets from these IC sets, which are then applied to build base classifiers. Three schemes are used to fuse these base classifiers. The first fusion scheme is to combine all individuals in the final generation of the MOGA. In addition, in the evolution, we design a global-recording technique to record the best IC subsets of each IC set in a global-recording list. Then the IC subsets in the list are deployed to build base classifier so as to implement the second fusion scheme. Furthermore, by pruning about half of less accurate base classifiers obtained by the second scheme, a compact and more accurate ensemble system is built, which is regarded as the third fusion scheme. Three microarray datasets are used to test the ensemble systems, and the corresponding results demonstrate that these ensemble schemes can further improve the performance of the ICA based classification model, and the third fusion scheme leads to the most accurate ensemble system with the smallest ensemble size.  相似文献   

13.
本文通过对数据挖掘技术的研究,采用关联规则法对学生答题数据进行分析,并在关联规则使用中采用改进型的Apriori算法进行运算,构建高频集,并对于高分学生和低分学生的试卷进行了加权处理,使得高频集中的试题在知识点和难度上的关联更加突出,便于在自动组卷时更科学地评价试卷.  相似文献   

14.
Knowledge gained through classification of microarray gene expression data is increasingly important as they are useful for phenotype classification of diseases. Different from black box methods, fuzzy expert system can produce interpretable classifier with knowledge expressed in terms of if-then rules and membership function. This paper proposes a novel Genetic Swarm Algorithm (GSA) for obtaining near optimal rule set and membership function tuning. Advanced and problem specific genetic operators are proposed to improve the convergence of GSA and classification accuracy. The performance of the proposed approach is evaluated using six gene expression data sets. From the simulation study it is found that the proposed approach generated a compact fuzzy system with high classification accuracy for all the data sets when compared with other approaches.  相似文献   

15.
Automated test data generation plays an important part in reducing the cost and increasing the reliability of software testing. However, a challenging problem in path-oriented test data generation is the existence of infeasible program paths, where considerable effort may be wasted in trying to generate input data to traverse the paths. In this paper, we propose a heuristics-based approach to infeasible path detection for dynamic test data generation. Our approach is based on the observation that many infeasible program paths exhibit some common properties. Through realizing these properties in execution traces collected during the test data generation process, infeasible paths can be detected early with high accuracy. Our experiments show that the proposed approach efficiently detects most of the infeasible paths with an average precision of 96.02% and a recall of 100% of all the cases.  相似文献   

16.
The problem of identifying key genes is of fundamental importance in biology and medicine. The GeneRank model explores connectivity data to produce a prioritization of the genes in a microarray experiment that is less susceptible to variation caused by experimental noise than the one based on expression levels alone. The GeneRank algorithm amounts to solving an unsymmetric linear system. However, when the matrix in question is very large, the GeneRank algorithm is inefficient and even can be infeasible. On the other hand, the adjacency matrix is symmetric in the GeneRank model, while the original GeneRank algorithm fails to exploit the symmetric structure of the problem in question. In this paper, we discover that the GeneRank problem can be rewritten as a symmetric positive definite linear system, and propose a preconditioned conjugate gradient algorithm to solve it. Numerical experiments support our theoretical results, and show superiority of the novel algorithm.  相似文献   

17.
针对模拟退火算法,遗传算法应用于测试数据的自动生成的局限性,提出了一种基于GEMGA(基因表达散乱遗传算法)的结构化测试数据的自动生成的方法.讨论了路径的选择,提出了将控制流图与数据流图结合起来生成测试路径,通过TriType的分析结果说明了该方法的可行性.根据得到的测试路径将GEMGA应用到测试数据的自动生成,TdType的实验结果表明,GEMGA能生成更高质量的数据,并适用于较大规模的程序.  相似文献   

18.
面向试飞工程数据仓库系统的设计与实现   总被引:4,自引:2,他引:2  
飞行试验产生的数据复杂多变,因此要求工程数据管理系统具有较高的灵活性和可扩展性,提出了利用元对象实现各种不同模式工程数据统一管理的方法,描述了一个集数据管理与分析于一体的工程数据仓库系统的实现,给出了系统的组成结构,并刻画了该系统的主要功能和特点。  相似文献   

19.
Test data generation in program testing is the process of identifying a set of test data which satisfies a given testing criterion. Existing pathwise test data generators proceed by selecting program paths that satisfy the selected criterion and then generating program inputs for these paths. One of the problems with this approach is that unfeasible paths are often selected; as a result, significant computational effort can be wasted in analysing those paths. In this paper, an approach to test data generation, referred to as a dynamic approach for test data generation, is presented. In this approach, the path selection stage is eliminated. Test data are derived based on the actual execution of the program under test and function minimization methods. The approach starts by executing a program for an arbitrary program input. During program execution for each executed branch, a search procedure decides whether the execution should continue through the current branch or an alternative branch should be taken. If an undesirable execution flow is observed at the current branch, then a real-valued function is associated with this branch, and function minimization search algorithms are used to locate values of input variables automatically, which will change the flow of execution at this branch.  相似文献   

20.
一种用于测试数据生成的动态程序切片算法   总被引:3,自引:0,他引:3  
王雪莲  赵瑞莲  李立健 《计算机应用》2005,25(6):1445-1447,1450
介绍了程序切片技术的基本概念,提出了一种基于前向分析的动态程序切片算法,探讨了程序切片在软件测试数据生成中的应用,结果表明可以有效地提高基于路径的测试数据生成效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号