首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Cluster analysis for gene expression data: a survey   总被引:16,自引:0,他引:16  
DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field.  相似文献   

2.
Clustering sensor data discovers useful information hidden in sensor networks. In sensor networks, a sensor has two types of attributes: a geographic attribute (i.e, its spatial location) and non-geographic attributes (e.g., sensed readings). Sensor data are periodically collected and viewed as spatial data streams, where a spatial data stream consists of a sequence of data points exhibiting attributes in both the geographic and non-geographic domains. Previous studies have developed a dual clustering problem for spatial data by considering similarity-connected relationships in both geographic and non-geographic domains. However, the clustering processes in stream environments are time-sensitive because of frequently updated sensor data. For sensor data, the readings from one sensor are similar for a period, and the readings refer to temporal locality features. Using the temporal locality features of the sensor data, this study proposes an incremental clustering (IC) algorithm to discover clusters efficiently. The IC algorithm comprises two phases: cluster prediction and cluster refinement. The first phase estimates the probability of two sensors belonging to a cluster from the previous clustering results. According to the estimation, a coarse clustering result is derived. The cluster refinement phase then refines the coarse result. This study evaluates the performance of the IC algorithm using synthetic and real datasets. Experimental results show that the IC algorithm outperforms exiting approaches confirming the scalability of the IC algorithm. In addition, the effect of temporal locality features on the IC algorithm is analyzed and thoroughly examined in the experiments.  相似文献   

3.
The dramatic increase in space-borne sensors over the past two decades is presenting unique opportunities for new and enhanced applications in various scientific disciplines. Using these data sets, hydrogeologists can now address and understand the partitioning of water systems on regional and global scales, yet such applications present mounting challenges in data retrieval, assimilation, and analysis for scientists attempting to process relevant large temporal remote sensing data sets (e.g., TRMM, SSM/I, AVHRR, MODIS, QuikSCAT, and AMSR-E). We describe solutions to these problems through the development of an interactive data language (IDL)-based computer program, the remote sensing data extraction model (RESDEM) for integrated processing and analysis of a suite of remote sensing data sets. RESDEM imports, calibrates, and georeferences scenes, and subsets global data sets for the purpose of extracting and verifying precipitation over areas and time periods of interest. Verification of precipitation events is accomplished by integrating other long-term satellite based data sets. The modules in RESDEM process data for cloud detection and others for detecting changes in soil moisture, vegetative water capacity and vegetation intensity following targeted precipitation events. Using the arid Sinai Peninsula (SP; area: 61,000 km2) and the Eastern Desert (ED; area: 220,000 km2) of Egypt as test sites, we demonstrate how RESDEM outputs (verified precipitation events) are now enabling regional scale applications of continuous (1998–2006) rainfall-runoff and groundwater recharge computations.  相似文献   

4.
Hierarchical clustering is a common procedure for identifying structure in a dataset, and this is frequently used for organizing genomic data. Although more advanced clustering algorithms are available, the simplicity and visual appeal of hierarchical clustering have made it ubiquitous in gene expression data analysis. Hence, even minor improvements in this framework would have significant impact. There is currently no simple and systematic way of assessing and displaying the significance of various clusters in a resulting dendrogram without making certain distributional assumptions or ignoring gene-specific variances. In this work, we introduce a permutation test based on comparing the within-cluster structure of the observed data with those of sample datasets obtained by permuting the cluster membership. We carry out this test at each node of the dendrogram using a statistic derived from the singular value decomposition of variance matrices. The p-values thus obtained provide insight into the significance of each cluster division. Given these values, one can also modify the dendrogram by combining non-significant branches. By adjusting the cut-off level of significance for branches, one can produce dendrograms with a desired level of detail for ease of interpretation. We demonstrate the usefulness of this approach by applying it to illustrative datasets.  相似文献   

5.
In this paper, we present Microarray Medical Data explorer (Microarray-MD), a novel software system that is able to assist in the exploratory analysis of gene expression microarray data. It implements a combination scheme of multiple Support Vector Machines, which integrates a variety of gene selection criteria and allows for the discrimination of multiple diseases or subtypes of a disease. The system can be trained and automatically tune its parameters with the provision of pathologically characterized gene expression data to its input. Given a set of new, uncharacterized, patient's data as input, it outputs a decision on the type or the subtype of a disease. A graphical user interface provides easy access to the system operations and direct adjustment of its parameters. It has been tested on various publicly available datasets. The overall accuracy it achieves was estimated to exceed 90%.  相似文献   

6.
In this paper, we present an approximate data gathering technique, called EDGES, for sensor networks that utilizes temporal and spatial correlations. The goal of EDGES is to efficiently obtain the sensor reading within a certain error bound. To do this, EDGES utilizes the multiple model Kalman filter, which is for the non-linear data distribution, as an approximation approach. The use of the Kalman filter allows EDGES to predict the future value using a single previous sensor reading in contrast to the other statistical models such as the linear regression and multivariate Gaussian. In order to extend the lifetime of networks, EDGES utilizes the spatial correlation. In EDGES, we group spatially close sensors as a cluster. Since a cluster header in a network acts as a sensor and router, a cluster header wastes its energy severely to send its own reading and/or data coming from its children. Thus, we devise a redistribution method which distributes the energy consumption of a cluster header using the spatial correlation. In some previous works, the fixed routing topology is used or the roles of nodes are decided at the base station and this information propagates through the whole network. But, in EDGES, the change of a cluster is notified to a small portion of the network. Our experimental results over randomly generated sensor networks with synthetic and real data sets demonstrate the efficiency of EDGES.  相似文献   

7.
Existing data management tools have some limitations such as restrictions to specific file systems or shortage of transparence to applications.In this paper,we present a new data management tool called AIP,which is implemented via the standard data management API,and hence it supports multiple file systems and makes data management operations transparent to applications.First,AIP provides centralized policy-based data management for controlling the placement of files in different storage tiers.Second,AIP uses differentiated collections of file states to improve the execution efficiency of data management policies,with the help of the caching mechanism of file states.Third,AIP also provides a resource arbitration mechanism for controlling the rate of initiated data management operations.Our results from representative experiments demonstrate that AIP has the ability to provide high performance,to introduce low management overhead,and to have good scalability.  相似文献   

8.
9.
Multimod Data Manager: a tool for data fusion   总被引:2,自引:0,他引:2  
Nowadays biomedical engineers regularly have to combine data from multiple medical imaging modalities, biomedical measurements and computer simulations and this can demand the knowledge of many specialised software tools. Acquiring this knowledge to the depth necessary to perform the various tasks can require considerable time and thus divert the researcher from addressing the actual biomedical problems. The aim of the present study is to describe a new application called the Multimod Data Manager, distributed as a freeware, which provides the end user with a fully integrated environment for the fusion and manipulation of all biomedical data. The Multimod Data Manager is generated using a software application framework, called the Multimod Application Framework, which is specifically designed to support the rapid development of computer aided medicine applications. To understand the general logic of the Data Manager, we first introduce the framework from which it is derived. We then illustrate its use by an example--the development of a complete subject-specific musculo-skeletal model of the lower limb from the Visible Human medical imaging data to be used for predicting the stresses in the skeleton during gait. While the Data Manager is clearly still only at the prototype stage, we believe that it is already capable of being used to solve a large number of problems common to many biomedical engineering activities.  相似文献   

10.
11.
Gene expression data are expected to be of significant help in the development of efficient cancer diagnosis and classification platforms. One problem arising from these data is how to select a small subset of genes from thousands of genes and a few samples that are inherently noisy. This research aims to select a small subset of informative genes from the gene expression data which will maximize the classification accuracy. A model for gene selection and classification has been developed by using a filter approach, and an improved hybrid of the genetic algorithm and a support vector machine classifier. We show that the classification accuracy of the proposed model is useful for the cancer classification of one widely used gene expression benchmark data set.  相似文献   

12.
赵宇海  王国仁  印莹 《计算机应用》2005,25(6):1388-1391
提出了一种用于基因表达数据的无参数聚类算法。该算法把多维数据的模糊聚类方法与CTWC相结合,并引入基于范数的方法进一步对该方法加以改进和论证。将该算法应用于真实的结肠癌基因表达数据集,确定了含8个基因的特征基因组合,该特征基因组合不仅达到了90%左右的结肠癌样本识别率,还能鉴别结肠癌样本的亚型。实验结果充分验证了这种算法的可行性。  相似文献   

13.
Clustering analysis of temporal gene expression data is widely used to study dynamic biological systems, such as identifying sets of genes that are regulated by the same mechanism. However, most temporal gene expression data often contain noise, missing data points, and non-uniformly sampled time points, which imposes challenges for traditional clustering methods of extracting meaningful information. In this paper, we introduce an improved clustering approach based on the regularized spline regression and an energy based similarity measure. The proposed approach models each gene expression profile as a B-spline expansion, for which the spline coefficients are estimated by regularized least squares scheme on the observed data. To compensate the inadequate information from noisy and short gene expression data, we use its correlated genes as the test set to choose the optimal number of basis and the regularization parameter. We show that this treatment can help to avoid over-fitting. After fitting the continuous representations of gene expression profiles, we use an energy based similarity measure for clustering. The energy based measure can include the temporal information and relative changes of the time series using the first and second derivatives of the time series. We demonstrate that our method is robust to noise and can produce meaningful clustering results.  相似文献   

14.
The GMAP: a versatile tool for physical data independence   总被引:1,自引:0,他引:1  
Physical data independence is touted as a central feature of modern database systems. It allows users to frame queries in terms of the logical structure of the data, letting a query processor automatically translate them into optimal plans that access physical storage structures. Both relational and object-oriented systems, however, force users to frame their queries in terms of a logical schema that is directly tied to physical structures. We present an approach that eliminates this dependence. All storage structures are defined in a declarative language based on relational algebra as functions of a logical schema. We present an algorithm, integrated with a conventional query optimizer, that translates queries over this logical schema into plans that access the storage structures. We also show how to compile update requests into plans that update all relevant storage structures consistently and optimally. Finally, we report on experiments with a prototype implementation of our approach that demonstrate how it allows storage structures to be tuned to the expected or observed workload to achieve significantly better performance than is possible with conventional techniques. Edited by Matthias Jarke, Jorge Bocca, Carlo Zaniolo. Received September 15, 1994 / Accepted September 1, 1995  相似文献   

15.
Han  Fei  Zhu  Shaojun  Ling  Qinghua  Han  Henry  Li  Hailong  Guo  Xinli  Cao  Jiechuan 《Neural computing & applications》2022,34(19):16325-16339
Neural Computing and Applications - Traditional machine learning methods are difficult to obtain good performance in the classification of gene expression data due to its characteristics of high...  相似文献   

16.
17.
We introduce a flexible, variable resolution tool for interactive resampling of computational fluid dynamics (CFD) simulation data on versatile grids. The tool and coupled algorithm afford users precise control of glyph placement during vector field visualization via six interactive degrees of freedom. Other important characteristics of this method include: (1) an algorithm that resamples any unstructured grid onto any structured grid, (2) handles changes to underlying topology and geometry, (3) handles unstructured grids with holes and discontinuities, (4) does not rely on any pre-processing of the data, and (5) processes large numbers of unstructured grid cells efficiently. We believe this tool to be a valuable asset in the engineer's pursuit of understanding and visualizing the underlying flow field in CFD simulation results.  相似文献   

18.
Low spatial resolution satellite sensors provide information over relatively large targets with typical pixel resolutions of hundreds of km2. However, the spatial scales of ground measurements are usually much smaller. Such differences in spatial scales makes the interpretation of comparisons between quantities derived from low resolution sensors and ground measurements particularly difficult. It also highlights the importance of developing appropriate sampling strategies when designing ground campaigns for validation studies of low resolution sensors.

We make use of statistical modelling of high resolution surface shortwave radiation budget (SSRB) data to look into this problem. A spatial model that describes the SSRB over a selected region is proposed, and the impact of different sampling schemes in the performance of the model is analysed. Both systematic and random sampling schemes can efficiently represent the full observations set.  相似文献   

19.
The typical AI problem is that of making a plan of the actions to be performed by a controller so that it could get into a set of final situations, if it started with a certain initial situation.The plans, and related winning strategies, happen to be finite in the case of a finite number of states and a finite number of instant actions.The situation becomes much more complex when we deal with planning under temporal uncertainty caused by actions with delayed effects.Here we introduce a tree-based formalism to express plans, or winning strategies, in finite state systems in which actions may have quantitatively delayed effects. Since the delays are non-deterministic and continuous, we need an infinite branching to display all possible delays. Nevertheless, under reasonable assumptions, we show that infinite winning strategies which may arise in this context can be captured by finite plans.The above planning problem is specified in logical terms within a Horn fragment of affine logic. Among other things, the advantage of linear logic approach is that we can easily capture ‘preemptive/anticipative’ plans (in which a new action β may be taken at some moment within the running time of an action α being carried out, in order to be prepared before completion of action α).In this paper we propose a comprehensive and adequate logical model of strong planning under temporal uncertainty which addresses infinity concerns. In particular, we establish a direct correspondence between linear logic proofs and plans, or winning strategies, for the actions with quantitative delayed effects.  相似文献   

20.
CLARANS: a method for clustering objects for spatial data mining   总被引:14,自引:0,他引:14  
Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, it proposes a new clustering method called CLARANS, whose aim is to identify spatial structures that may be present in the data. Experimental results indicate that, when compared with existing clustering methods, CLARANS is very efficient and effective. Second, the paper investigates how CLARANS can handle not only point objects, but also polygon objects efficiently. One of the methods considered, called the IR-approximation, is very efficient in clustering convex and nonconvex polygon objects. Third, building on top of CLARANS, the paper develops two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes. Both algorithms can discover knowledge that is difficult to find with existing spatial data mining algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号