共查询到20条相似文献,搜索用时 0 毫秒
1.
2.
Jie Li Author Vitae Xianglong Tang Author Vitae Wei Zhao Jianhua Huang 《Pattern recognition》2007,40(11):3249-3262
Microarrays have been widely used to classify cancer samples and discover the biological types, for example tumor versus normal phenotypes in cancer research. One of the challenging scientific tasks in the post-genomic epoch is how to identify a subset of differentially expressed genes from thousands of genes in microarray data which will enable us to understand the underlying molecular mechanisms of diseases, accurately diagnosing diseases and identifying novel therapeutic targets. In this paper, we propose a new framework for identifying differentially expressed genes. In the proposed framework, genes are ranked according to their residuals. The performance of the framework is assessed through applying it to several public microarray data. Experimental results show that the proposed method gives more robust and accurate rank than other statistical test methods, such as t-test, Wilcoxon rank sum test and KS-test. Another novelty of the method is that we design an algorithm for selecting a small subset of genes that show significant variation in expression (“outlier” genes). The number of genes in the small subset can be controlled via an alterable window of confidence level. In addition, the results of the proposed method can be visualized. By observing the residual plot, we can easily find genes that show significant variation in two groups of samples and learn the degrees of differential expression of genes. Through a comparison study, we found several “outlier” genes which had been verified in previous biological experiments while they were either not identified by other methods or had lower ranks in standard statistical tests. 相似文献
3.
4.
In this paper, we propose a Bayesian framework for estimation of parameters of a mixture of autoregressive models for time series clustering. The proposed approach is based on variational principles and provides a tractable approximation to the true posterior density that minimizes Kullback–Liebler (KL) divergence with respect to prior distribution. This method simultaneously addresses the model complexity and parameter estimation problems. The proposed approach is applied both on simulated and real-world time series datasets. It is found to be useful in exploring and finding the true number of underlying clusters, starting from an arbitrarily large number of clusters. 相似文献
5.
Lafabregue Baptiste Weber Jonathan Gançarski Pierre Forestier Germain 《Data mining and knowledge discovery》2022,36(1):29-81
Data Mining and Knowledge Discovery - Time series are ubiquitous in data mining applications. Similar to other types of data, annotations can be challenging to acquire, thus preventing from... 相似文献
6.
提出了一种基于DTW的符号化时间序列聚类算法,对降维后得到的不等长符号时间序列进行聚类。该算法首先对时间序列进行降维处理,提取时间序列的关键点,并对其进行符号化;其次利用DTW方法进行相似度计算;最后利用Normal矩阵和FCM方法进行聚类分析。实验结果表明,将DTW方法应用在关键点提取之后的符号化时间序列上,聚类结果的准确率有较好大提高。 相似文献
7.
Performing data mining tasks in streaming data is considered a challenging research direction, due to the continuous data evolution. In this work, we focus on the problem of clustering streaming time series, based on the sliding window paradigm. More specifically, we use the concept of subspace α-clusters. A subspace α-cluster consists of a set of streams, whose value difference is less than α in a consecutive number of time instances (dimensions). The clusters can be continuously and incrementally updated as the streaming time series evolve with time. The proposed technique is based on a careful examination of pair-wise stream similarities for a subset of dimensions and then it is generalized for more streams per cluster. Additionally, we extend our technique in order to find maximal pClusters in consecutive dimensions that have been used in previously proposed clustering methods. Performance evaluation results, based on real-life and synthetic data sets, show that the proposed method is more efficient than existing techniques. Moreover, it is shown that the proposed pruning criteria are very important for search space reduction, and that the cost of incremental cluster monitoring is more computationally efficient that the re-clustering process. 相似文献
8.
Policker S. Geva A.B. 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2000,30(2):339-343
The object of this paper is to present a model and a set of algorithms for estimating the parameters of a nonstationary time series generated by a continuous change in regime. We apply fuzzy clustering methods to the task of estimating the continuous drift in the time series distribution and interpret the resulting temporal membership matrix as weights in a time varying, mixture probability distribution function (PDF). We analyze the stopping conditions of the algorithm to infer a novel cluster validity criterion for fuzzy clustering algorithms of temporal patterns. The algorithm performance is demonstrated with three different types of signals. 相似文献
9.
Data Mining and Knowledge Discovery - With the increasing power of data storage and advances in data generation and collection technologies, large volumes of time series data become available and... 相似文献
10.
时序数据中的野值会直接影响数据挖掘算法的结果,甚至造成算法失效。传统的基于密度的带有噪声的空间聚类(DBSCAN)算法可以用来识别野值,但是却存在算法对参数敏感、时间复杂度高、精度不高等问题。针对时序数据的特点,提出了一种可自动进行多次识别的基于方差聚类的野值识别算法。该方法通过将传统的邻域密度转换为方差和均值、将密度阈值转换为时间窗口内的方差和阈值,在定义野值数据、野簇数据和异常簇数据的基础上,给出野值识别方法的判断规则。同时,针对一次野值识别不能将全部野值剔除的问题,通过定义多次野值识别的结束条件将算法扩展为多次野值识别算法。通过在某航天数据挖掘项目中的应用,验证了该算法具有较好的通用性、低的时间复杂度、可进行多次识别以提高精度等特点。 相似文献
11.
12.
T. Warren Liao Author Vitae 《Pattern recognition》2007,40(9):2550-2562
A two-step procedure is developed for the exploratory mining of real-valued vector (multivariate) time series using partition-based clustering methods. The proposed procedure was tested with model-generated data, multiple sensor-based process data, as well as simulation data. The test results indicate that the proposed procedure is quite effective in producing better clustering results than a hidden Markov model (HMM)-based clustering method if there is a priori knowledge about the number of clusters in the data. Two existing validity indices were tested and found ineffective in determining the actual number of clusters. Determining the appropriate number of clusters in the case that there is no a priori knowledge is a known unresolved research issue not only for our proposed procedure but also for the HMM-based clustering method and further development is necessary. 相似文献
13.
Pattern Analysis and Applications - In this paper, a subsequence time-series clustering algorithm is proposed to identify the strongly coupled aftershocks sequences and Poissonian background... 相似文献
14.
The problem of clustering time series is studied for a general class of non-parametric autoregressive models. The dissimilarity between two time series is based on comparing their full forecast densities at a given horizon. In particular, two functional distances are considered: L1 and L2. As the forecast densities are unknown, they are approximated using a bootstrap procedure that mimics the underlying generating processes without assuming any parametric model for the true autoregressive structure of the series. The estimated forecast densities are then used to construct the dissimilarity matrix and hence to perform clustering. Asymptotic properties of the proposed method are provided and an extensive simulation study is carried out. The results show the good behavior of the procedure for a wide variety of nonlinear autoregressive models and its robustness to non-Gaussian innovations. Finally, the proposed methodology is applied to a real dataset involving economic time series. 相似文献
15.
Dynamic Time Warping (DTW) is a popular and efficient distance measure used in classification and clustering algorithms applied to time series data. By computing the DTW distance not on raw data but on the time series of the (first, discrete) derivative of the data, we obtain the so-called Derivative Dynamic Time Warping (DDTW) distance measure. DDTW, used alone, is usually inefficient, but there exist datasets on which DDTW gives good results, sometimes much better than DTW. To improve the performance of the two distance measures, we can combine them into a new single (parametric) distance function. The literature contains examples of the combining of DTW and DDTW in algorithms for supervised classification of time series data. In this paper, we demonstrate that combination of DTW and DDTW can also be applied in a method of time series clustering (unsupervised classification). In particular, we focus on a hierarchical clustering (with average linkage) of univariate (one-dimensional) time series data. We construct a new parametric distance function, combining DTW and DDTW, where a single real number parameter controls the contribution of each of the two measures to the total value of the combined distances. The parameter is tuned in the initial phase of the clustering algorithm. Using this technique in clustering methods requires a different approach (to address certain specific problems) than for supervised methods. In the clustering process we use three internal cluster validation measures (measures which do not use labels) and three external cluster validation measures (measures which do use clustering data labels). Internal measures are used to select an optimal value of the parameter of the algorithm, where external measures give information about the overall performance of the new method and enable comparison with other distance functions. Computational experiments are performed on a large real-world data base (UCR Time Series Classification Archive: 84 datasets) from a very broad range of fields, including medicine, finance, multimedia and engineering. The experimental results demonstrate the effectiveness of the proposed approach for hierarchical clustering of time series data. The method with the new parametric distance function outperforms DTW (and DDTW) on the data base used. The results are confirmed by graphical and statistical comparison. 相似文献
16.
Traditional and fuzzy cluster analyses are applicable to variables whose values are uncorrelated. Hence, in order to cluster time series data which are usually serially correlated, one needs to extract features from the time series, the values of which are uncorrelated. The periodogram which is an estimator of the spectral density function of a time series is a feature that can be used in the cluster analysis of time series because its ordinates are uncorrelated. Additionally, the normalized periodogram and the logarithm of the normalized periodogram are also features that can be used. In this paper, we consider a fuzzy clustering approach for time series based on the estimated cepstrum. The cepstrum is the spectrum of the logarithm of the spectral density function. We show in our simulation studies for the typical generating processes that have been considered, fuzzy clustering based on the cepstral coefficients performs very well compared to when it is based on other features. 相似文献
17.
对当前聚类算法进行研究的基础上,提出了有效地实现多元时间序列聚类的方法.用离散哈达玛变换对多元数据进行降维,求出多元变量相关系数矩阵的特征值作为权值.采用带权值的矩阵相似性度量方法,利用改进的K-means算法对多元时间序列进行聚类分析.实验结果表明,该方法能够有效地实现多元时间序列聚类,把具有相似趋势变化的多元时间序列对象划分到同一类中. 相似文献
18.
19.
Marcin Michalak 《Pattern Analysis & Applications》2011,14(3):283-293
This short article describes two kernel algorithms of the regression function estimation. One of them is called HASKE and has its own heuristic of the h parameter evaluation. The second is a hybrid algorithm that connects the SVM and HASKE in such a way that the definition of the local neighborhood is based on the definition of the h-neighborhood from HASKE. Both of them are used as predictors for time series. 相似文献
20.
The problem of adaptive segmentation of time series with abrupt changes in the spectral characteristics is addressed. Such time series have been encountered in various fields of time series analysis such as speech processing, biomedical signal processing, image analysis and failure detection. Mathematically, these time series often can be modeled by zero mean gaussian distributed autoregressive (AR) processes, where the parameters of the process, including the gain factor, remain constant for certain time intervals and then jump abruptly to new values. Identification of such processes requires adaptive segmentation: the times of parameter jumps have to be estimated thoroughly to constitute boundaries of “homogeneous” segments which can be described by stationary AR processes. In this paper, a new effective method for sequential adaptive segmentation is proposed, which is based on parallel application of two sequential parameter estimation procedures. The detection of a parameter change as well as the estimation of the accurate position of a segment boundary is effectively performed by a sequence of suitable generalized likelihood ratio (GLR) tests. Flow charts as well as a block diagram of the algorithm are presented. The adjustment of the three control parameters of the procedure (the AR model order, a threshold for the GLR test and the length of a “test window”) is discussed with respect to various performance features. The results of simulation experiments are presented which demonstrate the good detection properties of the algorithm and in particular an excellent ability to allocate the segment boundaries even within a sequence of short segments. As an application to biomedical signals, the analysis of human electroencephalograms (EEG) is considered and an example is shown. 相似文献