首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 38 毫秒
1.
Similarity search and detection is a central problem in time series data processing and management. Most approaches to this problem have been developed around the notion of dynamic time warping, whereas several dimensionality reduction techniques have been proposed to improve the efficiency of similarity searches. Due to the continuous increasing of sources of time series data and the cruciality of real-world applications that use such data, we believe there is a challenging demand for supporting similarity detection in time series in a both accurate and fast way. Our proposal is to define a concise yet feature-rich representation of time series, on which the dynamic time warping can be applied for effective and efficient similarity detection of time series. We present the Derivative time series Segment Approximation (DSA) representation model, which originally features derivative estimation, segmentation and segment approximation to provide both high sensitivity in capturing the main trends of time series and data compression. We extensively compare DSA with state-of-the-art similarity methods and dimensionality reduction techniques in clustering and classification frameworks. Experimental evidence from effectiveness and efficiency tests on various datasets shows that DSA is well-suited to support both accurate and fast similarity detection.  相似文献   

2.
Dynamic time warping (DTW) distance has been effectively used in mining time series data in a multitude of domains. However, in its original formulation DTW is extremely inefficient in comparing long sparse time series, containing mostly zeros and some unevenly spaced nonzero observations. Original DTW distance does not take advantage of this sparsity, leading to redundant calculations and a prohibitively large computational cost for long time series. We derive a new time warping similarity measure (AWarp) for sparse time series that works on the run-length encoded representation of sparse time series. The complexity of AWarp is quadratic on the number of observations as opposed to the range of time of the time series. Therefore, AWarp can be several orders of magnitude faster than DTW on sparse time series. AWarp is exact for binary-valued time series and a close approximation of the original DTW distance for any-valued series. We discuss useful variants of AWarp: bounded (both upper and lower), constrained, and multidimensional. We show applications of AWarp to three data mining tasks including clustering, classification, and outlier detection, which are otherwise not feasible using classic DTW, while producing equivalent results. Potential areas of application include bot detection, human activity classification, search trend analysis, seismic analysis, and unusual review pattern mining.  相似文献   

3.
鉴于传统方法不能直接有效地对多元时间序列数据进行聚类分析,提出一种基于分量属性近邻传播的多元时间序列数据聚类方法.通过动态时间弯曲方法度量多元时间序列数据之间的总体距离,利用近邻传播聚类算法分别对数据之间的总体距离矩阵和分量近似距离矩阵进行聚类分析,综合考虑这两种视角下序列数据之间的关联关系,使用近邻传播方法对反映原始多元时间序列数据的综合关系矩阵实现较高质量的聚类.数值实验结果表明,与传统聚类方法相比,所提出方法不仅能够有效地反映总体数据特征之间的关系,而且通过重要分量属性序列之间的关联关系分析能够提高原始时间序列数据的聚类效果.  相似文献   

4.
Clustering problems are central to many knowledge discovery and data mining tasks. However, most existing clustering methods can only work with fixed-dimensional representations of data patterns. In this paper, we study the clustering of data patterns that are represented as sequences or time series possibly of different lengths. We propose a model-based approach to this problem using mixtures of autoregressive moving average (ARMA) models. We derive an expectation-maximization (EM) algorithm for learning the mixing coefficients as well as the parameters of the component models. To address the model selection problem, we use the Bayesian information criterion (BIC) to determine the number of clusters in the data. Experiments are conducted on a number of simulated and real datasets. Results from the experiments show that our method compares favorably with other methods proposed previously by others for similar time series clustering tasks.  相似文献   

5.
6.
李海林  杨丽彬 《控制与决策》2013,28(11):1718-1722

数据降维和特征表示是解决时间序列维灾问题的关键技术和重要方法, 它们在时间序列数据挖掘中起基础性作用. 鉴于此, 提出一种新的时间序列数据降维和特征表示方法, 利用正交多项式回归模型对时间序列实现特征提取, 结合特征序列长度对时间序列的拟合分析结果, 运用奇异值分解方法对特征序列进一步降维处理, 进而得到保存大部分信息且维数更低的特征序列. 数值实验结果表明, 新方法可以在维度较低的特征空间下取得较好的数据挖掘聚类和分类效果.

  相似文献   

7.
Traditional pattern recognition generally involves two tasks: unsupervised clustering and supervised classification. When class information is available, fusing the advantages of both clustering learning and classification learning into a single framework is an important problem worthy of study. To date, most algorithms generally treat clustering learning and classification learning in a sequential or two-step manner, i.e., first execute clustering learning to explore structures in data, and then perform classification learning on top of the obtained structural information. However, such sequential algorithms cannot always guarantee the simultaneous optimality for both clustering and classification learning. In fact, the clustering learning in these algorithms just aids the subsequent classification learning and does not benefit from the latter. To overcome this problem, a simultaneous learning framework for clustering and classification (SCC) is presented in this paper. SCC aims to achieve three goals: (1) acquiring the robust classification and clustering simultaneously; (2) designing an effective and transparent classification mechanism; (3) revealing the underlying relationship between clusters and classes. To this end, with the Bayesian theory and the cluster posterior probabilities of classes, we define a single objective function to which the clustering process is directly embedded. By optimizing this objective function, the effective and robust clustering and classification results are achieved simultaneously. Experimental results on both synthetic and real-life datasets show that SCC achieves promising classification and clustering results at one time.  相似文献   

8.
Feature selection is one of the most important machine learning procedure, and it has been successfully applied to make a preprocessing before using classification and clustering methods. High-dimensional features often appear in big data, and it’s characters block data processing. So spectral feature selection algorithms have been increasing attention by researchers. However, most feature selection methods, they consider these tasks as two steps, learn similarity matrix from original feature space (may be include redundancy for all features), and then conduct data clustering. Due to these limitations, they do not get good performance on classification and clustering tasks in big data processing applications. To address this problem, we propose an Unsupervised Feature Selection method with graph learning framework, which can reduce the redundancy features influence and utilize a low-rank constraint on the weight matrix simultaneously. More importantly, we design a new objective function to handle this problem. We evaluate our approach by six benchmark datasets. And all empirical classification results show that our new approach outperforms state-of-the-art feature selection approaches.  相似文献   

9.
Pairwise clustering methods have shown great promise for many real-world applications. However, the computational demands of these methods make them impractical for use with large data sets. The contribution of this paper is a simple but efficient method, called eSPEC, that makes clustering feasible for problems involving large data sets. Our solution adopts a “sampling, clustering plus extension” strategy. The methodology starts by selecting a small number of representative samples from the relational pairwise data using a selective sampling scheme; then the chosen samples are grouped using a pairwise clustering algorithm combined with local scaling; and finally, the label assignments of the remaining instances in the data are extended as a classification problem in a low-dimensional space, which is explicitly learned from the labeled samples using a cluster-preserving graph embedding technique. Extensive experimental results on several synthetic and real-world data sets demonstrate both the feasibility of approximately clustering large data sets and acceleration of clustering in loadable data sets of our method.  相似文献   

10.
Linear model is a general forecasting model and moving average technical index (MATI) is one of useful forecasting methods to predict the future stock prices in stock markets. Therefore, individual investors, stock fund managers, and financial analysts attempt to predict price fluctuation in stock markets by either linear model or MATI. From literatures, three major drawbacks are found in many existing forecasting models. First, forecasting rules mined from some AI algorithms, such as neural networks, could be very difficult to understand. Second, statistic assumptions about variables are required for time series to generate forecasting models, which are not easily understandable by stock investors. Third, stock market investors usually make short-term decisions based on recent price fluctuations, i.e., the last one or two periods, but most time series models use only the last period of stock price. In order to overcome these drawbacks, this study proposes a hybrid forecasting model using linear model and MATI to predict stock price trends with the following four steps: (1) test the lag period of Taiwan Stock Exchange Capitalization Weighted Stock Index (TAIEX) and calculate the last n-period moving average; (2) use subtractive clustering to partition technical indicator values into linguistic values based on data discretization method objectively; (3) employ fuzzy inference system (FIS) to build linguistic rules from the linguistic technical indicator dataset, and optimize the FIS parameters by adaptive network; and (4) refine the proposed model by adaptive expectation models. The proposed model is then verified by root mean squared error (RMSE), and a ten-year period of TAIEX is selected as experiment datasets. The results show that the proposed model is superior to the other forecasting models, namely Chen's model and Yu's model in terms of RMSE.  相似文献   

11.
一种SOM和GRNN结合的模式全自动分类新方法   总被引:1,自引:0,他引:1  
非监督学习算法的分类精度通常很难令人满意,而监督的学习算法需要人工选取训练样本,这有时很难得到,并且其分类精度直接依赖于所选取的学习样本。针对这些缺陷,提出一种非监督自组织神经网络(SOMNN)和监督的广义回归网络(GRNN)结合的全自动模式分类新方法。新方法首先通过SOMNN将原始数据进行自动聚类,再用所得的聚类中心以及中心邻近数据点训练GRNN,然后根据GRNN的分类结果重新计算聚类中心,再根据新的聚类中心和中心邻近点训练GRNN,如此反复,直至得到稳定的中心为止。Iris数据,Wine数据的实验结果都验证了新方法的可行性。  相似文献   

12.
Directly applying single-label classification methods to the multi-label learning problems substantially limits both the performance and speed due to the imbalance, dependence and high dimensionality of the given label matrix. Existing methods either ignore these three problems or reduce one with the price of aggravating another. In this paper, we propose a {0,1} label matrix compression and recovery method termed ??compressed labeling (CL)?? to simultaneously solve or at least reduce these three problems. CL first compresses the original label matrix to improve balance and independence by preserving the signs of its Gaussian random projections. Afterward, we directly utilize popular binary classification methods (e.g., support vector machines) for each new label. A fast recovery algorithm is developed to recover the original labels from the predicted new labels. In the recovery algorithm, a ??labelset distilling method?? is designed to extract distilled labelsets (DLs), i.e., the frequently appeared label subsets from the original labels via recursive clustering and subtraction. Given a distilled and an original label vector, we discover that the signs of their random projections have an explicit joint distribution that can be quickly computed from a geometric inference. Based on this observation, the original label vector is exactly determined after performing a series of Kullback-Leibler divergence based hypothesis tests on the distribution about the new labels. CL significantly improves the balance of?the training samples and reduces the dependence between different labels. Moreover, it accelerates the learning process by training fewer binary classifiers for compressed labels, and makes use of label dependence via DLs based tests. Theoretically, we prove the recovery bounds of CL which verifies the effectiveness of CL for label compression and multi-label classification performance improvement brought by label correlations preserved in DLs. We show the effectiveness, efficiency and robustness of CL via 5 groups of experiments on 21 datasets from text classification, image annotation, scene classification, music categorization, genomics and web page classification.  相似文献   

13.
时间序列数据挖掘综述*   总被引:20,自引:3,他引:17  
在综合分析近年来时间序列数据挖掘相关文献的基础上,讨论了时间序列数据挖掘的最新进展,对各种学术观点进行了比较归类,并预测了其发展趋势.内容涵盖了时间序列数据变换、相似性搜索、预测、分类、聚类、分割、可视化等方面,为研究者了解最新的时间序列数据挖掘研究动态、新技术及发展趋势提供了参考.  相似文献   

14.
针对现有基于语义知识规则分析的文本相似性度量方法存在时间复杂度高的局限性,提出基于分类词典的文本相似性度量方法。利用汉语词法分析系统ICTCLAS对文本分词,运用TF×IDF方法提取文本关键词,遍历分类词典获取关键词编码,通过计算文本关键词编码的近似性来衡量原始文本之间的相似度。选取基于语义知识规则和基于统计两个类别的相似性度量方法作为对比方法,通过传统聚类与KNN分类分别对相似性度量方法进行效果验证。数值实验结果表明,新方法在聚类与分类实验中均能取得较好的实验结果,相较于其他基于语义分析的相似性度量方法还具有良好的时间效率。  相似文献   

15.
一种基于Markov链模型的动态聚类方法   总被引:11,自引:0,他引:11  
对单变量时间序列的聚类,是一类有着广泛应用背景的特殊的聚类问题。由于该问题的特殊性,现有的聚类方法无法直接使用,故提出了一种新的基于Markov链模型的动态聚类方法。该方法首先对每一个时间序列建立一个描述其动态特征的Markov链模型,从而把对时间序列的聚类问题转化为对Markov链模型的聚类问题。然后通过定义各个Markov链之间的“距离”,采用动态聚类算法完成对这些Markov链模型的聚类,使用该方法,分别对一经真实数据和仿真数据进行了聚类试验,都获得了比较好的聚类结果。  相似文献   

16.
By using a kernel function, data that are not easily separable in the original space can be clustered into homogeneous groups in the implicitly transformed high-dimensional feature space. Kernel k-means algorithms have recently been shown to perform better than conventional k-means algorithms in unsupervised classification. However, few reports have examined the benefits of using a kernel function and the relative merits of the various kernel clustering algorithms with regard to the data distribution. In this study, we reformulated four representative clustering algorithms based on a kernel function and evaluated their performances for various data sets. The results indicate that each kernel clustering algorithm gives markedly better performance than its conventional counterpart for almost all data sets. Of the kernel clustering algorithms studied in the present work, the kernel average linkage algorithm gives the most accurate clustering results.  相似文献   

17.
Characteristic-Based Clustering for Time Series Data   总被引:1,自引:0,他引:1  
With the growing importance of time series clustering research, particularly for similarity searches amongst long time series such as those arising in medicine or finance, it is critical for us to find a way to resolve the outstanding problems that make most clustering methods impractical under certain circumstances. When the time series is very long, some clustering algorithms may fail because the very notation of similarity is dubious in high dimension space; many methods cannot handle missing data when the clustering is based on a distance metric.This paper proposes a method for clustering of time series based on their structural characteristics. Unlike other alternatives, this method does not cluster point values using a distance metric, rather it clusters based on global features extracted from the time series. The feature measures are obtained from each individual series and can be fed into arbitrary clustering algorithms, including an unsupervised neural network algorithm, self-organizing map, or hierarchal clustering algorithm.Global measures describing the time series are obtained by applying statistical operations that best capture the underlying characteristics: trend, seasonality, periodicity, serial correlation, skewness, kurtosis, chaos, nonlinearity, and self-similarity. Since the method clusters using extracted global measures, it reduces the dimensionality of the time series and is much less sensitive to missing or noisy data. We further provide a search mechanism to find the best selection from the feature set that should be used as the clustering inputs.The proposed technique has been tested using benchmark time series datasets previously reported for time series clustering and a set of time series datasets with known characteristics. The empirical results show that our approach is able to yield meaningful clusters. The resulting clusters are similar to those produced by other methods, but with some promising and interesting variations that can be intuitively explained with knowledge of the global characteristics of the time series.  相似文献   

18.
Dynamic Time Warping (DTW) is a popular and efficient distance measure used in classification and clustering algorithms applied to time series data. By computing the DTW distance not on raw data but on the time series of the (first, discrete) derivative of the data, we obtain the so-called Derivative Dynamic Time Warping (DDTW) distance measure. DDTW, used alone, is usually inefficient, but there exist datasets on which DDTW gives good results, sometimes much better than DTW. To improve the performance of the two distance measures, we can combine them into a new single (parametric) distance function. The literature contains examples of the combining of DTW and DDTW in algorithms for supervised classification of time series data. In this paper, we demonstrate that combination of DTW and DDTW can also be applied in a method of time series clustering (unsupervised classification). In particular, we focus on a hierarchical clustering (with average linkage) of univariate (one-dimensional) time series data. We construct a new parametric distance function, combining DTW and DDTW, where a single real number parameter controls the contribution of each of the two measures to the total value of the combined distances. The parameter is tuned in the initial phase of the clustering algorithm. Using this technique in clustering methods requires a different approach (to address certain specific problems) than for supervised methods. In the clustering process we use three internal cluster validation measures (measures which do not use labels) and three external cluster validation measures (measures which do use clustering data labels). Internal measures are used to select an optimal value of the parameter of the algorithm, where external measures give information about the overall performance of the new method and enable comparison with other distance functions. Computational experiments are performed on a large real-world data base (UCR Time Series Classification Archive: 84 datasets) from a very broad range of fields, including medicine, finance, multimedia and engineering. The experimental results demonstrate the effectiveness of the proposed approach for hierarchical clustering of time series data. The method with the new parametric distance function outperforms DTW (and DDTW) on the data base used. The results are confirmed by graphical and statistical comparison.  相似文献   

19.
《Computer Networks》2008,52(14):2677-2689
This work explores the use of statistical techniques, namely stratified sampling and cluster analysis, as powerful tools for deriving traffic properties at the flow level. Our results show that the adequate selection of samples leads to significant improvements allowing further important statistical analysis. Although stratified sampling is a well-known technique, the way we classify the data prior to sampling is innovative and deserves special attention. We evaluate two partitioning clustering methods, namely clustering large applications (CLARA) and K-means, and validate their outcomes by using them as thresholds for stratified sampling. We show that using flow sizes to divide the population we can obtain accurate estimates for both size and flow durations. The presented sampling and clustering classification techniques achieve data reduction levels higher than that of existing methods, on the order of 0.1% while maintaining good accuracy for the estimates of the sum, mean and variance for both flow duration and sizes.  相似文献   

20.
Spectral clustering is one of the most popular and important clustering methods in pattern recognition, machine learning, and data mining. However, its high computational complexity limits it in applications involving truly large-scale datasets. For a clustering problem with n samples, it needs to compute the eigenvectors of the graph Laplacian with O(n3) time complexity. To address this problem, we propose a novel method called anchor-based spectral clustering (ASC) by employing anchor points of data. Specifically, m (m ? n) anchor points are selected from the dataset, which can basically maintain the intrinsic (manifold) structure of the original data. Then a mapping matrix between the original data and the anchors is constructed. More importantly, it is proved that this data-anchor mapping matrix essentially preserves the clustering structure of the data. Based on this mapping matrix, it is easy to approximate the spectral embedding of the original data. The proposed method scales linearly relative to the size of the data but with low degradation of the clustering performance. The proposed method, ASC, is compared to the classical spectral clustering and two state-of-the-art accelerating methods, i.e., power iteration clustering and landmark-based spectral clustering, on 10 real-world applications under three evaluation metrics. Experimental results show that ASC is consistently faster than the classical spectral clustering with comparable clustering performance, and at least comparable with or better than the state-of-the-art methods on both effectiveness and efficiency.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号