共查询到20条相似文献,搜索用时 15 毫秒
1.
Time series data mining (TSDM) techniques permit exploring large amounts of time series data in search of consistent patterns and/or interesting relationships between variables. TSDM is becoming increasingly important as a knowledge management tool where it is expected to reveal knowledge structures that can guide decision making in conditions of limited certainty. Human decision making in problems related with analysis of time series databases is usually based on perceptions like “end of the day”, “high temperature”, “quickly increasing”, “possible”, etc. Though many effective algorithms of TSDM have been developed, the integration of TSDM algorithms with human decision making procedures is still an open problem. In this paper, we consider architecture of perception-based decision making system in time series databases domains integrating perception-based TSDM, computing with words and perceptions, and expert knowledge. The new tasks which should be solved by the perception-based TSDM methods to enable their integration in such systems are discussed. These tasks include: precisiation of perceptions, shape pattern identification, and pattern retranslation. We show how different methods developed so far in TSDM for manipulation of perception-based information can be used for development of a fuzzy perception-based TSDM approach. This approach is grounded in computing with words and perceptions permitting to formalize human perception-based inference mechanisms. The discussion is illustrated by examples from economics, finance, meteorology, medicine, etc. 相似文献
2.
Current research in indexing and mining time series data has produced many interesting algorithms and representations. However,
the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets
encountered in science, engineering, and business domains. In this work, we introduce a novel multi-resolution symbolic representation
which can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature.
To demonstrate the utility of this representation, we constructed a simple tree-based index structure which facilitates fast
exact search and orders of magnitude faster, approximate search. For example, with a database of one-hundred million time
series, the approximate search can retrieve high quality nearest neighbors in slightly over a second, whereas a sequential
scan would take tens of minutes. Our experimental evaluation demonstrates that our representation allows index performance
to scale well with increasing dataset sizes. Additionally, we provide analysis concerning parameter sensitivity, approximate
search effectiveness, and lower bound comparisons between time series representations in a bit constrained environment. We
further show how to exploit the combination of both exact and approximate search as sub-routines in data mining algorithms,
allowing for the exact mining of truly massive real world datasets, containing tens of millions of time series. 相似文献
3.
T. Warren Liao Author Vitae 《Pattern recognition》2005,38(11):1857-1874
Time series clustering has been shown effective in providing useful information in various domains. There seems to be an increased interest in time series clustering as part of the effort in temporal data mining research. To provide an overview, this paper surveys and summarizes previous works that investigated the clustering of time series data in various application domains. The basics of time series clustering are presented, including general-purpose clustering algorithms commonly used in time series clustering studies, the criteria for evaluating the performance of the clustering results, and the measures to determine the similarity/dissimilarity between two time series being compared, either in the forms of raw data, extracted features, or some model parameters. The past researchs are organized into three groups depending upon whether they work directly with the raw data either in the time or frequency domain, indirectly with features extracted from the raw data, or indirectly with models built from the raw data. The uniqueness and limitation of previous research are discussed and several possible topics for future research are identified. Moreover, the areas that time series clustering have been applied to are also summarized, including the sources of data used. It is hoped that this review will serve as the steppingstone for those interested in advancing this area of research. 相似文献
4.
Experiencing SAX: a novel symbolic representation of time series 总被引:15,自引:3,他引:15
Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi 《Data mining and knowledge discovery》2007,15(2):107-144
Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets,
eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series,
noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms
from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced
over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the
same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance
measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures
defined on the original time series.
In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity
reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance
measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it
allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical
results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation
on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization. 相似文献
5.
Eamonn Keogh Jessica Lin Sang-Hee Lee Helga Van Herle 《Knowledge and Information Systems》2007,11(1):1-27
In this work we introduce the new problem of finding time seriesdiscords. Time series discords are subsequences of longer time series that are maximally different to all the rest of the time series subsequences. They thus capture the sense of the most unusual subsequence within a time series. While discords have many uses for data mining, they are particularly attractive as anomaly detectors because they only require one intuitive parameter (the length of the subsequence) unlike most anomaly detection algorithms that typically require many parameters. While the brute force algorithm to discover time series discords is quadratic in the length of the time series, we show a simple algorithm that is three to four orders of magnitude faster than brute force, while guaranteed to produce identical results. We evaluate our work with a comprehensive set of experiments on diverse data sources including electrocardiograms, space telemetry, respiration physiology, anthropological and video datasets.
Eamonn Keogh is an Assistant Professor of computer science at the University of California, Riverside. His research interests include data mining, machine learning and information retrieval. Several of his papers have won best paper awards, including papers at SIGKDD and SIGMOD. Dr. Keogh is the recipient of a 5-year NSF Career Award for “Efficient discovery of previously unknown patterns and relationships in massive time series databases.”
Jessica Lin is an Assistant Professor of information and software engineering at George Mason University. She received her Ph.D. from the University of California, Riverside. Her research interests include data mining and informational retrieval.
Sang-Hee Lee is a paleoanthropologist at the University of California, Riverside. Her research interests include the evolution of human morphological variation and how different mechanisms (such as taxonomy, sex, age, and time) explain what is observed in fossil data. Dr. Lee obtained her Ph.D. in anthropology from the University of Michigan in 1999.
Helga Van Herle is an Assistant Clinical Professor of medicine at the Division of Cardiology of the Geffen School of Medicine at UCLA. She received her M.D. from UCLA in 1993; completed her residency in internal medicine at the New York Hospital (Cornell University, 1993–1996) and her cardiology fellowship at UCLA (1997–2001). Dr. Van Herle holds a M.Sc. in bioengineering from Columbia University (1987) and a B.Sc. in Chemical Engineering from UCLA (1985) 相似文献
6.
Instance selection aims at filtering out noisy data (or outliers) from a given training set, which not only reduces the need for storage space, but can also ensure that the classifier trained by the reduced set provides similar or better performance than the baseline classifier trained by the original set. However, since there are numerous instance selection algorithms, there is no concrete winner that is the best for various problem domain datasets. In other words, the instance selection performance is algorithm and dataset dependent. One main reason for this is because it is very hard to define what the outliers are over different datasets. It should be noted that, using a specific instance selection algorithm, over-selection may occur by filtering out too many ‘good’ data samples, which leads to the classifier providing worse performance than the baseline. In this paper, we introduce a dual classification (DuC) approach, which aims to deal with the potential drawback of over-selection. Specifically, performing instance selection over a given training set, two classifiers are trained using both a ‘good’ and ‘noisy’ sets respectively identified by the instance selection algorithm. Then, a test sample is used to compare the similarities between the data in the good and noisy sets. This comparison guides the input of the test sample to one of the two classifiers. The experiments are conducted using 50 small scale and 4 large scale datasets and the results demonstrate the superior performance of the proposed DuC approach over the baseline instance selection approach. 相似文献
7.
《Expert systems with applications》2014,41(14):6524-6535
The traffic density situation in a traffic network, especially traffic congestion, exhibits characteristics similar to thermodynamic heat conduction, e.g., the traffic congestion in one section can be conducted to other adjacent sections of the traffic network sequentially. Analyzing this conduction facilitates the forecasting of future traffic situation; therefore, a navigation system can reduce traffic congestion and improve transportation mobility. This study describes a methodology for traffic conduction analysis modeling based on extracting important time-related conduction rules using a type of evolutionary algorithm named Genetic Network Programming (GNP). The extracted rules construct a useful model for forecasting future traffic situations and analyzing traffic conduction. The proposed methodology was implemented and experimentally evaluated using a large scale real-time traffic simulator, SOUND/4U. 相似文献
8.
Nicu Sebe Ira Cohen Fabio G. Cozman Theo Gevers Thomas S. Huang 《Multimedia Systems》2005,10(6):484-498
Human–computer interaction (HCI) lies at the crossroads of many scientific areas including artificial intelligence, computer
vision, face recognition, motion tracking, etc. It is argued that to truly achieve effective human–computer intelligent interaction,
the computer should be able to interact naturally with the user, similar to the way HCI takes place. In this paper, we discuss
training probabilistic classifiers with labeled and unlabeled data for HCI applications. We provide an analysis that shows
under what conditions unlabeled data can be used in learning to improve classification performance, and we investigate the
implications of this analysis to a specific type of probabilistic classifiers, Bayesian networks. Finally, we show how the
resulting algorithms are successfully employed in facial expression recognition, face detection, and skin detection. 相似文献
9.
Roland Fried 《Computational statistics & data analysis》2007,52(2):1063-1074
Abrupt shifts in the level of a time series represent important information and should be preserved in statistical signal extraction. Various rules for detecting level shifts that are resistant to outliers and which work with only a short time delay are investigated. The properties of robustified versions of the t-test for two independent samples and its non-parametric alternatives are elaborated under different types of noise. Trimmed t-tests, median comparisons, robustified rank and ANOVA tests based on robust scale estimators are compared. 相似文献
10.
An active research topic in data mining is the discovery of sequential patterns, which finds all frequent subsequences in a sequence database. The generalized sequential pattern (GSP) algorithm was proposed to solve the mining of sequential patterns with time constraints, such as time gaps and sliding time windows. Recent studies indicate that the pattern-growth methodology could speed up sequence mining. However, the capabilities to mine sequential patterns with time constraints were previously available only within the Apriori framework. Therefore, we propose the DELISP (delimited sequential pattern) approach to provide the capabilities within the pattern-growth methodology. DELISP features in reducing the size of projected databases by bounded and windowed projection techniques. Bounded projection keeps only time-gap valid subsequences and windowed projection saves nonredundant subsequences satisfying the sliding time-window constraint. Furthermore, the delimited growth technique directly generates constraint-satisfactory patterns and speeds up the pattern growing process. The comprehensive experiments conducted show that DELISP has good scalability and outperforms the well-known GSP algorithm in the discovery of sequential patterns with time constraints. 相似文献
11.
12.
《Expert systems with applications》2014,41(7):3402-3408
Currently, there is an increased interest in time series clustering research, particularly for finding useful similar time series in various applied areas such as speech recognition, environmental research, finance and medical imaging. Clustering and classification of time series has the potential to analyze large volumes of data. Most of the traditional time series clustering and classification algorithms deal only with univariate time series data. In this paper, we develop an unsupervised learning algorithm for bivariate time series. The initial clusters are found using K-means algorithm and the model parameters are estimated using the EM algorithm. The learning algorithm is developed by utilizing component maximum likelihood and Bayesian Information Criteria (BIC). The performance of the developed algorithm is evaluated using real time data collected from a pollution centre. A comparative study of the proposed algorithm is made with the existing data mining algorithm that uses univariate autoregressive process of order 1 (AR(1)) model. It is observed that the proposed algorithm out performs the existing algorithms. 相似文献
13.
Zhen He X. Sean Wang Byung Suk Lee Alan C. H. Ling 《Knowledge and Information Systems》2008,15(1):31-54
Recently, periodic pattern mining from time series data has been studied extensively. However, an interesting type of periodic
pattern, called partial periodic (PP) correlation in this paper, has not been investigated. An example of PP correlation is
that power consumption is high either on Monday or Tuesday but not on both days. In general, a PP correlation is a set of
offsets within a particular period such that the data at these offsets are correlated with a certain user-desired strength.
In the above example, the period is a week (7 days), and each day of the week is an offset of the period. PP correlations
can provide insightful knowledge about the time series and can be used for predicting future values. This paper introduces
an algorithm to mine time series for PP correlations based on the principal component analysis (PCA) method. Specifically,
given a period, the algorithm maps the time series data to data points in a multidimensional space, where the dimensions correspond
to the offsets within the period. A PP correlation is then equivalent to correlation of data when projected to a subset of
the dimensions. The algorithm discovers, with one sequential scan of data, all those PP correlations (called minimum PP correlations)
that are not unions of some other PP correlations. Experiments using both real and synthetic data sets show that the PCA-based
algorithm is highly efficient and effective in finding the minimum PP correlations.
Zhen He is a lecturer in the Department of Computer Science at La Trobe University. His main research areas are database systems
optimization, time series mining, wireless sensor networks, and XML information retrieval. Prior to joining La Trobe University,
he worked as a postdoctoral research associate in the University of Vermont. He holds Bachelors, Honors and Ph.D degrees in
Computer Science from the Australian National University.
X. Sean Wang received his Ph.D degree in Computer Science from the University of Southern California in 1992. He is currently the Dorothean
Chair Professor in Computer Science at the University of Vermont. He has published widely in the general area of databases
and information security, and was a recipient of the US National Science Foundation Research Initiation and CAREER awards.
His research interests include database systems, information security, data mining, and sensor data processing.
Byung Suk Lee is associate professor of Computer Science at the University of Vermont. His main research areas are database systems, data
modeling, and information retrieval. He held positions in industry and academia: Gold Star Electric, Bell Communications Research,
Datacom Global Communications, University of St. Thomas, and currently University of Vermont. He was also a visiting professor
at Dartmouth College and a participating guest at Lawrence Livermore National Laboratory. He served on international conferences
as a program committee member, a publicity chair, and a special session organizer, and also on US federal funding proposal
review panel. He holds a BS degree from Seoul National University, MS from Korea Advanced Institute of Science and Technology,
and Ph.D from Stanford University.
Alan C. H. Ling is an assistant professor at Department of Computer Science in University of Vermont. His research interests include combinatorial
design theory, coding theory, sequence designs, and applications of design theory. 相似文献
14.
A review on time series data mining 总被引:5,自引:0,他引:5
Tak-chung Fu 《Engineering Applications of Artificial Intelligence》2011,24(1):164-181
Time series is an important class of temporal data objects and it can be easily obtained from scientific and financial applications. A time series is a collection of observations made chronologically. The nature of time series data includes: large in data size, high dimensionality and necessary to update continuously. Moreover time series data, which is characterized by its numerical and continuous nature, is always considered as a whole instead of individual numerical field. The increasing use of time series data has initiated a great deal of research and development attempts in the field of data mining. The abundant research on time series data mining in the last decade could hamper the entry of interested researchers, due to its complexity. In this paper, a comprehensive revision on the existing time series data mining research is given. They are generally categorized into representation and indexing, similarity measure, segmentation, visualization and mining. Moreover state-of-the-art research issues are also highlighted. The primary objective of this paper is to serve as a glossary for interested researchers to have an overall picture on the current time series data mining development and identify their potential research direction to further investigation. 相似文献
15.
Lucia Sacchi Cristiana Larizza Carlo Combi Riccardo Bellazzi 《Data mining and knowledge discovery》2007,15(2):217-247
A large volume of research in temporal data mining is focusing on discovering temporal rules from time-stamped data. The majority
of the methods proposed so far have been mainly devoted to the mining of temporal rules which describe relationships between
data sequences or instantaneous events and do not consider the presence of complex temporal patterns into the dataset. Such
complex patterns, such as trends or up and down behaviors, are often very interesting for the users. In this paper we propose
a new kind of temporal association rule and the related extraction algorithm; the learned rules involve complex temporal patterns
in both their antecedent and consequent. Within our proposed approach, the user defines a set of complex patterns of interest
that constitute the basis for the construction of the temporal rule; such complex patterns are represented and retrieved in
the data through the formalism of knowledge-based Temporal Abstractions. An Apriori-like algorithm looks then for meaningful
temporal relationships (in particular, precedence temporal relationships) among the complex patterns of interest. The paper
presents the results obtained by the rule extraction algorithm on a simulated dataset and on two different datasets related
to biomedical applications: the first one concerns the analysis of time series coming from the monitoring of different clinical
variables during hemodialysis sessions, while the other one deals with the biological problem of inferring relationships between
genes from DNA microarray data. 相似文献
16.
Young-Seon Jeong Author Vitae Author Vitae Olufemi A. Omitaomu Author Vitae 《Pattern recognition》2011,44(9):2231-2240
Dynamic time warping (DTW), which finds the minimum path by providing non-linear alignments between two time series, has been widely used as a distance measure for time series classification and clustering. However, DTW does not account for the relative importance regarding the phase difference between a reference point and a testing point. This may lead to misclassification especially in applications where the shape similarity between two sequences is a major consideration for an accurate recognition. Therefore, we propose a novel distance measure, called a weighted DTW (WDTW), which is a penalty-based DTW. Our approach penalizes points with higher phase difference between a reference point and a testing point in order to prevent minimum distance distortion caused by outliers. The rationale underlying the proposed distance measure is demonstrated with some illustrative examples. A new weight function, called the modified logistic weight function (MLWF), is also proposed to systematically assign weights as a function of the phase difference between a reference point and a testing point. By applying different weights to adjacent points, the proposed algorithm can enhance the detection of similarity between two time series. We show that some popular distance measures such as DTW and Euclidean distance are special cases of our proposed WDTW measure. We extend the proposed idea to other variants of DTW such as derivative dynamic time warping (DDTW) and propose the weighted version of DDTW. We have compared the performances of our proposed procedures with other popular approaches using public data sets available through the UCR Time Series Data Mining Archive for both time series classification and clustering problems. The experimental results indicate that the proposed approaches can achieve improved accuracy for time series classification and clustering problems. 相似文献
17.
18.
19.
Ujjwal Maulik Author Vitae Author Vitae 《Pattern recognition》2011,44(3):615-623
In this article, we present a semisupervised support vector machine that uses self-training approach. We then construct an ensemble of semisupervised SVM classifiers to address the problem of pixel classification of remote sensing images. Semisupervised support vector machines (S3VMs) are based on applying the margin maximization principle to both labeled and unlabeled samples. The ensemble of SVM classifiers recognizes the conceptual similarity between component classifiers from the same data source. The effectiveness of the proposed technique is first demonstrated for two numeric remote sensing data described in terms of feature vectors and then identifying different land cover regions in remote sensing imagery. Experimental results on these datasets show that employing this learning scheme can increase the accuracy level. The performance of the ensemble is compared with one of its component classifier and conventional SVM in terms of accuracy and quantitative cluster validity indices. 相似文献
20.
《Expert systems with applications》2014,41(14):6098-6105
Streaming time series segmentation is one of the major problems in streaming time series mining, which can create the high-level representation of streaming time series, and thus can provide important supports for many time series mining tasks, such as indexing, clustering, classification, and discord discovery. However, the data elements in streaming time series, which usually arrive online, are fast-changing and unbounded in size, consequently, leading to a higher requirement for the computing efficiency of time series segmentation. Thus, it is a challenging task how to segment streaming time series accurately under the constraint of computing efficiency. In this paper, we propose exponential smoothing prediction-based segmentation algorithm (ESPSA). The proposed algorithm is developed based on a sliding window model, and uses the typical exponential smoothing method to calculate the smoothing value of arrived data element of streaming time series as the prediction value of the future data. Besides, to determine whether a data element is a segmenting key point, we study the statistical characteristics of the prediction error and then deduce the relationship between the prediction error and the compression rate. The extensive experiments on both synthetic and real datasets demonstrate that the proposed algorithm can segment streaming time series effectively and efficiently. More importantly, compared with candidate algorithms, the proposed algorithm can reduce the computing time by orders of magnitude. 相似文献