共查询到20条相似文献,搜索用时 0 毫秒
1.
Little work has been reported in the literature to support k-nearest neighbor (k-NN) searches/queries in hybrid data spaces (HDS). An HDS is composed of a combination of continuous and non-ordered discrete dimensions. This combination presents new challenges in data organization and search ordering. In this paper, we present an algorithm for k-NN searches using a multidimensional index structure in hybrid data spaces. We examine the concept of search stages and use the properties of an HDS to derive a new search heuristic that greatly reduces the number of disk accesses in the initial stage of searching. Further, we present a performance model for our algorithm that estimates the cost of performing such searches. Our experimental results demonstrate the effectiveness of our algorithm and the accuracy of our performance estimation model. 相似文献
2.
A review on time series data mining 总被引:5,自引:0,他引:5
Tak-chung Fu 《Engineering Applications of Artificial Intelligence》2011,24(1):164-181
Time series is an important class of temporal data objects and it can be easily obtained from scientific and financial applications. A time series is a collection of observations made chronologically. The nature of time series data includes: large in data size, high dimensionality and necessary to update continuously. Moreover time series data, which is characterized by its numerical and continuous nature, is always considered as a whole instead of individual numerical field. The increasing use of time series data has initiated a great deal of research and development attempts in the field of data mining. The abundant research on time series data mining in the last decade could hamper the entry of interested researchers, due to its complexity. In this paper, a comprehensive revision on the existing time series data mining research is given. They are generally categorized into representation and indexing, similarity measure, segmentation, visualization and mining. Moreover state-of-the-art research issues are also highlighted. The primary objective of this paper is to serve as a glossary for interested researchers to have an overall picture on the current time series data mining development and identify their potential research direction to further investigation. 相似文献
3.
We introduce a new representation for time series, the Multiresolution Vector Quantized (MVQ) approximation, along with a distance function. Similar to Discrete Wavelet Transform, MVQ keeps both local and global information about the data. However, instead of keeping low-level time series values, it maintains high-level feature information (key subsequences), facilitating the introduction of more meaningful similarity measures. The method is fast and scales linearly with the database size and dimensionality. Contrary to previous methods, the vast majority of which use the Euclidean distance, MVQ uses a multiresolution/hierarchical distance function. In our experiments, the proposed technique consistently outperforms the other major methods. 相似文献
4.
A compact multi-resolution index for variable length queries in time series databases 总被引:1,自引:1,他引:0
We study the problem of searching similar patterns in time series data for variable length queries. Recently, a multi-resolution indexing technique (MRI) was proposed in (Kahveci and Singh, in proceedings of the international conference on data engineering, pp. 273–282, 2001; Kahveci and Singh, IEEE Trans Knowl Data Eng 16(4):418–433, 2004) to address this problem, which uses compression as an additional step to reduce the index size. In this paper, we propose an alternative technique, called compact MRI (CMRI), which uses adaptive piecewise constant approximation (APCA) representation as dimensionality reduction technique, and which occupies much less space without requiring compression. We implemented both MRI and CMRI, and conducted extensive experiments to evaluate and compare their performance on real stock data as well as synthetic. Our results indicate that CMRI provides a much better precision ranging from 0.75 to 0.89 on real data, and from 0.80 to 0.95 on synthetic data, while for MRI, these ranges are from 0.16 to 0.34, and from 0.03 to 0.65, respectively. Compared to sequential scan, we found that CMRI is 4–30 times faster and the number of disk I/Os it required is close to minimal. In terms of storage utilization, CMRI occupies 1% of the memory occupied by MRI. These results and analysis show CMRI to be an efficient and scalable indexing technique for large time series databases. 相似文献
5.
Francesco Gullo Author Vitae Author Vitae Andrea Tagarelli Author Vitae Sergio Greco Author Vitae 《Pattern recognition》2009,42(11):2998-3014
Similarity search and detection is a central problem in time series data processing and management. Most approaches to this problem have been developed around the notion of dynamic time warping, whereas several dimensionality reduction techniques have been proposed to improve the efficiency of similarity searches. Due to the continuous increasing of sources of time series data and the cruciality of real-world applications that use such data, we believe there is a challenging demand for supporting similarity detection in time series in a both accurate and fast way. Our proposal is to define a concise yet feature-rich representation of time series, on which the dynamic time warping can be applied for effective and efficient similarity detection of time series. We present the Derivative time series Segment Approximation (DSA) representation model, which originally features derivative estimation, segmentation and segment approximation to provide both high sensitivity in capturing the main trends of time series and data compression. We extensively compare DSA with state-of-the-art similarity methods and dimensionality reduction techniques in clustering and classification frameworks. Experimental evidence from effectiveness and efficiency tests on various datasets shows that DSA is well-suited to support both accurate and fast similarity detection. 相似文献
6.
Dynamic time warping (DTW) is a powerful technique in the time-series similarity search. However, its performance on large-scale data is unsatisfactory because of its high computational cost and the fact that it cannot be indexed directly. The lower bound technique for DTW is an effective solution to this problem. In this paper, we explain the existing lower-bound functions from a unified perspective and show that they are only special cases under our framework. We then propose a group of lower-bound functions for DTW and compare their performances through extensive experiments. The experimental results show that the new methods are better than the existing ones in most cases, and a theoretical explanation of the results is also given. We further implement an index structure based on the new lower-bound function. Experimental results demonstrate a similar conclusion. 相似文献
7.
Optimal algorithms for the online time series search problem 总被引:1,自引:0,他引:1
In the problem of online time series search introduced by El-Yaniv et al. (2001) [1], a player observes prices one by one over time and shall select exactly one of the prices on its arrival without the knowledge of future prices, aiming to maximize the selected price. In this paper, we extend the problem by introducing profit function. Considering two cases where the search duration is either known or unknown beforehand, we propose two optimal deterministic algorithms respectively. The models and results in this paper generalize those of El-Yaniv et al. (2001) [1]. 相似文献
8.
时间序列模式在很多领域中存在,时序模式的表示及存储查询是时间序列数据挖掘的重要任务之一.分析和研究了地震前兆时序模式的特点,采用半结构化语言XML并利用分段线性表示法表示地震前兆时序模式,在此基础上提出了针对Java、PL/SQL、命令行3种不同环境下地震前兆时序模式存储及查询方法,既保证了时序模式的存储查询效率,又满足了不同平台下针对时序模式的处理,从而进一步为地震预报服务. 相似文献
9.
Time series data mining (TSDM) techniques permit exploring large amounts of time series data in search of consistent patterns and/or interesting relationships between variables. TSDM is becoming increasingly important as a knowledge management tool where it is expected to reveal knowledge structures that can guide decision making in conditions of limited certainty. Human decision making in problems related with analysis of time series databases is usually based on perceptions like “end of the day”, “high temperature”, “quickly increasing”, “possible”, etc. Though many effective algorithms of TSDM have been developed, the integration of TSDM algorithms with human decision making procedures is still an open problem. In this paper, we consider architecture of perception-based decision making system in time series databases domains integrating perception-based TSDM, computing with words and perceptions, and expert knowledge. The new tasks which should be solved by the perception-based TSDM methods to enable their integration in such systems are discussed. These tasks include: precisiation of perceptions, shape pattern identification, and pattern retranslation. We show how different methods developed so far in TSDM for manipulation of perception-based information can be used for development of a fuzzy perception-based TSDM approach. This approach is grounded in computing with words and perceptions permitting to formalize human perception-based inference mechanisms. The discussion is illustrated by examples from economics, finance, meteorology, medicine, etc. 相似文献
10.
A sieve bootstrap procedure for constructing interpolation intervals for a general class of linear processes is proposed. This sieve bootstrap provides consistent estimators of the conditional distribution of the missing values, given the observed data. A Monte Carlo experiment is used to show the finite sample properties of the sieve bootstrap and finally, the performance of the proposed method is illustrated with a real data example. 相似文献
11.
Towards the evaluation of time series protection methods 总被引:1,自引:0,他引:1
The goal of statistical disclosure control (SDC) is to modify statistical data so that it can be published without releasing confidential information that may be linked to specific respondents. The challenge for SDC is to achieve this variation with minimum loss of the detail and accuracy sought by final users. There are many approaches to evaluate the quality of a protection method. However, all these measures are only applicable to numerical or categorical attributes.In this paper, we present some recent results about time series protection and re-identification. We propose a complete framework to evaluate time series protection methods. We also present some empirical results to show how our framework works. 相似文献
12.
Spatial reasoning and similarity retrieval are two important functions of any image information system. Good spatial knowledge representation for images is necessary to adequately support these two functions. In this paper, we propose a new spatial knowledge representation, called the SK-set based on morphological skeleton theories. Spatial reasoning algorithms which achieve more accurate results by directly analysing skeletons are described. SK-set facilitates browsing and progressive visualization. We also define four new types of similarity measures and propose a similarity retrieval algorithm for performing image retrieval. Moreover, using SK-set as a spatial knowledge representation will reduce the storage space required by an image database significantly. 相似文献
13.
14.
We describe a new multi-phase, color-based image retrieval system (FOCUS) which is capable of identifying multi-colored query objects in an image in the presence of significant, interfering backgrounds. The query object may occur in arbitrary sizes, orientations, and locations in the database images. Scale and rotation invariant color features have been developed to describe an image, such that the matching process is fast even in the case of complex images. The first phase of processing matches the query object color with the color content of an image computed as the peaks in the color histogram of the image. The second phase matches the spatial relationships between color regions in the image with the query using a spatial proximity graph (SPG) structure designed for the purpose. Processing at coarse granularity is preferred over pixel-level processing to produce simpler graphs, which significantly reduces computation time during matching. The speed of the system and the small storage overhead make it suitable for use in large databases with online user interfaces. Test results with multi-colored query objects from man-made and natural domains show that FOCUS is quite effective in handling interfering backgrounds and large variations in scale. The experimental results on a database of diverse images highlights the capabilities of the system. 相似文献
15.
Efficient query filtering for streaming time series with applications to semisupervised learning of time series classifiers 总被引:1,自引:1,他引:1
Li Wei Eamonn Keogh Helga Van Herle Agenor Mafra-Neto Russell J. Abbott 《Knowledge and Information Systems》2007,11(3):313-344
In this paper, we define time series query filtering, the problem of monitoring the streaming time series for a set of predefined patterns. This problem is of great practical
importance given the massive volume of streaming time series available through sensors, medical patient records, financial
indices and space telemetry. Since the data may arrive at a high rate and the number of predefined patterns can be relatively
large, it may be impossible for the comparison algorithm to keep up. We propose a novel technique that exploits the commonality
among the predefined patterns to allow monitoring at higher bandwidths, while maintaining a guarantee of no false dismissals.
Our approach is based on the widely used envelope-based lower-bounding technique. As we will demonstrate on extensive experiments
in diverse domains, our approach achieves tremendous improvements in performance in the offline case, and significant improvements
in the fastest possible arrival rate of the data stream that can be processed with guaranteed no false dismissals. As a further
demonstration of the utility of our approach, we demonstrate that it can make semisupervised learning of time series classifiers
tractable.
Li Wei is a Ph.D. candidate in the Department of Computer Science & Engineering at the University of California, Riverside. She
received her B.S. and M.S. degrees from Fudan University, China. Her research interests include data mining and information
retrieval.
Eamonn Keogh is an Assistant Professor of computer science at the University of California, Riverside. His research interests include
data mining, machine learning and information retrieval. Several of his papers have won best paper awards, including papers
at SIGKDD and SIGMOD. Dr. Keogh is the recipient of a 5-year NSF Career Award for “Efficient Discovery of Previously Unknown Patterns and Relationships in Massive Time Series Databases”.
Helga Van Herle is an Assistant Clinical Professor of medicine at the Division of Cardiology of the Geffen School of Medicine at UCLA. She
received her M.D. from UCLA in 1993; completed her residency in internal medicine at the New York Hospital (Cornell University;
1993–1996) and her cardiology fellowship at UCLA (1997–2001). Dr. Van Herle holds an M.Sc. in bioengineering from Columbia
University (1987) and a B.Sc. in chemical engineering from UCLA (1985).
Agenor Mafra-Neto, Ph.D., is the CEO of ISCA Technologies, Inc., in California and the founder of ISCA Technologies, LTDA, in Brazil. His research
interests include the analysis of insect behavior and communication systems, the manipulation of insect behavior, and the
automation of pest monitoring and pest control. Dr. Mafra-Neto is currently coordinating the deployment of area-wide smart
sensor and effector networks to micromanage agricultural and public health pests in the field in an automatic fashion.
Russell J. Abbott is a Professor of computer science at California State University, Los Angeles, and a member of the staff at the Aerospace
Corporation, El Segundo, CA. His primary interests are in the field of complex systems. He is currently organizing a workshop
to bring together people working in the fields of complex systems and systems engineering. 相似文献
16.
Po-Whei Huang Lipin Hsu Yan-Wei Su Phen-Lan Lin 《Journal of Visual Languages and Computing》2008,19(6):637-651
In this paper, we presented a novel image representation method to capture the information about spatial relationships between objects in a picture. Our method is more powerful than all other previous methods in terms of accuracy, flexibility, and capability of discriminating pictures. In addition, our method also provides different degrees of granularity for reasoning about directional relations in both 8- and 16-direction reference frames. In similarity retrieval, our system provides twelve types of similarity measures to support flexible matching between the query picture and the database pictures. By exercising a database containing 3600 pictures, we successfully demonstrated the effectiveness of our image retrieval system. Experiment result showed that 97.8% precision rate can be achieved while maintaining 62.5% recall rate; and 97.9% recall rate can be achieved while maintaining 51.7% precision rate. On an average, 86.1% precision rate and 81.2% recall rate can be achieved simultaneously if the threshold is set to 0.5 or 0.6. This performance is considered to be very good as an information retrieval system. 相似文献
17.
In this paper, we propose a rotation-invariant spatial knowledge representation called RS-string. Then we present the string generation algorithm to automatically generate RS-strings for segmented pictures. We also propose the spatial reasoning and similarity retrieval algorithms based on RS-strings. The similarity retrieval algorithm is much more flexible than all previous 2D string representations because our approach can consider every possible view of a query picture. Thus the system does not require the user to provide a query picture which must have the same orientation as that of a database picture. Finally, we provide several examples to demonstrate the capabilities of spatial reasoning and similarity retrieval based on the RS-string representation. 相似文献
18.
Estimating spatio-temporal patterns of agricultural productivity in fragmented landscapes using AVHRR NDVI time series 总被引:5,自引:0,他引:5
The characteristics of Normalized Difference Vegetation Index (NDVI) time series can be disaggregated into a set of quantitative metrics that may be used to derive information about vegetation phenology and land cover. In this paper, we examine the patterns observed in metrics calculated for a time series of 8 years over the southwest of Western Australia—an important crop and animal production area of Australia. Four analytical approaches were used; calculation of temporal mean and standard deviation layers for selected metrics showing significant spatial variability; classification based on temporal and spatial patterns of key NDVI metrics; metrics were analyzed for eight areas typical of climatic and production systems across the agricultural zone; and relationships between total production and productivity measured by dry sheep equivalents were developed with time integrated NDVI (TINDVI). Two metrics showed clear spatial patterns; the season duration based on the smooth curve produced seven zones based on increasing length of growing season; and TINDVI provided a set of classes characterized by differences in overall magnitude of response, and differences in response in particular years. Frequency histograms of TINDVI could be grouped on the basis of a simple shape classification: tall and narrow with high, medium or low mean indicating most land is responsive agricultural cover with uniform seasonal conditions; broad and short indicating that land is of mixed cover type or seasonal conditions are not spatially uniform. TINDVI showed a relationship to agricultural productivity that is dependent on the extent to which crop or total agricultural production was directly reduced by rainfall deficiency. TINDVI proved most sensitive to crop productivity for Statistical Local Areas (SLAs) having rainfall less than 600 mm, and in years when rainfall and crop production were highly correlated. It is concluded that metrics from standardized NDVI time series could be routinely and transparently used for retrospective assessment of seasonal conditions and changes in vegetation responses and cover. 相似文献
19.
In this paper, a computational method of forecasting based on fuzzy time series have been developed to provide improved forecasting results to cope up the situation containing higher uncertainty due to large fluctuations in consecutive year's values in the time series data and having no visualization of trend or periodicity. The proposed model is of order three and uses a time variant difference parameter on current state to forecast the next state. The developed model has been tested on the historical student enrollments, University of Alabama to have comparison with the existing methods and has been implemented for forecasting of a crop production system of lahi crop, containing higher uncertainty. The suitability of the developed model has been examined in comparison with the other models to show its superiority. 相似文献
20.
The forecasting process of real-world time series has to deal with especially unexpected values, commonly known as outliers. Outliers in time series can lead to unreliable modeling and poor forecasts. Therefore, the identification of future outlier occurrence is an essential task in time series analysis to reduce the average forecasting error. The main goal of this work is to predict the occurrence of outliers in time series, based on the discovery of motifs. In this sense, motifs will be those pattern sequences preceding certain data marked as anomalous by the proposed metaheuristic in a training set. Once the motifs are discovered, if data to be predicted are preceded by any of them, such data are identified as outliers, and treated separately from the rest of regular data. The forecasting of outlier occurrence has been added as an additional step in an existing time series forecasting algorithm (PSF), which was based on pattern sequence similarities. Robust statistical methods have been used to evaluate the accuracy of the proposed approach regarding the forecasting of both occurrence of outliers and their corresponding values. Finally, the methodology has been tested on six electricity-related time series, in which most of the outliers were properly found and forecasted. 相似文献