首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Many kinds of information are hidden in email data, such as the information being exchanged, the time of exchange, and the user IDs participating in the exchange. Analyzing the email data can reveal valuable information about the social networks of a single user or multiple users, the topics being discussed, and so on. In this paper, we describe a novel approach for temporally analyzing the communication patterns embedded in email data based on time series segmentation. The approach computes egocentric communication patterns of a single user, as well as sociocentric communication patterns involving multiple users. Time series segmentation is used to uncover patterns that may span multiple time points and to study how these patterns change over time. To find egocentric patterns, the email communication of a user is represented as an item-set time series. An optimal segmentation of the item-set time series is constructed, from which patterns are extracted. To find sociocentric patterns, the email data is represented as an item-setgroup time series. Patterns involving multiple users are then extracted from an optimal segmentation of the item-setgroup time series. The proposed approach is applied to the Enron email data set, which produced very promising results.  相似文献   

2.
王玲  李泽中 《控制与决策》2024,39(2):568-576
现有多元时间序列分段算法中分段点的选择以及分段个数的确定往往需要分别独立完成,大大增加了算法的计算复杂度.为解决上述问题,提出一种基于多元时间序列的自适应贪婪高斯分段算法.该算法将多元时间序列各个分段所对应的数据解释为来自不同多元高斯分布的独立样本,进而将分段问题转化为协方差正则化的最大似然估计问题进行求解.为提高学习效率,采用贪婪搜寻方法使每个段的似然值最大化进而近似地找到最优分段点,并且在搜寻的过程中利用信息增益方法自适应地获取最优的分段个数,避免分段个数确定和分段点选择分别独立进行,从而减少计算的复杂度.基于多种领域的真实数据集实验结果表明,所提出方法的分段精度以及运行效率均优于传统方法,并且能够有效完成多元时间序列的异常检测任务.  相似文献   

3.
Time series data, due to their numerical and continuous nature, are difficult to process, analyze, and mine. However, these tasks become easier when the data can be transformed into meaningful symbols. Most recent works on time series only address how to identify a given pattern from a time series and do not consider the problem of identifying a suitable set of time points for segmenting the time series in accordance with a given set of pattern templates (e.g., a set of technical patterns for stock analysis). However, the use of fixed-length segmentation is an oversimplified approach to this problem; hence, a dynamic approach (with high controllability) is preferable so that the time series can be segmented flexibly and effectively according to the needs of the users and the applications. In view of the fact that this segmentation problem is an optimization problem and evolutionary computation is an appropriate tool to solve it, we propose an evolutionary time series segmentation algorithm. This approach allows a sizeable set of pattern templates to be generated for mining or query. In addition, defining similarity between time series (or time series segments) is of fundamental importance in fitness computation. By identifying the perceptually important points directly from the time domain, time series segments and templates of different lengths can be compared and intuitive pattern matching can be carried out in an effective and efficient manner. Encouraging experimental results are reported from tests that segment both artificial time series generated from the combinations of pattern templates and the time series of selected Hong Kong stocks.  相似文献   

4.
时间序列数据主要依据采集时间进行排序,时间序列上相邻的数据具有一定的关联性,当用户读取时间序列数据时不只是读取一条数据,而是连续读取一段时间序列数据。针对时间序列的局部性特点,提出一种基于动态分段的时间序列索引DSI,通过设置差值及差值等级对时间序列数据进行动态分段,使用区间树快速查找不同长度的数据分段块,并利用层次聚类算法优化查询结果集合。实验结果表明,DSI索引的查询效率优于现有时间序列查询索引。  相似文献   

5.
基于重要点的时间序列线性分段算法能在较好地保留时间序列的全局特征的基础上达到较好的拟合精度。但传统的基于重要点的时间序列分段算法需要指定误差阈值等参数进行分段,这些参数与原始数据相关,用户不方便设定,而且效率和拟合效果有待于进一步提高。为了解决这一问题,提出一种基于时间序列重要点的分段算法——PLR_TSIP,该方法首先综合考虑到了整体拟合误差的大小和序列长度,接着针对优先级较高的分段进行预分段处理以期找到最优的分段;最后在分段时考虑到了分段中最大值点和最小值点的同异向关系,可以一次进行多个重要点的划分。通过多个数据集的实验分析对比,与传统的分段算法相比,减小了拟合误差,取得了更好的拟合效果;与其他重要点分段算法相比,在提高拟合效果的同时,较大地提高了分段效率。  相似文献   

6.
The computation of a piecewise smooth function that approximates a finite set of data points may be decomposed into two decoupled tasks: 1) the computation of the locally smooth models, and hence, the segmentation of the data into classes that consist of the sets of points best approximated by each model; 2) the computation of the normalized discriminant functions for each induced class (which maybe interpreted as relative probabilities). The approximating function may then be computed as the optimal estimator with respect to this measure field. For the first step, we propose a scheme that involves both robust regression and spatial localization using Gaussian windows. The discriminant functions are obtained fitting Gaussian mixture models for the data distribution inside each class. We give an efficient procedure for effecting both computations and for the determination of the optimal number of components. Examples of the application of this scheme to image filtering, surface reconstruction and time series prediction are presented.  相似文献   

7.
In this paper, we associate each time series of a stock price (TS-P) in a stock market with a time series of hash codes (TS-HC) that indicate price increase or decrease for each element of the TS-P. As noted, in this case hash codes represent integer numbers and their sequence allows to identify the same (typical) groups of TS-P elements in the stock price dynamics. We describe the procedures for transforming an initial time series and calculating the hash codes. The main properties of a sequence of hash codes are established. Finally, we suggest an analysis and prediction method for a stock price trajectory using segmentation and hashing data.  相似文献   

8.
刘苗苗  周从华  张婷 《计算机工程》2021,47(8):62-68,77
利用动态时间弯曲(DTW)技术在原始多元时间序列进行相似性度量时时间复杂度较高,且DTW在追求最小弯曲距离的过程中可能会出现过渡拉伸和压缩的问题。提出一种基于分段特征及自适应加权的DTW多元时间序列相似性度量方法。对原始时间序列在各个变量维度上进行统一分段,选取分段后拟合线段的斜率、分段区间的最大值和最小值以及时间跨度作为每一段的特征,实现对原始序列的大幅降维,提高计算效率。在DTW计算最佳弯曲路径的过程中为每个点设置自适应代价权重,限制弯曲路径中点列的重复使用次数,改善时间序列因过度拉伸或压缩所导致的度量精度低的问题,以得到最优路径路线。实验结果表明,该方法能很好地度量多元时间序列之间的相似性,在多个数据集上都能取得较好的度量结果。  相似文献   

9.
Knowledge Discovery from Series of Interval Events   总被引:4,自引:0,他引:4  
Knowledge discovery from data sets can be extensively automated by using data mining software tools. Techniques for mining series of interval events, however, have not been considered. Such time series are common in many applications. In this paper, we propose mining techniques to discover temporal containment relationships in such series. Specifically, an item A is said to contain an item B if an event of type B occurs during the time span of an event of type A, and this is a frequent relationship in the data set. Mining such relationships provides insight about temporal relationships among various items. We implement the technique and analyze trace data collected from a real database application. Experimental results indicate that the proposed mining technique can discover interesting results. We also introduce a quantization technique as a preprocessing step to generalize the method to all time series.  相似文献   

10.
Clustering analysis of temporal gene expression data is widely used to study dynamic biological systems, such as identifying sets of genes that are regulated by the same mechanism. However, most temporal gene expression data often contain noise, missing data points, and non-uniformly sampled time points, which imposes challenges for traditional clustering methods of extracting meaningful information. In this paper, we introduce an improved clustering approach based on the regularized spline regression and an energy based similarity measure. The proposed approach models each gene expression profile as a B-spline expansion, for which the spline coefficients are estimated by regularized least squares scheme on the observed data. To compensate the inadequate information from noisy and short gene expression data, we use its correlated genes as the test set to choose the optimal number of basis and the regularization parameter. We show that this treatment can help to avoid over-fitting. After fitting the continuous representations of gene expression profiles, we use an energy based similarity measure for clustering. The energy based measure can include the temporal information and relative changes of the time series using the first and second derivatives of the time series. We demonstrate that our method is robust to noise and can produce meaningful clustering results.  相似文献   

11.
针对数据流间“模式依赖”问题,给出了一种模式依赖挖掘算法,该算法包括:挖掘前时间序列分段和模式表示,条件规则元组的创建和维护,模式依赖的置信度和支持度计算,2个或N个数据流概要结构的设计等。股票数据实验和实际系统表明,该挖掘方法能够有效地发现数据流间的模式依赖,可用于预测。  相似文献   

12.
基于片段模式的多时间序列关联分析   总被引:3,自引:0,他引:3  
本文对基于片断模式的多时间序列关联分析进行了研究,提出了一种分析方法。这一方法是,首先通过聚类找出在时间序列中频繁出现的片断模式,然后将找到的片断模式作为模板,对时间序列进行跨事务关联分析。我们采用中国证券市场1997~2001年的数据为测试数据集,对我们提出的算法进行了测试。测试结果表明,我们的算法是有效的。  相似文献   

13.
In k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k and the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation  相似文献   

14.
Generalized principal component analysis (GPCA)   总被引:3,自引:0,他引:3  
This paper presents an algebro-geometric solution to the problem of segmenting an unknown number of subspaces of unknown and varying dimensions from sample data points. We represent the subspaces with a set of homogeneous polynomials whose degree is the number of subspaces and whose derivatives at a data point give normal vectors to the subspace passing through the point. When the number of subspaces is known, we show that these polynomials can be estimated linearly from data; hence, subspace segmentation is reduced to classifying one point per subspace. We select these points optimally from the data set by minimizing certain distance function, thus dealing automatically with moderate noise in the data. A basis for the complement of each subspace is then recovered by applying standard PCA to the collection of derivatives (normal vectors). Extensions of GPCA that deal with data in a high-dimensional space and with an unknown number of subspaces are also presented. Our experiments on low-dimensional data show that GPCA outperforms existing algebraic algorithms based on polynomial factorization and provides a good initialization to iterative techniques such as k-subspaces and expectation maximization. We also present applications of GPCA to computer vision problems such as face clustering, temporal video segmentation, and 3D motion segmentation from point correspondences in multiple affine views.  相似文献   

15.
Rich side information concerning users and items are valuable for collaborative filtering (CF) algorithms for recommendation. For example, rating score is often associated with a piece of review text, which is capable of providing valuable information to reveal the reasons why a user gives a certain rating. Moreover, the underlying community and group relationship buried in users and items are potentially useful for CF. In this paper, we develop a new model to tackle the CF problem which predicts user’s ratings on previously unrated items by effectively exploiting interactions among review texts as well as the hidden user community and item group information. We call this model CMR (co-clustering collaborative filtering model with review text). Specifically, we employ the co-clustering technique to model the user community and item group, and each community–group pair corresponds to a co-cluster, which is characterized by a rating distribution in exponential family and a topic distribution. We have conducted extensive experiments on 22 real-world datasets, and our proposed model CMR outperforms the state-of-the-art latent factor models. Furthermore, both the user’s preference and item profile are drifting over time. Dynamic modeling the temporal changes in user’s preference and item profiles are desirable for improving a recommendation system. We extend CMR and propose an enhanced model called TCMR to consider time information and exploit the temporal interactions among review texts and co-clusters of user communities and item groups. In this TCMR model, each community–group co-cluster is characterized by an additional beta distribution for time modeling. To evaluate our TCMR model, we have conducted another set of experiments on 22 larger datasets with wider time span. Our proposed model TCMR performs better than CMR and the standard time-aware recommendation model on the rating score prediction tasks. We also investigate the temporal effect on the user–item co-clusters.  相似文献   

16.
We propose a transductive shape segmentation algorithm, which can transfer prior segmentation results in database to new shapes without explicitly specification of prior category information. Our method first partitions an input shape into a set of segmentations as a data preparation, and then a linear integer programming algorithm is used to select segments from them to form the final optimal segmentation. The key idea is to maximize the segment similarity between the segments in the input shape and the segments in database, where the segment similarity is computed through sparse reconstruction error. The segment‐level similarity enables to handle a large amount of shapes with significant topology or shape variations with a small set of segmented example shapes. Experimental results show that our algorithm can generate high quality segmentation and semantic labeling results in the Princeton segmentation benchmark.  相似文献   

17.
Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions) to discover patterns, rules and outlier clouds. For example, consider the numerical case of sales of items, where, for each item sold, we record the unit price and quantity; then, each customer is represented as a distribution of 2-d points (one for each item he/she bought). We want to find similar users, e.g., for market segmentation or anomaly/fraud detection. We propose to address this problem and present D-Search, which includes fast and effective algorithms for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations, (2) multistep sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-Search: (3) time-series distribution mining, which finds similar subsequences in time-series distribution datasets. Extensive experiments on real multidimensional datasets show that our solution achieves a wall clock time up to 2,300 times faster than the naive implementation without sacrificing accuracy.  相似文献   

18.
Customer segmentation based on temporal variation of subscriber preferences is useful for communication service providers (CSPs) in applications such as targeted campaign design, churn prediction, and fraud detection. Traditional clustering algorithms are inadequate in this context, as a multidimensional feature vector represents a subscriber profile at an instant of time, and grouping of subscribers needs to consider variation of subscriber preferences across time. Clustering in this case usually requires complex multivariate time series analysis‐based models. Because conventional time series clustering models have limitations around scalability and ability to accurately represent temporal behaviour sequences (TBS) of users, that may be short, noisy, and non‐stationary, we propose a latent Dirichlet allocation (LDA) based model to represent temporal behaviour of mobile subscribers as compact and interpretable profiles. Our model makes use of the structural regularity within the observable data corresponding to a large number of user profiles and relaxes the strict temporal ordering of user preferences in TBS clustering. We use mean‐shift clustering to segment subscribers based on their discovered profiles. Further, we mine segment‐specific association rules from the discovered TBS clusters, to aid marketers in designing intelligent campaigns that match segment preferences. Our experiments on real world data collected from a popular Asian communication service provider gave encouraging results.  相似文献   

19.
Towards Temporal Dynamic Segmentation   总被引:2,自引:0,他引:2  
In recent years, there have been many research studies focusing on linear data modeling as well as on temporal GIS-T (GIS for transportation) implementations. However, what was fundamentally missing from the research circle was the study of a methodology for processing and representation of linearly referenced features in the temporal context, or temporal dynamic segmentation. This paper dissects the functional specifications of temporal dynamic segmentation. The authors start by exploring the definition and characteristics of dynamic segmentation. The scope of dynamic segmentation is extended to include two functional categories and three essential functions. The paper then defines spatiotemporal segment and a spatiotemporal join operation, which are the building blocks and the key mechanism behind temporal dynamic segmentation. A set of metric criteria for identifying spatiotemporal segment topologies are proposed as an effective alternative to the more general, but more costly, frameworks for the identification of topological relationships. The authors finally present functional specifications of the three essential functions of dynamic segmentation.  相似文献   

20.
We present SpeedSeg, a technique for segmenting pen strokes into lines and arcs. The technique uses pen speed information to help infer the segmentation intended by the drawer. To begin, an initial set of candidate segment points is identified. This set includes speed minima below a threshold, and curvature maxima at which the pen speed is also below a threshold. The ink between each pair of consecutive segment points is then classified as either a line or an arc, depending on which fits best. Next, a feedback process is employed, and segments are judiciously merged and split as necessary to improve the quality of the segmentation. In user studies, SpeedSeg performed accurately for new users. The studies also demonstrated that SpeedSeg's accuracy is surprisingly insensitive to the values of many of the empirical parameters used by the technique. However, it is still possible to quickly tune the system to optimize performance for a given user. Finally, SpeedSeg outperformed a state-of-the-art segmentation algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号