共查询到20条相似文献,搜索用时 0 毫秒
1.
《Intelligent Data Analysis》1999,3(5):377-398
One important focus of data mining research is in the development of algorithms for extracting valuable information from large databases in order to facilitate business decisions. This study explores a new technique for data mining – latent semantic indexing (LSI). LSI is an efficient information retrieval method for textual documents. By determining the singular value decomposition (SVD) of a large sparse term-by-document matrix, LSI constructs an approximate vector space model which represents important associative relationships between terms and documents that are not evident in individual documents. This paper explores the applicability of the LSI model to numerical databases, namely consumer product data. By properly choosing attributes of data records as terms or documents, a term-by-document frequency matrix is built from which a distribution-based indexing scheme is employed to construct a correlated distribution matrix (CDM). An LSI-like vector space model is then used to detect useful or hidden patterns in the numerical data. The extracted information can then be validated using statistical hypotheses testing or resampling. LSI is an automatic yet intelligent indexing method. Its application to numerical data introduces a promising way to discover knowledge in important commercial application areas such as retail and consumer banking. 相似文献
2.
Mining neighbor-based patterns in data streams 总被引:1,自引:0,他引:1
Discovery of complex patterns such as clusters, outliers, and associations from huge volumes of streaming data has been recognized as critical for many application domains. However, little research effort has been made toward detecting patterns within sliding window semantics as required by real-time monitoring tasks, ranging from real time traffic monitoring to stock trend analysis. Applying static pattern detection algorithms from scratch to every window is impractical due to their high algorithmic complexity and the real-time responsiveness required by streaming applications. In this work, we develop methods for the incremental detection of neighbor-based patterns, in particular, density-based clusters and distance-based outliers over sliding stream windows. Incremental computation for pattern detection queries is challenging. This is because purging of to-be-expired data from previously formed patterns may cause birth, shrinkage, splitting or termination of these complex patterns. To overcome this, we exploit the “predictability” property of sliding windows to elegantly discount the effect of expired objects with little maintenance cost. Our solution achieves guaranteed minimal CPU consumption, while keeping the memory utilization linear in the number of objects in the window. To thoroughly analyze the performance of our proposed methods, we develop a cost model characterizing the performance of our proposed neighbor-based pattern mining strategies. We conduct an analysis study to not only identify the key performance factors for each strategy but also show under which conditions each of them are most efficient. Our comprehensive experimental study, using both synthetic and real data from domains of moving object monitoring and stock trades, demonstrates superiority of our proposed strategies over alternate methods in both CPU processing resources and in memory utilization. 相似文献
3.
Traditional classification methods assume that the training and the test data arise from the same underlying distribution.
However, in several adversarial settings, the test set is deliberately constructed in order to increase the error rates of
the classifier. A prominent example is spam email where words are transformed to get around word based features embedded in
a spam filter. 相似文献
4.
Mining asynchronous periodic patterns in time series data 总被引:4,自引:0,他引:4
Jiong Yang Wei Wang Yu P.S. 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(3):613-628
Periodicy detection in time series data is a challenging problem of great importance in many applications. Most previous work focused on mining synchronous periodic patterns and did not recognize the misaligned presence of a pattern due to the intervention of random noise. In this paper, we propose a more flexible model of asynchronous periodic pattern that may be present only within a subsequence and whose occurrences may be shifted due to disturbance. Two parameters min/spl I.bar/rep and max/spl I.bar/dis are employed to specify the minimum number of repetitions that is required within each segment of nondisrupted pattern occurrences and the maximum allowed disturbance between any two successive valid segments. Upon satisfying these two requirements, the longest valid subsequence of a pattern is returned. A two-phase algorithm is devised to first generate potential periods by distance-based pruning followed by an iterative procedure to derive and validate candidate patterns and locate the longest valid subsequence. We also show that this algorithm cannot only provide linear time complexity with respect to the length of the sequence but also achieve space efficiency. 相似文献
5.
Mining sequential patterns from multidimensional sequence data 总被引:1,自引:0,他引:1
Chung-Ching Yu Yen-Liang Chen 《Knowledge and Data Engineering, IEEE Transactions on》2005,17(1):136-140
The problem addressed in This work is to discover the frequently occurred sequential patterns from databases. Although much work has been devoted to this subject, to the best of our knowledge, no previous research was able to find sequential patterns from d-dimensional sequence data, where d>2. Without such a capability, many practical data would be impossible to mine. For example, an online stock-trading site may have a customer database, where each customer may visit a Web site in a series of days; each day takes a series of sessions and each session visits a series of Web pages. Then, the data for each customer forms a 3-dimensional list, where the first dimension is days, the second is sessions, and the third is visited pages. To mine sequential patterns from this kind of sequence data, two efficient algorithms have been developed in This work. 相似文献
6.
Mining frequent patterns from univariate uncertain data 总被引:1,自引:0,他引:1
Ying-Ho LiuAuthor Vitae 《Data & Knowledge Engineering》2012,71(1):47-68
In this paper, we propose a new algorithm called U2P-Miner for mining frequent U2 patterns from univariate uncertain data, where each attribute in a transaction is associated with a quantitative interval and a probability density function. The algorithm is implemented in two phases. First, we construct a U2P-tree that compresses the information in the target database. Then, we use the U2P-tree to discover frequent U2 patterns. Potential frequent U2 patterns are derived by combining base intervals and verified by traversing the U2P-tree. We also develop two techniques to speed up the mining process. Since the proposed method is based on a tree-traversing strategy, it is both efficient and scalable. Our experimental results demonstrate that the U2P-Miner algorithm outperforms three widely used algorithms, namely, the modified Apriori, modified H-mine, and modified depth-first backtracking algorithms. 相似文献
7.
A data stream is a potentially uninterrupted flow of data. Mining this flow makes it necessary to cope with uncertainty, as only a part of the stream can be stored. In this paper, we evaluate a statistical technique which biases the estimation of the support of patterns, so as to maximize either the precision or the recall, as chosen by the user, and limit the degradation of the other criterion. Theoretical results show that the technique is not far from the optimum, from the statistical standpoint. Experiments performed tend to demonstrate its potential, as it remains robust even under significant distribution drifts. 相似文献
8.
用挖掘频繁闭合模式集代替挖掘频繁模式集是近年来提出的一个重要策略。根据数据流的特点,提出了一种基于滑动窗口的频繁闭合模式的新方法DSFC_Mine。该算法以滑动窗口中的基本窗口为更新单位,利用改进的CHARM算法计算每个基本窗口的潜在频繁闭合项集,将它们存储到一种新的数据结构中,利用该数据结构可以快速地挖掘滑动窗口中的所有频繁闭合项集。实验验证了该算法在时间上和空间上的可行性和有效性。 相似文献
9.
Yuan J Wu Y 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2012,42(2):334-346
Traditional text data mining techniques are not directly applicable to image data which contain spatial information and are characterized by high-dimensional visual features. It is not a trivial task to discover meaningful visual patterns from images because the content variations and spatial dependence in visual data greatly challenge most existing data mining methods. This paper presents a novel approach to coping with these difficulties for mining visual collocation patterns. Specifically, the novelty of this work lies in the following new contributions: 1) a principled solution to the discovery of visual collocation patterns based on frequent itemset mining and 2) a self-supervised subspace learning method to refine the visual codebook by feeding back discovered patterns via subspace learning. The experimental results show that our method can discover semantically meaningful patterns efficiently and effectively. 相似文献
10.
Online social networks allow users to tag their posts with geographical coordinates collected through the GPS interface of smart phones. The time- and geo-coordinates associated with a sequence of posts/tweets manifest the spatial–temporal movements of people in real life. This paper aims to analyze such movements to discover people and community behavior. To this end, we defined and implemented a novel methodology to mine popular travel routes from geo-tagged posts. Our approach infers interesting locations and frequent travel sequences among these locations in a given geo-spatial region, as shown from the detailed analysis of the collected geo-tagged data. 相似文献
11.
Gemma C. Garriga Roni Khardon Luc De Raedt 《Annals of Mathematics and Artificial Intelligence》2013,69(4):315-342
Recent theoretical insights have led to the introduction of efficient algorithms for mining closed item-sets. This paper investigates potential generalizations of this paradigm to mine closed patterns in relational, graph and network databases. Several semantics and associated definitions for closed patterns in relational data have been introduced in previous work, but the differences among these and the implications of the choice of semantics was not clear. The paper investigates these implications in the context of generalizing the LCM algorithm, an algorithm for enumerating closed item-sets. LCM is attractive since its run time is linear in the number of closed patterns and since it does not need to store the patterns output in order to avoid duplicates, further reducing memory signature and run time. Our investigation shows that the choice of semantics has a dramatic effect on the properties of closed patterns and as a result, in some settings a generalization of the LCM algorithm is not possible. On the other hand, we provide a full generalization of LCM for the semantic setting that has been previously used by the Claudien system. 相似文献
12.
作为数据流挖掘的一个重要研究问题,滑动窗口下的数据流频繁模式挖掘近年来得到了广泛应用和研究。已有的算法大多要对数据流中所有的数据都进行处理,而现实中用户往往只关注事物的某些方面,由此借鉴MFI-TransSW算法,提出了一种基于事务型滑动窗口的算法BSW-Filter(Bit Sliding Window with Filter)。算法采用比特序列实现滑动窗口操作,同时由于增加了频繁项的筛选,减少了所需保存的数据项个数,从而减小了内存使用和提升处理速度。算法的空间复杂度与滑动窗口大小以及数据流取值范围无关,特别适用于周期较长数据范围广的数据挖掘。分析和实验验证了该算法的可行性和有效性。 相似文献
13.
Hui Chen 《Journal of Intelligent Information Systems》2014,42(1):111-131
Frequent pattern mining in data streams is an important research topic in the data mining community. In previous studies, a minimum support threshold was assumed to be available for mining frequent patterns. However, setting such a threshold is typically difficult. Hence, it is more reasonable to ask users to set a bound on the result size. The present study considers mining top-k frequent patterns from data streams using a sliding window technique. A single-pass algorithm, called MSWTP, is developed for the generation of top-k frequent patterns without a threshold. In the method, the content of the transactions in the sliding window is incrementally maintained in a summary data structure, named SWTP-tree, by scanning the stream only once. To make the mining operation efficient, insignificant patterns are distinguished from others by applying the Chernoff bound. Two kinds of obsolete pattern and one kind of insignificant pattern are periodically pruned from the pattern tree. Whenever necessary, the k most frequent patterns can be selected from SWTP-tree in order of their descending frequency. The performance of the proposed technique is evaluated via simulation experiments. The results show that the proposed method is both efficient and scalable, and that it outperforms comparable algorithms. 相似文献
14.
With the help of various positioning tools, individuals’ mobility behaviors are being continuously captured from mobile phones, wireless networking devices and GPS appliances. These mobility data serve as an important foundation for understanding individuals’ mobility behaviors. For instance, recent studies show that, despite the dissimilarity in the mobility areas covered by individuals, there is high regularity in the human mobility behaviors, suggesting that most individuals follow a simple and reproducible pattern. This survey paper reviews relevant results on uncovering mobility patterns from GPS datasets. Specially, it covers the results about inferring locations of significance for prediction of future moves, detecting modes of transport, mining trajectory patterns and recognizing location-based activities. The survey provides a general perspective for studies on the issues of individuals’ mobility by reviewing the methods and algorithms in detail and comparing the existing results on the same issues. Several new and emergent issues concerning individuals’ mobility are proposed for further research. 相似文献
15.
In recent years, emerging applications introduced new constraints for data mining methods. These constraints are typical of
a new kind of data: the data streams. In data stream processing, memory usage is restricted, new elements are generated continuously and have to be considered
in a linear time, no blocking operator can be performed and the data can be examined only once. At this time, only a few methods
has been proposed for mining sequential patterns in data streams. We argue that the main reason is the combinatory phenomenon
related to sequential pattern mining. In this paper, we propose an algorithm based on sequences alignment for mining approximate
sequential patterns in Web usage data streams. To meet the constraint of one scan, a greedy clustering algorithm associated
to an alignment method is proposed. We will show that our proposal is able to extract relevant sequences with very low thresholds. 相似文献
16.
17.
General patterns of execution that have been frequently scheduled by a workflow management system provide the administrator with previously unknown, and potentially useful information, e.g., about the existence of unexpected causalities between subprocesses of a given workflow. This paper investigates the problem of mining unconnected patterns on the basis of some execution traces, i.e., of detecting sets of activities exhibiting no explicit dependency relationships that are frequently executed together. The problem is faced in the paper by proposing and analyzing two algorithms. One algorithm takes into account information about the structure of the control-flow graph only, while the other is a smart refinement where the knowledge of the frequencies of edges and activities in the traces at hand is also accounted for, by means of a sophisticated graphical analysis. Both algorithms have been implemented and integrated into a system prototype, which may profitably support the enactment phase of the workflow. The correctness of the two algorithms is formally proven, and several experiments are reported to evidence the ability of the graphical analysis to significantly improve the performances, by dramatically pruning the search space of candidate patterns. 相似文献
18.
The main task of mining sequential patterns is to analyze the transaction database of a company in order to find out the priorities of items that most customers take when consuming. In this article, we propose a new method—the ISP Algorithm. With this method, we can find out not only the order of consumer items of each customer, but also offer the periodic interval of consumer items of each customer. Compared with other previous periodic association rules, the difference is that the period the algorithm provides is not the repeated purchases in a regular time, but the possible repurchases within a certain time frame. The algorithm utilizes the transaction time interval of individual customers and that of all the customers to find out when and who will buy goods, and what items of goods they will buy. © 2005 Wiley Periodicals, Inc. Int J Int Syst 20: 359–373, 2005. 相似文献
19.
In this paper we aim at extending the non-derivable condensed representation in frequent itemset mining to sequential pattern mining. We start by showing a negative example: in the context of frequent sequences, the notion of non-derivability is meaningless. Therefore, we extend our focus to the mining of conjunctions of sequences. Besides of being of practical importance, this class of patterns has some nice theoretical properties. Based on a new unexploited theoretical definition of equivalence classes for sequential patterns, we are able to extend the notion of a non-derivable itemset to the sequence domain. We present a new depth-first approach to mine non-derivable conjunctive sequential patterns and show its use in mining association rules for sequences. This approach is based on a well known combinatorial theorem: the Möbius inversion. A performance study using both synthetic and real datasets illustrates the efficiency of our mining algorithm. These new introduced patterns have a high-potential for real-life applications, especially for network monitoring and biomedical fields with the ability to get sequential association rules with all the classical statistical metrics such as confidence, conviction, lift etc. 相似文献
20.
Junguo Liu Jimmy R. Williams Xiuying Wang Hong Yang 《Environmental Modelling & Software》2009,24(5):655-664
Although the EPIC model has been widely used in agricultural and environmental studies, applications of this model may be limited in the regions where daily weather data are not available. In this paper, a stand-alone MODAWEC model was developed to generate daily precipitation and maximum and minimum temperature from monthly precipitation, maximum and minimum temperature, and wet days. A case study shows that the crop yields and evapotranspiration (ET) simulated with the generated daily weather data compare very well with those simulated with the measured daily weather data with low normalized mean square errors (0.008–0.017 for crop yields and 0.003–0.004 for ET). The MODAWEC model can extend the application of the EPIC model to the regions where daily data are not available or not complete. In addition, the generated daily weather data can possibly be used by other environmental models. Associated with MODAWEC, the EPIC model can play a greater role in assessing the impacts of global climate change on future food production and water use. 相似文献