首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到10条相似文献,搜索用时 187 毫秒
1.
Hypothesis testing using constrained null models can be used to compute the significance of data mining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several data mining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate data mining problems in the terms of statistical significance.  相似文献   

2.
Mining spatial colocation patterns: a different framework   总被引:2,自引:0,他引:2  
Recently, there has been considerable interest in mining spatial colocation patterns from large spatial datasets. Spatial colocation patterns represent the subsets of spatial events whose instances are often located in close geographic proximity. Most studies of spatial colocation mining require the specification of two parameter constraints to find interesting colocation patterns. One is a minimum prevalent threshold of colocations, and the other is a distance threshold to define spatial neighborhood. However, it is difficult for users to decide appropriate threshold values without prior knowledge of their task-specific spatial data. In this paper, we propose a different framework for spatial colocation pattern mining. To remove the first constraint, we propose the problem of finding N-most prevalent colocated event sets, where N is the desired number of colocated event sets with the highest interest measure values per each pattern size. We developed two alternative algorithms for mining the N-most patterns. They reduce candidate events effectively and use a filter-and-refine strategy for efficiently finding colocation instances from a spatial dataset. We prove the algorithms are correct and complete in finding the N-most prevalent colocation patterns. For the second constraint, a distance threshold for spatial neighborhood determination, we present various methods to estimate appropriate distance bounds from user input data. The result can help an user to set a distance for a conceptualization of spatial neighborhood. Our experimental results with real and synthetic datasets show that our algorithmic design is computationally effective in finding the N-most prevalent colocation patterns. The discovered patterns were different depending on the distance threshold, which shows that it is important to select appropriate neighbor distances.  相似文献   

3.
Discovering colored Petri nets from event logs   总被引:1,自引:0,他引:1  
Process-aware information systems typically log events (e.g., in transaction logs or audit trails) related to the actual execution of business processes. Analysis of these execution logs may reveal important knowledge that can help organizations to improve the quality of their services. Starting from a process model, which can be discovered by conventional process mining algorithms, we analyze how data attributes influence the choices made in the process based on past process executions using decision mining, also referred to as decision point analysis. In this paper we describe how the resulting model (including the discovered data dependencies) can be represented as a Colored Petri Net (CPN), and how further perspectives, such as the performance and organizational perspective, can be incorporated. We also present a CPN Tools Export plug-in implemented within the ProM framework. Using this plug-in, simulation models in ProM obtained via a combination of various process mining techniques can be exported to CPN Tools. We believe that the combination of automatic discovery of process models using ProM and the simulation capabilities of CPN Tools offers an innovative way to improve business processes. The discovered process model describes reality better than most hand-crafted simulation models. Moreover, the simulation models are constructed in such a way that it is easy to explore various redesigns. A. Rozinat’s research was supported by the IOP program of the Dutch Ministry of Economic Affairs. M. Song’s research was supported by the Technology Foundation STW.  相似文献   

4.
In this article we show that there is a strong connection between decision tree learning and local pattern mining. This connection allows us to solve the computationally hard problem of finding optimal decision trees in a wide range of applications by post-processing a set of patterns: we use local patterns to construct a global model. We exploit the connection between constraints in pattern mining and constraints in decision tree induction to develop a framework for categorizing decision tree mining constraints. This framework allows us to determine which model constraints can be pushed deeply into the pattern mining process, and allows us to improve the state-of-the-art of optimal decision tree induction.  相似文献   

5.
Numerous interestingness measures have been proposed in statistics and data mining to assess object relationships. This is especially important in recent studies of association or correlation pattern mining. However, it is still not clear whether there is any intrinsic relationship among many proposed measures, and which one is truly effective at gauging object relationships in large data sets. Recent studies have identified a critical property, null-(transaction) invariance, for measuring associations among events in large data sets, but many measures do not have this property. In this study, we re-examine a set of null-invariant interestingness measures and find that they can be expressed as the generalized mathematical mean, leading to a total ordering of them. Such a unified framework provides insights into the underlying philosophy of the measures and helps us understand and select the proper measure for different applications. Moreover, we propose a new measure called Imbalance Ratio to gauge the degree of skewness of a data set. We also discuss the efficient computation of interesting patterns of different null-invariant interestingness measures by proposing an algorithm, GAMiner, which complements previous studies. Experimental evaluation verifies the effectiveness of the unified framework and shows that GAMiner speeds up the state-of-the-art algorithm by an order of magnitude.  相似文献   

6.
Mining regional co-location patterns with kNNG   总被引:2,自引:0,他引:2  
Spatial co-location pattern mining discovers the subsets of features of which the events are frequently located together in geographic space. The current research on this topic adopts a distance threshold that has limitations in spatial data sets with various magnitudes of neighborhood distances, especially for mining of regional co-location patterns. In this paper, we propose a hierarchical co-location mining framework accounting for both variety of neighborhood distances and spatial heterogeneity. By adopting k-nearest neighbor graph (kNNG) instead of distance threshold, we propose “distance variation coefficient” as a new measure to drive the mining operations and determine an individual neighborhood relationship graph for each region. The proposed mining algorithm outputs a set of regions with each of them an individual set of regional co-location patterns. The experimental results on both synthetic and real world data sets show that our framework is effective to discover these regional co-location patterns.  相似文献   

7.
Hierarchical visual event pattern mining and its applications   总被引:1,自引:0,他引:1  
In this paper, we propose a hierarchical visual event pattern mining approach and utilize the patterns to address the key problems in video mining and understanding field. We classify events into primitive events (PEs) and compound events (CEs), where PEs are the units of CEs, and CEs serve as smooth priors and rules for PEs. We first propose a tensor-based video representation and Joint Matrix Factorization (JMF) for unsupervised primitive event categorization. Then we apply frequent pattern mining techniques to discover compound event pattern structures. After that, we utilize the two kinds of event patterns to address the applications of event recognition and anomaly detection. First we extend the Sequential Monte Carlo (SMC) method to recognition of live, sequential visual events. To accomplish this task we present a scheme that alternatively recognizes primitive and compound events in one framework. Then, we categorize the anomalies into abnormal events (never seen events) and abnormal contexts (rule breakers), and the two kinds of anomalies are detected simultaneously by embedding a deviation criterion into the SMC framework. Extensive experiments have been conducted which demonstrate that the proposed approach is effective as compared to other major approaches.  相似文献   

8.
We propose an efficient automata-based approach to extract behavioral units and rules from continuous sequential data of animal behavior. By introducing novel extensions, we integrate two elemental methods—the N-gram model and Angluin’s machine learning algorithm into an ethological data mining framework. This allows us to obtain the minimized automaton-representation of behavioral rules that accept (or generate) the smallest set of possible behavioral patterns from sequential data of animal behavior. With this method, we demonstrate how the ethological data mining works using real birdsong data; we use the Bengalese finch song and perform experimental evaluations of this method using artificial birdsong data generated by a computer program. These results suggest that our ethological data mining works effectively even for noisy behavioral data by appropriately setting the parameters that we introduce. In addition, we demonstrate a case study using the Bengalese finch song, showing that our method successfully grasps the core structure of the singing behavior such as loops and branches. Yasuki Kakishita and Kazutoshi Sasahara have contributed equally to this work.  相似文献   

9.
Data stream mining is an emerging research topic in the data mining field. Finding frequent itemsets is one of the most important tasks in data stream mining with wide applications like online e-business and web click-stream analysis. However, two main problems existed in relevant studies: (1) The utilities (e.g., importance or profits) of items are not considered. Actual utilities of patterns cannot be reflected in frequent itemsets. (2) Existing utility mining methods produce too many patterns and this makes it difficult for the users to filter useful patterns among the huge set of patterns. In view of this, in this paper we propose a novel framework, named GUIDE (Generation of maximal high Utility Itemsets from Data strEams), to find maximal high utility itemsets from data streams with different models, i.e., landmark, sliding window and time fading models. The proposed structure, named MUI-Tree (Maximal high Utility Itemset Tree), maintains essential information for the mining processes and the proposed strategies further facilitates the performance of GUIDE. Main contributions of this paper are as follows: (1) To the best of our knowledge, this is the first work on mining the compact form of high utility patterns from data streams; (2) GUIDE is an effective one-pass framework which meets the requirements of data stream mining; (3) GUIDE generates novel patterns which are not only high utility but also maximal, which provide compact and insightful hidden information in the data streams. Experimental results show that our approach outperforms the state-of-the-art algorithms under various conditions in data stream environments on different models.  相似文献   

10.
Due to its damage to Internet security, malware (e.g., virus, worm, trojan) and its detection has caught the attention of both anti-malware industry and researchers for decades. To protect legitimate users from the attacks, the most significant line of defense against malware is anti-malware software products, which mainly use signature-based method for detection. However, this method fails to recognize new, unseen malicious executables. To solve this problem, in this paper, based on the instruction sequences extracted from the file sample set, we propose an effective sequence mining algorithm to discover malicious sequential patterns, and then All-Nearest-Neighbor (ANN) classifier is constructed for malware detection based on the discovered patterns. The developed data mining framework composed of the proposed sequential pattern mining method and ANN classifier can well characterize the malicious patterns from the collected file sample set to effectively detect newly unseen malware samples. A comprehensive experimental study on a real data collection is performed to evaluate our detection framework. Promising experimental results show that our framework outperforms other alternate data mining based detection methods in identifying new malicious executables.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号