首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Frequent pattern mining in data streams is an important research topic in the data mining community. In previous studies, a minimum support threshold was assumed to be available for mining frequent patterns. However, setting such a threshold is typically difficult. Hence, it is more reasonable to ask users to set a bound on the result size. The present study considers mining top-k frequent patterns from data streams using a sliding window technique. A single-pass algorithm, called MSWTP, is developed for the generation of top-k frequent patterns without a threshold. In the method, the content of the transactions in the sliding window is incrementally maintained in a summary data structure, named SWTP-tree, by scanning the stream only once. To make the mining operation efficient, insignificant patterns are distinguished from others by applying the Chernoff bound. Two kinds of obsolete pattern and one kind of insignificant pattern are periodically pruned from the pattern tree. Whenever necessary, the k most frequent patterns can be selected from SWTP-tree in order of their descending frequency. The performance of the proposed technique is evaluated via simulation experiments. The results show that the proposed method is both efficient and scalable, and that it outperforms comparable algorithms.  相似文献   

2.
In recent years, data stream mining has become an important research topic. With the emergence of new applications, the data we process are not again static, but the continuous dynamic data stream. Examples include network traffic analysis, Web click stream mining, network intrusion detection, and on-line transaction analysis. In this paper, we propose a new framework for data stream mining, called the weighted sliding window model. The proposed model allows the user to specify the number of windows for mining, the size of a window, and the weight for each window. Thus users can specify a higher weight to a more significant data section, which will make the mining result closer to user’s requirements. Based on the weighted sliding window model, we propose a single pass algorithm, called WSW, to efficiently discover all the frequent itemsets from data streams. By analyzing data characteristics, an improved algorithm, called WSW-Imp, is developed to further reduce the time of deciding whether a candidate itemset is frequent or not. Empirical results show that WSW-Imp outperforms WSW under the weighted sliding window model.  相似文献   

3.
Continuous processing of top-k queries over data streams is a promising technique for alleviating the information overload problem as it distinguishes relevant from irrelevant data stream objects with respect to a given scoring function over time. Thus it enables filtering of irrelevant data objects and delivery of top-k objects relevant to user interests in real-time. We propose a solution for distributed continuous top-k processing based on the publish/subscribe communication paradigm—top-k publish/subscribe over sliding windows (top-k/w publish/subscribe). It identifies k best-ranked objects with respect to a given scoring function over a sliding window of size w, and extends the publish/subscribe communication paradigm by continuous top-k processing algorithms coming from the field of data stream processing.In this paper, we introduce, analyze and evaluate the essential building blocks of distributed top-k/w publish/subscribe systems: first, we present a formal top-k/w publish/subscribe model and compare it to the prevailing Boolean publish/subscribe model. Next, we outline the top-k/w processing tasks performed by publish/subscribe nodes and investigate the properties of supported scoring functions. Furthermore, we explore potential routing strategies for distributed top-k/w publish/subscribe systems. Finally, we experimentally evaluate model properties and provide a comparative study investigating traffic requirements of potential routing strategies.  相似文献   

4.
How can we discover interesting patterns from time-evolving high-speed data streams? How to analyze the data streams quickly and accurately, with little space overhead? How to guarantee the found patterns to be self-consistent? High-speed data stream has been receiving increasing attention due to its wide applications such as sensors, network traffic, social networks, etc. The most fundamental task on the data stream is frequent pattern mining; especially, focusing on recentness is important in real applications. In this paper, we develop two algorithms for discovering recently frequent patterns in data streams. First, we propose TwMinSwap to find top-k recently frequent items in data streams, which is a deterministic version of our motivating algorithm TwSample providing theoretical guarantees based on item sampling. TwMinSwap improves TwSample in terms of speed, accuracy, and memory usage. Both require only O(k) memory spaces and do not require any prior knowledge on the stream such as its length and the number of distinct items in the stream. Second, we propose TwMinSwap-Is to find top-k recently frequent itemsets in data streams. We especially focus on keeping self-consistency of the discovered itemsets, which is the most important property for reliable results, while using O(k) memory space with the assumption of a constant itemset size. Through extensive experiments, we demonstrate that TwMinSwap outperforms all competitors in terms of accuracy and memory usage, with fast running time. We also show that TwMinSwap-Is is more accurate than the competitor and discovers recently frequent itemsets with reasonably large sizes (at most 5–7) depending on datasets. Thanks to TwMinSwap and TwMinSwap-Is, we report interesting discoveries in real world data streams, including the difference of trends between the winner and the loser of U.S. presidential candidates, and temporal human contact patterns.  相似文献   

5.
Frequent sequential pattern mining has become one of the most important tasks in data mining. It has many applications, such as sequential analysis, classification, and prediction. How to generate candidates and how to control the combinatorically explosive number of intermediate subsequences are the most difficult problems. Intelligent systems such as recommender systems, expert systems, and business intelligence systems use only a few patterns, namely those that satisfy a number of defined conditions. Challenges include the mining of top-k patterns, top-rank-k patterns, closed patterns, and maximal patterns. In many cases, end users need to find itemsets that occur with a sequential pattern. Therefore, this paper proposes approaches for mining top-k co-occurrence items usually found with a sequential pattern. The Naive Approach Mining (NAM) algorithm discovers top-k co-occurrence items by directly scanning the sequence database to determine the frequency of items. The Vertical Approach Mining (VAM) algorithm is based on vertical database scanning. The Vertical with Index Approach Mining (VIAM) algorithm is based on a vertical database with index scanning. VAM and VIAM use pruning strategies to reduce the search space, thus improving performance. VAM and VIAM are especially effective in mining the co-occurrence items of a long input pattern. The three algorithms were evaluated using real-world databases. The experimental results show that these algorithms perform well, especially VAM and VIAM.  相似文献   

6.
挖掘滑动窗口中的数据流频繁模式   总被引:2,自引:0,他引:2  
随着数据流应用的不断增多,数据流环境下的数据挖掘技术受到了越来越多的关注.文章结合数据流的特点,提出一种新的基于滑动窗口的频繁模式挖掘算法:DSFPM.算法分块挖掘数据流,在内存中维持一个用于保存所有潜在的频繁模式信息的存储结构DSFPM-Tree,并在各个基本窗口进入滑动窗口后动态更新该存储结构.算法仅处理和保存各个基本窗口的临界频繁闭合项集,极大地提高了时间和空间效率.实验结果表明,该算法具有良好的性能.  相似文献   

7.
Frequent pattern mining discovers sets of items that frequently appear together in a transactional database; these can serve valuable economic and research purposes. However, if the database contains sensitive data (e.g., user behavior records, electronic health records), directly releasing the discovered frequent patterns with support counts will carry significant risk to the privacy of individuals. In this paper, we study the problem of how to accurately find the top-k frequent patterns with noisy support counts on transactional databases while satisfying differential privacy. We propose an algorithm, called differentially private frequent pattern (DFP-Growth), that integrates a Laplace mechanism and an exponential mechanism to avoid privacy leakage. We theoretically prove that the proposed method is (λ, δ)-useful and differentially private. To boost the accuracy of the returned noisy support counts, we take consistency constraints into account to conduct constrained inference in the post-processing step. Extensive experiments, using several real datasets, confirm that our algorithm generates highly accurate noisy support counts and top-k frequent patterns.  相似文献   

8.
With the fast increase in Web activities, Web data mining has recently become an important research topic and is receiving a significant amount of interest from both academic and industrial environments. While existing methods are efficient for the mining of frequent path traversal patterns from the access information contained in a log file, these approaches are likely to over evaluate associations. Explicitly, most previous studies of mining path traversal patterns are based on the model of a uniform support threshold, where a single support threshold is used to determine frequent traversal patterns without taking into consideration such important factors as the length of a pattern, the positions of Web pages, and the importance of a particular pattern, etc. As a result, a low support threshold will lead to lots of uninteresting patterns derived whereas a high support threshold may cause some interesting patterns with lower supports to be ignored. In view of this, this paper broadens the horizon of frequent path traversal pattern mining by introducing a flexible model of mining Web traversal patterns with dynamic thresholds. Specifically, we study and apply the Markov chain model to provide the determination of support threshold of Web documents; and further, by properly employing some effective techniques devised for joining reference sequences, the proposed algorithm dynamic threshold miner (DTM) not only possesses the capability of mining with dynamic thresholds, but also significantly improves the execution efficiency as well as contributes to the incremental mining of Web traversal patterns. Performance of algorithm DTM and the extension of existing methods is comparatively analyzed with synthetic and real Web logs. It is shown that the option of algorithm DTM is very advantageous in reducing the number of unnecessary rules produced and leads to prominent performance improvement.  相似文献   

9.
杨皓  段磊  胡斌  邓松  王文韬  秦攀 《软件学报》2015,26(11):2994-3009
对比序列模式能够表达序列数据集合间的差异,在商品推荐、用户行为分析和电力供应预测等领域有广泛的应用.已有的对比序列模式挖掘算法需要用户设定正例支持度阈值和负例支持度阈值.在不具备足够先验知识的情况下,用户难以设定恰当的支持度阈值,从而可能错失一些对比显著的模式.为此,提出了带间隔约束的top-k对比序列模式挖掘算法kDSP-Miner(top-k distinguishing sequential patterns with gap constraint miner).kDSP-Miner中用户只需设置期望发现的对比最显著的模式个数,从而避免了直接设置对比支持度阈值.相应地,挖掘算法更容易使用,并且结果更易于解释.同时,为了提高算法执行效率,设计了若干剪枝策略和启发策略.进一步设计了kDSP-Miner的多线程版本,以提高其对高维序列元素情况的处理能力.通过在真实世界数据集上的详实实验,验证了算法的有效性和执行效率.  相似文献   

10.
High on-shelf utility itemset (HOU) mining is an emerging data mining task which consists of discovering sets of items generating a high profit in transaction databases. The task of HOU mining is more difficult than traditional high utility itemset (HUI) mining, because it also considers the shelf time of items, and items having negative unit profits. HOU mining can be used to discover more useful and interesting patterns in real-life applications than traditional HUI mining. Several algorithms have been proposed for this task. However, a major drawback of these algorithms is that it is difficult for users to find a suitable value for the minimum utility threshold parameter. If the threshold is set too high, not enough patterns are found. And if the threshold is set too low, too many patterns will be found and the algorithm may use an excessive amount of time and memory. To address this issue, we propose to address the problem of top-k on-shelf high utility itemset mining, where the user directly specifies k, the desired number of patterns to be output instead of specifying a minimum utility threshold value. An efficient algorithm named KOSHU (fast top-K on-shelf high utility itemset miner) is proposed to mine the top-k HOUs efficiently, while considering on-shelf time periods of items, and items having positive and/or negative unit profits. KOSHU introduces three novel strategies, named efficient estimated co-occurrence maximum period rate pruning, period utility pruning and concurrence existing of a pair 2-itemset pruning to reduce the search space. KOSHU also incorporates several novel optimizations and a faster method for constructing utility-lists. An extensive performance study on real-life and synthetic datasets shows that the proposed algorithm is efficient both in terms of runtime and memory consumption and has excellent scalability.  相似文献   

11.
Effective and efficient mining of music structure patterns from music query data is one of the most interesting issues of multimedia data mining. In this paper, we introduce a new kind of pattern, called emerging melody structure (EMS), for knowledge discovery from music melody streams. EMSs are defined as music data items with melody strings whose support increase significantly from one sliding window to another window from streaming melody sequences. The discovered EMS can be used to predict the future trend of online music style recommendation, to personalize the Web service of music downloading priority, for music composers to compose new music or for service provider to collect more similar music. Therefore, an efficient data mining approach, called MEMSA (Mining Emerging Melody Structure Algorithm), is proposed to discover all EMSs from streaming music query data over sliding windows. In the framework of MEMSA, a prefix tree-based data structure, called EMS-tree (Emerging Melody Structure tree), is constructed for maintaining temporal EMSs effectively. Experimental results show that the proposed method MEMSA is an efficient algorithm for mining all EMSs from streaming melody sequences efficiently.  相似文献   

12.
Data stream mining is an emerging research topic in the data mining field. Finding frequent itemsets is one of the most important tasks in data stream mining with wide applications like online e-business and web click-stream analysis. However, two main problems existed in relevant studies: (1) The utilities (e.g., importance or profits) of items are not considered. Actual utilities of patterns cannot be reflected in frequent itemsets. (2) Existing utility mining methods produce too many patterns and this makes it difficult for the users to filter useful patterns among the huge set of patterns. In view of this, in this paper we propose a novel framework, named GUIDE (Generation of maximal high Utility Itemsets from Data strEams), to find maximal high utility itemsets from data streams with different models, i.e., landmark, sliding window and time fading models. The proposed structure, named MUI-Tree (Maximal high Utility Itemset Tree), maintains essential information for the mining processes and the proposed strategies further facilitates the performance of GUIDE. Main contributions of this paper are as follows: (1) To the best of our knowledge, this is the first work on mining the compact form of high utility patterns from data streams; (2) GUIDE is an effective one-pass framework which meets the requirements of data stream mining; (3) GUIDE generates novel patterns which are not only high utility but also maximal, which provide compact and insightful hidden information in the data streams. Experimental results show that our approach outperforms the state-of-the-art algorithms under various conditions in data stream environments on different models.  相似文献   

13.
Sliding window is a widely used model for data stream mining due to its emphasis on recent data and its bounded memory requirement. The main idea behind a transactional sliding window is to keep a fixed size window over a data stream. The window size is kept constant by removing old transactions from the window, when new transactions arrive. Older transactions of window are removed irrespective to whether a significant change has occurred or not. Another challenge of sliding window model is determining window size. The classic approach for determining the window size is to obtain it from the user. In order to determine the precise size of the window, the user must have prior knowledge about the time and scale of changes within the data stream. However, due to the unpredictable changing nature of data streams, this prior knowledge cannot be easily determined. Moreover, by using a fixed window size during a data stream mining, the performance of this model is degraded in terms of reflecting recent changes. Based on these observations, this study relaxes the notion of window size and proposes a new algorithm named VSW (Variable Size sliding Window frequent itemset mining) which is suitable for observing recent changes in the set of frequent itemsets over data streams. The window size is determined dynamically based on amounts of concept change that occurs within the arriving data stream. The window expands as the concept becomes stable and shrinks when a concept change occurs. In this study, it is shown that if stale transactions are removed from the window after a concept change, updated frequent itemsets always belong to the most recent concept. Experimental evaluations on both synthetic and real data show that our algorithm effectively detects the concept change, adjust the window size, and adapts itself to the new concepts along the data stream.  相似文献   

14.
High utility pattern (HUP) mining over data streams has become a challenging research issue in data mining. When a data stream flows through, the old information may not be interesting in the current time period. Therefore, incremental HUP mining is necessary over data streams. Even though some methods have been proposed to discover recent HUPs by using a sliding window, they suffer from the level-wise candidate generation-and-test problem. Hence, they need a large amount of execution time and memory. Moreover, their data structures are not suitable for interactive mining. To solve these problems of the existing algorithms, in this paper, we propose a novel tree structure, called HUS-tree (high utility stream tree) and a new algorithm, called HUPMS (high utility pattern mining over stream data) for incremental and interactive HUP mining over data streams with a sliding window. By capturing the important information of stream data into an HUS-tree, our HUPMS algorithm can mine all the HUPs in the current window with a pattern growth approach. Furthermore, HUS-tree is very efficient for interactive mining. Extensive performance analyses show that our algorithm is very efficient for incremental and interactive HUP mining over data streams and significantly outperforms the existing sliding window-based HUP mining algorithms.  相似文献   

15.
Sequential pattern mining has been studied extensively in the data mining community. Most previous studies require the specification of a min_support threshold for mining a complete set of sequential patterns satisfying the threshold. However, in practice, it is difficult for users to provide an appropriate min_support threshold. To overcome this difficulty, we propose an alternative mining task: mining top-k frequent closed sequential patterns of length no less than min_, where k is the desired number of closed sequential patterns to be mined and min_ is the minimal length of each pattern. We mine the set of closed patterns because it is a compact representation of the complete set of frequent patterns. An efficient algorithm, called TSP, is developed for mining such patterns without min_support. Starting at (absolute) min_support=1, the algorithm makes use of the length constraint and the properties of top-k closed sequential patterns to perform dynamic support raising and projected database pruning. Our extensive performance study shows that TSP has high performance. In most cases, it outperforms the efficient closed sequential pattern-mining algorithm, CloSpan, even when the latter is running with the best tuned min_support threshold. Thus, we conclude that, for sequential pattern mining, mining top-k frequent closed sequential patterns without min_support is more preferable than the traditional min_support-based mining.  相似文献   

16.
In this paper, we introduce item-centric mining, a new semantics for mining long-tailed datasets. Our algorithm, TopPI, finds for each item its top-k most frequent closed itemsets. While most mining algorithms focus on the globally most frequent itemsets, TopPI guarantees that each item is represented in the results, regardless of its frequency in the database.TopPI allows users to efficiently explore Web data, answering questions such as “what are the k most common sets of songs downloaded together with the ones of my favorite artist?”. When processing retail data consisting of 55 million supermarket receipts, TopPI finds the itemset “milk, puff pastry” that appears 10,315 times, but also “frangipane, puff pastry” and “nori seaweed, wasabi, sushi rice” that occur only 1120 and 163 times, respectively. Our experiments with analysts from the marketing department of our retail partner demonstrate that item-centric mining discover valuable itemsets. We also show that TopPI can serve as a building-block to approximate complex itemset ranking measures such as the p-value.Thanks to efficient enumeration and pruning strategies, TopPI avoids the search space explosion induced by mining low support itemsets. We show how TopPI can be parallelized on multi-cores and distributed on Hadoop clusters. Our experiments on datasets with different characteristics show the superiority of TopPI when compared to standard top-k solutions, and to Parallel FP-Growth, its closest competitor.  相似文献   

17.
18.
Data mining for Web intelligence   总被引:2,自引:0,他引:2  
Searching, comprehending, and using the semistructured HTML, XML, and database-service-engine information stored on the Web poses a significant challenge. This data is more sophisticated and dynamic than the information commercial database systems store. To supplement keyword-based indexing, researchers have applied data mining to Web-page ranking. In this context, data mining helps Web search engines find high-quality Web pages and enhances Web click stream analysis. For the Web to reach its full potential, however, we must improve its services, make it more comprehensible, and increase its usability. As researchers continue to develop data mining techniques, the authors believe this technology will play an increasingly important role in meeting the challenges of developing the intelligent Web. Ultimately, data mining for Web intelligence will make the Web a richer, friendlier, and more intelligent resource that we can all share and explore. The paper considers how data mining holds the key to uncovering and cataloging the authoritative links, traversal patterns, and semantic structures that will bring intelligence and direction to our Web interactions.  相似文献   

19.
High utility sequential pattern (HUSP) mining has emerged as an important topic in data mining. A number of studies have been conducted on mining HUSPs, but they are mainly intended for non-streaming data and thus do not take data stream characteristics into consideration. Streaming data are fast changing, continuously generated unbounded in quantity. Such data can easily exhaust computer resources (e.g., memory) unless a proper resource-aware mining is performed. In this study, we explore the fundamental problem of how limited memory can be best utilized to produce high quality HUSPs over a data stream. We design an approximation algorithm, called MAHUSP, that employs memory adaptive mechanisms to use a bounded portion of memory, in order to efficiently discover HUSPs over data streams. An efficient tree structure, called MAS-Tree, is proposed to store potential HUSPs over a data stream. MAHUSP guarantees that all HUSPs are discovered in certain circumstances. Our experimental study shows that our algorithm can not only discover HUSPs over data streams efficiently, but also adapt to memory allocation with limited sacrifices in the quality of discovered HUSPs. Furthermore, in order to show the effectiveness and efficiency of MAHUSP in real-life applications, we apply our proposed algorithm to a web clickstream dataset obtained from a Canadian news portal to showcase users’ reading behavior, and to a real biosequence database to identify disease-related gene regulation sequential patterns. The results show that MAHUSP effectively discovers useful and meaningful patterns in both cases.  相似文献   

20.
Frequent itemset mining over data streams becomes a hot topic in data mining and knowledge discovery in recent years, and has been applied to different areas. However, the setting of a minimum support threshold needs some domain knowledge. It will bring a lot of difficulties or much burden to users if the support threshold is not set reasonably. It is interesting for users to find top-K frequent itemsets over data streams. In this paper, a dynamical incremental approximate algorithm TOPSIL-Miner is presented to mine top-K significant itemsets in landmark windows. A new data structure, TOPSIL-Tree, is designed to store the potential significant itemsets and other data structures of maximum support list, ordered item list, TOPSET and minimum support list are devised to maintain information about mining results. Moreover, three optimal strategies are exploited to reduce time and space cost of the algorithm: (1) pruning trivial nodes in the current data stream, (2) promoting mining support threshold during mining process adaptively and heuristically, and (3) promoting pruning threshold dynamically. The accuracy of the algorithm is also analyzed. Extensive experiments are performed to evaluate the good effectiveness and the high efficiency and precision of the algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号