首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Mining frequent itemsets over data streams has attracted much research attention in recent years. In the past, we had developed a hash-based approach for mining frequent itemsets over a single data stream. In this paper, we extend that approach to mine global frequent itemsets from a collection of data streams distributed at distinct remote sites. To speed up the mining process, we make the first attempt to address a new problem on continuously maintaining a global synopsis for the union of all the distributed streams. The mining results therefore can be yielded on demand by directly processing the maintained global synopsis. Instead of collecting and processing all the data in a central server, which may waste the computation resources of remote sites, distributed computations over the data streams are performed. A distributed computation framework is proposed in this paper, including two communication strategies and one merging operation. These communication strategies are designed according to an accuracy guarantee of the mining results, determining when and what the remote sites should transmit to the central server (named coordinator). On the other hand, the merging operation is exploited to merge the information received from the remote sites into the global synopsis maintained at the coordinator. By the strategies and operation, the goal of continuously maintaining the global synopsis can be achieved. Rooted in the continuously maintained global synopsis, we propose a mining algorithm for finding global frequent itemsets. Moreover, the correctness guarantees of the communication strategies and merging operation, and the accuracy guarantee analysis of the mining algorithm are provided. Finally, a series of experiments on synthetic datasets and a real dataset are performed to show the effectiveness and efficiency of the distributed computation framework.  相似文献   

2.
Mining frequent itemsets from transactional data streams is challenging due to the nature of the exponential explosion of itemsets and the limit memory space required for mining frequent itemsets. Given a domain of I unique items, the possible number of itemsets can be up to 2I − 1. When the length of data streams approaches to a very large number N, the possibility of an itemset to be frequent becomes larger and difficult to track with limited memory. The existing studies on finding frequent items from high speed data streams are false-positive oriented. That is, they control memory consumption in the counting processes by an error parameter ?, and allow items with support below the specified minimum support s but above s − ? counted as frequent ones. However, such false-positive oriented approaches cannot be effectively applied to frequent itemsets mining for two reasons. First, false-positive items found increase the number of false-positive frequent itemsets exponentially. Second, minimization of the number of false-positive items found, by using a small ?, will make memory consumption large. Therefore, such approaches may make the problem computationally intractable with bounded memory consumption. In this paper, we developed algorithms that can effectively mine frequent item(set)s from high speed transactional data streams with a bound of memory consumption. Our algorithms are based on Chernoff bound in which we use a running error parameter to prune item(set)s and use a reliability parameter to control memory. While our algorithms are false-negative oriented, that is, certain frequent itemsets may not appear in the results, the number of false-negative itemsets can be controlled by a predefined parameter so that desired recall rate of frequent itemsets can be guaranteed. Our extensive experimental studies show that the proposed algorithms have high accuracy, require less memory, and consume less CPU time. They significantly outperform the existing false-positive algorithms.  相似文献   

3.
A data stream is a massive, open-ended sequence of data elements continuously generated at a rapid rate. Mining data streams is more difficult than mining static databases because the huge, high-speed and continuous characteristics of streaming data. In this paper, we propose a new one-pass algorithm called DSM-MFI (stands for Data Stream Mining for Maximal Frequent Itemsets), which mines the set of all maximal frequent itemsets in landmark windows over data streams. A new summary data structure called summary frequent itemset forest (abbreviated as SFI-forest) is developed for incremental maintaining the essential information about maximal frequent itemsets embedded in the stream so far. Theoretical analysis and experimental studies show that the proposed algorithm is efficient and scalable for mining the set of all maximal frequent itemsets over the entire history of the data streams.  相似文献   

4.
王鑫  刘方爱 《计算机应用》2016,36(7):1988-1992
针对已有的多数据流协同频繁项集挖掘算法存在内存占用率高以及发现频繁项集效率低的问题,提出了改进的多数据流协同频繁项集挖掘(MCMD-Stream)算法。首先,该算法利用单遍扫描数据库的字节序列滑动窗口挖掘算法发现数据流中的潜在频繁项集和频繁项集;其次,构建类似频繁模式树(FP-Tree)的压缩频繁模式树(CP-Tree)存储已发现的潜在频繁项集和频繁项集,同时更新CP-Tree树中每个节点生成的对数倾斜时间表中的频繁项计数;最后,通过汇总分析得出在多条数据流中多次出现的且有价值的频繁项集,即协同频繁项集。相比A-Stream和H-Stream算法,MCMD-Stream算法不仅能够提高多数据流中协同频繁项集挖掘的效率,并且还降低了内存空间的使用率。实验结果表明MCMD-Stream算法能够有效地应用于多数据流的协同频繁项集挖掘。  相似文献   

5.
A data stream is a massive and unbounded sequence of data elements that are continuously generated at a fast speed. Compared with traditional approaches, data mining in data streams is more challenging since several extra requirements need to be satisfied. In this paper, we propose a mining algorithm for finding frequent itemsets over the transactional data stream. Unlike most of existing algorithms, our method works based on the theory of Approximate Inclusion–Exclusion. Without incrementally maintaining the overall synopsis of the stream, we can approximate the itemsets’ counts according to certain kept information and the counts bounding technique. Some additional techniques are designed and integrated into the algorithm for performance improvement. Besides, the performance of the proposed algorithm is tested and analyzed through a series of experiments.  相似文献   

6.
Towards a new approach for mining frequent itemsets on data stream   总被引:1,自引:0,他引:1  
Mining frequent patterns on streaming data is a new challenging problem for the data mining community since data arrives sequentially in the form of continuous rapid streams. In this paper we propose a new approach for mining itemsets. Our approach has the following advantages: an efficient representation of items and a novel data structure to maintain frequent patterns coupled with a fast pruning strategy. At any time, users can issue requests for frequent itemsets over an arbitrary time interval. Furthermore our approach produces an approximate answer with an assurance that it will not bypass user-defined frequency and temporal thresholds. Finally the proposed method is analyzed by a series of experiments on different datasets.  相似文献   

7.
Efficient algorithms for mining closed itemsets and their lattice structure   总被引:7,自引:0,他引:7  
The set of frequent closed itemsets uniquely determines the exact frequency of all itemsets, yet it can be orders of magnitude smaller than the set of all frequent itemsets. In this paper, we present CHARM, an efficient algorithm for mining all frequent closed itemsets. It enumerates closed sets using a dual itemset-tidset search tree, using an efficient hybrid search that skips many levels. It also uses a technique called diffsets to reduce the memory footprint of intermediate computations. Finally, it uses a fast hash-based approach to remove any "nonclosed" sets found during computation. We also present CHARM-L, an algorithm that outputs the closed itemset lattice, which is very useful for rule generation and visualization. An extensive experimental evaluation on a number of real and synthetic databases shows that CHARM is a state-of-the-art algorithm that outperforms previous methods. Further, CHARM-L explicitly generates the frequent closed itemset lattice.  相似文献   

8.
DSM-FI: an efficient algorithm for mining frequent itemsets in data streams   总被引:4,自引:4,他引:0  
Online mining of data streams is an important data mining problem with broad applications. However, it is also a difficult problem since the streaming data possess some inherent characteristics. In this paper, we propose a new single-pass algorithm, called DSM-FI (data stream mining for frequent itemsets), for online incremental mining of frequent itemsets over a continuous stream of online transactions. According to the proposed algorithm, each transaction of the stream is projected into a set of sub-transactions, and these sub-transactions are inserted into a new in-memory summary data structure, called SFI-forest (summary frequent itemset forest) for maintaining the set of all frequent itemsets embedded in the transaction data stream generated so far. Finally, the set of all frequent itemsets is determined from the current SFI-forest. Theoretical analysis and experimental studies show that the proposed DSM-FI algorithm uses stable memory, makes only one pass over an online transactional data stream, and outperforms the existing algorithms of one-pass mining of frequent itemsets.
Suh-Yin LeeEmail:
  相似文献   

9.
As data have been accumulated more quickly in recent years, corresponding databases have also become huger, and thus, general frequent pattern mining methods have been faced with limitations that do not appropriately respond to the massive data. To overcome this problem, data mining researchers have studied methods which can conduct more efficient and immediate mining tasks by scanning databases only once. Thereafter, the sliding window model, which can perform mining operations focusing on recently accumulated parts over data streams, was proposed, and a variety of mining approaches related to this have been suggested. However, it is hard to mine all of the frequent patterns in the data stream environment since generated patterns are remarkably increased as data streams are continuously extended. Thus, methods for efficiently compressing generated patterns are needed in order to solve that problem. In addition, since not only support conditions but also weight constraints expressing items’ importance are one of the important factors in the pattern mining, we need to consider them in mining process. Motivated by these issues, we propose a novel algorithm, weighted maximal frequent pattern mining over data streams based on sliding window model (WMFP-SW) to obtain weighted maximal frequent patterns reflecting recent information over data streams. Performance experiments report that MWFP-SW outperforms previous algorithms in terms of runtime, memory usage, and scalability.  相似文献   

10.
A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This issue becomes more serious in finding frequent itemsets or frequency counting over an online transactional data stream since there can be a large number of itemsets to be monitored. We have proposed a method called the estDec method for finding frequent itemsets over an online data stream. In order to reduce the number of monitored itemsets in this method, monitoring the count of an itemset is delayed until its support is large enough to become a frequent itemset in the near future. For this purpose, the count of an itemset should be estimated. Consequently, how to estimate the count of an itemset is a critical issue in minimizing memory usage as well as processing time. In this paper, the effects of various count estimation methods for finding frequent itemsets are analyzed in terms of mining accuracy, memory usage and processing time.  相似文献   

11.
Utility of an itemset is considered as the value of this itemset, and utility mining aims at identifying the itemsets with high utilities. The temporal high utility itemsets are the itemsets whose support is larger than a pre-specified threshold in current time window of the data stream. Discovery of temporal high utility itemsets is an important process for mining interesting patterns like association rules from data streams. In this paper, we propose a novel method, namely THUI (Temporal High Utility Itemsets)-Mine, for mining temporal high utility itemsets from data streams efficiently and effectively. To the best of our knowledge, this is the first work on mining temporal high utility itemsets from data streams. The novel contribution of THUI-Mine is that it can effectively identify the temporal high utility itemsets by generating fewer candidate itemsets such that the execution time can be reduced substantially in mining all high utility itemsets in data streams. In this way, the process of discovering all temporal high utility itemsets under all time windows of data streams can be achieved effectively with less memory space and execution time. This meets the critical requirements on time and space efficiency for mining data streams. Through experimental evaluation, THUI-Mine is shown to significantly outperform other existing methods like Two-Phase algorithm under various experimental conditions.  相似文献   

12.
增量式频繁项集挖掘是当前研究的热点,基于FP-Growth的Pre-FUFP算法有效处理了频繁模式的更新,但需递归遍历FP-tree,导致效率较低。提出Pre-FIUT算法,引入频繁超度量树结构,提高了获得频繁项集挖掘效率;基于FIUT的Pre-FIUT可通过查看频繁超度量树叶子结点的支持度确定频繁项集,并与次频繁项集概念相结合进行增量式频繁项集挖掘。实验表明,Pre-FIUT算法能快速扫描和更新数据,合理利用内存,精确获得频繁项集。  相似文献   

13.
将动态规划算法应用于最大频繁项目集的挖掘,可以克服Apriori算法需要多次扫描数据库确定新的候选项集的缺点;通过对数据进行初始化构建矩阵,结合动态规划的思想通过在矩阵中找到最大无向完全图来获得所有的最大伪频繁项集,最后利用一个非频繁项集的子集有可能是频繁项目集的性质对所有的最大伪频繁项集消减获取最大频繁项集。实验结果表明,它能够快速挖掘频繁项集,且适用于海量、高维数据。  相似文献   

14.
李海峰  章宁 《计算机工程》2012,38(21):45-48
最大频繁项集适用于内存空间有限的数据流挖掘。为此,提出一种基于界碑模型的最大频繁项集挖掘方法,采用最大频繁项集树的数据结构,增量式地维护最大频繁项集与部分附属信息,实现项集的快速搜索和裁剪。在MUSHROOM和BMS-POS数据集上的实验结果表明,该方法具有较高的挖掘效率。  相似文献   

15.
对于大型数据,频繁项集挖掘显得庞大而冗余,挖掘最大频繁项集可以减少挖出的频繁项集的个数。可是对于不确定性数据流,传统判断项集是否频繁的方法已不能准确表达项集的频繁性,而且目前还没有在不确定数据流上挖掘最大频繁项集的相关研究。因此,针对上述不足,提出了一种基于衰减模型的不确定性数据流最大频繁项集挖掘算法TUFSMax。该算法采用标记树结点的方法,使得算法不需要超集检测就可挖掘出所有的最大频繁项集,节约了超集检测时间。实验证明了提出的算法在时间和空间上具有高效性。  相似文献   

16.
数据流具有流动性、连续性以及项分布不均衡性等特点,挖掘数据流中频繁项集是一项意义重大且具有挑战性的工作。提出一种均衡时空挖掘数据流中频繁项集算法—Bala_ Tree, Bala_ Tree实现一遍扫描数据流、快速簇更新、周期树结构重构以及基于经典算法挖掘频繁项集。实验表明,此算法能快速扫描和更新数据,合理利用内存以及精确获得频繁项集,Ba1a_Tree算法优于其他同类算法。  相似文献   

17.
Mining utility itemsets from data steams is one of the most interesting research issues in data mining and knowledge discovery. In this paper, two efficient sliding window-based algorithms, MHUI-BIT (Mining High-Utility Itemsets based on BITvector) and MHUI-TID (Mining High-Utility Itemsets based on TIDlist), are proposed for mining high-utility itemsets from data streams. Based on the sliding window-based framework of the proposed approaches, two effective representations of item information, Bitvector and TIDlist, and a lexicographical tree-based summary data structure, LexTree-2HTU, are developed to improve the efficiency of discovering high-utility itemsets with positive profits from data streams. Experimental results show that the proposed algorithms outperform than the existing approaches for discovering high-utility itemsets from data streams over sliding windows. Beside, we also propose the adapted approaches of algorithms MHUI-BIT and MHUI-TID in order to handle the case when we are interested in mining utility itemsets with negative item profits. Experiments show that the variants of algorithms MHUI-BIT and MHUI-TID are efficient approaches for mining high-utility itemsets with negative item profits over stream transaction-sensitive sliding windows.  相似文献   

18.
A survey on algorithms for mining frequent itemsets over data streams   总被引:9,自引:8,他引:1  
The increasing prominence of data streams arising in a wide range of advanced applications such as fraud detection and trend learning has led to the study of online mining of frequent itemsets (FIs). Unlike mining static databases, mining data streams poses many new challenges. In addition to the one-scan nature, the unbounded memory requirement and the high data arrival rate of data streams, the combinatorial explosion of itemsets exacerbates the mining task. The high complexity of the FI mining problem hinders the application of the stream mining techniques. We recognize that a critical review of existing techniques is needed in order to design and develop efficient mining algorithms and data structures that are able to match the processing rate of the mining with the high arrival rate of data streams. Within a unifying set of notations and terminologies, we describe in this paper the efforts and main techniques for mining data streams and present a comprehensive survey of a number of the state-of-the-art algorithms on mining frequent itemsets over data streams. We classify the stream-mining techniques into two categories based on the window model that they adopt in order to provide insights into how and why the techniques are useful. Then, we further analyze the algorithms according to whether they are exact or approximate and, for approximate approaches, whether they are false-positive or false-negative. We also discuss various interesting issues, including the merits and limitations in existing research and substantive areas for future research.  相似文献   

19.
窗口模式下在线数据流中频繁项集的挖掘*   总被引:1,自引:1,他引:0  
拟采用一种基于滑动窗模式的单遍挖掘算法,专注于处理近期数据;为了减少处理时间和占用的内存,设计了一种新的事务表示方法。通过处理这个事务的表达式,频繁项集可以被高效输出,并解决了使用基于Apriori理论的算法时,由候选频繁1-项集生成频繁2-项集时数据项顺序判断不准确问题。该算法称为MRFI-SW算法。  相似文献   

20.
Sliding window is a widely used model for data stream mining due to its emphasis on recent data and its bounded memory requirement. The main idea behind a transactional sliding window is to keep a fixed size window over a data stream. The window size is kept constant by removing old transactions from the window, when new transactions arrive. Older transactions of window are removed irrespective to whether a significant change has occurred or not. Another challenge of sliding window model is determining window size. The classic approach for determining the window size is to obtain it from the user. In order to determine the precise size of the window, the user must have prior knowledge about the time and scale of changes within the data stream. However, due to the unpredictable changing nature of data streams, this prior knowledge cannot be easily determined. Moreover, by using a fixed window size during a data stream mining, the performance of this model is degraded in terms of reflecting recent changes. Based on these observations, this study relaxes the notion of window size and proposes a new algorithm named VSW (Variable Size sliding Window frequent itemset mining) which is suitable for observing recent changes in the set of frequent itemsets over data streams. The window size is determined dynamically based on amounts of concept change that occurs within the arriving data stream. The window expands as the concept becomes stable and shrinks when a concept change occurs. In this study, it is shown that if stale transactions are removed from the window after a concept change, updated frequent itemsets always belong to the most recent concept. Experimental evaluations on both synthetic and real data show that our algorithm effectively detects the concept change, adjust the window size, and adapts itself to the new concepts along the data stream.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号