共查询到20条相似文献,搜索用时 15 毫秒
1.
In recent years, data stream mining has become an important research topic. With the emergence of new applications, the data we process are not again static, but the continuous dynamic data stream. Examples include network traffic analysis, Web click stream mining, network intrusion detection, and on-line transaction analysis. In this paper, we propose a new framework for data stream mining, called the weighted sliding window model. The proposed model allows the user to specify the number of windows for mining, the size of a window, and the weight for each window. Thus users can specify a higher weight to a more significant data section, which will make the mining result closer to user’s requirements. Based on the weighted sliding window model, we propose a single pass algorithm, called WSW, to efficiently discover all the frequent itemsets from data streams. By analyzing data characteristics, an improved algorithm, called WSW-Imp, is developed to further reduce the time of deciding whether a candidate itemset is frequent or not. Empirical results show that WSW-Imp outperforms WSW under the weighted sliding window model. 相似文献
2.
Identifying the most frequent elements in a data stream is a well known and difficult problem. Identifying the most frequent elements for each individual, especially in very large populations, is even harder. The use of fast and small memory footprint algorithms is paramount when the number of individuals is very large. In many situations such analysis needs to be performed and kept up to date in near real time. Fortunately, approximate answers are usually adequate when dealing with this problem. This paper presents a new and innovative algorithm that addresses this problem by merging the commonly used counter-based and sketch-based techniques for top-k identification. The algorithm provides the top-k list of elements, their frequency and an error estimate for each frequency value. It also provides strong guarantees on the error estimate, order of elements and inclusion of elements in the list depending on their real frequency. Additionally the algorithm provides stochastic bounds on the error and expected error estimates. Telecommunications customer’s behavior and voice call data is used to present concrete results obtained with this algorithm and to illustrate improvements over previously existing algorithms. 相似文献
3.
Many recent applications involve processing and analyzing uncertain data. In this paper, we combine the feature of top-k objects with that of skyline to model the problem of top-k skyline objects against uncertain data. The problem of efficiently computing top-k skyline objects on large uncertain datasets is challenging in both discrete and continuous cases. In this paper, firstly an efficient exact algorithm for computing the top-k skyline objects is developed for discrete cases. To address applications where each object may have a massive set of instances or a continuous probability density function, we also develop an efficient randomized algorithm with an ?‐approximation guarantee. Moreover, our algorithms can be immediately extended to efficiently compute p-skyline; that is, retrieving the uncertain objects with skyline probabilities above a given threshold. Our extensive experiments on synthetic and real data demonstrate the efficiency of both algorithms and the randomized algorithm is highly accurate. They also show that our techniques significantly outperform the existing techniques for computing p-skyline. 相似文献
4.
关联规则的发现是数据挖掘的一个重要方面,产生频繁项集是其中一个关键步骤。提出了一种基于十字链表快速挖掘频繁项集的算法,该算法只需扫描一次数据库,充分利用已有信息产生频繁项集,无需存储候选项集。通过与其它一些算法比较,说明该算法有更好的性能。 相似文献
5.
6.
《Expert systems with applications》2014,41(10):4505-4512
Node-list and N-list, two novel data structure proposed in recent years, have been proven to be very efficient for mining frequent itemsets. The main problem of these structures is that they both need to encode each node of a PPC-tree with pre-order and post-order code. This causes that they are memory-consuming and inconvenient to mine frequent itemsets. In this paper, we propose Nodeset, a more efficient data structure, for mining frequent itemsets. Nodesets require only the pre-order (or post-order code) of each node, which makes it saves half of memory compared with N-lists and Node-lists. Based on Nodesets, we present an efficient algorithm called FIN to mining frequent itemsets. For evaluating the performance of FIN, we have conduct experiments to compare it with PrePost and FP-growth1, two state-of-the-art algorithms, on a variety of real and synthetic datasets. The experimental results show that FIN is high performance on both running time and memory usage. 相似文献
7.
《Expert systems with applications》2014,41(6):2914-2938
Multilevel knowledge in transactional databases plays a significant role in our real-life market basket analysis. Many researchers have mined the hierarchical association rules and thus proposed various approaches. However, some of the existing approaches produce many multilevel and cross-level association rules that fail to convey quality information. From these large number of redundant association rules, it is extremely difficult to extract any meaningful information. There also exist some approaches that mine minimal association rules, but these have many shortcomings due to their naïve-based approaches. In this paper, we have focused on the need for generating hierarchical minimal rules that provide maximal information. An algorithm has been proposed to derive minimal multilevel association rules and cross-level association rules. Our work has made significant contributions in mining the minimal cross-level association rules, which express the mixed relationship between the generalized and specialized view of the transaction itemsets. We are the first to design an efficient algorithm using a closed itemset lattice-based approach, which can mine the most relevant minimal cross-level association rules. The parent–child relationship of the lattices has been exploited while mining cross-level closed itemset lattices. We have extensively evaluated our proposed algorithm’s efficiency using a variety of real-life datasets and performing a large number of experiments. The proposed algorithm has outperformed the existing related work significantly during the pervasive performance comparison. 相似文献
8.
Hai Thanh MaiAuthor Vitae Yu Won LeeAuthor Vitae 《Journal of Systems and Software》2011,84(2):314-327
Top-k monitoring queries are useful in many wireless sensor network applications. A query of this type continuously returns a list of k ordered nodes with the highest (or lowest) sensor readings. To process these queries, a well-known approach is to install a filter at each sensor node to avoid unnecessary transmissions of sensor readings. In this paper, we propose a new top-k monitoring method, named Distributed Adaptive Filter-based Monitoring. In this method, we first propose a new query reevaluation algorithm that works distributedly in the network to reduce the communication cost of sending probe messages. Then, we present an adaptive filter updating algorithm which is based on predicted benefits to lower down the transmission cost of sending updated filters to the sensor nodes. Experimental results on real data traces show that our proposed method performs much better than the other existing methods in terms of both network lifetime and average energy consumption. 相似文献
9.
Ramkishore Bhattacharyya 《Pattern recognition letters》2011,32(13):1554-1563
The classical notion of clustering is to induce an equivalence class partition on a set of points, each class, being a homogeneous group, is called a cluster. Since it is an equivalence class partition, a point must belong to one and exactly one cluster. However in many applications, data distributions are such that only a subset of the points tends to flock under some distinct clusters while others go random. The present paper introduces an algorithm to find an optimal subset of points (ideally filtering out the random ones) with sufficient grouping tendency. It builds the neighborhood population around every point and picks up top k dense regions with possible reshuffling of points in post-processing. Performance of the algorithm is evaluated with applications onto real and simulated data. Comparative analysis on different quality indices with some other state-of-the-art algorithms establishes effectiveness of the approach. 相似文献
10.
关联规则挖掘Apriori算法的研究与改进 总被引:6,自引:1,他引:6
关联规则挖掘是数据挖掘研究领域中的一个重要任务,旨在挖掘事务数据库中有趣的关联.Apriori算法是关联规则挖掘中的经典算法.然而Apriori算法存在着产生候选项目集效率低和频繁扫描数据等缺点.对Apriori算法的原理及效率进行分析,指出了一些不足,并且提出了改进的Apriori_LB算法.该算法基于新的数据结构,改进了产生候选项集的连接方法.在详细阐述了Apriori_LB算法后,对Apriori算法和Apriori_LB算法进行了分析和比较,实验结果表明改进的Apriori_LB算法优于Apriori算法,特别是对最小支持度较小或者项数较少的事务数据库进行挖掘时,效果更加显著. 相似文献
11.
This paper analyzes the execution behavior of “No Random Accesses” (NRA) and determines the depths to which each sorted file is scanned in growing phase and shrinking phase of NRA respectively. The analysis shows that NRA needs to maintain a large quantity of candidate tuples in growing phase on massive data. Based on the analysis, this paper proposes a novel top-k algorithm Top-K with Early Pruning (TKEP) which performs early pruning in growing phase. General rule and mathematical analysis for early pruning are presented in this paper. The theoretical analysis shows that early pruning can prune most of the candidate tuples. Although TKEP is an approximate method to obtain the top-k result, the probability for correctness is extremely high. Extensive experiments show that TKEP has a significant advantage over NRA. 相似文献
12.
关联规则算法的实现与改进 总被引:11,自引:0,他引:11
关联规则作为一种数据挖掘的工具,它能够发现数据项集之间有趣的关联。在关联规则的算法中,Apriori算法是其中的关键算法之一。面对大量复杂的数据集,怎样选择数据结构,怎样优化处理过程,对于此算法的性能将会十分重要。该文首先介绍了关联规则的原理和Apriori算法的实现,然后提出了对该算法的若干改进,例如:采用树型结构存取频繁项集,使用三种缓存优化的方法等。这些优化都能够在整体上提高算法的效率。对于大数据项,试验显示,这些改进能够正确、有效、快速地实现Apriori算法。 相似文献
13.
A new concise representation of frequent itemsets using generators and a positive border 总被引:2,自引:2,他引:0
A complete set of frequent itemsets can get undesirably large due to redundancy when the minimum support threshold is low
or when the database is dense. Several concise representations have been previously proposed to eliminate the redundancy.
Generator based representations rely on a negative border to make the representation lossless. However, the number of itemsets
on a negative border sometimes even exceeds the total number of frequent itemsets. In this paper, we propose to use a positive
border together with frequent generators to form a lossless representation. A positive border is usually orders of magnitude
smaller than its corresponding negative border. A set of frequent generators plus its positive border is always no larger
than the corresponding complete set of frequent itemsets, thus it is a true concise representation. The generalized form of
this representation is also proposed. We develop an efficient algorithm, called GrGrowth, to mine generators and positive
borders as well as their generalizations. The GrGrowth algorithm uses the depth-first-search strategy to explore the search
space, which is much more efficient than the breadth-first-search strategy adopted by most of the existing generator mining
algorithms. Our experiment results show that the GrGrowth algorithm is significantly faster than level-wise algorithms for
mining generator based representations, and is comparable to the state-of-the-art algorithms for mining frequent closed itemsets.
相似文献
Guimei LiuEmail: |
14.
A top-k query returns k tuples with the highest (or the lowest) scores from a relation. The score is computed by combining the values of one or more attributes. We focus on top-k queries having monotone linear score functions. Layer-based methods are well-known techniques for top-k query processing. These methods construct a database as a single list of layers. Here, the ith layer has the tuples that can be the top-i tuple. Thus, these methods answer top-k queries by reading at most k layers. Query performance, however, is poor when the number of tuples in each layer (simply, the layer size) is large. In this paper, we propose a new layer-ordering method, called the Partitioned-Layer Index (simply, the PL Index), that significantly improves query performance by reducing the layer size. The PL Index uses the notion of partitioning, which constructs a database as multiple sublayer lists instead of a single layer list subsequently reducing the layer size. The PL Index also uses the convex skyline, which is a subset of the skyline, to construct a sublayer to further reduce the layer size. The PL Index has the following desired properties. The query performance of the PL Index is quite insensitive to the weights of attributes (called the preference vector) of the score function and is approximately linear in the value of k. The PL Index is capable of tuning query performance for the most frequently used value of k by controlling the number of sublayer lists. Experimental results using synthetic and real data sets show that the query performance of the PL Index significantly outperforms existing methods except for small values of k (say, k?9). 相似文献
15.
裴古英 《自动化与仪器仪表》2009,(5):16-18
关联规则的发现是数据挖掘中的一个重要问题,其核心是频繁模式的挖掘,通常采用的APriori算法要多次扫描数据库并产生大量的候选项集,开销很大。本文采用基于布尔矩阵关联挖掘的算法,只需扫描一次数据库而且不需要链接产生候选项集,从而提高算法的效率。并通过实例说明了它是一种有效的关联规则挖掘方法。 相似文献
16.
17.
A data stream is a massive and unbounded sequence of data elements that are continuously generated at a fast speed. Compared with traditional approaches, data mining in data streams is more challenging since several extra requirements need to be satisfied. In this paper, we propose a mining algorithm for finding frequent itemsets over the transactional data stream. Unlike most of existing algorithms, our method works based on the theory of Approximate Inclusion–Exclusion. Without incrementally maintaining the overall synopsis of the stream, we can approximate the itemsets’ counts according to certain kept information and the counts bounding technique. Some additional techniques are designed and integrated into the algorithm for performance improvement. Besides, the performance of the proposed algorithm is tested and analyzed through a series of experiments. 相似文献
18.
Mining frequent itemsets is an essential problem in data mining and plays an important role in many data mining applications. In recent years, some itemset representations based on node sets have been proposed, which have shown to be very efficient for mining frequent itemsets. In this paper, we propose DiffNodeset, a novel and more efficient itemset representation, for mining frequent itemsets. Based on the DiffNodeset structure, we present an efficient algorithm, named dFIN, to mining frequent itemsets. To achieve high efficiency, dFIN finds frequent itemsets using a set-enumeration tree with a hybrid search strategy and directly enumerates frequent itemsets without candidate generation under some case. For evaluating the performance of dFIN, we have conduct extensive experiments to compare it against with existing leading algorithms on a variety of real and synthetic datasets. The experimental results show that dFIN is significantly faster than these leading algorithms. 相似文献
19.
A core issue of the association rule extracting process in the data mining field is to find the frequent patterns in the database of operational transactions. If these patterns discovered, the decision making process and determining strategies in organizations will be accomplished with greater precision. Frequent pattern is a pattern seen in a significant number of transactions. Due to the properties of these data models which are unlimited and high-speed production, these data could not be stored in memory and for this reason it is necessary to develop techniques that enable them to be processed online and find repetitive patterns. Several mining methods have been proposed in the literature which attempt to efficiently extract a complete or a closed set of different types of frequent patterns from a dataset. In this paper, a method underpinned upon Cellular Learning Automata (CLA) is presented for mining frequent itemsets. The proposed method is compared with Apriori, FP-Growth and BitTable methods and it is ultimately concluded that the frequent itemset mining could be achieved in less running time. The experiments are conducted on several experimental data sets with different amounts of minsup for all the algorithms as well as the presented method individually. Eventually the results prod to the effectiveness of the proposed method. 相似文献
20.
Frequent itemset mining is an important problem in the data mining area with a wide range of applications. Many decision support systems need to support online interactive frequent itemset mining, which is a challenging task because frequent itemset mining is a computation intensive repetitive process. One solution is to precompute frequent itemsets. In this paper, we propose a compact disk-based data structure—CFP-tree to store precomputed frequent itemsets on a disk to support online mining requests. The CFP-tree structure effectively utilizes the redundancy in frequent itemsets to save space. The compressing ratio of a CFP-tree can be as high as several thousands or even higher. Efficient algorithms for retrieving frequent itemsets from a CFP-tree, as well as efficient algorithms to construct and maintain a CFP-tree, are developed. Our performance study demonstrates that with a CFP-tree, frequent itemset mining requests can be responded to promptly. 相似文献