首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Mining association rules is most commonly seen among the techniques for knowledge discovery from databases (KDD). It is used to discover relationships among items or itemsets. Furthermore, temporal data mining is concerned with the analysis of temporal data and the discovery of temporal patterns and regularities. In this paper, a new concept of up-to-date patterns is proposed, which is a hybrid of the association rules and temporal mining. An itemset may not be frequent (large) for an entire database but may be large up-to-date since the items seldom occurring early may often occur lately. An up-to-date pattern is thus composed of an itemset and its up-to-date lifetime, in which the user-defined minimum-support threshold must be satisfied. The proposed approach can mine more useful large itemsets than the conventional ones which discover large itemsets valid only for the entire database. Experimental results show that the proposed algorithm is more effective than the traditional ones in discovering such up-to-date temporal patterns especially when the minimum-support threshold is high.  相似文献   

2.
FP-growth算法的实现方法研究   总被引:8,自引:0,他引:8  
事务数据库中频繁模式的挖掘研究作为关联规则等许多数据挖掘问题的核心工作,已经研究了许多年。早期算法大都是Apriori型算法,即首先产生候选集,然后在候选集的基础上找出频繁模式,候选集的产生往往是耗时的,特别是挖掘富模式或长模式时。JianweiHan等人提出了一种新颖的数据结构FP-tree及基于其上的FP-growth算法,用于有效的富模式与长模式挖掘。由于不同的实现方法可能会导致不同的挖掘效率,该文在讨论FP-growth算法的基础上,采用了几种不同的方法来实现它,并用几个数据库对它们的性能进行了比较。  相似文献   

3.
Processing changeable data streams in real time is one of the most important issues in the data mining field due to its broad applications such as retail market analysis, wireless sensor networks, and stock market prediction. In addition, it is an interesting and challenging problem to deal with the stream data since not only the data have unbounded, continuous, and high speed characteristics but also their environments have limited resources. High utility pattern mining, meanwhile, is one of the essential research topics in pattern mining to overcome major drawbacks of the traditional framework for frequent pattern mining that takes only binary databases and identical item importance into consideration. This approach conducts mining processes by reflecting characteristics of real world databases, non-binary quantities and relative importance of items. Although relevant algorithms were proposed for finding high utility patterns in stream environments, they suffer from a level-wise candidate generation-and-test and a large number of candidates by their overestimation techniques. As a result, they consume a huge amount of execution time, which is a significant performance issue since a rapid process is necessary in stream data analysis. In this paper, we propose an algorithm for mining high utility patterns from resource-limited environments through efficient processing of data streams in order to solve the problems of the overestimation-based methods. To improve mining performance with fewer candidates and search space than the previous ones, we develop two techniques for reducing overestimated utilities. Moreover, we suggest a tree-based data structure to maintain information of stream data and high utility patterns. The proposed tree is restructured by our updating method with decreased overestimation utilities to keep up-to-date stream information whenever the current window slides. Our approach also has an important effect on expert and intelligent systems in that it can provide users with more meaningful information than traditional analysis methods by reflecting the characteristics of real world non-binary databases in stream environments and emphasizing on recent data. Comprehensive experimental results show that our algorithm outperforms the existing sliding window-based one in terms of runtime efficiency and scalability.  相似文献   

4.
因树型结构的良好表达能力,在互联网中传输的信息流越来越多以树型结构形式存储。但由于流式数据的时效性,隐含在数据流中的知识会随着时间的推移发生改变。针对数据流场景下挖掘最近时间段内的频繁子树模式的问题,提出了一种滑动窗口模型下挖掘频繁子树模式算法——SWMiner算法,用于挖掘数据流下任意时刻窗口下所有的频繁子树模式。SWMiner算法使用基于前缀树的结构来压缩存储生成的树模式,并且使用trie merging机制有效地更新子树模式的支持度。实验结果表明,SWMiner算法在滑动窗口模型中的性能优于目前现有的常用算法,能有效地挖掘最近时间段内的频繁树模式。  相似文献   

5.
Inter-sequence pattern mining can find associations across several sequences in a sequence database, which can discover both a sequential pattern within a transaction and sequential patterns across several different transactions. However, inter-sequence pattern mining algorithms usually generate a large number of recurrent frequent patterns. We have observed mining closed inter-sequence patterns instead of frequent ones can lead to a more compact yet complete result set. Therefore, in this paper, we propose a model of closed inter-sequence pattern mining and an efficient algorithm called CISP-Miner for mining such patterns, which enumerates closed inter-sequence patterns recursively along a search tree in a depth-first search manner. In addition, several effective pruning strategies and closure checking schemes are designed to reduce the search space and thus accelerate the algorithm. Our experiment results demonstrate that the proposed CISP-Miner algorithm is very efficient and outperforms a compared EISP-Miner algorithm in most cases.  相似文献   

6.
基于FP-Tree的最大频繁项目集挖掘及更新算法   总被引:105,自引:2,他引:105       下载免费PDF全文
宋余庆  朱玉全  孙志挥  陈耿 《软件学报》2003,14(9):1586-1592
挖掘最大频繁项目集是多种数据挖掘应用中的关键问题,之前的很多研究都是采用Apriori类的候选项目集生成-检验方法.然而,候选项目集产生的代价是很高的,尤其是在存在大量强模式和/或长模式的时候.提出了一种快速的基于频繁模式树(FP-tree)的最大频繁项目集挖掘DMFIA(discover maximum frequent itemsets algorithm)及其更新算法UMFIA(update maximum frequent itemsets algorithm).算法UMFIA将充分利用以前的挖掘结果来减少在更新的数据库中发现新的最大频繁项目集的费用.  相似文献   

7.
本文以标记有序树作为半结构化数据的数据模型 ,研究了半结构化数据的树状最大频繁模式挖掘问题 .已有挖掘算法通常挖掘所有频繁模式 ,其中很多模式为其它模式的子模式 ,针对该问题 ,设计实现了一种最大模式挖掘算法 .该算法采用最右扩展枚举方法无重复枚举所有候选模式 ,利用频繁模式扩展森林实现高效剪枝扩展和挖掘频繁叶模式 ,通过计算频繁叶模式间的包含关系挖掘树状最大频繁模式 .试验结果表明该算法具有良好性能  相似文献   

8.
提出一种结合倾斜时间窗的TWCT树结构,可以保存不同时间粒度下频繁模式的完全集,并设计了其顺序更新和删除算法,使其能够存储在外存,从而有效地降低算法的内存空间需求。结合TWCT树结构特点,提出了数据流上的频繁模式挖掘算法TWCT-Stream,其模式生长的TWCT-Growth算法按字典顺序生成频繁模式,以配合TWCT结构的顺序更新。实验证实算法的内存需求低于FP-Stream等同类算法。  相似文献   

9.
为解决传统频繁模式挖掘算法效率不高的问题,提出了一种改进的基于FP-tree (Frequent pattern tree)的Apriori频繁模式挖掘算法.首先,在Apriori算法的连接步加入连接预处理过程;其次,对CP-tree (Compact Pattern tree)进行扩展,构造了一个新的树结构ECP-tree (Extension of Compact Pattern tree),新的树结构只需对数据库进行一次扫描就能构造出一棵紧凑的前缀树,且支持交互式挖掘与增量挖掘;然后,将改进点与APFT算法结合,用于挖掘频繁模式;最后,使用UCI数据库中两个数据集进行实验.实验结果表明:改进算法具有较高的挖掘效率,频繁模式挖掘速度显著提升.  相似文献   

10.
Mining frequent tree patterns has many applications in different areas such as XML data, bioinformatics and World Wide Web. The crucial step in frequent pattern mining is frequency counting, which involves a matching operator to find occurrences (instances) of a tree pattern in a given collection of trees. A widely used matching operator for tree-structured data is subtree homeomorphism, where an edge in the tree pattern is mapped onto an ancestor-descendant relationship in the given tree. Tree patterns that are frequent under subtree homeomorphism are usually called embedded patterns. In this paper, we present an efficient algorithm for subtree homeomorphism with application to frequent pattern mining. We propose a compact data-structure, called occ, which stores only information about the rightmost paths of occurrences and hence can encode and represent several occurrences of a tree pattern. We then define efficient join operations on the occ data-structure, which help us count occurrences of tree patterns according to occurrences of their proper subtrees. Based on the proposed subtree homeomorphism method, we develop an effective pattern mining algorithm, called TPMiner. We evaluate the efficiency of TPMiner on several real-world and synthetic datasets. Our extensive experiments confirm that TPMiner always outperforms well-known existing algorithms, and in several cases the improvement with respect to existing algorithms is significant.  相似文献   

11.
Mining sequential patterns by pattern-growth: the PrefixSpan approach   总被引:12,自引:0,他引:12  
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [R. Agrawal et al. (1994)] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an initial study of the pattern growth-based sequential pattern mining, FreeSpan [J. Han et al. (2000)], we propose a more efficient method, called PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE [M. Zaki, (2001)] (a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures.  相似文献   

12.
A new mining approach for uncertain databases using CUFP trees   总被引:1,自引:0,他引:1  
In the past, many algorithms have been proposed to mine frequent itemsets from transactional databases, in which the presence or absence of items in transactions was certainly known. In some applications, items may also be uncertain in transactions with their existential probabilities ranging from 0 to 1 in the uncertain dataset. Apparently, the processing in uncertain datasets is quite different from those in certain datasets. The UF-tree algorithm was proposed to construct the UF-tree structure from an uncertain dataset and mine frequent itemsets from the tree. In the UF-tree construction process, however, only the same items with the same existential probabilities in transactions were merged together in the tree, thus causing many redundant nodes in the tree. In this paper, a new tree structure called the compressed uncertain frequent-pattern tree (CUFP tree) is designed to efficiently keep the related information in the mining process. In the CUFP tree, the same items will be merged in a branch of the tree even when the existential probabilities in transactions are not the same. A mining algorithm called the CUFP-mine algorithm is then proposed based on the tree structure to find uncertain frequent patterns. Experimental results show that the proposed approach has a better performance than UF-tree algorithm both in the execution time and in the number of tree nodes.  相似文献   

13.
In the past decade, XML has emerged as the standard language for information exchanging over the Internet. Due to its tree-structure paradigm, XML is superior for its capability of storing, querying, and manipulating complex data. Therefore, discovering frequent tree patterns over tree-structured data has become an interesting topic for XML data management. In this paper, we propose a tree mining algorithm, named BUXMiner, for finding a special class of frequent trees, called rooted unordered trees, from a tree-structured database. BUXMiner employs an efficient bottom-up approach to enumerate all candidate trees over a compact global tree guide and computes the frequent trees based on the tree guide. In addition to BUXMiner, we also propose a mining approach called BUMXMiner to discover the maximal frequent rooted unordered trees. We compare BUXMiner with previous tree-structure mining algorithms, namely XQPMinerTID and FastXMiner, which were also proposed to discover rooted unordered trees. The experimental results show that our algorithm outperforms XQPMinerTID and FastXMiner in terms of efficiency. The performance results from real-world applications also indicate the usefulness of our proposed tree mining algorithms in a variety of web applications, such as analysis of web page access patterns and mining frequent XML query patterns for caching.  相似文献   

14.
王树怡  董东 《计算机科学》2017,44(Z6):486-490
在软件开发过程中,开发人员经常需要遵循特定的API用法模式,而这些用法模式几乎没有相关文档作为参考。为了挖掘API用法模式,提出基于聚类和频繁闭合偏序序列的API用法模式挖掘途径。通过抽象语法树对源代码进行解析,对提取API方法调用序列进行层次聚类,最后使用频繁闭合偏序挖掘算法DFP进行API用法模式的挖掘。实验结果表明,在相同的数据集上,与SPADE算法和BIDE算法相比,所得候选API用法模式集更加精简。  相似文献   

15.
Traditional frequent pattern mining methods consider an equal profit/weight for all items and only binary occurrences (0/1) of the items in transactions. High utility pattern mining becomes a very important research issue in data mining by considering the non-binary frequency values of items in transactions and different profit values for each item. However, most of the existing high utility pattern mining algorithms suffer in the level-wise candidate generation-and-test problem and generate too many candidate patterns. Moreover, they need several database scans which are directly dependent on the maximum candidate length. In this paper, we present a novel tree-based candidate pruning technique, called HUC-Prune (High Utility Candidates Prune), to solve these problems. Our technique uses a novel tree structure, called HUC-tree (High Utility Candidates tree), to capture important utility information of the candidate patterns. HUC-Prune avoids the level-wise candidate generation process by adopting a pattern growth approach. In contrast to the existing algorithms, its number of database scans is completely independent of the maximum candidate length. Extensive experimental results show that our algorithm is very efficient for high utility pattern mining and it outperforms the existing algorithms.  相似文献   

16.
While frequent pattern mining is fundamental for many data mining tasks, mining maximal frequent patterns efficiently is important in both theory and applications of frequent pattern mining. The fundamental challenge is how to search a large space of item combinations. Most of the existing methods search an enumeration tree of item combinations in a depth-first manner. In this paper, we develop a new technique for more efficient max-pattern mining. Our method is pattern-aware: it uses the patterns already found to schedule its future search so that many search subspaces can be pruned. We present efficient techniques to implement the new approach. As indicated by a systematic empirical study using the benchmark data sets, our new approach outperforms the currently fastest max-pattern mining algorithms FPMax* and LCM2 clearly. The source code and the executable code (on both Windows and Linux platforms) are publicly available at .  相似文献   

17.
In this paper, we proposed an efficient algorithm, called PCP-Miner (Pointset Closed Pattern Miner), for mining frequent closed patterns from a pointset database, where a pointset contains a set of points. Our proposed algorithm consists of two phases. First, we find all frequent patterns of length two in the database. Second, for each pattern found in the first phase, we recursively generate frequent closed patterns by a frequent pattern tree in a depth-first search manner. Since the PCP-Miner does not generate unnecessary candidates, it is more efficient and scalable than the modified Apriori, SASMiner and MaxGeo. The experimental results show that the PCP-Miner algorithm outperforms the comparing algorithms by more than one order of magnitude.  相似文献   

18.
频繁模式挖掘是最基本的数据挖掘问题,由于内在复杂性,提高挖掘算法性能一直是个难题.耶是通过数据库混合投影来挖掘频繁模式完全集的全新算法.HP混合投影思想是:任意数据集都不能简单地归入某个单一特性类别,挖掘过程应根据局部数据子集的特性变化动态地调整频繁模式树构造策略、事务子集表示形式、投影方法.HP提出基于树表示的虚拟投影与基于数组表示的非过滤投影,较好地解决了提高时间效率与节省内存空间的矛盾.实验表明,HP时间效率比Apriori,FP—Growth和H-Mine高出1~3个数量级,并且空间可伸缩性也大大优于这些算法.  相似文献   

19.
在频繁模式挖掘过程中能够动态改变约束的算法比较少.提出了一种基于约束的频繁模式挖掘算法MCFP.MCFP首先按照约束的性质来建立频繁模式树,并且只需扫描一遍数据库,然后建立每个项的条件树,挖掘以该项为前缀的最大频繁模式,并用最大模式树来存储,最后根据最大模式来找出所有支持度明确的频繁模式.MCFP算法允许用户在挖掘频繁模式过程中动态地改变约束.实验表明,该算法与iCFP算法相比是很有效的.  相似文献   

20.
Mining top?k frequent patterns without minimum support threshold   总被引:1,自引:1,他引:0  
Finding frequent patterns play an important role in mining association rules, sequences, episodes, Web log mining and many other interesting relationships among data. Frequent pattern mining methods often produce a huge number of frequent itemsets that is not feasible for effective usage. The number of highly correlated patterns is usually very small and may even be one. Most of the existing frequent pattern mining techniques often require the setting of many input parameters and may involve multiple passes over the database. Minimum support is the widely used parameter in frequent pattern mining to discover statistically significant patterns. Specifying appropriate minimum support is a challenging task for a data analyst as the choice of minimum support value is somewhat arbitrary. Generally, it is required to repeatedly execute an algorithm, heuristically tuning the value of minimum support over a wide range, until the desired result is obtained, certainly, a very time-consuming process. Setting up an inappropriate minimum support may also cause an algorithm to fail in finding the true patterns. We present a novel method to efficiently retrieve top few maximal frequent patterns in order of significance without use of the minimum support parameter. Instead, we are only required to specify a more human understandable parameter, namely the desired number itemsets k. Our technique requires only a single pass over the database and generation of length two itemsets. The association ratio graph is proposed as a compact structure containing concise information, which is created in time quadratic to the size of the database. Algorithms are described for using this graph structure to discover top-most and top-k maximal frequent itemsets without minimum support threshold. To effectively achieve this, the method employs construction of an all path source-to-destination tree to discover all maximal cycles in the graph. The results can be ranked in decreasing order of significance. Results are presented demonstrating the performance advantages to be gained from the use of this approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号