首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 305 毫秒
1.
Efficient Incremental Maintenance of Frequent Patterns with FP-Tree   总被引:3,自引:0,他引:3       下载免费PDF全文
Mining frequent patterns has been studied popularly in data mining area. However, little work has been done on mining patterns when the database has an influx of fresh data constantly. In these dynamic scenarios, efficient maintenance of the discovered patterns is crucial. Most existing methods need to scan the entire database repeatedly, which is an obvious disadvantage. In this paper, an efficient incremental mining algorithm, Incremental-Mining (IM), is proposed for maintenance of the frequent patterns when new incremental data come. Based on the frequent pattern tree (FP-tree) structure, IM gives a way to make the most of the things from the previous mining process, and requires scanning the original data once at most. Furthermore, IM can identify directly the differential set of frequent patterns, which may be more informative to users. Moreover, IM can deal with changing thresholds as well as changing data, thus provide a full maintenance scheme. IM has been implemented and the performance study shows it outperforms three other incremental algorithms: FUP, DB-tree and re-running frequent pattern growth (FP-growth).  相似文献   

2.
Mining frequent itemsets has emerged as a fundamental problem in data mining and plays an essential role in many important data mining tasks.In this paper,we propose a novel vertical data representation called N-list,which originates from an FP-tree-like coding prefix tree called PPC-tree that stores crucial information about frequent itemsets.Based on the N-list data structure,we develop an efficient mining algorithm,PrePost,for mining all frequent itemsets.Efficiency of PrePost is achieved by the following three reasons.First,N-list is compact since transactions with common prefixes share the same nodes of the PPC-tree.Second,the counting of itemsets’ supports is transformed into the intersection of N-lists and the complexity of intersecting two N-lists can be reduced to O(m + n) by an efficient strategy,where m and n are the cardinalities of the two N-lists respectively.Third,PrePost can directly find frequent itemsets without generating candidate itemsets in some cases by making use of the single path property of N-list.We have experimentally evaluated PrePost against four state-of-the-art algorithms for mining frequent itemsets on a variety of real and synthetic datasets.The experimental results show that the PrePost algorithm is the fastest in most cases.Even though the algorithm consumes more memory when the datasets are sparse,it is still the fastest one.  相似文献   

3.
Previous research works have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent but only the closed ones because the latter leads to not only more compact yet complete result set but also better efficiency. Upon discovery of frequent closed XML query patterns, indexing and caching can be effectively adopted for query performance enhancement. Most of the previous algorithms for finding frequent patterns basically introduced a straightforward generate-and-test strategy. In this paper, we present SOLARIA*, an efficient algorithm for mining frequent closed XML query patterns without candidate maintenance and costly tree-containment checking. Efficient algorithm of sequence mining is involved in discovering frequent tree-structured patterns, which aims at replacing expensive containment testing with cheap parent-child checking in sequences. SOLARIA* deeply prunes unrelated search space for frequent pattern enumeration by parent-child relationship constraint. By a thorough experimental study on various real-life data, we demonstrate the efficiency and scalability of SOLARIA* over the previous known alternative. SOLARIA* is also linearly scalable in terms of XML queries' size.  相似文献   

4.
Finding correlated sequential patterns in large sequence databases is one of the essential tasks in data mining since a huge number of sequential patterns are usually mined, but it is hard to find sequential patterns with the correlation. According to the requirement of real applications, the needed data analysis should be different. In previous mining approaches, after mining the sequential patterns, sequential patterns with the weak affinity are found even with a high minimum support. In this paper, a new framework is suggested for mining weighted support affinity patterns in which an objective measure, sequential ws-confidence is developed to detect correlated sequential patterns with weighted support affinity patterns. To efficiently prune the weak affinity patterns, it is proved that ws-confidence measure satisfies the anti-monotone and cross weighted support properties which can be applied to eliminate sequential patterns with dissimilar weighted support levels. Based on the framework, a weighted support affinity pattern mining algorithm (WSMiner) is suggested. The performance study shows that WSMiner is efficient and scalable for mining weighted support affinity patterns.  相似文献   

5.
Mining frequent patterns from datasets is one of the key success of data mining research. Currently,most of the studies focus on the data sets in which the elements are independent, such as the items in the marketing basket. However, the objects in the real world often have close relationship with each other. How to extract frequent patterns from these relations is the objective of this paper. The authors use graphs to model the relations, and select a simple type for analysis. Combining the graph theory and algorithms to generate frequent patterns, a new algorithm called Topology, which can mine these graphs efficiently, has been proposed.The performance of the algorithm is evaluated by doing experiments with synthetic datasets and real data. The experimental results show that Topology can do the job well. At the end of this paper, the potential improvement is mentioned.  相似文献   

6.
Approaches for scaling DBSCAN algorithm to large spatial databases   总被引:7,自引:0,他引:7       下载免费PDF全文
The huge amount of information stored in datablases owned by coporations(e.g.retail,financial,telecom) has spurred a tremendous interest in the area of knowledge discovery and data mining.Clustering.in data mining,is a useful technique for discovering intersting data distributions and patterns in the underlying data,and has many application fields,such as statistical data analysis,pattern recognition,image processsing,and other business application,s Although researchers have been working on clustering algorithms for decades,and a lot of algorithms for clustering have been developed,there is still no efficient algorithm for clustering very large databases and high dimensional data,As an outstanding representative of clustering algorithms,DBSCAN algorithm shows good performance in spatial data clustering.However,for large spatial databases,DBSCAN requires large volume of memory supprot and could incur substatial I/O costs because it operates directly on the entrie database,In this paper,several approaches are proposed to scale DBSCAN algorithm to large spatial databases.To begin with,a fast DBSCAN algorithm is developed.which considerably speeeds up the original DBSCAN algorithm,Then a sampling based DBSCAN algorithm,a partitioning-based DBSCAN algorithm,and a parallel DBSCAN algorithm are introduced consecutively.Following that ,based on the above-proposed algorithms,a synthetic algorithm is also given,Finally,some experimental results are given to demonstrate the effectiveness and efficiency of these algorithms.  相似文献   

7.
The search for patterns or motifs in data represents a problem area of key interest to finance and economic researchers. In this paper, we introduce the motif tracking algorithm (MTA), a novel immune inspired (IS) pattern identification tool that is able to identify unknown motifs of a non specified length which repeat within time series data. The power of the algorithm comes from the fact that it uses a small number of parameters with minimal assumptions regarding the data being examined or the underlying motifs. Our interest lies in applying the algorithm to financial time series data to identify unknown patterns that exist. The algorithm is tested using three separate data sets. Particular suitability to financial data is shown by applying it to oil price data. In all cases, the algorithm identifies the presence of a motif population in a fast and efficient manner due to the utilization of an intuitive symbolic representation. The resulting population of motifs is shown to have considerable potential value for other applications such as forecasting and algorithm seeding.  相似文献   

8.
Mining frequent patterns from people’s trajectory has become a hot topic in big data research. Previously, these data mostly come from GPS. Compared with GPS data which is more densely sampled, base station data is extremely sparse in both time and space. Trajectory discovery from base station data becomes much more challenging. In this paper, we propose a new method to effectively solve this problem. In our method, we assume that the locations of objects are sampled over a long time period. First, sequential pattern mining algorithm is employed to find frequent passing areas of a person’s route every day. Second, frequent paths are pieced together by points of records which pass through frequent passing area. Finally, to ensure credibility and efficiency, we depend on the location information provided by scattered points which piece together frequent paths to mine frequent road paths.  相似文献   

9.
This paper presents some new algorithms to efficiently mine max frequent generalized itemsets (g-itemsets) and essential generalized association rules (g-rules). These are compact and general representations for all frequent patterns and all strong association rules in the generalized environment. Our results fill an important gap among algorithms for frequent patterns and association rules by combining two concepts. First, generalized itemsets employ a taxonomy of items, rather than a flat list of items. This produces more natural frequent itemsets and associations such as (meat, milk) instead of (beef, milk), (chicken, milk), etc. Second, compact representations of frequent itemsets and strong rules, whose result size is exponentially smaller, can solve a standard dilemma in mining patterns: with small threshold values for support and confidence, the user is overwhelmed by the extraordinary number of identified patterns and associations; but with large threshold values, some interesting patterns and associations fail to be identified. Our algorithms can also expand those max frequent g-itemsets and essential g-rules into the much larger set of ordinary frequent g-itemsets and strong g-rules. While that expansion is not recommended in most practical cases, we do so in order to present a comparison with existing algorithms that only handle ordinary frequent g-itemsets. In this case, the new algorithm is shown to be thousands, and in some cases millions, of the time faster than previous algorithms. Further, the new algorithm succeeds in analyzing deeper taxonomies, with the depths of seven or more. Experimental results for previous algorithms limited themselves to taxonomies with depth at most three or four. In each of the two problems, a straightforward lattice-based approach is briefly discussed and then a classificationbased algorithm is developed. In particular, the two classification-based algorithms are MFGI_class for mining max frequent g-itemsets and EGR_class for mining essential g-rules. The classification-based algorithms are featured with conceptual classification trees and dynamic generation and pruning algorithms.  相似文献   

10.
With the development of Internet, frequent pattern mining has been extended to more complexpatterns like tree mining and graph mining. Such applications arise in complex domains like bioinformatics, webmining, etc. In this papert we present a novel algorithm, named Chopper, to discover frequent subtrees fromordered labeled trees. An extensive performance study shows that the newly developed algorithm outperformsTreeMinerV one of the fastest methods proposed previously, in mining large databases. At the end of this paper,the potential improvement of Chopper is mentioned.  相似文献   

11.
一种最大频繁模式的快速挖掘算法   总被引:2,自引:1,他引:1  
挖掘最大频繁模式是多种数据挖掘应用中的关键问题。提出一种挖掘最大频繁模式的快速算法,该算法利用前缀树压缩存放数据,并通过调整前缀树中节点信息和节点链直接在前缀树上采用深度优先的策略进行挖掘,而不需要创建条件模式树,从而大大提高了挖掘效率。  相似文献   

12.
挖掘和更新最大频繁模式是多种数据挖掘应用中的关键问题。之前的许多研究都是采用Apriori类的候选生成-检验方法或基于FP-Tree的方法,而产生大量候选和动态创建大量FP-Tree的代价太高,特别是在支持度阈值较小或存在长模式时。因此,文章提出了一种最大频繁模式的快速挖掘算法DMFP及更新算法IUMFP。DMFP算法利用前缀树压缩存放数据,并通过调整前缀树中节点信息和节点链直接在前缀树上采用深度优先的策略进行挖掘,而不需要创建条件模式树,从而大大提高了挖掘效率。算法IUMFP充分利用以前的挖掘结果减少发现更新数据中新的最大频繁模式的代价。  相似文献   

13.
快速挖掘全局频繁项目集   总被引:32,自引:1,他引:32  
分布式环境中,全局频繁项目集的挖掘是数据挖掘中最重要的研究课题之一.传统的全局频繁项目集挖掘算法采用Apriori算法框架,须多遍扫描数据库并产生大量的候选项目集,且通过传送局部频繁项目集求全局频繁项目集的网络通信代价高.为此,提出了一种分布数据库的全局频繁项目集快速挖掘算法——FMAGF.FMAGF算法采用传送条件频繁模式树或条件模式基来挖掘全局频繁项目集,可有效地减小网络通信量,提高全局频繁项目集挖掘效率.理论分析和实验结果表明提出的算法是有效可行的.  相似文献   

14.
一种改进的FP-Growth算法及其在业务关联中的应用   总被引:2,自引:0,他引:2  
基于FP-树的FP-Growth算法在挖掘频繁模式过程中需要递归地产生大量的条件FP-树,效率不高,并且不太适合应用在移动通信业务交叉销售等具有业务约束的关联规则挖掘中。因此,提出了基于项目约束的频繁模式树ICFP-树和直接在此树上进行挖掘的新算法——ICFP-Mine。理论分析和实验结果表明,ICFP-Mine算法在内存占用和时间开销等方面比FP-Growth算法更优越,在移动通信业务交叉销售领域的应用中取得了较好的效果。  相似文献   

15.
频繁模式挖掘算法FP-growth算法需递归地生成大量的条件FP-树,且耗费大量存储空间和时间。为此,采用矩阵技术统计约束子树中的频繁项集和频繁项集的支持度,以进行数据挖掘。实验结果表明,该频繁模式挖掘算法是有效的,具有较高的时间效率及空间 效率。  相似文献   

16.
针对FP-Growth算法中频繁模式树的遍历低效问题,提出了一种无项头表的频繁模式增长算法。该算法利用递归回溯的方式遍历频繁模式树以求取条件模式基,解决了对同一树路径多次重复遍历的问题。从理论分析和实际挖掘能力两方面,将新算法与FP-Growth算法进行了对比。结果表明,新算法有效减少了条件模式基的搜索开销,使频繁模式挖掘的效率提高了2~5倍,在时间和空间性能上均优于FP-Growth算法。将该算法应用于通信告警关联规则挖掘,较快地挖掘出了关联规则结果,且正确规则的覆盖率达到了83.3%。  相似文献   

17.
Apriori和FP-Growth算法是频繁模式挖掘中的经典算法,由于Apriori存在更多缺陷,因此FP-Growth是单机计算环境下比较高效的算法。然而,对于非并行计算在大数据时代遇到的瓶颈,提出一种基于事务中项间联通权重矩阵的负载平衡并行频繁模式增长算法CWBPFP。算法在Spark框架上实现并行计算,数据分组时利用负载均衡策略,存入分组的数据是相应频繁项的编码。每个工作节点将分组数据中每一个事物中项的联通信息存入一个下三角联通权重矩阵中,使用被约束子树来加快每个工作节点挖掘频繁模式时创建条件FP-tree的速度,再用联通权重矩阵避免每次挖掘分组中频繁模式时对条件模式基的第一次扫描。由于联通权重矩阵和被约束子树的结合应用于每一个工作节点的FP-tree挖掘过程,因此提升了并行挖掘FP-tree性能。通过实验表明,所提出的并行算法对大的数据有较高性能和可扩展性。  相似文献   

18.
基于FP-T ree的FP-M ax算法在挖掘最大频繁集时需多次递归建立条件模式树耗费大量存储空间,这大大降低了算法的挖掘效率。提出了一种基于改进FP-T ree的最大频繁集快速挖掘算法-FP-EM ax算法。该算法无需建立条件模式库大大减少了存储空间开销,采用预剪枝策略减少条件模式树的构造次数及子集检测次数,从而算法的挖掘效率大大提高。最后通过实验证明FP-EM ax算法在支持度较小的情况下较之于FP-M ax及同类算法具有更好的性能。  相似文献   

19.
序列模式发现是最重要的数据挖掘任务之一,并有着广阔的应用前景。针对静态数据库,序列模式挖掘已经被深入地研究,但针对基于数据流的序列模式挖掘的研究还不是十分深入。数据流有着无限性的特性,因此往往不能保存数据流中全部的数据,同时很多时候只对最近的时间段的序列模式感兴趣,提出一个有效的结合滑动窗口技术的挖掘序列模式的算法FPM-SW,算法利用到3个数据结构(PatternTable,CountTable和Ta-tree)来处理基于数据流的序列模式挖掘的复杂性问题。算法通过CountTable结构来保存以往的潜在频繁序列,考虑到在某些情况下CountTable占用内存过多,算法还结合了一种压缩CountTable技术来减少内存占用。FPM-SW的优点是可以最大限度地降低负正例的产生,实验表明FPM-SW具有较高的准确率。  相似文献   

20.
最大目标频繁模式挖掘算法研究   总被引:2,自引:0,他引:2  
传统的频繁模式挖掘算法往往会得到成百上千的结果模式,面对繁多的频繁模式用户通常要经过“二次挖掘”才能得到有用的目标模式。怎样根据用户需求直接挖掘用户感兴趣的目标模式是该文的研究目标。文章在FP-树的基础上设计了紧缩的、非冗余的TFP-树,它能有效过滤与目标模式无关的项和事务,而仅保留与目标模式相关的信息,缩小TFP-树的大小规模。同时根据TFP-树的规律和特点,笔者设计了最大目标频繁模式挖掘算法,算法的结果模式具有以下两个特点:(1)满足用户需求的目标模式;(2)最大模式。该实验结果验证了TFP-树算法是有效的,而且显著改善了FP-树算法的性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号