期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient mining of maximal frequent itemsets from databases on a cluster of workstations 总被引：2，自引：2，他引：0

Soon M. Chung Congnan Luo 《Knowledge and Information Systems》2008,16(3):359-391

In this paper, we propose two parallel algorithms for mining maximal frequent itemsets from databases. A frequent itemset is maximal if none of its supersets is frequent. One parallel algorithm is named distributed max-miner (DMM), and it requires very low communication and synchronization overhead in distributed computing systems. DMM has the local mining phase and the global mining phase. During the local mining phase, each node mines the local database to discover the local maximal frequent itemsets, then they form a set of maximal candidate itemsets for the top-down search in the subsequent global mining phase. A new prefix tree data structure is developed to facilitate the storage and counting of the global candidate itemsets of different sizes. This global mining phase using the prefix tree can work with any local mining algorithm. Another parallel algorithm, named parallel max-miner (PMM), is a parallel version of the sequential max-miner algorithm (Proc of ACM SIGMOD Int Conf on Management of Data, 1998, pp 85–93). Most of existing mining algorithms discover the frequent k-itemsets on the kth pass over the databases, and then generate the candidate (k + 1)-itemsets for the next pass. Compared to those level-wise algorithms, PMM looks ahead at each pass and prunes more candidate itemsets by checking the frequencies of their supersets. Both DMM and PMM were implemented on a cluster of workstations, and their performance was evaluated for various cases. They demonstrate very good performance and scalability even when there are large maximal frequent itemsets (i.e., long patterns) in databases.

Congnan LuoEmail:

相似文献

2.

快速挖掘全局频繁项目集 总被引：32，自引：1，他引：32

杨明孙志挥吉根林《计算机研究与发展》2003,40(4):620-626

分布式环境中，全局频繁项目集的挖掘是数据挖掘中最重要的研究课题之一．传统的全局频繁项目集挖掘算法采用Apriori算法框架，须多遍扫描数据库并产生大量的候选项目集，且通过传送局部频繁项目集求全局频繁项目集的网络通信代价高．为此，提出了一种分布数据库的全局频繁项目集快速挖掘算法——FMAGF.FMAGF算法采用传送条件频繁模式树或条件模式基来挖掘全局频繁项目集，可有效地减小网络通信量，提高全局频繁项目集挖掘效率．理论分析和实验结果表明提出的算法是有效可行的．相似文献

3.

Parallel Algorithms for Discovery of Association Rules 总被引：2，自引：0，他引：2

Mohammed J. Zaki Srinivasan Parthasarathy Mitsunori Ogihara Wei Li 《Data mining and knowledge discovery》1997,1(4):343-373

Discovery of association rules is an important data mining task. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the set of frequent itemsets (a subset of database items), thus incurring high I/O overhead. In the parallel case, most algorithms perform a sum-reduction at the end of each pass to construct the global counts, also incurring high synchronization cost. In this paper we describe new parallel association mining algorithms. The algorithms use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets. Once this set has been identified, the algorithms make use of efficient traversal techniques to generate the frequent itemsets contained in each cluster. We propose two clustering schemes based on equivalence classes and maximal hypergraph cliques, and study two lattice traversal techniques based on bottom-up and hybrid search. We use a vertical database layout to cluster related transactions together. The database is also selectively replicated so that the portion of the database needed for the computation of associations is local to each processor. After the initial set-up phase, the algorithms do not need any further communication or synchronization. The algorithms minimize I/O overheads by scanning the local database portion only twice. Once in the set-up phase, and once when processing the itemset clusters. Unlike previous parallel approaches, the algorithms use simple intersection operations to compute frequent itemsets and do not have to maintain or search complex hash structures. Our experimental testbed is a 32-processor DEC Alpha cluster inter-connected by the Memory Channel network. We present results on the performance of our algorithms on various databases, and compare it against a well known parallel algorithm. The best new algorithm outperforms it by an order of magnitude. 相似文献

4.

快速挖掘全局最大频繁项目集 总被引：18，自引：1，他引：18

陆介平杨明孙志挥鞠时光《软件学报》2005,16(4):553-560

挖掘最大频繁项目集是多种数据挖掘应用中的关键问题.现行可用的最大频繁项目集挖掘算法大多基于单机环境,针对分布式环境下的全局最大频繁项目集挖掘尚不多见.若将基于单机环境的最大频繁项目集挖掘算法运用于分布式环境,或运用分布式环境下的全局频繁项目集挖掘算法来挖掘全局最大频繁项目集,均会产生大量的候选频繁项目集,且网络通信代价高.为此,提出了快速挖掘全局最大频繁项目集算法FMGMFI(fast mining global maximum frequent itemsets),该算法采用FP-tree存储结构,可方便地从各局部FP-tree的相关路径中得到项目集的频度,同时采用自顶向下和自底向上的双向搜索策略,可有效地降低网络通信代价.实验结果表明,FMGMF算法是有效、可行的. 相似文献

5.

分布环境中的并行频繁模式挖掘算法

阮幼林李庆华刘干《计算机工程与应用》2005,41(25):1-3,22

频繁模式的并行挖掘算法是数据挖掘中重要的研究课题。目前已经提出的并行算法大多是基于Apriori或基于FP-tree。由于两者的固有局限性,而且在计算过程中需要多次同步,因而具有较低的性能。文章提出了一种基于分布数据库的并行挖掘算法。该算法尽可能地让每个处理器独立地挖掘,每个处理器基于前缀树采用深度优先搜索的策略挖掘局部频繁模式集,并通过相关性质尽量减少候选全局频繁模式的规模,减少网络的通信量和同步次数以提高挖掘效率。相似文献

6.

快速挖掘分布式数据库全局最大频繁项集 总被引：1，自引：0，他引：1

何波《控制与决策》2011,26(8):1214-1218

提出一种快速挖掘分布式数据库全局最大频繁项集算法（FMMH）．FMMFI算法首先设置了中心节点,并以各个节点构建局部FP-tree,采用挖掘最大频繁项目集算法（DMHA）快速挖掘局部最大频繁项集;然后与中心节点交互以实现数据汇总：最终获得全局最大频繁项集．FMMFI算法采用自上而下的剪枝策略,能大幅减少候选项集,降低通信量．理论分析和实验结果表明,FMMFI算法是有效的．相似文献

7.

快速挖掘频繁项集的并行算法 总被引：3，自引：0，他引：3

何波王华秋刘贞王越《计算机应用》2006,26(2):391-0392

传统的挖掘频繁项集的并行算法存在数据偏移、通信量大、同步次数较多和扫描数据库次数较多等问题。针对这些问题,提出了一种快速挖掘频繁项集的并行算法(FPMFI)。FPMFI算法让各计算机节点独立地计算局部频繁项集,然后与中心节点交互实现数据汇总,最终获得全局频繁项集。理论分析和实验结果表明FPMFI算法是有效的。 相似文献

8.

Parallel mining of maximal sequential patterns using multiple samples

Congnan Luo Soon M. Chung 《The Journal of supercomputing》2012,59(2):852-881

In this paper, we propose a new parallel algorithm, named PMSPX, which mines maximal frequent sequences by using multiple samples to exclude infrequent candidates effectively. A frequent sequence is maximal if none of its supersequences is frequent. Unlike the traditional single-sample methods developed for mining frequent itemsets, PMSPX uses multiple samples. Thus, it can avoid or alleviate some problems inherent in the single-sample methods. We theoretically analyzed how to increase the minimum support level to prevent misestimating infrequent candidates as frequent in the mining of samples. PMSPX is a parallel version of our sequential MSPX algorithm, and it is developed on a cluster of workstations. In PMSPX, each processing node uses MSPX to find a candidate set of local maximal frequent sequences first, independently from other processing nodes. Then, a top-down search is performed, starting with all the candidates, in a synchronous manner to identify real maximal frequent sequences. This asynchronous local mining followed by synchronous global mining approach minimizes the synchronization and communication among the processing nodes. Three database partitioning methods are proposed to distribute the database across the processing nodes, so that their workloads are balanced and the data skewness of the whole database is preserved in the data partition of each node. A comprehensive analysis was performed on PMSPX and existing parallel sequence mining algorithms, and extensive experiments were conducted on PMSPX. PMSPX demonstrates very good speedup and scaleup properties. It also requires less communication and synchronization than other parallel algorithms. 相似文献

9.

基于DDMINER分布式数据库系统中频繁项目集的更新 总被引：13，自引：0，他引：13

吉根林杨明赵斌孙志挥《计算机学报》2003,26(10):1387-1392

给出了一种分布式数据挖掘系统的体系结构DDMINER，对分布式数据库系统中频繁项目集的更新问题进行探讨，既考虑了数据库中事务增加的情况，又考虑了事务删除的情况；提出了一种基于DDMINER的局部频繁项目集的更新算法ULF和全局频繁项目集的更新算法UGF．该算法能够产生较少数量的候选频繁项目集，在求解全局频繁项目集过程中，传送候选局部频繁项目集支持数的通信量为O(n)；将文章提出的算法用Java语言加以实现，并对算法性能进行了研究；实验结果表明这些算法是正确、可行的，并且具有较高的效率．相似文献

10.

Parallel and distributed methods for incremental frequent itemset mining 总被引：3，自引：0，他引：3

Otey M.E. Parthasarathy S. Chao Wang Veloso A. Meira W. Jr. 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2004,34(6):2439-2450

Traditional methods for data mining typically make the assumption that the data is centralized, memory-resident, and static. This assumption is no longer tenable. Such methods waste computational and input/output (I/O) resources when data is dynamic, and they impose excessive communication overhead when data is distributed. Efficient implementation of incremental data mining methods is, thus, becoming crucial for ensuring system scalability and facilitating knowledge discovery when data is dynamic and distributed. In this paper, we address this issue in the context of the important task of frequent itemset mining. We first present an efficient algorithm which dynamically maintains the required information even in the presence of data updates without examining the entire dataset. We then show how to parallelize this incremental algorithm. We also propose a distributed asynchronous algorithm, which imposes minimal communication overhead for mining distributed dynamic datasets. Our distributed approach is capable of generating local models (in which each site has a summary of its own database) as well as the global model of frequent itemsets (in which all sites have a summary of the entire database). This ability permits our approach not only to generate frequent itemsets, but also to generate high-contrast frequent itemsets, which allows one to examine how the data is skewed over different sites. 相似文献

11.

基于粒度计算的频繁闭项目集挖掘

方刚王佳乐应宏汤小斌《计算机工程与应用》2014,50(20):130-134

针对现有频繁闭项目集挖掘算法存在的不足,提出了一种基于粒度计算的频繁闭项目集挖掘算法。通过混合进制数的变化来生成候选项目集,避免使用了复杂的数据结构,减少了内存和CPU的开销;利用粒度计算的分而治之思想来计算频繁闭项目集的支持度,避免了多次重复扫描数据库,减少了计算复杂度和I/O开销。实验结果表明该算法比经典的频繁闭项目集挖掘算法快速而有效。相似文献

12.

基于FP树的全局最大频繁项集挖掘算法 总被引：12，自引：1，他引：12

王黎明赵辉《计算机研究与发展》2007,44(3):445-451

挖掘最大频繁项集是多种数据挖掘应用了更新最大频繁候选项集集合,需要反复地扫描整个数据库,而且大部分算法是单机算法,全局最大频繁项集挖掘算法并不多见.为此提出MGMF算法,该算法利用FP-树结构,类似FP-树挖掘方法,一遍就可以挖掘出所有的最大频繁项集,并且超集检测非常简单、快捷.另外MGMF算法采用了分布式PDDM算法播报消息的思想,具有很好的拓展性和并行性.实验证明MGMF算法是有效可行的. 相似文献

13.

传感器网络分布式数据流的频繁项集挖掘算法

洪月华《计算机科学》2013,40(2):58-60,94

研究无线传感器网络中数据流频繁项集挖掘问题。针对集中式的静态数据流频繁项集挖掘方法不能在传感器网络中直接使用这一特点,提出基于传感器网络的分布式数据流的频繁项集挖掘算法FIMVS。该算法基于FPtree快速挖掘出传感器节点上单一数据流的局部频繁项集,然后通过路由将其在无线传感器网络里逐层上传合并,在Sink节点上汇聚后,采用自顶向下的高效剪枝策略挖掘出全局频繁项集。实验结果表明,该算法能有效地大幅度减少候选项集,降低无线传感器网络中的通信量,并有较高的时间和空间效率。相似文献

14.

全局频繁闭项目集挖掘算法研究

陈健美朱玉全宋顺林桂长青宋余庆《计算机科学》2008,35(1):193-195

频繁闭项目集挖掘是数据挖掘研究中的一个重要研究课题.目前已有的频繁闭项目集挖掘算法主要针对单机环境,有关分布式环境下的全局频繁闭项目集挖掘算法的研究尚不多见.为此,本文提出了一种快速挖掘全局频繁闭项目集算法,并对其更新问题进行了研究;提出了一种相应的频繁闭项目集增量式更新算法,该算法将充分利用先前的挖掘结果来节省发现新的全局频繁闭项目集的时间开销.实验结果表明算法是有效的. 相似文献

15.

分布式数据库的全局频繁项目集高效更新算法

宋宝莉覃征《计算机工程与应用》2006,42(31):157-160

提出了快速更新全局频繁项目集的算法IUAGFI(IncrementalUpdatingAlgorithmforGlobalFrequentItemsets)。该算法主要考虑数据库记录发生变化时全局频繁项目集的更新情况,在最坏的情况下仅需扫描各局部数据库一遍,并利用已建立的各局部改进的频繁模式树和已挖掘的结果,可避免传送某些原全局频繁项目对应的被约束子树,从而降低网络通讯代价。实验结果表明,该算法是有效可行的。相似文献

16.

基于频繁模式树的分布式关联规则挖掘算法 总被引：1，自引：0，他引：1

何波《控制与决策》2012,27(4):618-622

提出一种基于频繁模式树的分布式关联规则挖掘算法(DMARF).DMARF算法设置了中心结点,利用局部频繁模式树让各计算机结点快速获取局部频繁项集,然后与中心结点交互实现数据汇总,最终获得全局频繁项集.DMARF算法采用顶部和底部策略,能大幅减少候选项集,降低通信量.理论分析和实验结果均表明了DMARF算法是快速而有效的. 相似文献

17.

一种高效的最大频繁项集挖掘算法DFMFI-Miner

陈慧萍王建东王煜《计算机仿真》2006,23(7):79-83

分析最大频繁项集和完全频繁项集的关系,提出了一个挖掘最大频繁项集的高效算法DFMFI—Miner（The Miner Basedon Depth—First Searching for Mining Maximal Frequent Itemsets）,采用深度优先方法搜索项集空间,采用垂直位图及一定的压缩方法对表示事务数据库并进行约简,并采用多种有效剪枝策略和优化策略,提高了算法的效率。在多个数据集上进行了实验,实验结果表明该算法特别适于挖掘具有长频繁项集的数据集。相似文献

18.

一种改进的最大频繁项目集挖掘算法

下载免费PDF全文

潘益婷张红娟严建军《计算机工程与科学》2009,31(8)

本文提出了一种基于布尔矩阵FP-array的最大频繁项目集挖掘的并行算法。该算法利用基于前缀的划分方法将事务数据集划分为较小的子空间,并将具有完全包含关系的项目集分配到同一处理机,然后各处理机站点Si分别进行局部最大频繁项目集的挖掘,再将挖掘结果传送到主站点S,最后得到全局最大频繁项目集。相似文献

19.

A lattice-based approach for I/O efficient association rule mining

《Information Systems》2002,27(1):41-74

Most algorithms for association rule mining are variants of the basic Apriori algorithm (Agarwal and Srikant, Fast algorithms for mining association rules in databases, in: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), Santiago, Chile, 1994, pp. 487–499). One characteristic of these Apriori-based algorithms is that candidate itemsets are generated in rounds, with the size of the itemsets incremented by one per round. The number of database scans required by Apriori-based algorithms thus depends on the size of the biggest frequent itemsets. In this paper, we devise a more general candidate set generation algorithm, LGen, which generates candidate itemsets of multiple sizes during each database scan. We present an algorithm FindLarge which uses LGen to find frequent itemsets. We show that, given a reasonable set of suggested frequent itemsets, FindLarge can significantly reduce the number of I/O passes required. In the best cases, only two passes are sufficient to discover all the frequent itemsets irrespective of the size of the biggest ones.Two I/O-saving algorithms, namely DIC and Pincher-Search, are compared with FindLarge in a series of experiments. We discuss the conditions under which FindLarge significantly outperforms the others in terms of I/O efficiency. 相似文献

20.

基于图的关联规则改进算法 总被引：1，自引：0，他引：1

黄红星《计算机与数字工程》2009,37(12):38-41,162

关联规则挖掘是数据挖掘研究的最重要课题之一。基于图的关联规则挖掘DLG算法通过一次扫描数据库构建关联图,然后遍历该关联图产生频繁项集,有效地提高了关联规则挖掘的性能。在分析该算法基本原理基础上,提出了一种改进的算法—DLG#。改进算法在关联图构造同时构造项集关联矩阵,在候选项集生成时结合关联图和Apriori性质对冗余项集进行剪枝,减少了候选项集数,简化了候选项集的验证。比较实验结果表明,在不同数据集和不同支持度阈值下,改进算法都能更快速的发现频繁项集,当频繁项集平均长度较大时性能提高明显。相似文献