首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Association-rule mining, which is based on frequency values of items, is the most common topic in data mining. In real-world applications, customers may, however, buy many copies of products and each product may have different factors, such as profits and prices. Only mining frequent itemsets in binary databases is thus not suitable for some applications. Utility mining is thus presented to consider additional measures, such as profits or costs according to user preference. In the past, a two-phase mining algorithm was designed for fast discovering high utility itemsets from databases. When data come intermittently, the approach needs to process all the transactions in a batch way. In this paper, an incremental mining algorithm for efficiently mining high utility itemsets is proposed to handle the above situation. It is based on the concept of the fast-update (FUP) approach, which was originally designed for association mining. The proposed approach first partitions itemsets into four parts according to whether they are high transaction-weighted utilization itemsets in the original database and in the newly inserted transactions. Each part is then executed by its own procedure. Experimental results also show that the proposed algorithm executes faster than the two-phase batch mining algorithm in the intermittent data environment  相似文献   

2.
Discovering frequent itemsets is a key problem in important data mining applications, such as the discovery of association rules, strong rules, episodes, and minimal keys. Typical algorithms for solving this problem operate in a bottom-up, breadth-first search direction. The computation starts from frequent 1-itemsets (the minimum length frequent itemsets) and continues until all maximal (length) frequent itemsets are found. During the execution, every frequent itemset is explicitly considered. Such algorithms perform well when all maximal frequent itemsets are short. However, performance drastically deteriorates when some of the maximal frequent itemsets are long. We present a new algorithm which combines both the bottom-up and the top-down searches. The primary search direction is still bottom-up, but a restricted search is also conducted in the top-down direction. This search is used only for maintaining and updating a new data structure, the maximum frequent candidate set. It is used to prune early candidates that would be normally encountered in the bottom-up search. A very important characteristic of the algorithm is that it does not require explicit examination of every frequent itemset. We evaluate the performance of the algorithm using well-known synthetic benchmark databases, real-life census, and stock market databases  相似文献   

3.
Efficient algorithms for mining closed itemsets and their lattice structure   总被引:7,自引:0,他引:7  
The set of frequent closed itemsets uniquely determines the exact frequency of all itemsets, yet it can be orders of magnitude smaller than the set of all frequent itemsets. In this paper, we present CHARM, an efficient algorithm for mining all frequent closed itemsets. It enumerates closed sets using a dual itemset-tidset search tree, using an efficient hybrid search that skips many levels. It also uses a technique called diffsets to reduce the memory footprint of intermediate computations. Finally, it uses a fast hash-based approach to remove any "nonclosed" sets found during computation. We also present CHARM-L, an algorithm that outputs the closed itemset lattice, which is very useful for rule generation and visualization. An extensive experimental evaluation on a number of real and synthetic databases shows that CHARM is a state-of-the-art algorithm that outperforms previous methods. Further, CHARM-L explicitly generates the frequent closed itemset lattice.  相似文献   

4.
对现有关联规则更新算法中的增量式更新算法进行分析,发现在决策者优先关注最大频繁项目集的情况下,该算法不能以较少的数据库遍历次数快速获取最大频繁项集。针对该算法的不足,提出一种基于逆向搜索的方式进行关联规则更新的算法。该算法生成新增项集的所有频繁项集,通过将其中最大频繁项集跟原项集中最大频繁项集进行拼接、修剪,从中获得更新后的最大频繁项集。实例结果表明,该算法既降低了关联规则更新过程中对数据库的遍历次数,又实现了优先获取最大频繁项目集。  相似文献   

5.
Discovering shared conceptualizations in folksonomies   总被引:2,自引:0,他引:2  
Social bookmarking tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. Unlike ontologies, shared conceptualizations are not formalized, but rather implicit. We present a new data mining task, the mining of all frequent tri-concepts, together with an efficient algorithm, for discovering these implicit shared conceptualizations. Our approach extends the data mining task of discovering all closed itemsets to three-dimensional data structures to allow for mining folksonomies. We provide a formal definition of the problem, and present an efficient algorithm for its solution. Finally, we show the applicability of our approach on three large real-world examples.  相似文献   

6.
李海峰 《计算机工程》2011,37(14):59-61
提出一种采用图形处理器挖掘闭合频繁项集的方法,用二进制数据表示项集,利用单指令多数据的体系结构实现并行计算,结合项集索引树,可以提高项集支持度计算和项集查找的速度。在2种数据集上的实验结果表明,该方法能够用更少的空间保存频繁项集的全部信息,并减少挖掘时间。  相似文献   

7.
基于遗传算法的顾客购买行为特征提取   总被引:2,自引:0,他引:2  
提出一种基于遗传算法的顾客行为特征提取算法。首先,采用Tanimoto 相似度来度量顾客间购买行为,并设计遗传聚类算法对顾客群体进行划分,把具有相似购买行为顾客聚集为一类。然后,针对不同顾客群体的购买行为特征,设计一种基于遗传算法的多种群特征提取方法,从各个子群体中发现顾客的购买行为的知识。为了增强种群内部协同进化能力和规则质量,我们采用最近邻替代遗传策略和局部搜索策略。使用实际零售数据集对整个算法进行验证,并与经典的Apriori算法进行比较。实验结果表明该算法在不需要产生频繁项集的情况下,可较高效生成精简规则集,在规则形式方面也更加灵活。最后,对实验结果进行详细分析。  相似文献   

8.
一种挖掘最大频繁项集的深度优先算法   总被引:7,自引:0,他引:7  
最大频繁项集挖掘是许多数据挖掘应用中的重要问题.提出一种新的深度优先搜索最大频繁项集的算法.该算法采用位图数据格式,结合了流行的各种有效剪枝技术,并使用局部最大频繁项集来进行高效的超集存在判断,明显地加速了最大频繁项集的生成,从而降低了CPU时间.  相似文献   

9.
基于属性分组的高效挖掘关联规则算法   总被引:6,自引:0,他引:6  
挖掘频繁项集在数据挖掘中有着重要的作用。目前,关于频繁项集的挖掘问题已经提出了一些算法,虽然实现了一次扫描数据库即可以发现所有的频繁项集,但是当属性数目很多时,算法的执行效率下降很快。论文首次提出了利用属性分组作为挖掘关联规则的工具,给出了基于属性分组的频繁项集挖掘算法,用矩阵来存储数据库属性间的信息并提取频繁项集,而且不产生候选项集。经实验验证该算法是快速有效的。  相似文献   

10.
刘慧婷  沈盛霞  赵鹏  姚晟 《计算机应用》2015,35(10):2911-2914
由于不确定数据的向下封闭属性,挖掘全部频繁项集的方法会得到一个指数级的结果。为获得一个较小的合适的结果集,研究了在不确定数据上挖掘频繁闭项集,并提出了一种新的频繁闭项集挖掘算法——NA-PFCIM。该算法将项集挖掘过程看作一个概率分布函数,考虑到基于正态分布模型的方法提取的频繁项集精确度较高,而且支持大型数据库,采用了正态分布模型提取频繁项集。同时,为了减少搜索空间以及避免冗余计算,利用基于深度优先搜索的策略来获得所有的概率频繁闭项集。该算法还设计了两个剪枝策略:超集修剪和子集修剪。最后,在常用的数据集(T10I4D100K、Accidents、Mushroom、Chess)上,将提出的NA-PFCIM算法和基于泊松分布的A-PFCIM算法进行比较。实验结果表明,NA-PFCIM算法能够减少所要扩展的项集,同时减少项集频繁概率的计算,其性能优于对比算法。  相似文献   

11.
频繁项集挖掘的研究与进展   总被引:6,自引:0,他引:6  
挖掘频繁项集是许多数据挖掘任务中的关键问题,也是关联规则挖掘算法的核心,所以提高频繁项集的生成效率一直是近几年数据挖掘领域研究的热点之一,研究人员从不同的角度对算法进行改进以提高算法的效率。该文从频繁项集生成过程中解空间的类型、搜索方法和剪枝策略、数据库的表示方法、数据压缩技术等几个方面对频繁项集挖掘的基本策略进行了研究,对完全频繁项集挖掘、频繁闭项集挖掘和最大频繁项集挖掘的典型算法特别是最新算法进行了介绍和评述,并分析了各种算法的性能特点,指出其适于哪种类型的数据集。最后,对频繁项集挖掘算法的发展方向进行了初步的探讨。  相似文献   

12.
改进的数据流频繁闭项集挖掘算法   总被引:2,自引:1,他引:1       下载免费PDF全文
为提高数据流频繁闭项集的查找效率,提出一种改进的NewMoment频繁闭项集挖掘算法,通过在LevelCET数据结构中加入层次结点,并利用层次检测策略与最佳频繁闭项集检测策略快速挖掘数据流滑动窗口中所有的频繁闭项集。实验结果证明,与NewMoment算法相比,改进的算法性能更优。  相似文献   

13.
卢铮松  赵洁 《计算机工程》2009,35(20):81-82
对某供热公司累积的大量供热用户收费数据进行分析,通过构建数据仓库和利用数据概化方法建立供热用户数据挖掘模型,使用频繁项集方法产生关联规则,利用决策树算法得出交费时间特征,从而得出不同区域和类型用户的习惯交费时间段。对该数据挖掘模型进行评价,提出的4项收费决策建议在实际应用中取得良好效果。  相似文献   

14.
In recent times, data are generated as a form of continuous data streams in many applications. Since handling data streams is necessary and discovering knowledge behind data streams can often yield substantial benefits, mining over data streams has become one of the most important issues. Many approaches for mining frequent itemsets over data streams have been proposed. These approaches often consist of two procedures including continuously maintaining synopses for data streams and finding frequent itemsets from the synopses. However, most of the approaches assume that the synopses of data streams can be saved in memory and ignore the fact that the information of the non-frequent itemsets kept in the synopses may cause memory utilization to be significantly degraded. In this paper, we consider compressing the information of all the itemsets into a structure with a fixed size using a hash-based technique. This hash-based approach skillfully summarizes the information of the whole data stream by using a hash table, provides a novel technique to estimate the support counts of the non-frequent itemsets, and keeps only the frequent itemsets for speeding up the mining process. Therefore, the goal of optimizing memory space utilization can be achieved. The correctness guarantee, error analysis, and parameter setting of this approach are presented and a series of experiments is performed to show the effectiveness and the efficiency of this approach.  相似文献   

15.
最大频繁项目集的快速更新   总被引:29,自引:0,他引:29  
挖掘最大频繁项目集是多种数据挖掘应用中的关键问题.为克服基于Apriori的最大频繁项目集挖掘算法存在的不足,DMFIA采用FP-tree存储结构及自顶向下的搜索策略,有效地提高了最大频繁项目集的挖掘效率.但对于频繁项目多而最大频繁项目集维数相对较小的情况,DMFIA要经过多层搜索且在每一层产生大量的候选项目集,因而影响算法的执行效率.为此,该文提出了DMFIA的改进算法IDMFIA(the Improved algorithm of DMFIA).IDMFIA采用自顶向下和自底向上双向搜索策略,可尽早修剪掉较短最大频繁项目集的超集和较长最大频繁项目集的子集.另外,该文还提出最大频繁项目集更新算法FUMFIA(Fast Updating Maximum Frequent Itemsets Algorithm),该算法充分利用已建立的FP-tree和已挖掘的最大频繁项目集,可对已挖掘的最大频繁项目集进行高效维护.实验结果表明,IDMFIA和FUMFIA可有效提高最大频繁项目集的挖掘和更新效率.  相似文献   

16.
In this paper we present the dual support Apriori for temporal data (DSAT) algorithm. This is a novel technique for discovering jumping and emerging patterns (JEPs) from time series data using a sliding window technique. Our approach is particularly effective when performing trend analysis in order to explore the itemset variations over time. Our proposed framework is different from the previous work on JEP in that we do not rely on itemsets borders with a constrained search space. DSAT exploits previously mined time stamped data by using a sliding window concept, thus requiring less memory, minimum computational cost and very low dataset accesses. DSAT discovers all JEPs, as in “naïve” approaches, but utilises less memory and scales linearly with large datasets sets as demonstrated in the experimental section.  相似文献   

17.
李海峰  章宁 《计算机工程》2012,38(21):45-48
最大频繁项集适用于内存空间有限的数据流挖掘。为此,提出一种基于界碑模型的最大频繁项集挖掘方法,采用最大频繁项集树的数据结构,增量式地维护最大频繁项集与部分附属信息,实现项集的快速搜索和裁剪。在MUSHROOM和BMS-POS数据集上的实验结果表明,该方法具有较高的挖掘效率。  相似文献   

18.
一种隐私保护关联规则挖掘的混合算法*   总被引:3,自引:2,他引:1  
针对现有的隐私保护关联规则挖掘算法无法满足效率与精度之间较好的折中问题,提出了一种基于安全多方计算与随机干扰相结合的混合算法。算法基于半诚实模型,首先使用项集随机干扰矩阵对各个分布站点的数据进行变换和隐藏,然后提出一种方法恢复项集的全局支持数。由于采用的是对项集进行干扰,克服了传统方法由于独立地干扰每个项而破坏项之间相关性,导致恢复精度下降的缺陷。将小于阈值的项集进行剪枝,再使用安全多方计算在剪枝后的空间中精确找出全局频繁项集,进而生成全局关联规则。实验表明,该算法在保持隐私度的情况下,能够获得精度和效率之间较好的折中。  相似文献   

19.
Identifying Approximate Itemsets of Interest in Large Databases   总被引:2,自引:0,他引:2  
This paper presents a method for discovering approximate frequent itemsets of interest in large scale databases. This method uses the central limit theorem to increase efficiency, enabling us to reduce the sample size by about half compared to previous approximations. Further efficiency is gained by pruning from the search space uninteresting frequent itemsets. In addition to improving efficiency, this measure also reduces the number of itemsets that the user need consider. The model and algorithm have been implemented and evaluated using both synthetic and real-world databases. Our experimental results demonstrate the efficiency of the approach.  相似文献   

20.
Incremental mining has attracted the attention of many researchers due to its usefulness in online applications. Many algorithms have thus been proposed for incrementally mining frequent itemsets. Maintaining a frequent-itemset lattice (FIL) is difficult for databases with large numbers of frequent itemsets, especially huge databases, due to the storage of links of nodes in the lattice. However, generating association rules from a FIL has been shown to be more effective than traditional methods such as directly generating rules from frequent itemsets or frequent closed itemsets. Therefore, when the number of frequent itemsets is not huge (i.e., they can be stored in the lattice without excessive memory overhead), the lattice-based approach outperforms approaches which mine association rules from frequent itemsets/frequent closed itemsets. However, incremental algorithms for building FILs have not yet been proposed. This paper proposes an effective approach for the maintenance of a FIL based on the pre-large concept in incremental mining. The building process of a FIL is first improved using two proposed theorems regarding the paternity relation between two nodes in the lattice. An effective approach for maintaining a FIL with dynamically inserted data is then proposed based on the pre-large and the diffset concepts. The experimental results show that the proposed approach outperforms the batch approach for building a FIL in terms of execution time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号