共查询到20条相似文献,搜索用时 31 毫秒
1.
Association-rule mining, which is based on frequency values of items, is the most common topic in data mining. In real-world applications, customers may, however, buy many copies of products and each product may have different factors, such as profits and prices. Only mining frequent itemsets in binary databases is thus not suitable for some applications. Utility mining is thus presented to consider additional measures, such as profits or costs according to user preference. In the past, a two-phase mining algorithm was designed for fast discovering high utility itemsets from databases. When data come intermittently, the approach needs to process all the transactions in a batch way. In this paper, an incremental mining algorithm for efficiently mining high utility itemsets is proposed to handle the above situation. It is based on the concept of the fast-update (FUP) approach, which was originally designed for association mining. The proposed approach first partitions itemsets into four parts according to whether they are high transaction-weighted utilization itemsets in the original database and in the newly inserted transactions. Each part is then executed by its own procedure. Experimental results also show that the proposed algorithm executes faster than the two-phase batch mining algorithm in the intermittent data environment 相似文献
2.
Discovering frequent itemsets is a key problem in important data mining applications, such as the discovery of association rules, strong rules, episodes, and minimal keys. Typical algorithms for solving this problem operate in a bottom-up, breadth-first search direction. The computation starts from frequent 1-itemsets (the minimum length frequent itemsets) and continues until all maximal (length) frequent itemsets are found. During the execution, every frequent itemset is explicitly considered. Such algorithms perform well when all maximal frequent itemsets are short. However, performance drastically deteriorates when some of the maximal frequent itemsets are long. We present a new algorithm which combines both the bottom-up and the top-down searches. The primary search direction is still bottom-up, but a restricted search is also conducted in the top-down direction. This search is used only for maintaining and updating a new data structure, the maximum frequent candidate set. It is used to prune early candidates that would be normally encountered in the bottom-up search. A very important characteristic of the algorithm is that it does not require explicit examination of every frequent itemset. We evaluate the performance of the algorithm using well-known synthetic benchmark databases, real-life census, and stock market databases 相似文献
3.
The set of frequent closed itemsets uniquely determines the exact frequency of all itemsets, yet it can be orders of magnitude smaller than the set of all frequent itemsets. In this paper, we present CHARM, an efficient algorithm for mining all frequent closed itemsets. It enumerates closed sets using a dual itemset-tidset search tree, using an efficient hybrid search that skips many levels. It also uses a technique called diffsets to reduce the memory footprint of intermediate computations. Finally, it uses a fast hash-based approach to remove any "nonclosed" sets found during computation. We also present CHARM-L, an algorithm that outputs the closed itemset lattice, which is very useful for rule generation and visualization. An extensive experimental evaluation on a number of real and synthetic databases shows that CHARM is a state-of-the-art algorithm that outperforms previous methods. Further, CHARM-L explicitly generates the frequent closed itemset lattice. 相似文献
4.
5.
Discovering shared conceptualizations in folksonomies 总被引:2,自引:0,他引:2
Robert Jschke Andreas Hotho Christoph Schmitz Bernhard Ganter Gerd Stumme 《Journal of Web Semantics》2008,6(1):38-53
Social bookmarking tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. Unlike ontologies, shared conceptualizations are not formalized, but rather implicit. We present a new data mining task, the mining of all frequent tri-concepts, together with an efficient algorithm, for discovering these implicit shared conceptualizations. Our approach extends the data mining task of discovering all closed itemsets to three-dimensional data structures to allow for mining folksonomies. We provide a formal definition of the problem, and present an efficient algorithm for its solution. Finally, we show the applicability of our approach on three large real-world examples. 相似文献
6.
提出一种采用图形处理器挖掘闭合频繁项集的方法,用二进制数据表示项集,利用单指令多数据的体系结构实现并行计算,结合项集索引树,可以提高项集支持度计算和项集查找的速度。在2种数据集上的实验结果表明,该方法能够用更少的空间保存频繁项集的全部信息,并减少挖掘时间。 相似文献
7.
基于遗传算法的顾客购买行为特征提取 总被引:2,自引:0,他引:2
提出一种基于遗传算法的顾客行为特征提取算法。首先,采用Tanimoto 相似度来度量顾客间购买行为,并设计遗传聚类算法对顾客群体进行划分,把具有相似购买行为顾客聚集为一类。然后,针对不同顾客群体的购买行为特征,设计一种基于遗传算法的多种群特征提取方法,从各个子群体中发现顾客的购买行为的知识。为了增强种群内部协同进化能力和规则质量,我们采用最近邻替代遗传策略和局部搜索策略。使用实际零售数据集对整个算法进行验证,并与经典的Apriori算法进行比较。实验结果表明该算法在不需要产生频繁项集的情况下,可较高效生成精简规则集,在规则形式方面也更加灵活。最后,对实验结果进行详细分析。 相似文献
8.
一种挖掘最大频繁项集的深度优先算法 总被引:7,自引:0,他引:7
最大频繁项集挖掘是许多数据挖掘应用中的重要问题.提出一种新的深度优先搜索最大频繁项集的算法.该算法采用位图数据格式,结合了流行的各种有效剪枝技术,并使用局部最大频繁项集来进行高效的超集存在判断,明显地加速了最大频繁项集的生成,从而降低了CPU时间. 相似文献
9.
基于属性分组的高效挖掘关联规则算法 总被引:6,自引:0,他引:6
挖掘频繁项集在数据挖掘中有着重要的作用。目前,关于频繁项集的挖掘问题已经提出了一些算法,虽然实现了一次扫描数据库即可以发现所有的频繁项集,但是当属性数目很多时,算法的执行效率下降很快。论文首次提出了利用属性分组作为挖掘关联规则的工具,给出了基于属性分组的频繁项集挖掘算法,用矩阵来存储数据库属性间的信息并提取频繁项集,而且不产生候选项集。经实验验证该算法是快速有效的。 相似文献
10.
由于不确定数据的向下封闭属性,挖掘全部频繁项集的方法会得到一个指数级的结果。为获得一个较小的合适的结果集,研究了在不确定数据上挖掘频繁闭项集,并提出了一种新的频繁闭项集挖掘算法——NA-PFCIM。该算法将项集挖掘过程看作一个概率分布函数,考虑到基于正态分布模型的方法提取的频繁项集精确度较高,而且支持大型数据库,采用了正态分布模型提取频繁项集。同时,为了减少搜索空间以及避免冗余计算,利用基于深度优先搜索的策略来获得所有的概率频繁闭项集。该算法还设计了两个剪枝策略:超集修剪和子集修剪。最后,在常用的数据集(T10I4D100K、Accidents、Mushroom、Chess)上,将提出的NA-PFCIM算法和基于泊松分布的A-PFCIM算法进行比较。实验结果表明,NA-PFCIM算法能够减少所要扩展的项集,同时减少项集频繁概率的计算,其性能优于对比算法。 相似文献
11.
频繁项集挖掘的研究与进展 总被引:6,自引:0,他引:6
挖掘频繁项集是许多数据挖掘任务中的关键问题,也是关联规则挖掘算法的核心,所以提高频繁项集的生成效率一直是近几年数据挖掘领域研究的热点之一,研究人员从不同的角度对算法进行改进以提高算法的效率。该文从频繁项集生成过程中解空间的类型、搜索方法和剪枝策略、数据库的表示方法、数据压缩技术等几个方面对频繁项集挖掘的基本策略进行了研究,对完全频繁项集挖掘、频繁闭项集挖掘和最大频繁项集挖掘的典型算法特别是最新算法进行了介绍和评述,并分析了各种算法的性能特点,指出其适于哪种类型的数据集。最后,对频繁项集挖掘算法的发展方向进行了初步的探讨。 相似文献
12.
13.
14.
A novel hash-based approach for mining frequent itemsets over data streams requiring less memory space 总被引:2,自引:1,他引:1
In recent times, data are generated as a form of continuous data streams in many applications. Since handling data streams
is necessary and discovering knowledge behind data streams can often yield substantial benefits, mining over data streams
has become one of the most important issues. Many approaches for mining frequent itemsets over data streams have been proposed.
These approaches often consist of two procedures including continuously maintaining synopses for data streams and finding
frequent itemsets from the synopses. However, most of the approaches assume that the synopses of data streams can be saved
in memory and ignore the fact that the information of the non-frequent itemsets kept in the synopses may cause memory utilization
to be significantly degraded. In this paper, we consider compressing the information of all the itemsets into a structure
with a fixed size using a hash-based technique. This hash-based approach skillfully summarizes the information of the whole
data stream by using a hash table, provides a novel technique to estimate the support counts of the non-frequent itemsets,
and keeps only the frequent itemsets for speeding up the mining process. Therefore, the goal of optimizing memory space utilization
can be achieved. The correctness guarantee, error analysis, and parameter setting of this approach are presented and a series
of experiments is performed to show the effectiveness and the efficiency of this approach. 相似文献
15.
最大频繁项目集的快速更新 总被引:29,自引:0,他引:29
挖掘最大频繁项目集是多种数据挖掘应用中的关键问题.为克服基于Apriori的最大频繁项目集挖掘算法存在的不足,DMFIA采用FP-tree存储结构及自顶向下的搜索策略,有效地提高了最大频繁项目集的挖掘效率.但对于频繁项目多而最大频繁项目集维数相对较小的情况,DMFIA要经过多层搜索且在每一层产生大量的候选项目集,因而影响算法的执行效率.为此,该文提出了DMFIA的改进算法IDMFIA(the Improved algorithm of DMFIA).IDMFIA采用自顶向下和自底向上双向搜索策略,可尽早修剪掉较短最大频繁项目集的超集和较长最大频繁项目集的子集.另外,该文还提出最大频繁项目集更新算法FUMFIA(Fast Updating Maximum Frequent Itemsets Algorithm),该算法充分利用已建立的FP-tree和已挖掘的最大频繁项目集,可对已挖掘的最大频繁项目集进行高效维护.实验结果表明,IDMFIA和FUMFIA可有效提高最大频繁项目集的挖掘和更新效率. 相似文献
16.
In this paper we present the dual support Apriori for temporal data (DSAT) algorithm. This is a novel technique for discovering jumping and emerging patterns (JEPs) from time series data using a sliding window technique. Our approach is particularly effective when performing trend analysis in order to explore the itemset variations over time. Our proposed framework is different from the previous work on JEP in that we do not rely on itemsets borders with a constrained search space. DSAT exploits previously mined time stamped data by using a sliding window concept, thus requiring less memory, minimum computational cost and very low dataset accesses. DSAT discovers all JEPs, as in “naïve” approaches, but utilises less memory and scales linearly with large datasets sets as demonstrated in the experimental section. 相似文献
17.
18.
一种隐私保护关联规则挖掘的混合算法* 总被引:3,自引:2,他引:1
针对现有的隐私保护关联规则挖掘算法无法满足效率与精度之间较好的折中问题,提出了一种基于安全多方计算与随机干扰相结合的混合算法。算法基于半诚实模型,首先使用项集随机干扰矩阵对各个分布站点的数据进行变换和隐藏,然后提出一种方法恢复项集的全局支持数。由于采用的是对项集进行干扰,克服了传统方法由于独立地干扰每个项而破坏项之间相关性,导致恢复精度下降的缺陷。将小于阈值的项集进行剪枝,再使用安全多方计算在剪枝后的空间中精确找出全局频繁项集,进而生成全局关联规则。实验表明,该算法在保持隐私度的情况下,能够获得精度和效率之间较好的折中。 相似文献
19.
Identifying Approximate Itemsets of Interest in Large Databases 总被引:2,自引:0,他引:2
This paper presents a method for discovering approximate frequent itemsets of interest in large scale databases. This method uses the central limit theorem to increase efficiency, enabling us to reduce the sample size by about half compared to previous approximations. Further efficiency is gained by pruning from the search space uninteresting frequent itemsets. In addition to improving efficiency, this measure also reduces the number of itemsets that the user need consider. The model and algorithm have been implemented and evaluated using both synthetic and real-world databases. Our experimental results demonstrate the efficiency of the approach. 相似文献
20.
Incremental mining has attracted the attention of many researchers due to its usefulness in online applications. Many algorithms have thus been proposed for incrementally mining frequent itemsets. Maintaining a frequent-itemset lattice (FIL) is difficult for databases with large numbers of frequent itemsets, especially huge databases, due to the storage of links of nodes in the lattice. However, generating association rules from a FIL has been shown to be more effective than traditional methods such as directly generating rules from frequent itemsets or frequent closed itemsets. Therefore, when the number of frequent itemsets is not huge (i.e., they can be stored in the lattice without excessive memory overhead), the lattice-based approach outperforms approaches which mine association rules from frequent itemsets/frequent closed itemsets. However, incremental algorithms for building FILs have not yet been proposed. This paper proposes an effective approach for the maintenance of a FIL based on the pre-large concept in incremental mining. The building process of a FIL is first improved using two proposed theorems regarding the paternity relation between two nodes in the lattice. An effective approach for maintaining a FIL with dynamically inserted data is then proposed based on the pre-large and the diffset concepts. The experimental results show that the proposed approach outperforms the batch approach for building a FIL in terms of execution time. 相似文献