首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
在大的数据集合中,开采其中的频繁项目集集合是数据挖掘中极具挑战的重要任务。已经有很多高效的算法被总结了出来。本文提出了一种思想,即开采频繁项目集集合的一 个子集,我们称之为频繁无析取规则集集合,而并非开采完全的频繁项目集集合。我们证明能借助它不读取数据库而还原出频繁项目集集合的全集和它们的支持度。本文还提 提出了一个开采无析取规则集集合的算法HOPE-Ⅱ,实验结果显示了其高效性。我们将它与另一种称为频繁封闭集的精简集进行对比,几乎所有的实验结果都显示使用无析取规则集集合比使用封闭集集合来开采频繁项目集集合更有效。  相似文献   

2.
数据挖掘的一个基本任务是在海量数据的数据库中开采频繁项目集。本文提出了一种方法,不用开采频繁项目集全集,而是开采它的一个称为频繁无规则集集合的精简集。我们能用频繁无规则集集合还原出完整的频繁项目集集合和它们的精确支持度而不用读取数据库。可以看到,对频繁无规则集集合的开采是高效的。我们给出了一个算法HOPE-Ⅲ来开采频繁无规则集集合,并将它和算法A-Close进行了比较。实验结果显示,HOPE-Ⅲ在任何情况下都比A-Close的性能更好。  相似文献   

3.
Given a large collection of transactions containing items, a basic common data mining problem is to extract the so-called frequent itemsets (i.e., sets of items appearing in at least a given number of transactions). In this paper, we propose a structure called free-sets, from which we can approximate any itemset support (i.e., the number of transactions containing the itemset) and we formalize this notion in the framework of -adequate representations (H. Mannila and H. Toivonen, 1996. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 189–194). We show that frequent free-sets can be efficiently extracted using pruning strategies developed for frequent itemset discovery, and that they can be used to approximate the support of any frequent itemset. Experiments on real dense data sets show a significant reduction of the size of the output when compared with standard frequent itemset extraction. Furthermore, the experiments show that the extraction of frequent free-sets is still possible when the extraction of frequent itemsets becomes intractable, and that the supports of the frequent free-sets can be used to approximate very closely the supports of the frequent itemsets. Finally, we consider the effect of this approximation on association rules (a popular kind of patterns that can be derived from frequent itemsets) and show that the corresponding errors remain very low in practice.  相似文献   

4.
Mining frequent sequences in sequential databases are highly valuable for many real-life applications. However, in several cases, especially when databases are huge and when low minimum support thresholds are used, the cardinality of the result set can be enormous. Consequently, algorithms for discovering frequent sequences exhibit poor performance, showing an important increase in execution time, memory consumption and storage space usage. To address this issue, researchers have studied the tasks of mining frequent closed and generator sequences, as they provide several benefits when compared to the set of frequent sequences. One of the most important benefits is that the cardinalities of frequent closed and generator sequences are generally much less than the cardinality of frequent sequences. Hence, humans find it more convenient to analyze the information provided by closed and generator sequences. Moreover, it was shown that frequent closed sequences have the advantage of being lossless, and they thus preserve information about the frequency of all frequent subsequences, while generator sequences can provide higher accuracy for sequence classification tasks since they are the smallest patterns that characterize groups of sequences. Besides, frequent closed sequences can be combined with generators to produce non-redundant sequential rules and recover the complete set of frequent sequences and their frequencies. This paper proposes two novel algorithms named FCloSM and FGenSM to mine frequent closed and generator sequences efficiently. These algorithms are based on new pruning conditions called extended early elimination (3E) and early pruning techniques named EPCLO and EPGEN, designed to identify non-closed and non-generator patterns early. Based on these techniques, two local pruning strategies called LPCLO and LPGEN are proposed to eliminate non-closed and non-generator patterns more efficiently at two successive levels of the prefix search tree without performing subsequence relation checking. These theoretical results, which are the basis of FCloSM and FGenSM, are mathematically proved and are shown to be more general than those presented in previous work. Extensive experiments show that FCloSM and FGenSM are one to two orders of magnitude faster than the state-of-the-art algorithms for discovering frequent closed sequences (CloSpan, BIDE, ClaSP and CM-ClaSP) and for mining frequent generators (FEAT, FSGP and VGEN), and that FCloSM and FGenSM consume much less memory.  相似文献   

5.
6.
基于滑动窗口的数据流闭合频繁模式的挖掘   总被引:11,自引:1,他引:11  
频繁闭合模式集惟一确定频繁模式完全集并且数量小得多,然而,如何挖掘滑动窗口中的频繁闭合模式集是一个很大的挑战.根据数据流的特点,提出了一种发现滑动窗口中频繁闭合模式的新方法DS_CFI.DS_CFI算法将滑动窗口分割为若干个基本窗口,以基本窗口为更新单位。利用已有的频繁闭合模式挖掘算法计算每个基本窗口的潜在频繁闭合项集,将它们及其子集存储到一种新的数据结构DSCFI_tree中,DSCFI_tree能够增量更新,利用DSCFI_tree可以快速地挖掘滑动窗口中的所有频繁闭合模式.最后,通过实验验证了这种方法的有效性.  相似文献   

7.
A core issue of the association rule extracting process in the data mining field is to find the frequent patterns in the database of operational transactions. If these patterns discovered, the decision making process and determining strategies in organizations will be accomplished with greater precision. Frequent pattern is a pattern seen in a significant number of transactions. Due to the properties of these data models which are unlimited and high-speed production, these data could not be stored in memory and for this reason it is necessary to develop techniques that enable them to be processed online and find repetitive patterns. Several mining methods have been proposed in the literature which attempt to efficiently extract a complete or a closed set of different types of frequent patterns from a dataset. In this paper, a method underpinned upon Cellular Learning Automata (CLA) is presented for mining frequent itemsets. The proposed method is compared with Apriori, FP-Growth and BitTable methods and it is ultimately concluded that the frequent itemset mining could be achieved in less running time. The experiments are conducted on several experimental data sets with different amounts of minsup for all the algorithms as well as the presented method individually. Eventually the results prod to the effectiveness of the proposed method.  相似文献   

8.
Fast and memory efficient mining of frequent closed itemsets   总被引:12,自引:0,他引:12  
This paper presents a new scalable algorithm for discovering closed frequent itemsets, a lossless and condensed representation of all the frequent itemsets that can be mined from a transactional database. Our algorithm exploits a divide-and-conquer approach and a bitwise vertical representation of the database and adopts a particular visit and partitioning strategy of the search space based on an original theoretical framework, which formalizes the problem of closed itemsets mining in detail. The algorithm adopts several optimizations aimed to save both space and time in computing itemset closures and their supports. In particular, since one of the main problems in this type of algorithms is the multiple generation of the same closed itemset, we propose a new effective and memory-efficient pruning technique, which, unlike other previous proposals, does not require the whole set of closed patterns mined so far to be kept in the main memory. This technique also permits each visited partition of the search space to be mined independently in any order and, thus, also in parallel. The tests conducted on many publicly available data sets show that our algorithm is scalable and outperforms other state-of-the-art algorithms like CLOSET+ and FP-CLOSE, in some cases by more than one order of magnitude. More importantly, the performance improvements become more and more significant as the support threshold is decreased.  相似文献   

9.
Generating a Condensed Representation for Association Rules   总被引:1,自引:0,他引:1  
Association rule extraction from operational datasets often produces several tens of thousands, and even millions, of association rules. Moreover, many of these rules are redundant and thus useless. Using a semantic based on the closure of the Galois connection, we define a condensed representation for association rules. This representation is characterized by frequent closed itemsets and their generators. It contains the non-redundant association rules having minimal antecedent and maximal consequent, called min-max association rules. We think that these rules are the most relevant since they are the most general non-redundant association rules. Furthermore, this representation is a basis, i.e., a generating set for all association rules, their supports and their confidences, and all of them can be retrieved needless accessing the data. We introduce algorithms for extracting this basis and for reconstructing all association rules. Results of experiments carried out on real datasets show the usefulness of this approach. In order to generate this basis when an algorithm for extracting frequent itemsets—such as Apriori for instance—is used, we also present an algorithm for deriving frequent closed itemsets and their generators from frequent itemsets without using the dataset.  相似文献   

10.
基于概念格的规则产生集挖掘算法   总被引:27,自引:0,他引:27  
传统的规则提取算法产生的规则集合相当庞大,其中包含许多冗余的规则.使用闭项集可以减少规则的数目,而概念格结点问的泛化和例化关系非常适用于规则提取.基于概念格理论和闭项集的概念,提出了一种新的更有利于规则提取的格结构,给出了相应的基于闭标记的渐进式构造算法和规则提取算法.最后提供给用户的是直观的、易理解的规则子集,用户可以有选择地从中推导出其他的规则.实验表明该方法能够高效地挖掘规则产生集.  相似文献   

11.
In recent years, high utility itemsets (HUIs) mining from the transactional databases becomes one of the most emerging research topic in the field of data mining due to its wide range of applications in online e-commerce data analysis, identifying interesting patterns in biomedical data and for cross marketing solutions in retail business. It aims to discover the itemsets with high utilities efficiently by considering item quantities in a transaction and profit values of each item. However, it produces a tremendous number of HUIs, which imposes further burden in analysis of the extracted patterns and also degrades the performance of mining methods. Mining the set of closed + high utility itemsets (CHUIs) solves this issue as it is a loss-less and condensed representation of all HUIs. In this paper, we aim to present a new algorithm for finding CHUIs from a transactional database, called the CHUM (Closed + High Utility itemset Miner), which is scalable and efficient. The proposed mining algorithm adopts a tricky aimed vertical representation of the database in order to speed up the execution time in generating itemset closures and compute their utility information without accessing the database. The proposed method makes use of the item co-occurrences strategy in order to further reduce the number of intersections needed to be performed. Several experiments are conducted on various sparse and dense datasets and the simulation results clearly show the scalability and superior performance of our algorithm as compared to those for the existing state-of-the-art CHUD (Closed + High Utility itemset Discovery) algorithm.  相似文献   

12.
13.
数据挖掘中传统的关联规则生成算法产生的关联规则集合相当庞大,其中很多规则可由其它规则导出。使用闭项集可以减少规则的数目,而概念格节点间的泛化和例化关系非常适用于规则的提取。目前几种基于概念格的规则提取算法局限于得到准确支持度、信任度的无冗余规则。提出了一种在概念格上挖掘出能推导出所有满足最小支持度、信任度规则的规则产生集算法,文中称之为组规则产生集算法,减少了规则的规模。在此基础上进一步给出了组规则产生集的存储数据结构并用其导出一般规则产生集的算法。  相似文献   

14.
15.
基于iceberg概念格并置集成的闭频繁项集挖掘算法   总被引:2,自引:0,他引:2  
由于概念格的完备性,在基于概念格的数据挖掘过程中,构造概念格的时间复杂度和空间复杂度一直是影响其应用的主要因素.结合iceberg概念格的半格特性和概念格的集成思想,首先在理论上分析并置集成后的iceberg概念格与由完备概念格裁剪得到的iceberg格同构;然后分析了iceberg概念格集成过程中的映射关系;最终提出一个新颖的基于iceberg概念格并置的闭频繁项集挖掘算法(Icegalamera).此算法避免了完备概念格的计算,并且在构造过程中采用集成和剪枝策略,从而显著提高了挖掘效率.实验证明其产生的闭频繁项集的完备性.使用稠密和稀疏数据集在单站点模式下进行了性能测试,结果表明稀疏数据集上性能优势明显.  相似文献   

16.
A graph-based approach to document classification is described in this paper. The graph representation offers the advantage that it allows for a much more expressive document encoding than the more standard bag of words/phrases approach, and consequently gives an improved classification accuracy. Document sets are represented as graph sets to which a weighted graph mining algorithm is applied to extract frequent subgraphs, which are then further processed to produce feature vectors (one per document) for classification. Weighted subgraph mining is used to ensure classification effectiveness and computational efficiency; only the most significant subgraphs are extracted. The approach is validated and evaluated using several popular classification algorithms together with a real world textual data set. The results demonstrate that the approach can outperform existing text classification algorithms on some dataset. When the size of dataset increased, further processing on extracted frequent features is essential.  相似文献   

17.
用传统的规则生成算法产生的关联规则集合相当庞大,其中很多规则可由其它规则导出。使用闭项集可以减少规则的数目,而概念格节点间的泛化和例化关系非常适用于规则的提取。目前几种基于概念格的规则提取算法局限于得到准确支持度、信任度的无冗余规则。提出了一种在概念格上挖掘出能推导出所有满足最小支持度、信任度规则的规则产生集算法,文中称之为组规则产生集算法,减少了规则的规模,提高了挖掘效率,进一步给出了组规则产生集的存储数据结构和根据应用需要用其导出单一后项规则的算法。  相似文献   

18.
Web usage mining: extracting unexpected periods from web logs   总被引:3,自引:0,他引:3  
Existing Web usage mining techniques are currently based on an arbitrary division of the data (e.g. “one log per month”) or guided by presumed results (e.g. “what is the customers’ behaviour for the period of Christmas purchases?”). These approaches have two main drawbacks. First, they depend on the above-mentioned arbitrary organization of data. Second, they cannot automatically extract “seasonal peaks” from among the stored data. In this paper, we propose a specific data mining process (in particular, to extract frequent behaviour patterns) in order to reveal the densest periods automatically. From the whole set of possible combinations, our method extracts the frequent sequential patterns related to the extracted periods. A period is considered to be dense if it contains at least one frequent sequential pattern for the set of users connected to the website in that period. Our experiments show that the extracted periods are relevant and our approach is able to extract both frequent sequential patterns and the associated dense periods.  相似文献   

19.
Efficient algorithms for mining closed itemsets and their lattice structure   总被引:7,自引:0,他引:7  
The set of frequent closed itemsets uniquely determines the exact frequency of all itemsets, yet it can be orders of magnitude smaller than the set of all frequent itemsets. In this paper, we present CHARM, an efficient algorithm for mining all frequent closed itemsets. It enumerates closed sets using a dual itemset-tidset search tree, using an efficient hybrid search that skips many levels. It also uses a technique called diffsets to reduce the memory footprint of intermediate computations. Finally, it uses a fast hash-based approach to remove any "nonclosed" sets found during computation. We also present CHARM-L, an algorithm that outputs the closed itemset lattice, which is very useful for rule generation and visualization. An extensive experimental evaluation on a number of real and synthetic databases shows that CHARM is a state-of-the-art algorithm that outperforms previous methods. Further, CHARM-L explicitly generates the frequent closed itemset lattice.  相似文献   

20.
序列模式的挖掘是近年来的研究热点之一,目前很多研究都集中在闭合频繁项集与闭合序列模式的挖掘,较少涉及更加复杂、有重要应用价值的组合序列模式.针对任意长度和任意组合次数的频繁组合序列模式,提出了一种挖掘全部闭合的组合序列的算法CloCSP.为克服指数量级的候选序列进行闭合检验的困难,提出了既能生成频繁组合序列,又能有效剪枝,并同时完成闭合检验的混合扩展策略,该策略无需维护候选集.实验表明,CloCSP算法能够有效挖掘出隐藏在序列数据中,尤其是稠密数据集内的闭合组合序列模式,有助于揭示更加复杂的序列模式.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号