首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
《Information Systems》2001,26(1):1-14
In this paper, we examine the two issues of mining association rules and mining sequential patterns in a large database of sales transactions. The problems of mining association rules and mining sequential patterns focus on discovering large itemsets and large sequences, respectively. We present PSI and PSI_seq for efficient large itemsets generation and large sequences generation, respectively. The main ideas of these two algorithms are using prestored information to minimize the numbers of candidate itemsets and candidate sequences counted in each database scan. The prestored informations for PSI and PSI_seq include the itemsets and the sequences along with their support counts found in the last mining, respectively. Typically a user may require to tune the value of the minimum support many times before a set of useful association rules can be obtained from the transaction database. Using prestored information, the total computation time will be reduced effectively. Empirical results show that our approaches outperform previous methods by an order of magnitude, using little storage space for the prestored information.  相似文献   

挖掘关联规则算法的优化处理   总被引:9,自引:0,他引:9  
在挖掘关联规则的执行过程中,早期循环生成最大项目集的过程是很重要的。文中提出基于哈希表的算法,对生成侯选项目集的过程进行了优化,尤其是对生成二维侯选项目集更是有效。由于在早期循环中,生成侯选项目集的势较小,使得能更有效地修剪数据库,从而减小了后期循环的计算代价,同时也减小了I/O请求。  相似文献   

多段支持度数据挖掘算法研究   总被引:17,自引:0,他引:17  
在基于相联规则的数据挖掘算法中,Apriori等算法最为著名。它分为两个主要步骤:(1)通过多趟扫描数据库求解出频繁项集;(2)利用频繁项集生成规则。随后的许多算法都沿用Apriori中“频繁项集的子集必为频繁项集”的思想,在频繁项集Lk-1上进行JOIN运算构成潜在k项集Ck。由于数据库和Ck的规模较大,需要相当大的计算量才能生成频繁项集。AprioriTid算法给每个事务增加了一个唯一标识Tid,其特点是只扫描一趟数据库,其余趟扫描(如第k趟扫描)均在相应的数据集Ck^-上进行。由于数据规模改变不大,各算法的效率差别并不明显。该文提出分段计算支持度的思想,是把一个项集的支持度分段计算,每一个段记录该项集在相应规模事务中出现的频度,从而构成一个支持度向量。由于有了项集的多段支持度,可以推测出该项集能否包含在更大规模的频率项集中,采用这种算法既提高了在扫描数据库中的信息获取度,又能及时剔除超集不是频繁项集的项集,进一步缩减了潜在项集的规模,在数据集扫描过程中,按文中定理1的思想调整数据集,达到提高频繁项集生成效率的目的。  相似文献   

Mining associations with the collective strength approach   总被引:1,自引:0,他引:1  
The large itemset model has been proposed in the literature for finding associations in a large database of sales transactions. A different method for evaluating and finding itemsets referred to as strongly collective itemsets is proposed. We propose a criterion stressing the importance of the actual correlation of the items with one another rather than their absolute level of presence. Previous techniques for finding correlated itemsets are not necessarily applicable to very large databases. We provide an algorithm which provides very good computational efficiency, while maintaining statistical robustness. The fact that this algorithm relies on relative measures rather than absolute measures such as support also implies that the method can be applied to find association rules in data sets in which items may appear in a sizeable percentage of the transactions (dense data sets), data sets in which the items have varying density, or even negative association rules  相似文献   

关联规则挖掘中Apriori算法的研究与改进   总被引:5,自引:0,他引:5  
崔贯勋  李梁  王柯柯  苟光磊  邹航 《计算机应用》2010,30(11):2952-2955
经典的产生频繁项目集的Apriori算法存在多次扫描数据库可能产生大量候选及反复对候选项集和事务进行模式匹配的缺陷,导致了算法的效率较低。为此,对Apriori算法进行以下3方面的改进:改进由k阶频繁项集生成k+1阶候选频繁项集时的连接和剪枝策略;改进对事务的处理方式,减少Apriori算法中的模式匹配所需的时间开销;改进首次对数据库的处理方法,使得整个算法只扫描一次数据库,并由此提出了改进算法。实验结果表明,改进算法在性能上得到了明显提高。  相似文献   

Mining association rules is most commonly seen among the techniques for knowledge discovery from databases (KDD). It is used to discover relationships among items or itemsets. Furthermore, temporal data mining is concerned with the analysis of temporal data and the discovery of temporal patterns and regularities. In this paper, a new concept of up-to-date patterns is proposed, which is a hybrid of the association rules and temporal mining. An itemset may not be frequent (large) for an entire database but may be large up-to-date since the items seldom occurring early may often occur lately. An up-to-date pattern is thus composed of an itemset and its up-to-date lifetime, in which the user-defined minimum-support threshold must be satisfied. The proposed approach can mine more useful large itemsets than the conventional ones which discover large itemsets valid only for the entire database. Experimental results show that the proposed algorithm is more effective than the traditional ones in discovering such up-to-date temporal patterns especially when the minimum-support threshold is high.  相似文献   

传统的数据挖掘算法在挖掘频繁项集时会产生大量的冗余项集,影响挖掘效率。为此,提出一种基于矩阵的数据流Top-k频繁项集挖掘算法。引入2个0-1矩阵,即事务矩阵和二项集矩阵。采用事务矩阵表示滑动窗口模型中的事务列表,通过计算每行的支持度得到二项集矩阵。利用二项集矩阵得到候选项集,将事务矩阵中对应的行做逻辑与运算,计算出候选项集的支持度,从而得到Top-k频繁项集。把挖掘的结果存入数据字典中,当用户查询时,能够按支持度降序输出Top-k频繁项集。实验结果表明,该算法在挖掘过程中能避免冗余项集的产生,在保证正确率的前提下具有较高的时间效率。  相似文献   

基于数组的关联规则挖掘算法   总被引:12,自引:0,他引:12  
孟祥萍  钱进  刘大有 《计算机工程》2003,29(15):98-99,109
提高频繁项集挖掘算法的效率是关联规则挖掘研究的一个重点领域。文章提出了基于数组的关联规则挖掘算法,只需要扫描数据库1次,通过不断减少数据库中的事务个数,并且利用一维数组对候选2-项集进行计数来提高挖掘效率。实验表明,该文所提出的算法效率比经典Apriori算法快2~3倍。  相似文献   

一种不产生候选项挖掘频繁项集的新算法   总被引:4,自引:2,他引:4  
Apriori算法是关联规则挖掘算法中应用最为广泛的一种算法,它的主要目的是从大量的事务数据中通过候选项集挖掘出有趣的频繁项集,从而为用户提供有意义的关联关系。但随着数据库规模的扩大,apriori算法可能会产生如下两大棘手问题:大量候选项集的产生将造成巨大计算量的浪费;为剪掉无用候选项如何设置阈值。这些问题相对于众多普通用户来说都具有挑战性。该文提出的代码与运算是一种无须候选项挖掘频繁项集的算法,用户无须为设置阈值而煞费苦心。同时事务压缩算法的加入大大减少了算法中的计算量。  相似文献   

Genetic-Fuzzy Data Mining With Divide-and-Conquer Strategy   总被引:1,自引:0,他引:1  
Data mining is most commonly used in attempts to induce association rules from transaction data. Most previous studies focused on binary-valued transaction data. Transaction data in real-world applications, however, usually consist of quantitative values. This paper, thus, proposes a fuzzy data-mining algorithm for extracting both association rules and membership functions from quantitative transactions. A genetic algorithm (GA)-based framework for finding membership functions suitable for mining problems is proposed. The fitness of each set of membership functions is evaluated by the fuzzy-supports of the linguistic terms in the large 1-itemsets and by the suitability of the derived membership functions. The evaluation by the fuzzy supports of large 1-itemsets is much faster than that when considering all itemsets or interesting association rules. It can also help divide-and-conquer the derivation process of the membership functions for different items. The proposed GA framework, thus, maintains multiple populations, each for one item's membership functions. The final best sets of membership functions in all the populations are then gathered together to be used for mining fuzzy association rules. Experiments are conducted to analyze different fitness functions and set different fitness functions and setting different supports and confidences. Experiments are also conducted to compare the proposed algorithm, the one with uniform fuzzy partition, and the existing one without divide-and-conquer, with results validating the performance of the proposed algorithm.  相似文献   

提出了一种新颖的频繁模式挖掘算法,该算法与现有的挖掘算法相比具有明显的优点,首先,该算法不需要产生候选项集,其次该算法具有更少的数据库扫描次数,该算法在中小型数据库上挖掘关联规则只需要扫描交易数据库一次,对于大型交易数据库的关联规则挖掘最多也只需要扫描交易数据库两次。因而,该算法与现有的频繁模式挖掘算法相比具有更高的效率。  相似文献   

We explore a new problem of mining general temporal association rules in publication databases. In essence, a publication database is a set of transactions where each transaction T is a set of items of which each item contains an individual exhibition period. The current model of association rule mining is not able to handle the publication database due to the following fundamental problems, i.e., 1) lack of consideration of the exhibition period of each individual item and 2) lack of an equitable support counting basis for each item. To remedy this, we propose an innovative algorithm progressive-partition-miner (abbreviated as PPM) to discover general temporal association rules in a publication database. The basic idea of PPM is to first partition the publication database in light of exhibition periods of items and then progressively accumulate the occurrence count of each candidate 2-itemset based on the intrinsic partitioning characteristics. Algorithm PPM is also designed to employ a filtering threshold in each partition to early prune out those cumulatively infrequent 2-itemsets. The feature that the number of candidate 2-itemsets generated by PPM is very close to the number of frequent 2-itemsets allows us to employ the scan reduction technique to effectively reduce the number of database scans. Explicitly, the execution time of PPM is, in orders of magnitude, smaller than those required by other competitive schemes that are directly extended from existing methods. The correctness of PPM is proven and some of its theoretical properties are derived. Sensitivity analysis of various parameters is conducted to provide many insights into Algorithm PPM.  相似文献   

A new approach to online generation of association rules   总被引:6,自引:0,他引:6  
We discuss the problem of online mining of association rules in a large database of sales transactions. The online mining is performed by preprocessing the data effectively in order to make it suitable for repeated online queries. We store the preprocessed data in such a way that online processing may be done by applying a graph theoretic search algorithm whose complexity is proportional to the size of the output. The result is an online algorithm which is independent of the size of the transactional data and the size of the preprocessed data. The algorithm is almost instantaneous in the size of the output. The algorithm also supports techniques for quickly discovering association rules from large itemsets. The algorithm is capable of finding rules with specific items in the antecedent or consequent. These association rules are presented in a compact form, eliminating redundancy. The use of nonredundant association rules helps significantly in the reduction of irrelevant noise in the data mining process  相似文献   

Conventional data mining methods for finding frequent itemsets require considerable computing time to produce their results from a large data set. Due to this reason, it is almost impossible to apply them to an analysis task in an online data stream where a new transaction is continuously generated at a rapid rate. An algorithm for finding frequent itemsets over an online data stream should support flexible trade-off between processing time and mining accuracy. Furthermore, the most up-to-date resulting set of frequent itemsets should be available quickly at any moment. To satisfy these requirements, this paper proposes a data mining method for finding frequent itemsets over an online data stream. The proposed method examines each transaction one-by-one without any candidate generation process. The count of an itemset that appears in each transaction is monitored by a lexicographic tree resided in main memory. The current set of monitored itemsets in an online data stream is minimized by two major operations: delayed-insertion and pruning. The former is delaying the insertion of a new itemset in recent transactions until the itemset becomes significant enough to be monitored. The latter is pruning a monitored itemset when the itemset turns out to be insignificant. The number of monitored itemsets can be flexibly controlled by the thresholds of these two operations. As the number of monitored itemsets is decreased, frequent itemsets in the online data stream are more rapidly traced while they are less accurate. The performance of the proposed method is analyzed through a series of experiments in order to identify its various characteristics.  相似文献   

在关联规则挖掘中,主要的问题是如何高效地产生频繁项集。对近年来一些基于十字链表的Apriori算法进行研究和分析,发现它们的候选频繁项集生成方法有很大的改进空间。提出一个基于十字链表的改进算法,优化候选频繁项集的生成方法,减少对事务数据库的扫描,大大提高了挖掘效率。  相似文献   

一种有效的挖掘关联规则更新方法   总被引:1,自引:0,他引:1  
王新 《计算机应用》2005,25(6):1360-1361,1372
在挖掘关联规则过程中,用户往往需要多次调整(增加或减少)最小支持度,才能获得有用的关联规则。给出一个利用已存信息有效产生新候选项目集的PSI算法,结果表明每次扫描数据库时能有效地减少候选项目集的数。  相似文献   

A fast algorithm for mining association rules   总被引:9,自引:0,他引:9       下载免费PDF全文
In this paper,the problem of discovering association rules between items in a large database of sales transactions is discussed.and a novel algorithm,BitMatrix,is proposed.The proposed algorithm is fundamentally different from the known algorithms Apriori and AprioriTid.Empirical evaluation shows that the algorithm outperforms the known ones for large databases.Scale-up experiments show that the algorithm scales linearly with the number of transactions.  相似文献   

In this paper, a new mining capability, called mining of substitution rules, is explored. A substitution refers to the choice made by a customer to replace the purchase of some items with that of others. The mining of substitution rules in a transaction database, the same as that of association rules, will lead to very valuable knowledge in various aspects, including market prediction, user behaviour analysis and decision support. The process of mining substitution rules can be decomposed into two procedures. The first procedure is to identify concrete itemsets among a large number of frequent itemsets, where a concrete itemset is a frequent itemset whose items are statistically dependent. The second procedure is then on the substitution rule generation. In this paper, we first derive theoretical properties for the model of substitution rule mining and devise a technique on the induction of positive itemset supports to improve the efficiency of support counting for negative itemsets. Then, in light of these properties, the SRM (substitution rule mining) algorithm is designed and implemented to discover the substitution rules efficiently while attaining good statistical significance. Empirical studies are performed to evaluate the performance of the SRM algorithm proposed. It is shown that the SRM algorithm not only has very good execution efficiency but also produces substitution rules of very high quality.  相似文献   

约束关联规则的有效挖掘算法   总被引:5,自引:0,他引:5  
研究了在大型事务数据库中挖掘有约束条件的关联规则问题;给出院 约束频繁模式树的定义;提出了一种基于约束频繁模式树的约束关联规则挖掘算法-CFPTA,并与其它相应算法进行了比较,实验结果表明算法CFPTA是有效的。  相似文献   

Conventional algorithms for mining association rules operate in a combination of smaller large itemsets. This paper presents a new efficient which combines both the cluster concept and decomposition of larger candidate itemsets, while proceeds from mining the maximal large itemsets down to large 1-itemsets, named cluster-decomposition association rule (CDAR). First, the CDAR method creates some clusters by reading the database only once, and then clustering the transaction records to the kth cluster, where the length of a record is k. Then, the large k-itemsets are generated by contrasts with the kth cluster only, unlike the combination concept that contrasts with the entire database. Experiments with real-life databases show that CDAR outperforms Apriori, a well-known and widely used association rule.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号