期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Mining non-redundant diverse patterns: an information theoretic perspective

Chaofeng Sha Jian Gong Aoying Zhou 《Frontiers of Computer Science in China》2010,4(1):89-99

The discovery of diversity patterns from binary data is an important data mining task. In this paper, we propose the problem of mining highly diverse patterns called non-redundant diversity patterns (NDPs). In this framework, entropy is adopted to measure the diversity of itemsets. In addition, an algorithm called NDP miner is proposed to exploit both monotone properties of entropy diversity measure and pruning power for the efficient discovery of non-redundant diversity patterns. Finally, our experimental results are given to show that the NDP miner can efficiently identify non-redundant diversity patterns. 相似文献

2.

Mining non-redundant diverse patterns: an information theoretic perspective

Chaofeng SHA Jian GONG Aoying ZHOU 《Frontiers of Computer Science》2010,4(1):89

The discovery of diversity patterns from binary data is an important data mining task. In this paper, we propose the problem of mining highly diverse patterns called non-redundant diversity patterns (NDPs). In this framework, entropy is adopted to measure the diversity of itemsets. In addition, an algorithm called NDP miner is proposed to exploit both monotone properties of entropy diversity measure and pruning power for the efficient discovery of non-redundant diversity patterns. Finally, our experimental results are given to show that the NDP miner can efficiently identify non-redundant diversity patterns. 相似文献

3.

Beyond independence: probabilistic models for query approximation on binary transaction data 总被引：1，自引：0，他引：1

Pavlov D.N. Mannila H. Smyth P. 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(6):1409-1421

We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer. 相似文献

4.

Efficient closed high-utility pattern fusion model in large-scale databases

《Information Fusion》2021

High-Utility Itemset Mining (HUIM) is considered a major issue in recent decades since it reveals profit strategies for use in industry for decision-making. Most existing works have focused on mining high-utility itemsets from databases showing large amount of patterns; however exact decisions are still challenging to make from that large amounts of discovered knowledge. Closed High-utility itemset mining (CHUIM) provides a smart way to present concise high-utility itemsets that can be more effective for making correct decisions. However, none of the existing works have focused on handling large-scale databases to integrate discovered knowledge from several distributed databases. In this paper, we first present a large-scale information fusion architecture to integrate discovered closed high-utility patterns from several distributed databases. The generic composite model is used to cluster transactions regarding their relevant correlation that can ensure correctness and completeness of the fusion model. The well-known MapReduce framework is then deployed in the developed DFM-Miner algorithm to handle big datasets for information fusion and integration. Experiments are then compared to the state-of-the-art CHUI-Miner and CLS-Miner algorithms for mining closed high-utility patterns and the results indicated that the designed model is well designed for handling large-scale databases with less memory usage. Moreover, the designed MapReduce framework can speed up the mining performance of closed high-utility patterns in the developed fusion system. 相似文献

5.

Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

Hao Wu Jilles Vreeken Nikolaj Tatti Naren Ramakrishnan 《Data mining and knowledge discovery》2014,28(5-6):1398-1428

Many application domains such as intelligence analysis and cybersecurity require tools for the unsupervised identification of suspicious entities in multi-relational/network data. In particular, there is a need for automated semi-automated approaches to ‘uncover the plot’, i.e., to detect non-obvious coalitions of entities bridging many types of relations. We cast the problem of detecting such suspicious coalitions and their connections as one of mining surprisingly dense and well-connected chains of biclusters over multi-relational data. With this as our goal, we model data by the Maximum Entropy principle, such that in a statistically well-founded way we can gauge the surprisingness of a discovered bicluster chain with respect to what we already know. We design an algorithm for approximating the most informative multi-relational patterns, and provide strategies to incrementally organize discovered patterns into the background model. We illustrate how our method is adept at discovering the hidden plot in multiple synthetic and real-world intelligence analysis datasets. Our approach naturally generalizes traditional attribute-based maximum entropy models for single relations, and further supports iterative, human-in-the-loop, knowledge discovery. 相似文献

6.

An efficient algorithm for mining temporal high utility itemsets from data streams 总被引：1，自引：0，他引：1

Chun-Jung Chu Tyne Liang 《Journal of Systems and Software》2008,81(7):1105-1117

Utility of an itemset is considered as the value of this itemset, and utility mining aims at identifying the itemsets with high utilities. The temporal high utility itemsets are the itemsets whose support is larger than a pre-specified threshold in current time window of the data stream. Discovery of temporal high utility itemsets is an important process for mining interesting patterns like association rules from data streams. In this paper, we propose a novel method, namely THUI (Temporal High Utility Itemsets)-Mine, for mining temporal high utility itemsets from data streams efficiently and effectively. To the best of our knowledge, this is the first work on mining temporal high utility itemsets from data streams. The novel contribution of THUI-Mine is that it can effectively identify the temporal high utility itemsets by generating fewer candidate itemsets such that the execution time can be reduced substantially in mining all high utility itemsets in data streams. In this way, the process of discovering all temporal high utility itemsets under all time windows of data streams can be achieved effectively with less memory space and execution time. This meets the critical requirements on time and space efficiency for mining data streams. Through experimental evaluation, THUI-Mine is shown to significantly outperform other existing methods like Two-Phase algorithm under various experimental conditions. 相似文献

7.

不确定数据流最大频繁项集挖掘算法研究

刘慧婷候明利赵鹏姚晟《计算机工程与应用》2016,52(19):72-77

对于大型数据,频繁项集挖掘显得庞大而冗余,挖掘最大频繁项集可以减少挖出的频繁项集的个数。可是对于不确定性数据流,传统判断项集是否频繁的方法已不能准确表达项集的频繁性,而且目前还没有在不确定数据流上挖掘最大频繁项集的相关研究。因此,针对上述不足,提出了一种基于衰减模型的不确定性数据流最大频繁项集挖掘算法TUFSMax。该算法采用标记树结点的方法,使得算法不需要超集检测就可挖掘出所有的最大频繁项集,节约了超集检测时间。实验证明了提出的算法在时间和空间上具有高效性。相似文献

8.

Dataless Transitions Between Concise Representations of Frequent Patterns 总被引：1，自引：0，他引：1

Marzena Kryszkiewicz Henryk Rybiński Marcin Gajek 《Journal of Intelligent Information Systems》2004,22(1):41-70

For many data mining problems in order to solve them it is required to discover frequent patterns. Frequent itemsets are useful e.g. in the discovery of association and episode rules, sequential patterns and clusters. Nevertheless, the number of frequent itemsets is usually huge. Therefore, a number of lossless representations of frequent itemsets have recently been proposed. Two of such representations, namely the closed itemsets and the generators representation, are of particular interest as they can efficiently be applied for the discovery of most interesting non-redundant association and episode rules. On the other hand, it has been proved experimentally that other representations of frequent patterns happen to be more concise and more quickly extractable than these two representations even by several orders of magnitude. Hence, such concise representations seem to be an interesting alternative for materializing and reusing the knowledge of frequent patterns. The problem however arises, how to transform the intermediate representations into the desired ones efficiently and preferably without accessing the database. This article tackles this problem. As a result of investigating the properties of representations of frequent patterns, we offer a set of efficient algorithms for dataless transitioning between them. 相似文献

9.

最大频繁项目集的快速更新 总被引：29，自引：0，他引：29

吉根林杨明宋余庆孙志挥《计算机学报》2005,28(1):128-135

挖掘最大频繁项目集是多种数据挖掘应用中的关键问题．为克服基于Apriori的最大频繁项目集挖掘算法存在的不足,DMFIA采用FP-tree存储结构及自顶向下的搜索策略,有效地提高了最大频繁项目集的挖掘效率．但对于频繁项目多而最大频繁项目集维数相对较小的情况,DMFIA要经过多层搜索且在每一层产生大量的候选项目集,因而影响算法的执行效率．为此,该文提出了DMFIA的改进算法IDMFIA(the Improved algorithm of DMFIA)．IDMFIA采用自顶向下和自底向上双向搜索策略,可尽早修剪掉较短最大频繁项目集的超集和较长最大频繁项目集的子集．另外,该文还提出最大频繁项目集更新算法FUMFIA(Fast Updating Maximum Frequent Itemsets Algorithm),该算法充分利用已建立的FP-tree和已挖掘的最大频繁项目集,可对已挖掘的最大频繁项目集进行高效维护．实验结果表明,IDMFIA和FUMFIA可有效提高最大频繁项目集的挖掘和更新效率．相似文献

10.

Power-law relationship and self-similarity in the itemset support distribution: analysis and applications

Kun-Ta Chuang Jiun-Long Huang Ming-Syan Chen 《The VLDB Journal The International Journal on Very Large Data Bases》2008,17(5):1121-1141

In this paper, we identify and explore that the power-law relationship and the self-similar phenomenon appear in the itemset support distribution. The itemset support distribution refers to the distribution of the count of itemsets versus their supports. Exploring the characteristics of these natural phenomena is useful to many applications such as providing the direction of tuning the performance of the frequent-itemset mining. However, due to the explosive number of itemsets, it is prohibitively expensive to retrieve lots of itemsets before we identify the characteristics of the itemset support distribution in targeted data. As such, we also propose a valid and cost-effective algorithm, called algorithm PPL, to extract characteristics of the itemset support distribution. Furthermore, to fully explore the advantages of our discovery, we also propose novel mechanisms with the help of PPL to solve two important problems: (1) determining a subtle parameter for mining approximate frequent itemsets over data streams; and (2) determining the sufficient sample size for mining frequent patterns. As validated in our experimental results, PPL can efficiently and precisely identify the characteristics of the itemset support distribution in various real data. In addition, empirical studies also demonstrate that our mechanisms for those two challenging problems are in orders of magnitude better than previous works, showing the prominent advantage of PPL to be an important pre-processing means for mining applications. 相似文献

11.

基于差分隐私的不确定数据频繁项集挖掘算法

丁哲秦臻秦志光《计算机应用研究》2018,35(7)

基于不确定数据的频繁项集挖掘算法已经得到了广泛的研究。对于记录用户敏感信息的不确定数据,攻击者可以利用自己掌握的背景信息,通过分析基于不确定数据的频繁项集,从而获得用户的敏感信息。为了从不确定的数据集中挖掘出基于期望支持度的前K个最频繁的频繁项集,并且保证挖掘结果满足差分隐私,在本文中,FIMUDDP算法（Frequent Itemsets Mining for Uncertain Data based on Differential Privacy）被提出来。FIMUDDP利用差分隐私的指数机制和拉普拉斯机制确保从不确定数据中挖掘出的基于期望支持度的前K个最频繁的频繁项集和这些频繁项集的期望支持度满足差分隐私。通过对FIMUDDP进行理论分析和实验评估,验证了FIMUDDP的有效性。相似文献

12.

Mining closed strict episodes 总被引：1，自引：1，他引：0

Nikolaj Tatti Boris Cule 《Data mining and knowledge discovery》2012,25(1):34-66

Discovering patterns in a sequence is an important aspect of data mining. One popular choice of such patterns are episodes, patterns in sequential data describing events that often occur in the vicinity of each other. Episodes also enforce in which order the events are allowed to occur. In this work we introduce a technique for discovering closed episodes. Adopting existing approaches for discovering traditional patterns, such as closed itemsets, to episodes is not straightforward. First of all, we cannot define a unique closure based on frequency because an episode may have several closed superepisodes. Moreover, to define a closedness concept for episodes we need a subset relationship between episodes, which is not trivial to define. We approach these problems by introducing strict episodes. We argue that this class is general enough, and at the same time we are able to define a natural subset relationship within it and use it efficiently. In order to mine closed episodes we define an auxiliary closure operator. We show that this closure satisfies the needed properties so that we can use the existing framework for mining closed patterns. Discovering the true closed episodes can be done as a post-processing step. We combine these observations into an efficient mining algorithm and demonstrate empirically its performance in practice. 相似文献

13.

一种基于F-矩阵的最大频繁项目集快速挖掘算法

杨萍《计算机工程与应用》2003,39(34):197-200

最大频繁项目集挖掘是多种数据挖掘应用研究的一个重要方面,最大频繁项目集的快速挖掘算法研究是当前研究的热点。传统的最大频繁项目集挖掘算法要多遍扫描数据库并产生大量的候选项目集。为此,该文提出了基于F-矩阵的最大频繁项目集快速挖掘算法FMMFIBFM,FMMFIBFM采用FP-tree的存储结构,仅须扫描数据库两遍且不产生候选频繁项目集,有效地提高了频繁项目集的挖掘效率。实验结果表明,FMMFIBFM算法是有效可行的。相似文献

14.

Discovery of fuzzy temporal association rules 总被引：1，自引：0，他引：1

Wan-Jui Lee Shie-Jue Lee 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2004,34(6):2330-2342

We propose a data mining system for discovering interesting temporal patterns from large databases. The mined patterns are expressed in fuzzy temporal association rules which satisfy the temporal requirements specified by the user. Temporal requirements specified by human beings tend to be ill-defined or uncertain. To deal with this kind of uncertainty, a fuzzy calendar algebra is developed to allow users to describe desired temporal requirements in fuzzy calendars easily and naturally. Fuzzy operations are provided and users can define complicated fuzzy calendars to discover the knowledge in the time intervals that are of interest to them. A border-based mining algorithm is proposed to find association rules incrementally. By keeping useful information of the database in a border, candidate itemsets can be computed in an efficient way. Updating of the discovered knowledge due to addition and deletion of transactions can also be done efficiently. The kept information can be used to help save the work of counting and unnecessary scans over the updated database can be avoided. Simulation results show the effectiveness of the proposed system. A performance comparison with other systems is also given. 相似文献

15.

Mining top?k frequent patterns without minimum support threshold 总被引：1，自引：1，他引：0

Abdus Salam M. Sikandar Hayat Khayal 《Knowledge and Information Systems》2012,30(1):57-86

Finding frequent patterns play an important role in mining association rules, sequences, episodes, Web log mining and many other interesting relationships among data. Frequent pattern mining methods often produce a huge number of frequent itemsets that is not feasible for effective usage. The number of highly correlated patterns is usually very small and may even be one. Most of the existing frequent pattern mining techniques often require the setting of many input parameters and may involve multiple passes over the database. Minimum support is the widely used parameter in frequent pattern mining to discover statistically significant patterns. Specifying appropriate minimum support is a challenging task for a data analyst as the choice of minimum support value is somewhat arbitrary. Generally, it is required to repeatedly execute an algorithm, heuristically tuning the value of minimum support over a wide range, until the desired result is obtained, certainly, a very time-consuming process. Setting up an inappropriate minimum support may also cause an algorithm to fail in finding the true patterns. We present a novel method to efficiently retrieve top few maximal frequent patterns in order of significance without use of the minimum support parameter. Instead, we are only required to specify a more human understandable parameter, namely the desired number itemsets k. Our technique requires only a single pass over the database and generation of length two itemsets. The association ratio graph is proposed as a compact structure containing concise information, which is created in time quadratic to the size of the database. Algorithms are described for using this graph structure to discover top-most and top-k maximal frequent itemsets without minimum support threshold. To effectively achieve this, the method employs construction of an all path source-to-destination tree to discover all maximal cycles in the graph. The results can be ranked in decreasing order of significance. Results are presented demonstrating the performance advantages to be gained from the use of this approach. 相似文献

16.

最大频繁项目集的增量式更新算法

姜玉泉《计算机工程与应用》2003,39(24):187-188,201

发现最大频繁项目集是多种数据挖掘应用中的关键问题,目前已经提出了许多算法用于发现最大频繁项目集,而对最大频繁项目集维护问题的研究工作却不多,因此,迫切需要设计高效的算法来更新、维护和管理已挖掘出来的最大频繁项目集,为此,该文提出了一种快速的增量式更新最大频繁项目集算法IUAFI,并举例说明了算法的执行过程。相似文献

17.

一种基于FP-tree的最大频繁项目集挖掘算法 总被引：7，自引：0，他引：7

刘乃丽李玉忱马磊《计算机应用》2005,25(5):998-1000

挖掘关联规则是数据挖掘领域中的重要研究内容,其中挖掘最大频繁项目集是挖掘关联规则中的关键问题之一,以前的许多挖掘最大频繁项目集算法是先生成候选,再进行检验,然而候选项目集产生的代价是很高的,尤其是存在大量长模式的时候。文中改进了FP 树结构,提出了一种基于FP tree的快速挖掘最大频繁项目集的算法DMFIA 1,该算法不需要生成最大频繁候选项目集,比DMFIA算法挖掘最大频繁项目集的效率更高。改进的FP 树是单向的,每个结点只保留指向父结点的指针,这大约节省了三分之一的树空间。相似文献

18.

基于FP-Tree的最大频繁项目集挖掘及更新算法 总被引：105，自引：2，他引：105

下载免费PDF全文

宋余庆朱玉全孙志挥陈耿《软件学报》2003,14(9):1586-1592

挖掘最大频繁项目集是多种数据挖掘应用中的关键问题,之前的很多研究都是采用Apriori类的候选项目集生成-检验方法.然而,候选项目集产生的代价是很高的,尤其是在存在大量强模式和/或长模式的时候.提出了一种快速的基于频繁模式树(FP-tree)的最大频繁项目集挖掘DMFIA(discover maximum frequent itemsets algorithm)及其更新算法UMFIA(update maximum frequent itemsets algorithm).算法UMFIA将充分利用以前的挖掘结果来减少在更新的数据库中发现新的最大频繁项目集的费用. 相似文献

19.

基于项目集知识库的关联规则挖掘与更新的高效算法 总被引：4，自引：2，他引：2

李华君周海岩《计算机工程与设计》2004,25(12):2198-2201

通过对已有的诸关联规则挖掘与更新算法进行深入的分析和研究,指出了其共同存在的问题与不足,提出了一种基于项目集知识库的关联规则挖掘与更新方法。该方法既适应当数据库D中数据不变而用户指定的最小支持度和最小置信度这两个阈值变化的情况,也适合事务数据库D中数据发生变化的情况。当事务数据库D中数据不变时,仅需扫描数据库一次,便可建立项目集知识库KBD,然后可反复调整最小支持度和最小置信度进行关联规则挖掘与更新。而当事务数据库D中数据发生变化时,仅需扫描数据集d 和d-各一次;通过对项目集知识库KBD的更新来达到对频繁项目集和关联规则的更新。相似文献

20.

Discovering episodes with compact minimal windows

Nikolaj Tatti 《Data mining and knowledge discovery》2014,28(4):1046-1077

Discovering the most interesting patterns is the key problem in the field of pattern mining. While ranking or selecting patterns is well-studied for itemsets it is surprisingly under-researched for other, more complex, pattern types. In this paper we propose a new quality measure for episodes. An episode is essentially a set of events with possible restrictions on the order of events. We say that an episode is significant if its occurrence is abnormally compact, that is, only few gap events occur between the actual episode events, when compared to the expected length according to the independence model. We can apply this measure as a post-pruning step by first discovering frequent episodes and then rank them according to this measure. In order to compute the score we will need to compute the mean and the variance according to the independence model. As a main technical contribution we introduce a technique that allows us to compute these values. Such a task is surprisingly complex and in order to solve it we develop intricate finite state machines that allow us to compute the needed statistics. We also show that asymptotically our score can be interpreted as a $P$ value. In our experiments we demonstrate that despite its intricacy our ranking is fast: we can rank tens of thousands episodes in seconds. Our experiments with text data demonstrate that our measure ranks interpretable episodes high. 相似文献