首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Unil Yun 《Information Sciences》2007,177(17):3477-3499
Most algorithms for frequent pattern mining use a support constraint to prune the combinatorial search space but support-based pruning is not enough. After mining datasets to obtain frequent patterns, the resulting patterns can have weak affinity. Although the minimum support can be increased, it is not effective for finding correlated patterns with increased weight and/or support affinity. Interesting measures have been proposed to detect correlated patterns but any approach does not consider both support and weight. In this paper, we present a new strategy, Weighted interesting pattern mining (WIP) in which a new measure, weight-confidence, is suggested to mine correlated patterns with the weight affinity. A weight range is used to decide weight boundaries and an h-confidence serves to identify support affinity patterns. In WIP, without additional computation cost, original h-confidence is used instead of the upper bound of h-confidence for performance improvement. WIP not only gives a balance between the two measures of weight and support, but also considers weight affinity and/or support affinity between items within patterns so more correlated patterns can be detected. To our knowledge, ours is the first work specifically to consider weight affinity between items of patterns. A comprehensive performance study shows that WIP is efficient and scalable for finding affinity patterns. Moreover, it generates fewer but more valuable patterns with the correlation. To decrease the number of thresholds, w-confidence, h-confidence and weighted support can be used selectively according to requirement of applications.  相似文献   

2.
Recently, high utility sequential pattern mining has been an emerging popular issue due to the consideration of quantities, profits and time orders of items. The utilities of subsequences in sequences in the existing approach are difficult to be calculated due to the three kinds of utility calculations. To simplify the utility calculation, this work then presents a maximum utility measure, which is derived from the principle of traditional sequential pattern mining that the count of a subsequence in the sequence is only regarded as one. Hence, the maximum measure is properly used to simplify the utility calculation for subsequences in mining. Meanwhile, an effective upper-bound model is designed to avoid information losing in mining, and also an effective projection-based pruning strategy is designed as well to cause more accurate sequence-utility upper-bounds of subsequences. The indexing strategy is also developed to quickly find the relevant sequences for prefixes in mining, and thus unnecessary search time can be reduced. Finally, the experimental results on several datasets show the proposed approach has good performance in both pruning effectiveness and execution efficiency.  相似文献   

3.
Sequential pattern mining is essential in many applications, including computational biology, consumer behavior analysis, web log analysis, etc. Although sequential patterns can tell us what items are frequently to be purchased together and in what order, they cannot provide information about the time span between items for decision support. Previous studies dealing with this problem either set time constraints to restrict the patterns discovered or define time-intervals between two successive items to provide time information. Accordingly, the first approach falls short in providing clear time-interval information while the second cannot discover time-interval information between two non-successive items in a sequential pattern. To provide more time-related knowledge, we define a new variant of time-interval sequential patterns, called multi-time-interval sequential patterns, which can reveal the time-intervals between all pairs of items in a pattern. Accordingly, we develop two efficient algorithms, called the MI-Apriori and MI-PrefixSpan algorithms, to solve this problem. The experimental results show that the MI-PrefixSpan algorithm is faster than the MI-Apriori algorithm, but the MI-Apriori algorithm has better scalability in long sequence data.  相似文献   

4.
Fuzzy utility mining has been an emerging research issue because of its simplicity and comprehensibility. Different from traditional fuzzy data mining, fuzzy utility mining considers not only quantities of items in transactions but also their profits for deriving high fuzzy utility itemsets. In this paper, we introduce a new fuzzy utility measure with the fuzzy minimum operator to evaluate the fuzzy utilities of itemsets. Besides, an effective fuzzy utility upper-bound model based on the proposed measure is designed to provide the downward-closure property in fuzzy sets, thus reducing the search space of finding high fuzzy utility itemsets. A two-phase fuzzy utility mining algorithm, named TPFU, is also proposed and described for solving the problem of fuzzy utility mining. At last, the experimental results on both synthetic and real datasets show that the proposed algorithm has good performance.  相似文献   

5.
高效用模式挖掘是数据挖掘领域的一个基础研究方向,其中关于top-k高效用模式的挖掘算法也越来越多,其中k指的是用户需要挖掘的高效用模式的个数。它们可以归纳为两类:二阶段top-k算法和一阶段top-k算法。两者的主要区别是,前者在挖掘的过程中会产生大量的候选模式,这个是影响算法性能的主要因素;后者在挖掘的过程中不产生候选模式。为了更加高效地挖掘效用值最高的k个模式,一阶段算法TKHUP被提出。该算法在进行数据挖掘的过程中主要是通过四个有效策略来减少时间和空间消耗的。通过大量的实验数据表明,TKHUP在时间性能上优于其它top-k高效用模式挖掘算法。  相似文献   

6.
全集高效用模式挖掘算法存在的关键问题之一是会产生冗余的高效用项集,这将导致用户很难在大量的高效用项集中发现有用的信息,严重降低了高效用模式挖掘算法的性能。为解决这一问题,衍生出了精简高效用模式挖掘算法,其主要包括最大高效用模式、闭合高效用模式、top-k高效用模式以及三者之间的组合高效用模式挖掘算法等。首先,介绍了精简高效用模式的相关问题描述;然后,从有无候选项集生成、一两阶段挖掘方法、数据结构类型和剪枝策略等角度,重点分类总结了精简高效用模式挖掘方法;最后,给出了精简高效用模式的进一步研究方向,包括处理基于负项的高效用精简模式、处理基于时间的高效用精简模式及处理动态复杂的数据等。  相似文献   

7.
On-shelf utility mining has recently received interest in the data mining field due to its practical considerations. On-shelf utility mining considers not only profits and quantities of items in transactions but also their on-shelf time periods in stores. Profit values of items in traditional on-shelf utility mining are considered as being positive. However, in real-world applications, items may be associated with negative profit values. This paper proposes an efficient three-scan mining approach to efficiently find high on-shelf utility itemsets with negative profit values from temporal databases. In particular, an effective itemset generation method is developed to avoid generating a large number of redundant candidates and to effectively reduce the number of data scans in mining. Experimental results for several synthetic and real datasets show that the proposed approach has good performance in pruning effectiveness and execution efficiency.  相似文献   

8.
Mining association rules and mining sequential patterns both are to discover customer purchasing behaviors from a transaction database, such that the quality of business decision can be improved. However, the size of the transaction database can be very large. It is very time consuming to find all the association rules and sequential patterns from a large database, and users may be only interested in some information.

Moreover, the criteria of the discovered association rules and sequential patterns for the user requirements may not be the same. Many uninteresting information for the user requirements can be generated when traditional mining methods are applied. Hence, a data mining language needs to be provided such that users can query only interesting knowledge to them from a large database of customer transactions. In this paper, a data mining language is presented. From the data mining language, users can specify the interested items and the criteria of the association rules or sequential patterns to be discovered. Also, the efficient data mining techniques are proposed to extract the association rules and the sequential patterns according to the user requirements.  相似文献   


9.
Association-rule mining, which is based on frequency values of items, is the most common topic in data mining. In real-world applications, customers may, however, buy many copies of products and each product may have different factors, such as profits and prices. Only mining frequent itemsets in binary databases is thus not suitable for some applications. Utility mining is thus presented to consider additional measures, such as profits or costs according to user preference. In the past, a two-phase mining algorithm was designed for fast discovering high utility itemsets from databases. When data come intermittently, the approach needs to process all the transactions in a batch way. In this paper, an incremental mining algorithm for efficiently mining high utility itemsets is proposed to handle the above situation. It is based on the concept of the fast-update (FUP) approach, which was originally designed for association mining. The proposed approach first partitions itemsets into four parts according to whether they are high transaction-weighted utilization itemsets in the original database and in the newly inserted transactions. Each part is then executed by its own procedure. Experimental results also show that the proposed algorithm executes faster than the two-phase batch mining algorithm in the intermittent data environment  相似文献   

10.
Improving the quality of image data through noise filtering has gained more attention for a long time. To date, many studies have been devoted to filter the noise inside the image, while few of them focus on filtering the instance-level noise among normal images. In this paper, aiming at providing a noise filter for bag-of-features images, (1) we first propose to utilize the cosine interesting pattern to construct the noise filter; (2) then we prove that to filter noise only requires to mine the shortest cosine interesting patterns, which dramatically simplifies the mining process; (3) we present an in-breadth pruning technique to further speed up the mining process. Experimental results on two real-life image datasets demonstrate effectiveness and efficiency of our noise filtering method.  相似文献   

11.
张妮  韩萌  王乐  李小娟  程浩东 《计算机应用》2022,42(4):999-1010
高效用模式挖掘(HUPM)是新兴的数据科学研究内容之一,通过考虑事务数据库中项的单位利润和数量,以提取出更有用的信息。传统的HUPM方法假定所有项的效用值均为正,但是在实际应用中,某些数据项的效用值可能为负(如商品因产生亏损而导致利润值为负),含负项的模式挖掘与仅含正项的模式挖掘同样重要。首先,阐述了HUPM的相关概念,并分别给出相应正负效用的实例;然后,以正与负角度划分了HUPM方法,其中带有正效用的模式挖掘方法进一步以动态与静态的数据库新颖角度划分,带有负效用的模式挖掘方法中包括了基于先验、基于树、基于效用列表和基于数组等关键技术,并从不同方面对这些方法进行了讨论和总结;最后,给出了现有HUPM方法的不足和下一步研究方向。  相似文献   

12.
Frequent pattern mining is an essential theme in data mining. Existing algorithms usually use a bottom-up search strategy. However, for very high dimensional data, this strategy cannot fully utilize the minimum support constraint to prune the rowset search space. In this paper, we propose a new method called top-down mining together with a novel row enumeration tree to make full use of the pruning power of the minimum support constraint. Furthermore, to efficiently check if a rowset is closed, we develop a method called the trace-based method. Based on these methods, an algorithm called TD-Close is designed for mining a complete set of frequent closed patterns. To enhance its performance further, we improve it by using new pruning strategies and new data structures that lead to a new algorithm TTD-Close. Our performance study shows that the top-down strategy is effective in cutting down search space and saving memory space, while the trace-based method facilitates the closeness-checking. As a result, the algorithm TTD-Close outperforms the bottom-up search algorithms such as Carpenter and FPclose in most cases. It also runs faster than TD-Close.  相似文献   

13.
The topic on recommendation systems for mobile users has attracted a lot of attentions in recent years. However, most of the existing recommendation techniques were developed based only on geographic features of mobile users’ trajectories. In this paper, we propose a novel approach for recommending items for mobile users based on both the geographic and semantic features of users’ trajectories. The core idea of our recommendation system is based on a novel cluster-based location prediction strategy, namely TrajUtiRec, to improve items recommendation model. Our proposed cluster-based location prediction strategy evaluates the next location of a mobile user based on the frequent behaviors of similar users in the same cluster determined by analyzing users’ common behaviors in semantic trajectories. For each location, high utility itemset mining algorithm is performed for discovering high utility itemset. Accordingly, we can recommend the high utility itemset which is related to the location the user might visit. Through a comprehensive evaluation by experiments, our proposal is shown to deliver excellent performance.  相似文献   

14.
High utility itemset mining considers the importance of items such as profit and item quantities in transactions. Recently, mining high utility itemsets has emerged as one of the most significant research issues due to a huge range of real world applications such as retail market data analysis and stock market prediction. Although many relevant algorithms have been proposed in recent years, they incur the problem of generating a large number of candidate itemsets, which degrade mining performance. In this paper, we propose an algorithm named MU-Growth (Maximum Utility Growth) with two techniques for pruning candidates effectively in mining process. Moreover, we suggest a tree structure, named MIQ-Tree (Maximum Item Quantity Tree), which captures database information with a single-pass. The proposed data structure is restructured for reducing overestimated utilities. Performance evaluation shows that MU-Growth not only decreases the number of candidates but also outperforms state-of-the-art tree-based algorithms with overestimated methods in terms of runtime with a similar memory usage.  相似文献   

15.
通过分析有关高效用模式挖掘(high utility pattern mining,HUPM)最先进的方法,对其进行全面和结构化的概述。首先,通过介绍HUPM的相关概念、公式并给出应用示例,对HUPM有更深一步的理解;针对用于挖掘不同类型HUPM的最常见和最先进的关键技术进行分类,包括基于Apriori、基于树、基于列表、基于映射、基于垂直/水平数据格式、基于索引等方法。针对现有关键技术的用途和优缺点进行了全面概述,由于静态数据难以满足实际需要,总结了在数据流上应用的HUPM方法,主要包括基于增量方法、基于滑动窗口模型方法、基于时间衰减模型方法、基于地标模型方法等。最后,给出了现有技术的不足和改进方向,并且有针对性地提出了新的研究方法。  相似文献   

16.
高效用模式的挖掘需要设定一个合适的阈值,而阈值设定对用户来说并非易事,阈值过小导致产生大量低效用模式,阈值过大可能导致无高效用模式生成。因而Top-k高效用模式挖掘方法被提出,k指效用值前k大的模式。并且大量的高效用挖掘研究仅针对静态数据库,但在实际应用中常常会遇到新事务的加入的情况。针对以上问题,提出了增量的Top-k高效用挖掘算法TOPK-HUP-INS。算法通过四个有效的策略,在增量数据的情况下,有效地挖掘用户所需数量的高效用模式。通过在不同数据集上的对比实验表明TOPK-HUP-INS算法在时空性能上表现优异。  相似文献   

17.
Utility of an itemset is considered as the value of this itemset, and utility mining aims at identifying the itemsets with high utilities. The temporal high utility itemsets are the itemsets whose support is larger than a pre-specified threshold in current time window of the data stream. Discovery of temporal high utility itemsets is an important process for mining interesting patterns like association rules from data streams. In this paper, we propose a novel method, namely THUI (Temporal High Utility Itemsets)-Mine, for mining temporal high utility itemsets from data streams efficiently and effectively. To the best of our knowledge, this is the first work on mining temporal high utility itemsets from data streams. The novel contribution of THUI-Mine is that it can effectively identify the temporal high utility itemsets by generating fewer candidate itemsets such that the execution time can be reduced substantially in mining all high utility itemsets in data streams. In this way, the process of discovering all temporal high utility itemsets under all time windows of data streams can be achieved effectively with less memory space and execution time. This meets the critical requirements on time and space efficiency for mining data streams. Through experimental evaluation, THUI-Mine is shown to significantly outperform other existing methods like Two-Phase algorithm under various experimental conditions.  相似文献   

18.
数据流高效用模式挖掘方法是以二进制的频繁模式挖掘方法为前提,引入项的内部效用和外部效用,在模式挖掘过程中可以考虑项的重要性,从而挖掘更有价值的模式。从关键窗口技术、常用方法、表示形式等角度对数据流高效用模式挖掘方法进行分析并总结其相关算法,从而研究其特点、优势、劣势以及其关键问题所在。具体来说,说明了数据流高效用模式常用的概念;对处理数据流高效用模式的关键窗口技术进行了分析,涉及到滑动、衰减、界标和倾斜窗口模型;研究了一阶段和两阶段的数据流高效用模式挖掘方法;分析了高效用模式的表示形式,即完全高效用模式和压缩高效用模式;介绍了其他的数据流高效用模式,包括序列高效用模式、混合高效用模式以及高平均效用模式等;最后展望了数据流高效用模式挖掘的进一步研究方向。  相似文献   

19.
This paper presents an evolutionary algorithm for Discriminative Pattern (DP) mining that focuses on high dimensional data sets. DPs aims to identify the sets of characteristics that better differentiate a target group from the others (e.g. successful vs. unsuccessful medical treatments). It becomes more natural to extract information from high dimensionality data sets with the increase in the volume of data stored in the world (30 GB/s only in the Internet). There are several evolutionary approaches for DP mining, but none focusing on high-dimensional data. We propose an evolutionary approach attributing features that reduce the cost of memory and processing in the context of high-dimensional data. The new algorithm thus seeks the best (top-k) patterns and hides from the user many common parameters in other evolutionary heuristics such as population size, mutation and crossover rates, and the number of evaluations. We carried out experiments with real-world high-dimensional and traditional low dimensional data. The results showed that the proposed algorithm was superior to other approaches of the literature in high-dimensional data sets and competitive in the traditional data sets.  相似文献   

20.
由于能反映用户的偏好,可以弥补传统频繁项集挖掘仅由支持度来衡量项集重要性的不足,高效用项集正在成为当前数据挖掘研究的热点。为使高效用项集挖掘更好地适应数据规模不断增大的实际需求,提出了一种高效用项集的并行挖掘算法PHUI-Mine。提出了记录挖掘高效用项集信息的DHUI-树结构,描述了DHUI-树的构造方法,论证了DHUI-树的动态剪枝策略。在此基础上,给出了高效用项集挖掘的并行算法描述。实验结果表明,PHUI-Mine算法具有较高的挖掘效率及较低的存储开销。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号