首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Sequential pattern mining has been studied extensively in the data mining community. Most previous studies require the specification of a min_support threshold for mining a complete set of sequential patterns satisfying the threshold. However, in practice, it is difficult for users to provide an appropriate min_support threshold. To overcome this difficulty, we propose an alternative mining task: mining top-k frequent closed sequential patterns of length no less than min_, where k is the desired number of closed sequential patterns to be mined and min_ is the minimal length of each pattern. We mine the set of closed patterns because it is a compact representation of the complete set of frequent patterns. An efficient algorithm, called TSP, is developed for mining such patterns without min_support. Starting at (absolute) min_support=1, the algorithm makes use of the length constraint and the properties of top-k closed sequential patterns to perform dynamic support raising and projected database pruning. Our extensive performance study shows that TSP has high performance. In most cases, it outperforms the efficient closed sequential pattern-mining algorithm, CloSpan, even when the latter is running with the best tuned min_support threshold. Thus, we conclude that, for sequential pattern mining, mining top-k frequent closed sequential patterns without min_support is more preferable than the traditional min_support-based mining.  相似文献   

2.
Online mining of path traversal patterns from continuous Web click streams is one of the challenging research problems of Web usage mining. Most of previous works focus on mining path traversal patterns over the entire history of Web click streams. Mining the recent changes of Web click streams can provide valuable information for the analysis of the Web click streams. In this paper, we propose a new, online mining algorithm, called Top-DSW (top-k path traversal patterns of stream Damped Sliding Window), to discover the set of top-k path traversal patterns from streaming maximal forward references, where k is the desired number of path traversal patterns to be mined. An effective summary data structure, called TKP-DSW-list (a list of top-k path traversal patterns of stream Damped Sliding Windows) is developed to maintain the essential information about the top-k path traversal patterns from the maximal forward references within a stream damped sliding window. An effective space pruning mechanism, called TKR-list-maintain, is developed to control the memory requirement of the TKP-DSW-list. Experimental studies show that the proposed Top-DSW algorithm is an efficient, single-pass algorithm for online mining of the set of top-k path traversal patterns over stream damped sliding windows.  相似文献   

3.
Numerous interestingness measures have been proposed in statistics and data mining to assess object relationships. This is especially important in recent studies of association or correlation pattern mining. However, it is still not clear whether there is any intrinsic relationship among many proposed measures, and which one is truly effective at gauging object relationships in large data sets. Recent studies have identified a critical property, null-(transaction) invariance, for measuring associations among events in large data sets, but many measures do not have this property. In this study, we re-examine a set of null-invariant interestingness measures and find that they can be expressed as the generalized mathematical mean, leading to a total ordering of them. Such a unified framework provides insights into the underlying philosophy of the measures and helps us understand and select the proper measure for different applications. Moreover, we propose a new measure called Imbalance Ratio to gauge the degree of skewness of a data set. We also discuss the efficient computation of interesting patterns of different null-invariant interestingness measures by proposing an algorithm, GAMiner, which complements previous studies. Experimental evaluation verifies the effectiveness of the unified framework and shows that GAMiner speeds up the state-of-the-art algorithm by an order of magnitude.  相似文献   

4.
杨皓  段磊  胡斌  邓松  王文韬  秦攀 《软件学报》2015,26(11):2994-3009
对比序列模式能够表达序列数据集合间的差异,在商品推荐、用户行为分析和电力供应预测等领域有广泛的应用.已有的对比序列模式挖掘算法需要用户设定正例支持度阈值和负例支持度阈值.在不具备足够先验知识的情况下,用户难以设定恰当的支持度阈值,从而可能错失一些对比显著的模式.为此,提出了带间隔约束的top-k对比序列模式挖掘算法kDSP-Miner(top-k distinguishing sequential patterns with gap constraint miner).kDSP-Miner中用户只需设置期望发现的对比最显著的模式个数,从而避免了直接设置对比支持度阈值.相应地,挖掘算法更容易使用,并且结果更易于解释.同时,为了提高算法执行效率,设计了若干剪枝策略和启发策略.进一步设计了kDSP-Miner的多线程版本,以提高其对高维序列元素情况的处理能力.通过在真实世界数据集上的详实实验,验证了算法的有效性和执行效率.  相似文献   

5.
Mining minimal distinguishing subsequence patterns with gap constraints   总被引:1,自引:4,他引:1  
Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in sequences of another class. It is a natural way of representing strong and succinct contrast information between two sequential datasets and can be useful in applications such as protein comparison, document comparison and building sequential classification models. Mining MDS patterns is a challenging task and is significantly different from mining contrasts between relational/transactional data. One particularly important type of constraint that can be integrated into the mining process is the gap constraint. We present an efficient algorithm called ConSGapMiner (Contrast Sequences with Gap Miner), to mine all MDSs satisfying a minimum and maximum gap constraint, plus a maximum length constraint. It employs highly efficient bitset and boolean operations, for powerful gap-based pruning within a prefix growth framework. A performance evaluation with both sparse and dense datasets, demonstrates the scalability of ConSGapMiner and shows its ability to mine patterns from high dimensional datasets at low supports.  相似文献   

6.
In item promotion applications, there is a strong need for tools that can help to unlock the hidden profit within each individual customer’s transaction history. Discovering association patterns based on the data mining technique is helpful for this purpose. However, the conventional association mining approach, while generating “strong” association rules, cannot detect potential profit-building opportunities that can be exposed by “soft” association rules, which recommend items with looser but significant enough associations. This paper proposes a novel mining method that automatically detects hidden profit-building opportunities through discovering soft associations among items from historical transactions. Specifically, this paper proposes a relaxation method of association mining with a new support measurement, called soft support, that can be used for mining soft association patterns expressed with the “most” fuzzy quantifier. In addition, a novel measure for validating the soft-associated rules is proposed based on the estimated possibility of a conditioned quantified fuzzy event. The new measure is shown to be effective by comparison with several existing measures. A new association mining algorithm based on modification of the FT-Tree algorithm is proposed to accommodate this new support measure. Finally, the mining algorithm is applied to several data sets to investigate its effectiveness in finding soft patterns and content recommendation.  相似文献   

7.
Mining erasable itemsets are one of new emerging data mining tasks. In this paper, we present a new data representation called a PID_list, which keeps track of the id_nums (identification number) of products that include an itemset. On the basis of the PID_list, we propose a new algorithm called VM for mining top‐rank‐k erasable itemsets efficiently. The VM algorithm can avoid the time‐consuming process of calculating the gain of the candidate itemsets and lots of scans of the databases. Therefore, it can accelerate the task of mining greatly. For evaluating the VM algorithm, we have conducted experiments on six synthetic product databases. Our performance study shows that the VM algorithm is efficient and much faster than the MIKE algorithm, which is the first algorithm for dealing with the problem of mining top‐rank‐k erasable itemsets.  相似文献   

8.
Frequent pattern mining in data streams is an important research topic in the data mining community. In previous studies, a minimum support threshold was assumed to be available for mining frequent patterns. However, setting such a threshold is typically difficult. Hence, it is more reasonable to ask users to set a bound on the result size. The present study considers mining top-k frequent patterns from data streams using a sliding window technique. A single-pass algorithm, called MSWTP, is developed for the generation of top-k frequent patterns without a threshold. In the method, the content of the transactions in the sliding window is incrementally maintained in a summary data structure, named SWTP-tree, by scanning the stream only once. To make the mining operation efficient, insignificant patterns are distinguished from others by applying the Chernoff bound. Two kinds of obsolete pattern and one kind of insignificant pattern are periodically pruned from the pattern tree. Whenever necessary, the k most frequent patterns can be selected from SWTP-tree in order of their descending frequency. The performance of the proposed technique is evaluated via simulation experiments. The results show that the proposed method is both efficient and scalable, and that it outperforms comparable algorithms.  相似文献   

9.
Sequential Pattern Mining in Multi-Databases via Multiple Alignment   总被引:2,自引:0,他引:2  
To efficiently find global patterns from a multi-database, information in each local database must first be mined and summarized at the local level. Then only the summarized information is forwarded to the global mining process. However, conventional sequential pattern mining methods based on support cannot summarize the local information and is ineffective for global pattern mining from multiple data sources. In this paper, we present an alternative local mining approach for finding sequential patterns in the local databases of a multi-database. We propose the theme of approximate sequential pattern mining roughly defined as identifying patterns approximately shared by many sequences. Approximate sequential patterns can effectively summerize and represent the local databases by identifying the underlying trends in the data. We present a novel algorithm, ApproxMAP, to mine approximate sequential patterns, called consensus patterns, from large sequence databases in two steps. First, sequences are clustered by similarity. Then, consensus patterns are mined directly from each cluster through multiple alignment. We conduct an extensive and systematic performance study over synthetic and real data. The results demonstrate that ApproxMAP is effective and scalable in mining large sequences databases with long patterns. Hence, ApproxMAP can efficiently summarize a local database and reduce the cost for global mining. Furthremore, we present an elegant and uniform model to identify both high vote sequential patterns and exceptional sequential patterns from the collection of these consensus patterns from each local databases.  相似文献   

10.
Discovering shared conceptualizations in folksonomies   总被引:2,自引:0,他引:2  
Social bookmarking tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. Unlike ontologies, shared conceptualizations are not formalized, but rather implicit. We present a new data mining task, the mining of all frequent tri-concepts, together with an efficient algorithm, for discovering these implicit shared conceptualizations. Our approach extends the data mining task of discovering all closed itemsets to three-dimensional data structures to allow for mining folksonomies. We provide a formal definition of the problem, and present an efficient algorithm for its solution. Finally, we show the applicability of our approach on three large real-world examples.  相似文献   

11.
Mining frequent tree patterns has many applications in different areas such as XML data, bioinformatics and World Wide Web. The crucial step in frequent pattern mining is frequency counting, which involves a matching operator to find occurrences (instances) of a tree pattern in a given collection of trees. A widely used matching operator for tree-structured data is subtree homeomorphism, where an edge in the tree pattern is mapped onto an ancestor-descendant relationship in the given tree. Tree patterns that are frequent under subtree homeomorphism are usually called embedded patterns. In this paper, we present an efficient algorithm for subtree homeomorphism with application to frequent pattern mining. We propose a compact data-structure, called occ, which stores only information about the rightmost paths of occurrences and hence can encode and represent several occurrences of a tree pattern. We then define efficient join operations on the occ data-structure, which help us count occurrences of tree patterns according to occurrences of their proper subtrees. Based on the proposed subtree homeomorphism method, we develop an effective pattern mining algorithm, called TPMiner. We evaluate the efficiency of TPMiner on several real-world and synthetic datasets. Our extensive experiments confirm that TPMiner always outperforms well-known existing algorithms, and in several cases the improvement with respect to existing algorithms is significant.  相似文献   

12.
Inter-sequence pattern mining can find associations across several sequences in a sequence database, which can discover both a sequential pattern within a transaction and sequential patterns across several different transactions. However, inter-sequence pattern mining algorithms usually generate a large number of recurrent frequent patterns. We have observed mining closed inter-sequence patterns instead of frequent ones can lead to a more compact yet complete result set. Therefore, in this paper, we propose a model of closed inter-sequence pattern mining and an efficient algorithm called CISP-Miner for mining such patterns, which enumerates closed inter-sequence patterns recursively along a search tree in a depth-first search manner. In addition, several effective pruning strategies and closure checking schemes are designed to reduce the search space and thus accelerate the algorithm. Our experiment results demonstrate that the proposed CISP-Miner algorithm is very efficient and outperforms a compared EISP-Miner algorithm in most cases.  相似文献   

13.
14.
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist a large number of patterns and/or long patterns.In this study, we propose a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods.  相似文献   

15.
Effective and efficient mining of music structure patterns from music query data is one of the most interesting issues of multimedia data mining. In this paper, we introduce a new kind of pattern, called emerging melody structure (EMS), for knowledge discovery from music melody streams. EMSs are defined as music data items with melody strings whose support increase significantly from one sliding window to another window from streaming melody sequences. The discovered EMS can be used to predict the future trend of online music style recommendation, to personalize the Web service of music downloading priority, for music composers to compose new music or for service provider to collect more similar music. Therefore, an efficient data mining approach, called MEMSA (Mining Emerging Melody Structure Algorithm), is proposed to discover all EMSs from streaming music query data over sliding windows. In the framework of MEMSA, a prefix tree-based data structure, called EMS-tree (Emerging Melody Structure tree), is constructed for maintaining temporal EMSs effectively. Experimental results show that the proposed method MEMSA is an efficient algorithm for mining all EMSs from streaming melody sequences efficiently.  相似文献   

16.
Mining closed frequent itemsets from data streams is of interest recently. However, it is not easy for users to determine a proper minimum support threshold. Hence, it is more reasonable to ask users to set a bound on the result size. Therefore, an interactive single-pass algorithm, called TKC-DS (top-K frequent closed itemsets of data streams), is proposed for mining top-K closed itemsets from data streams efficiently. A novel data structure, called CIL (closed itemset lattice), is developed for maintaining the essential information of closed itemsets generated so far. Experimental results show that the proposed TKC-DS algorithm is an efficient method for mining top-K frequent itemsets from data streams.  相似文献   

17.
GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets   总被引:4,自引:0,他引:4  
We present GenMax, a backtrack search based algorithm for mining maximal frequent itemsets. GenMax uses a number of optimizations to prune the search space. It uses a novel technique called progressive focusing to perform maximality checking, and diffset propagation to perform fast frequency computation. Systematic experimental comparison with previous work indicates that different methods have varying strengths and weaknesses based on dataset characteristics. We found GenMax to be a highly efficient method to mine the exact set of maximal patterns.  相似文献   

18.
In this paper we consider the problem of discovering sequential patterns by handling time constraints as defined in the Gsp algorithm. While sequential patterns could be seen as temporal relationships between facts embedded in the database where considered facts are merely characteristics of individuals or observations of individual behavior, generalized sequential patterns aim to provide the end user with a more flexible handling of the transactions embedded in the database. We thus propose a new efficient algorithm, called Gtc (Graph for Time Constraints) for mining such patterns in very large databases. It is based on the idea that handling time constraints in the earlier stage of the data mining process can be highly beneficial. One of the most significant new feature of our approach is that handling of time constraints can be easily taken into account in traditional levelwise approaches since it is carried out prior to and separately from the counting step of a data sequence. Our test shows that the proposed algorithm performs significantly faster than a state-of-the-art sequence mining algorithm.  相似文献   

19.
In the past decade, XML has emerged as the standard language for information exchanging over the Internet. Due to its tree-structure paradigm, XML is superior for its capability of storing, querying, and manipulating complex data. Therefore, discovering frequent tree patterns over tree-structured data has become an interesting topic for XML data management. In this paper, we propose a tree mining algorithm, named BUXMiner, for finding a special class of frequent trees, called rooted unordered trees, from a tree-structured database. BUXMiner employs an efficient bottom-up approach to enumerate all candidate trees over a compact global tree guide and computes the frequent trees based on the tree guide. In addition to BUXMiner, we also propose a mining approach called BUMXMiner to discover the maximal frequent rooted unordered trees. We compare BUXMiner with previous tree-structure mining algorithms, namely XQPMinerTID and FastXMiner, which were also proposed to discover rooted unordered trees. The experimental results show that our algorithm outperforms XQPMinerTID and FastXMiner in terms of efficiency. The performance results from real-world applications also indicate the usefulness of our proposed tree mining algorithms in a variety of web applications, such as analysis of web page access patterns and mining frequent XML query patterns for caching.  相似文献   

20.
吴信东  谢飞  黄咏明  胡学钢  高隽 《软件学报》2013,24(8):1804-1815
很多应用领域产生大量的序列数据。如何从这些序列数据中挖掘具有重要价值的模式,已成为序列模式挖掘研究的主要任务。研究这样一个问题:给定序列S、支持度阈值和间隔约束,从序列S中挖掘所有出现次数不小于给定支持度阈值的频繁序列模式,并且要求模式中任意两个相邻元素在序列中的出现位置满足用户定义的间隔约束。设计了一种有效的带有通配符的模式挖掘算法One-Off Mining,模式在序列中的出现满足One-Off条件,即模式的任意两次出现都不共享序列中同一位置的字符。在生物DNA序列上的实验结果表明,One-Off Mining比相关的序列模式挖掘算法具有更好的时间性能和完备性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号