首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we proposed an efficient algorithm, called PCP-Miner (Pointset Closed Pattern Miner), for mining frequent closed patterns from a pointset database, where a pointset contains a set of points. Our proposed algorithm consists of two phases. First, we find all frequent patterns of length two in the database. Second, for each pattern found in the first phase, we recursively generate frequent closed patterns by a frequent pattern tree in a depth-first search manner. Since the PCP-Miner does not generate unnecessary candidates, it is more efficient and scalable than the modified Apriori, SASMiner and MaxGeo. The experimental results show that the PCP-Miner algorithm outperforms the comparing algorithms by more than one order of magnitude.  相似文献   

2.
In this paper, we propose an efficient algorithm, called CMP-Miner, to mine closed patterns in a time-series database where each record in the database, also called a transaction, contains multiple time-series sequences. Our proposed algorithm consists of three phases. First, we transform each time-series sequence in a transaction into a symbolic sequence. Second, we scan the transformed database to find frequent patterns of length one. Third, for each frequent pattern found in the second phase, we recursively enumerate frequent patterns by a frequent pattern tree in a depth-first search manner. During the process of enumeration, we apply several efficient pruning strategies to remove frequent but non-closed patterns. Thus, the CMP-Miner algorithm can efficiently mine the closed patterns from a time-series database. The experimental results show that our proposed algorithm outperforms the modified Apriori and BIDE algorithms.  相似文献   

3.
杨皓  段磊  胡斌  邓松  王文韬  秦攀 《软件学报》2015,26(11):2994-3009
对比序列模式能够表达序列数据集合间的差异,在商品推荐、用户行为分析和电力供应预测等领域有广泛的应用.已有的对比序列模式挖掘算法需要用户设定正例支持度阈值和负例支持度阈值.在不具备足够先验知识的情况下,用户难以设定恰当的支持度阈值,从而可能错失一些对比显著的模式.为此,提出了带间隔约束的top-k对比序列模式挖掘算法kDSP-Miner(top-k distinguishing sequential patterns with gap constraint miner).kDSP-Miner中用户只需设置期望发现的对比最显著的模式个数,从而避免了直接设置对比支持度阈值.相应地,挖掘算法更容易使用,并且结果更易于解释.同时,为了提高算法执行效率,设计了若干剪枝策略和启发策略.进一步设计了kDSP-Miner的多线程版本,以提高其对高维序列元素情况的处理能力.通过在真实世界数据集上的详实实验,验证了算法的有效性和执行效率.  相似文献   

4.
Multi-dimensional top-k dominating queries   总被引:1,自引:0,他引:1  
The top-k dominating query returns k data objects which dominate the highest number of objects in a dataset. This query is an important tool for decision support since it provides data analysts an intuitive way for finding significant objects. In addition, it combines the advantages of top-k and skyline queries without sharing their disadvantages: (i) the output size can be controlled, (ii) no ranking functions need to be specified by users, and (iii) the result is independent of the scales at different dimensions. Despite their importance, top-k dominating queries have not received adequate attention from the research community. This paper is an extensive study on the evaluation of top-k dominating queries. First, we propose a set of algorithms that apply on indexed multi-dimensional data. Second, we investigate query evaluation on data that are not indexed. Finally, we study a relaxed variant of the query which considers dominance in dimensional subspaces. Experiments using synthetic and real datasets demonstrate that our algorithms significantly outperform a previous skyline-based approach. We also illustrate the applicability of this multi-dimensional analysis query by studying the meaningfulness of its results on real data.  相似文献   

5.
李淼  谷峪  陈默  于戈 《软件学报》2017,28(2):310-325
随着地理位置定位技术的蓬勃发展,基于在线位置服务技术的应用也越来越多.提出一种查询类型——反向空间偏好top-k查询.类似于传统的反向空间top-k查询,对于给定的空间查询对象,该查询返回使该对象满足top-k属性得分的那些用户.但不同的是,该对象的属性不是自身具有的特性,而是通过计算该对象与其他偏好对象之间的空间关系(如距离)而确定.这种查询在市场分析等许多重要领域具有需求,例如,根据查询结果,分析出某个地区中某个设施受欢迎的程度.但是,由于大量空间对象的存在导致对象之间空间关系的计算代价非常高,如何实时地计算出对象的空间属性得分,给查询处理带来很大的挑战.针对该问题提出优化的查询处理算法包括:数据集剪枝、数据集批量处理、基于权重的用户分组等策略.通过理论分析和充分的实验验证,证明了所提出方法的有效性.与普通方法相比,这些方法能够大幅度提高查询处理的执行时间和I/O效率.  相似文献   

6.
Mining minimal distinguishing subsequence patterns with gap constraints   总被引:1,自引:4,他引:1  
Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in sequences of another class. It is a natural way of representing strong and succinct contrast information between two sequential datasets and can be useful in applications such as protein comparison, document comparison and building sequential classification models. Mining MDS patterns is a challenging task and is significantly different from mining contrasts between relational/transactional data. One particularly important type of constraint that can be integrated into the mining process is the gap constraint. We present an efficient algorithm called ConSGapMiner (Contrast Sequences with Gap Miner), to mine all MDSs satisfying a minimum and maximum gap constraint, plus a maximum length constraint. It employs highly efficient bitset and boolean operations, for powerful gap-based pruning within a prefix growth framework. A performance evaluation with both sparse and dense datasets, demonstrates the scalability of ConSGapMiner and shows its ability to mine patterns from high dimensional datasets at low supports.  相似文献   

7.
Mining frequent trajectory patterns in spatial-temporal databases   总被引:1,自引:0,他引:1  
In this paper, we propose an efficient graph-based mining (GBM) algorithm for mining the frequent trajectory patterns in a spatial-temporal database. The proposed method comprises two phases. First, we scan the database once to generate a mapping graph and trajectory information lists (TI-lists). Then, we traverse the mapping graph in a depth-first search manner to mine all frequent trajectory patterns in the database. By using the mapping graph and TI-lists, the GBM algorithm can localize support counting and pattern extension in a small number of TI-lists. Moreover, it utilizes the adjacency property to reduce the search space. Therefore, our proposed method can efficiently mine the frequent trajectory patterns in the database. The experimental results show that it outperforms the Apriori-based and PrefixSpan-based methods by more than one order of magnitude.  相似文献   

8.
周新  张孝  安润功  薛忠斌  王珊 《软件学报》2014,25(S2):157-168
基于位置的服务可以指引用户找到在特定位置或区域内能够提供所需要服务的对象(比如找某个高校附近(经纬度标识)的咖啡店).向这类服务提交一个查询位置和多个关键词,该类服务返回k个最相关的对象,对象和查询的相关性同时考虑空间相近性和文本相似性.为了支持高效的top-k空间关键词查询,出现了多种混合索引,然而现有的这些索引为了提供实时响应均耗费大量存储空间.提出一种基于压缩技术的索引CSTI,该索引显著减少了存储开销(至少减少80%甚至到两个数据量级),同时保持高效的查询性能.大量基于真实和仿真数据集的实验结果表明,CSTI在空间开销和响应时间上均优于已有方法.  相似文献   

9.
传统的数据挖掘方法会生成大量的模式和规则,且难以理解,而实际上用户感兴趣的只是其中的一小部分.针对该问题,在挖掘序列模式的PrefixSpan算法基础上提出一种带数据项约束的序列模式挖掘方法,通过数据项约束,减少了搜索空间.实验结果表明,该方法可以有效地挖掘出满足数据项约束的序列模式.  相似文献   

10.
Top-k queries on large multi-attribute data sets are fundamental operations in information retrieval and ranking applications. In this article, we initiate research on the anytime behavior of top-k algorithms on exact and fuzzy data. In particular, given specific top-k algorithms (TA and TA-Sorted) we are interested in studying their progress toward identification of the correct result at any point during the algorithms’ execution. We adopt a probabilistic approach where we seek to report at any point of operation of the algorithm the confidence that the top-k result has been identified. Such a functionality can be a valuable asset when one is interested in reducing the runtime cost of top-k computations. We present a thorough experimental evaluation to validate our techniques using both synthetic and real data sets.  相似文献   

11.
In this paper, given a set of sequence databases across multiple domains, we aim at mining multi-domain sequential patterns, where a multi-domain sequential pattern is a sequence of events whose occurrence time is within a pre-defined time window. We first propose algorithm Naive in which multiple sequence databases are joined as one sequence database for utilizing traditional sequential pattern mining algorithms (e.g., PrefixSpan). Due to the nature of join operations, algorithm Naive is costly and is developed for comparison purposes. Thus, we propose two algorithms without any join operations for mining multi-domain sequential patterns. Explicitly, algorithm IndividualMine derives sequential patterns in each domain and then iteratively combines sequential patterns among sequence databases of multiple domains to derive candidate multi-domain sequential patterns. However, not all sequential patterns mined in the sequence database of each domain are able to form multi-domain sequential patterns. To avoid the mining cost incurred in algorithm IndividualMine, algorithm PropagatedMine is developed. Algorithm PropagatedMine first performs one sequential pattern mining from one sequence database. In light of sequential patterns mined, algorithm PropagatedMine propagates sequential patterns mined to other sequence databases. Furthermore, sequential patterns mined are represented as a lattice structure for further reducing the number of sequential patterns to be propagated. In addition, we develop some mechanisms to allow some empty sets in multi-domain sequential patterns. Performance of the proposed algorithms is comparatively analyzed and sensitivity analysis is conducted. Experimental results show that by exploring propagation and lattice structures, algorithm PropagatedMine outperforms algorithm IndividualMine in terms of efficiency (i.e., the execution time).  相似文献   

12.
Many researchers in database and machine learning fields are primarily interested in data mining because it offers opportunities to discover useful information and important relevant patterns in large databases. Most previous studies have shown how binary valued transaction data may be handled. Transaction data in real-world applications usually consist of quantitative values, so designing a sophisticated data-mining algorithm able to deal with various types of data presents a challenge to workers in this research field. In the past, we proposed a fuzzy data-mining algorithm to find association rules. Since sequential patterns are also very important for real-world applications, this paper thus focuses on finding fuzzy sequential patterns from quantitative data. A new mining algorithm is proposed, which integrates the fuzzy-set concepts and the AprioriAll algorithm. It first transforms quantitative values in transactions into linguistic terms, then filters them to find sequential patterns by modifying the AprioriAll mining algorithm. Each quantitative item uses only the linguistic term with the maximum cardinality in later mining processes, thus making the number of fuzzy regions to be processed the same as the number of the original items. The patterns mined out thus exhibit the sequential quantitative regularity in databases and can be used to provide some suggestions to appropriate supervisors.  相似文献   

13.
Recently, uncertain data processing has become more and more important. Although a significant amount of previous research explores various continuous queries on data streams, continuous queries on uncertain data streams have seldom been investigated. In this paper, we formulate a novel and challenging problem of continuously monitoring top-k uncertain data streams, and propose a probabilistic threshold method. We develop four algorithms systematically: a deterministic exact algorithm, a randomized method, and their space-efficient versions using quantile summaries. An extensive empirical study using real data sets and synthetic data sets is reported to verify the effectiveness and the efficiency of our methods. We are grateful to the anonymous reviewers and Dr. Ihab F. Ilyas, the guest editor, for their constructive comments which help to improve the quality of the paper substantially. The research of Ming Hua and Jian Pei is supported in part by an NSERC Discovery grant and an NSERC Discovery Accelerator Supplements grant. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.  相似文献   

14.
In this paper, we study the incremental update of Frequent Closed Itemsets (FCIs) over a sliding window in a high-speed data stream. We propose the notion of semi-FCIs, which is to progressively increase the minimum support threshold for an itemset as it is retained longer in the window, thereby drastically reducing the number of itemsets that need to be maintained and processed. We explore the properties of semi-FCIs and observe that a majority of the subsets of a semi-FCI are not semi-FCIs and need not be updated. This finding allows us to devise an efficient algorithm, IncMine, that incrementally updates the set of semi-FCIs over a sliding window. We also develop an inverted index to facilitate the update process. Our empirical results show that IncMine achieves significantly higher throughput and consumes less memory than the state-of-the-art streaming algorithms for mining FCIs and FIs. IncMine also attains high accuracy of 100% precision and over 93% recall.  相似文献   

15.
Mining sequential patterns to find ordered events or subsequence patterns is essential in many applications, such as analysis of consumer shopping data, web clickstreams, and biological sequences. Traditional patterns reveal which items are frequently purchased together and in what order. However, information about the time intervals between purchases is missing. Therefore, Yang proposed using multi-time-interval sequential patterns to consider the time intervals between each pair of items in a pattern. For example, 〈Bread, ti1, Milk, (ti2ti1), Jam〉 means that Bread is bought before Milk within an interval of ti1, and Jam is bought after Bread and Milk within intervals of ti2 and ti1, respectively, where ti1 and ti2 are predefined time intervals. Although this new type of pattern considers the intervals between all pairs of items, it contains a sharp boundary problem; that is, when the time interval between two purchases is near the boundary of two predetermined time ranges, we either ignore or overemphasize it. In this study, we applied the concept of fuzzy sets to solve the sharp boundary problem. The discovered patterns, called fuzzy multi-time-interval sequential patterns, describe time intervals in linguistic terms for better understanding. Two algorithms, FuzzMI-Apriori and FuzzMI-PrefixSpan, were developed for mining fuzzy multi-time-interval patterns. Experiments using synthetic and real datasets showed the algorithms’ computational efficiency, scalability, and effectiveness.  相似文献   

16.
Sequential rule mining is an important data mining task used in a wide range of applications. However, current algorithms for discovering sequential rules common to several sequences use very restrictive definitions of sequential rules, which make them unable to recognize that similar rules can describe a same phenomenon. This can have many undesirable effects such as (1) similar rules that are rated differently, (2) rules that are not found because they are considered uninteresting when taken individually, (3) and rules that are too specific, which makes them less likely to be used for making predictions. In this paper, we address these problems by proposing a more general form of sequential rules such that items in the antecedent and in the consequent of each rule are unordered. We propose an algorithm named CMRules for mining this form of rules. The algorithm proceeds by first finding association rules to prune the search space for items that occur jointly in many sequences. Then it eliminates association rules that do not meet the minimum confidence and support thresholds according to the sequential ordering. We evaluate the performance of CMRules in three different ways. First, we provide an analysis of its time complexity. Second, we compare its performance (in terms of execution time, memory usage and scalability) with an adaptation of an algorithm from the literature that we name CMDeo. For this comparison, we use three real-life public datasets, which have different characteristics and represent three kinds of data. In many cases, results show that CMRules is faster and has a better scalability for low support thresholds than CMDeo. Lastly, we report a successful application of the algorithm in a tutoring agent.  相似文献   

17.
Multilevel knowledge in transactional databases plays a significant role in our real-life market basket analysis. Many researchers have mined the hierarchical association rules and thus proposed various approaches. However, some of the existing approaches produce many multilevel and cross-level association rules that fail to convey quality information. From these large number of redundant association rules, it is extremely difficult to extract any meaningful information. There also exist some approaches that mine minimal association rules, but these have many shortcomings due to their naïve-based approaches. In this paper, we have focused on the need for generating hierarchical minimal rules that provide maximal information. An algorithm has been proposed to derive minimal multilevel association rules and cross-level association rules. Our work has made significant contributions in mining the minimal cross-level association rules, which express the mixed relationship between the generalized and specialized view of the transaction itemsets. We are the first to design an efficient algorithm using a closed itemset lattice-based approach, which can mine the most relevant minimal cross-level association rules. The parent–child relationship of the lattices has been exploited while mining cross-level closed itemset lattices. We have extensively evaluated our proposed algorithm’s efficiency using a variety of real-life datasets and performing a large number of experiments. The proposed algorithm has outperformed the existing related work significantly during the pervasive performance comparison.  相似文献   

18.
Mining frequent patterns from univariate uncertain data   总被引:1,自引:0,他引:1  
In this paper, we propose a new algorithm called U2P-Miner for mining frequent U2 patterns from univariate uncertain data, where each attribute in a transaction is associated with a quantitative interval and a probability density function. The algorithm is implemented in two phases. First, we construct a U2P-tree that compresses the information in the target database. Then, we use the U2P-tree to discover frequent U2 patterns. Potential frequent U2 patterns are derived by combining base intervals and verified by traversing the U2P-tree. We also develop two techniques to speed up the mining process. Since the proposed method is based on a tree-traversing strategy, it is both efficient and scalable. Our experimental results demonstrate that the U2P-Miner algorithm outperforms three widely used algorithms, namely, the modified Apriori, modified H-mine, and modified depth-first backtracking algorithms.  相似文献   

19.
周明  李宏 《计算机工程》2007,33(2):74-76
传统频繁项集挖掘算法在处理稠密或长数据集(如基因表达数据集)时效率低且产生大量冗余模式,为解决这些问题一些学者提出了闭合模式的概念和挖掘闭合模式的算法,研究证明挖掘闭合模式可以显著减少项集数量并消除大量冗余模式。该文针对生物数据特点提出了一个新颖的挖掘频繁闭合模式的算法REMFOR,该算法在闭合模式概念和行枚举思想的基础上,采用垂直数据结构和fp-tree技术,对行集建立行fp-tree来挖掘频繁闭合模式。通过实例和实验证明该算法是正确有效的。  相似文献   

20.
General patterns of execution that have been frequently scheduled by a workflow management system provide the administrator with previously unknown, and potentially useful information, e.g., about the existence of unexpected causalities between subprocesses of a given workflow. This paper investigates the problem of mining unconnected patterns on the basis of some execution traces, i.e., of detecting sets of activities exhibiting no explicit dependency relationships that are frequently executed together. The problem is faced in the paper by proposing and analyzing two algorithms. One algorithm takes into account information about the structure of the control-flow graph only, while the other is a smart refinement where the knowledge of the frequencies of edges and activities in the traces at hand is also accounted for, by means of a sophisticated graphical analysis. Both algorithms have been implemented and integrated into a system prototype, which may profitably support the enactment phase of the workflow. The correctness of the two algorithms is formally proven, and several experiments are reported to evidence the ability of the graphical analysis to significantly improve the performances, by dramatically pruning the search space of candidate patterns.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号