首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
针对基于模式增长原理的嵌入式子树挖掘算法——TreeGrowth(TG)算法挖掘子树过大与内存消耗大缺点,在分区挖掘思想的基础上,提出了一种新算法——PTG(partition tree growth)算法。PTG算法将数据库划分成多个分区,先用TG算法进行挖掘,得到每个分区的局部频繁子树。根据全局支持数进行筛选,得到全局频繁子树,有效地减少了挖掘的子树,有效地降低了内存的开销。仿真实验结果表明,PTG算法能够解决在大数据集上挖掘时出现内存空间不足的问题,验证了其有效性与健壮性。  相似文献   

2.
Welke  Pascal  Horváth  Tamás  Wrobel  Stefan 《Machine Learning》2019,108(7):1137-1164
Machine Learning - Motivated by the impressive predictive power of simple patterns, we consider the problem of mining frequent subtrees in arbitrary graphs. Although the restriction of the pattern...  相似文献   

3.
Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. In this paper, we first present two canonical forms for labelled rooted unordered trees–the breadth-first canonical form (BFCF) and the depth-first canonical form (DFCF). Then the canonical forms are applied to the frequent subtree mining problem. Based on the BFCF, we develop a vertical mining algorithm, RootedTreeMiner, to discover all frequently occurring subtrees in a database of labelled rooted unordered trees. The RootedTreeMiner algorithm uses an enumeration tree to enumerate all (frequent) labelled rooted unordered subtrees. Next, we extend the definition of the DFCF to labelled free trees and present an Apriori-like algorithm, FreeTreeMiner, to discover all frequently occurring subtrees in a database of labelled free trees. Finally, we study the performance and the scalability of our algorithms through extensive experiments based on both synthetic data and datasets from real applications.  相似文献   

4.
一种有效的不确定数据概率频繁项集挖掘算法*   总被引:1,自引:1,他引:0  
针对PFIM算法中频繁概率计算方法的局限性,且挖掘时需要多次扫描数据库和生成大量候选集的不足,提出EPFIM(efficient probabilistic frequent itemset mining)算法。新提出的频繁概率计算方法能适应数据流等项集的概率发生变化时的情况;通过不确定数据库存储在概率矩阵中,以及利用项集的有序性和逐步删除无用事物来提高挖掘效率。理论分析和实验结果证明了EPFIM算法的性能更优。  相似文献   

5.
Frequent itemset mining (FIM) is a fundamental research topic, which consists of discovering useful and meaningful relationships between items in transaction databases. However, FIM suffers from two important limitations. First, it assumes that all items have the same importance. Second, it ignores the fact that data collected in a real-life environment is often inaccurate, imprecise, or incomplete. To address these issues and mine more useful and meaningful knowledge, the problems of weighted and uncertain itemset mining have been respectively proposed, where a user may respectively assign weights to items to specify their relative importance, and specify existential probabilities to represent uncertainty in transactions. However, no work has addressed both of these issues at the same time. In this paper, we address this important research problem by designing a new type of patterns named high expected weighted itemset (HEWI) and the HEWI-Uapriori algorithm to efficiently discover HEWIs. The HEWI-Uapriori finds HEWIs using an Apriori-like two-phase approach. The algorithm introduces a property named high upper-bound expected weighted downward closure (HUBEWDC) to early prune the search space and unpromising itemsets. Substantial experiments on real-life and synthetic datasets are conducted to evaluate the performance of the proposed algorithm in terms of runtime, memory consumption, and number of patterns found. Results show that the proposed algorithm has excellent performance and scalability compared with traditional methods for weighted-itemset mining and uncertain itemset mining.  相似文献   

6.
7.
鉴于图结构能简单方便地描绘复杂的数据以及实际应用中图数据的获得具有不确定性,不确定频繁子图挖掘算法得到广泛的研究。目前一个典型的图挖掘算法是MUSE,但MUSE算法存在期望支持度计算消耗大、时间效率不够高等问题。针对此问题提出了一种基于划分思想混合搜索策略的不确定子图挖掘算法EDFS,它用改进过的GSpan算法进行不确定的子图数据预处理,用裁剪子图模式的搜索空间裁剪不确定子图数据,用基于划分思想的混合策略进行频繁子图的挖掘。子图同构与边存在概率的实验结果证明了EDFS算法能更高效地挖掘出不确定数据频繁子图。  相似文献   

8.
Frequent itemset mining allows us to find hidden, important information from large databases. Moreover, processing incremental databases in the itemset mining area has become more essential because a huge amount of data has been accumulated continually in a variety of application fields and users want to obtain mining results from such incremental data in more efficient ways. One of the major problems in incremental itemset mining is that the corresponding mining results can be very large-scale according to threshold settings and data volumes. In addition, it is considerably hard to analyze all of them and find meaningful information. Furthermore, not all of the mining results become actually important information. In this paper, to solve these problems, we propose an algorithm for mining weighted maximal frequent itemsets from incremental databases. By scanning a given incremental database only once, the proposed algorithm can not only conduct its mining operations suitable for the incremental environment but also extract a smaller number of important itemsets compared to previous approaches. The proposed method also has an effect on expert and intelligent systems since it can automatically provide more meaningful pattern results reflecting characteristics of given incremental databases and threshold settings, which can help users analyze the given data more easily. Our comprehensive experimental results show that the proposed algorithm is more efficient and scalable than previous state-of-the-art algorithms.  相似文献   

9.
Data-mining and machine learning must confront the problem of pattern maintenance because data update is a fundamental operation in data management. Most existing data-mining algorithms assume that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new data. While there are many efficient mining techniques for data additions to databases, in this paper, we propose a decremental algorithm for pattern discovery when data is deleted from databases. We conduct extensive experiments for evaluating this approach, and illustrate that the proposed algorithm can well model and capture useful interactions within data when the data is decreasing.  相似文献   

10.
In this paper, we propose two parallel algorithms for mining maximal frequent itemsets from databases. A frequent itemset is maximal if none of its supersets is frequent. One parallel algorithm is named distributed max-miner (DMM), and it requires very low communication and synchronization overhead in distributed computing systems. DMM has the local mining phase and the global mining phase. During the local mining phase, each node mines the local database to discover the local maximal frequent itemsets, then they form a set of maximal candidate itemsets for the top-down search in the subsequent global mining phase. A new prefix tree data structure is developed to facilitate the storage and counting of the global candidate itemsets of different sizes. This global mining phase using the prefix tree can work with any local mining algorithm. Another parallel algorithm, named parallel max-miner (PMM), is a parallel version of the sequential max-miner algorithm (Proc of ACM SIGMOD Int Conf on Management of Data, 1998, pp 85–93). Most of existing mining algorithms discover the frequent k-itemsets on the kth pass over the databases, and then generate the candidate (k + 1)-itemsets for the next pass. Compared to those level-wise algorithms, PMM looks ahead at each pass and prunes more candidate itemsets by checking the frequencies of their supersets. Both DMM and PMM were implemented on a cluster of workstations, and their performance was evaluated for various cases. They demonstrate very good performance and scalability even when there are large maximal frequent itemsets (i.e., long patterns) in databases.
Congnan LuoEmail:
  相似文献   

11.
The frequent string mining problem is to find all substrings of a collection of string databases which satisfy database specific minimum and maximum frequency constraints. Our contribution improves the existing linear-time algorithm for this problem in such a way that the peak memory consumption is a constant factor of the size of the largest database of strings. We show how the results for each database can be stored implicitly in space proportional to the size of the database, making it possible to traverse the results in lexicographical order. Furthermore, we present a linear-time algorithm which calculates the intersection of the results of different databases. This algorithm is based on an algorithm to merge two suffix arrays, and our modification allows us to also calculate the LCP table of the resulting suffix array during the merging.  相似文献   

12.
针对频繁导出式子树的特点,给出一种基于编码的频繁导出式子树挖掘算法。该算法通过宽度优先编码来表示原始数据库,使单个投影的规模最小;通过对每个投影编码降低了整个投影库的规模,从而有效地提高了频繁导出式子树的挖掘效率。实验结果验证了该算法具有较高的挖掘效率。  相似文献   

13.
夏英  李洪旭 《计算机应用》2017,37(9):2439-2442
无序树常用于半结构化数据建模,对其进行频繁子树挖掘有利于发现隐藏的知识。传统的频繁子树挖掘方法常常输出大规模且带有冗余信息的频繁子树,这样的输出结果会降低后续操作的效率。针对传统方法的不足,提出了一种用于挖掘覆盖模式(MCRP)算法。首先,采用宽度孩子数编码对树进行编码;然后,通过基于最大前缀编码序列的边扩展方式生成所有的候选子树;最后,在频繁子树集和δ'-覆盖概念的基础上输出覆盖模式集。与传统的挖掘频繁闭树模式和极大频繁树模式的算法相比,该算法能够在保留所有频繁子树信息的情况下输出更少的频繁子树,并且将处理效率提高15%到25%。实验结果表明,所提算法能有效减小输出频繁子树的规模,减少冗余信息,在实际操作中具有较高的可行性。  相似文献   

14.
提出了一种基于频繁子树模式的GML文档结构聚类算法GCFS(GML Clustering based on Frequent Subtree patterns),与其他相关算法不同,该算法首先挖掘GML文档集合中的最大与闭合频繁Induced子树,并将其作为聚类特征,根据频繁子树的大小赋予不同的权值,采用余弦函数定义相似度,利用K-Means算法对聚类特征进行聚类。实验结果表明算法GCFS是有效的,具有较高的聚类效率,性能优于其他同类算法。  相似文献   

15.
Trends in databases: reasoning and mining   总被引:1,自引:0,他引:1  
We propose a temporal dependency, called trend dependency (TD), which captures a significant family of data evolution regularities. An example of such regularity is “Salaries of employees generally do not decrease.” TDs compare attributes over time using operators of {<,=,>,⩽,⩾,≠}. We define a satisfiability problem that is the dual of the logical implication problem for TDs and we investigate the computational complexity of both problems. As TDs allow expressing meaningful trends, “mining” them from existing databases is interesting. For the purpose of TD mining, TD satisfaction is characterized by support and confidence measures. We study the problem TDMINE: given a temporal database, mine the TDs that conform to a given template and whose support and confidence exceed certain threshold values. The complexity of TDMINE is studied, as well as algorithms to solve the problem  相似文献   

16.
Mining frequent trajectory patterns in spatial-temporal databases   总被引:1,自引:0,他引:1  
In this paper, we propose an efficient graph-based mining (GBM) algorithm for mining the frequent trajectory patterns in a spatial-temporal database. The proposed method comprises two phases. First, we scan the database once to generate a mapping graph and trajectory information lists (TI-lists). Then, we traverse the mapping graph in a depth-first search manner to mine all frequent trajectory patterns in the database. By using the mapping graph and TI-lists, the GBM algorithm can localize support counting and pattern extension in a small number of TI-lists. Moreover, it utilizes the adjacency property to reduce the search space. Therefore, our proposed method can efficiently mine the frequent trajectory patterns in the database. The experimental results show that it outperforms the Apriori-based and PrefixSpan-based methods by more than one order of magnitude.  相似文献   

17.
In this paper, we proposed an efficient algorithm, called PCP-Miner (Pointset Closed Pattern Miner), for mining frequent closed patterns from a pointset database, where a pointset contains a set of points. Our proposed algorithm consists of two phases. First, we find all frequent patterns of length two in the database. Second, for each pattern found in the first phase, we recursively generate frequent closed patterns by a frequent pattern tree in a depth-first search manner. Since the PCP-Miner does not generate unnecessary candidates, it is more efficient and scalable than the modified Apriori, SASMiner and MaxGeo. The experimental results show that the PCP-Miner algorithm outperforms the comparing algorithms by more than one order of magnitude.  相似文献   

18.
传统的关联规则挖掘研究事务中所包含的项与项之间的关联性,而负关联规则挖掘不仅要考虑事务中包含的项,还要考虑事务中不包含的项。给出了完全负关联规则的定义,提出一种基于树的算法Free-PNP,通过此算法挖掘数据库中的负频繁模式,继而得到所要挖掘的完全负关联规则。通过实验验证了算法的有效性。  相似文献   

19.
孔鹏程  张继福 《计算机应用》2009,29(4):1120-1123
针对频繁嵌入式子树挖掘,利用离散区间来构造投影库,给出一种基于离散区间的频繁嵌入式子树挖掘算法。该算法通过离散区间消除冗余投影,有效地压缩投影库的规模,提高了子树节点计数效率,减低了算法的时空复杂性。实验结果表明该算法具有较高的挖掘效率。  相似文献   

20.

This paper presents a new means of selecting quality data for mining multiple data sources. Traditional data-mining strategies obtain necessary data from internal and external data sources and pool all the data into a huge homogeneous dataset for discovery. In contrast, our data-mining strategy identifies quality data from (internal and external) data sources for a mining task. A framework is advocated for generating quality data. Experimental results demonstrate that application of this new data collecting technique can not only identify quality data, but can also efficiently reduce the amount of data that must be considered during mining.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号