Frequent itemset mining aims at discovering patterns the supports of which are beyond a given threshold. In many applications, including network event management systems, which motivated this work, patterns are composed of items each described by a subset of attributes of a relational table. As it involves an exponential mining space, the efficient implementation of user preferences and mining constraints becomes the first priority for a mining algorithm. User preferences and mining constraints are often expressed using patterns attribute structures. Unlike traditional methods that mine all frequent patterns indiscriminately, we regard frequent itemset mining as a two-step process: the mining of the pattern structures and the mining of patterns within each pattern structure. In this paper, we present a novel architecture that uses pattern structures to organize the mining space. In comparison with the previous techniques, the advantage of our approach is two-fold: (i) by exploiting the interrelationships among pattern structures, execution times for mining can be reduced significantly; and (ii) more importantly, it enables us to incorporate high-level simple user preferences and mining constraints into the mining process efficiently. These advantages are demonstrated by our experiments using both synthetic and real-life datasets.  相似文献   


社区发现旨在挖掘复杂网络蕴含的社区结构,是复杂网络分析的重要任务之一. 然而,现有的社区发现方法大多针对单层网络数据,对现实世界中广泛存在的多层网络数据的研究较少. 针对多层网络的社区发现问题,提出了一个基于2阶段集成的社区发现算法,以提高社区发现结果的准确性和可解释性. 首先,在各层分别得到基社区划分;其次以各层社区划分结构信息为主并结合其他各层网络得到的基社区划分中最优的社区划分信息进行局部集成;再次,基于信息熵对各层局部社区划分中各个社区的稳定性进行度量,并通过其他层社区划分结果来对各个局部社区划分的准确性进行评价;最后,基于各个社区以及社区划分的重要性进行全局加权集成得到最终的社区划分结果. 在人造多层网络和真实多层网络数据上与已有的多层网络社区发现算法进行了比较分析. 实验结果表明,提出的算法在多层模块度、标准化互信息等评价指标上优于已有算法.


针对网络安全数据高维度的特征,对传统离群点检测不能有效发现的网络数据中入侵行为细节进行检测。提出一种基于频繁模式的算法,通过检测数据项的频繁模式和关联规则,剥离数据流中或安全日志数据中的噪声和异常点,计算安全数据的加权频繁离群因子,精确定位离群点,最后从中自动筛选出异常属性。实验证明,该方法在较好的空间复杂性与时间复杂性下,能有效地发现在高维安全数据中异常的属性。  相似文献   

The FP-growth algorithm using the FP-tree has been widely studied for frequent pattern mining because it can dramatically improve performance compared to the candidate generation-and-test paradigm of Apriori. However, it still requires two database scans, which are not consistent with efficient data stream processing. In this paper, we present a novel tree structure, called CP-tree (compact pattern tree), that captures database information with one scan (insertion phase) and provides the same mining performance as the FP-growth method (restructuring phase). The CP-tree introduces the concept of dynamic tree restructuring to produce a highly compact frequency-descending tree structure at runtime. An efficient tree restructuring method, called the branch sorting method, that restructures a prefix-tree branch-by-branch, is also proposed in this paper. Moreover, the CP-tree provides full functionality for interactive and incremental mining. Extensive experimental results show that the CP-tree is efficient for frequent pattern mining, interactive, and incremental mining with a single database scan.  相似文献   


Traditional association-rule mining (ARM) considers only the frequency of items in a binary database, which provides insufficient knowledge for making efficient decisions and strategies. The mining of useful information from quantitative databases is not a trivial task compared to conventional algorithms in ARM. Fuzzy-set theory was invented to represent a more valuable form of knowledge for human reasoning, which can also be applied and utilized for quantitative databases. Many approaches have adopted fuzzy-set theory to transform the quantitative value into linguistic terms with its corresponding degree based on defined membership functions for the discovery of FFIs, also known as fuzzy frequent itemsets. Only linguistic terms with maximal scalar cardinality are considered in traditional fuzzy frequent itemset mining, but the uncertainty factor is not involved in past approaches. In this paper, an efficient fuzzy mining (EFM) algorithm is presented to quickly discover multiple FFIs from quantitative databases under type-2 fuzzy-set theory. A compressed fuzzy-list (CFL)-structure is developed to maintain complete information for rule generation. Two pruning techniques are developed for reducing the search space and speeding up the mining process. Several experiments are carried out to verify the efficiency and effectiveness of the designed approach in terms of runtime, the number of examined nodes, memory usage, and scalability under different minimum support thresholds and different linguistic terms used in the membership functions.


Many famous online social networks, e.g., Facebook and Twitter, have achieved great success in the last several years. Users in these online social networks can establish various connections via both social links and shared attribute information. Discovering groups of users who are strongly connected internally is defined as the community detection problem. Community detection problem is very important for online social networks and has extensive applications in various social services. Meanwhile, besides these popular social networks, a large number of new social networks offering specific services also spring up in recent years. Community detection can be even more important for new networks as high quality community detection results enable new networks to provide better services, which can help attract more users effectively. In this paper, we will study the community detection problem for new networks, which is formally defined as the “New Network Community Detection” problem. New network community detection problem is very challenging to solve for the reason that information in new networks can be too sparse to calculate effective similarity scores among users, which is crucial in community detection. However, we notice that, nowadays, users usually join multiple social networks simultaneously and those who are involved in a new network may have been using other well-developed social networks for a long time. With full considerations of network difference issues, we propose to propagate useful information from other well-established networks to the new network with efficient information propagation models to overcome the shortage of information problem. An effective and efficient method, Cat (Cold stArT community detector), is proposed in this paper to detect communities for new networks using information from multiple heterogeneous social networks simultaneously. Extensive experiments conducted on real-world heterogeneous online social networks demonstrate that Cat can address the new network community detection problem effectively.  相似文献   

In dynamic networks, periodically occurring interactions express especially significant meaning. However, these patterns also could occur infrequently, which is why it is difficult to detect while working with mass data. To identify such periodic patterns in dynamic networks, we propose single pass supergraph based periodic pattern mining SPPMiner technique that is polynomial unlike most graph mining problems. The proposed technique stores all entities in dynamic networks only once and calculate common sub-patterns once at each timestamps. In this way, it works faster. The performance study shows that SPPMiner method is time and memory efficient compared to others. In fact, the memory efficiency of our approach does not depend on dynamic network’s lifetime. By studying the growth of periodic patterns in social networks, the proposed research has potential implications for behavior prediction of intellectual communities.  相似文献   


Trillions of bytes of data are generated every day in different forms, and extracting useful information from that massive amount of data is the study of data mining. Sequential pattern mining is a major branch of data mining that deals with mining frequent sequential patterns from sequence databases. Due to items having different importance in real-life scenarios, they cannot be treated uniformly. With today’s datasets, the use of weights in sequential pattern mining is much more feasible. In most cases, as in real-life datasets, pushing weights will give a better understanding of the dataset, as it will also measure the importance of an item inside a pattern rather than treating all the items equally. Many techniques have been introduced to mine weighted sequential patterns, but typically these algorithms generate a massive number of candidate patterns and take a long time to execute. This work aims to introduce a new pruning technique and a complete framework that takes much less time and generates a small number of candidate sequences without compromising with completeness. Performance evaluation on real-life datasets shows that our proposed approach can mine weighted patterns substantially faster than other existing approaches.


传统数据挖掘算法在处理多表时,需要物理连接,存在效率不高的问题。为了解决这一问题,提出了一种多关系频繁模式挖掘算法。该算法利用元组ID传播的思想,使多表间无须物理连接,就可以直接挖掘频繁模式。实验表明,此算法具有较高的效率。  相似文献   

针对基于标签传播的复杂网络重叠社区发现算法中预先输入参数在真实网络中的局限性以及标签冗余等问题,提出一种基于标签传播的面向大规模学术社交网络的社区发现模型。该模型通过寻找网络中互不相交的最大极大团(UMC)并对每个UMC中的节点赋予唯一标签来减少冗余标签,提高社区发现的效率以及稳定性。标签更新时以UMC作为核心单位采用亲密度的方式由中心向四周更新UMC邻接节点的标签及权重,以权重最大值的方式更新网络中非UMC邻接节点的权重。后期处理阶段采用自适应阈值方式去除节点标签中的噪声,有效克服了预先输入重叠社区个数在真实网络中的局限性。通过在学术社交网络平台——学者网数据集上的实验表明,该模型能够将具有一定共性的节点划分到同一个社区中,并为学术社交网络平台进一步的好友推荐、论文分享等精确的个性化服务提供了支持。  相似文献   

Visualizing communication logs, like NetFlow records, is extremely useful for numerous tasks that need to analyze network traffic traces, like network planning, performance monitoring, and troubleshooting. Communication logs, however, can be massive, which necessitates designing effective visualization techniques for large data sets. To address this problem, we introduce a novel network traffic visualization scheme based on the key ideas of (1) exploiting frequent itemset mining (FIM) to visualize a succinct set of interesting traffic patterns extracted from large traces of communication logs; and (2) visualizing extracted patterns as hypergraphs that clearly display multi-attribute associations. We demonstrate case studies that support the utility of our visualization scheme and show that it enables the visualization of substantially larger data sets than existing network traffic visualization schemes based on parallel-coordinate plots or graphs. For example, we show that our scheme can easily visualize the patterns of more than 41 million NetFlow records. Previous research has explored using parallel-coordinate plots for visualizing network traffic flows. However, such plots do not scale to data sets with thousands of even millions of flows.  相似文献   

复杂网络中的社团结构探测对于理解网络的拓扑结构和功能有重要的意义.本文将字典学习方法应用到社团结构探测问题中,给出一种新的字典学习方法,并将其和其他几种流行的模型与算法作了系统比较.在三种类型的人工数据和来自不同领域的实际数据上的实验结果表明,本文所提出的算法在社团结构探测问题上是非常有效的,具有算法简单、收敛速度快、计算精度高等特点.  相似文献   

Discovering community structures is a fundamental problem concerning how to understand the topology and the functions of complex network. In this paper, we propose how to apply dictionary learning algorithm to community structure detection. We present a new dictionary learning algorithm and systematically compare it with other state-of-the-art models/algorithms. The results show that the proposed algorithm is highly effectively at finding the community structures in both synthetic datasets, including three types of data structures, and real world networks coming from different areas.  相似文献   

针对多数隐私保护的频繁模式挖掘算法需要多次数据库扫描以及计数时需要进行多次比较的不足,提出了一种增量的基于位图的部分隐藏随机化回答(IBRRPH)算法。首先,引入bitmap表示数据库中的事务,采用位与操作有效提高支持度的计算速度;其次,通过分析增量访问关系,引入增量更新模型,使得在数据增量更新时频繁模式挖掘最大限度地利用了之前挖掘结果。针对增量分别为1000至40000,与顾铖等提出的算法(顾铖,朱保平,张金康.一种改进的隐私保护关联规则挖掘算法.南京航空航天大学学报,2015,47(1):119-124)进行了对比测试实验。实验结果表明,与顾铖等提出的算法相比,IBRRPH算法的效率提高幅度超过21%。  相似文献   

In this paper, we introduce polygene-based evolution, a novel framework for evolutionary algorithms (EAs) that features distinctive operations in the evolutionary process. In traditional EAs, the primitive evolution unit is a gene, wherein genes are independent components during evolution. In polygene-based evolutionary algorithms (PGEAs), the evolution unit is a polygene, i.e., a set of co-regulated genes. Discovering and maintaining quality polygenes can play an effective role in evolving quality individuals. Polygenes generalize genes, and PGEAs generalize EAs. Implementing the PGEA framework involves three phases: (I) polygene discovery, (II) polygene planting, and (III) polygene-compatible evolution. For Phase I, we adopt an associative classification-based approach to discover quality polygenes. For Phase II, we perform probabilistic planting to maintain the diversity of individuals. For Phase III, we incorporate polygene-compatible crossover and mutation in producing the next generation of individuals. Extensive experiments on function optimization benchmarks in comparison with the conventional and state-of-the-art EAs demonstrate the potential of the approach in terms of accuracy and efficiency improvement.  相似文献   

Mining frequent patterns in a single network (graph) poses a number of challenges. Already only to match one path pattern to a network under subgraph isomorphism is NP-complete. Classical matching algorithms become intractable even for reasonably small patterns, on networks which are large or have a high average degree. Based on recent advances in parameterized complexity theory, we propose a novel miner for rooted trees in networks. The miner, for a fixed parameter $k$ k (maximal pattern size), can mine all rooted trees with delay linear in the size of the network and only mildly exponential in the fixed parameter $k$ k . This allows us to mine tractably, rooted trees, in large networks such as the WWW or social networks. We establish the practical applicability of our miner, by presenting an experimental evaluation on both synthetic and real-world data.  相似文献   

一些先进应用如欺诈检测和趋势学习等带来了数据流频繁模式挖掘的发展。不同于静态数据,数据流挖掘面临着时空约束和项集组合爆炸等问题。对已有数据流频繁模式挖掘算法进行综述并对经典和最新算法进行分析。按照模式集合的完整程度进行分类,数据流中频繁模式分为全集模式和压缩模式。压缩模式主要包括闭合模式、最大模式、top-k模式以及三者的组合模式。不同之处是闭合模式是无损压缩的,而其他模式是有损压缩的。为了得到有趣的频繁模式,可以挖掘基于用户约束的模式。为了处理数据流中的新近事务,将算法分为基于窗口模型和基于衰减模型的方法。数据流中模式挖掘常见的还包含序列模式和高效用模式,对经典和最新算法进行介绍。最后给出了数据流模式挖掘的下一步工作。  相似文献   

在频繁模式挖掘过程中能够动态改变约束的算法比较少.提出了一种基于约束的频繁模式挖掘算法MCFP.MCFP首先按照约束的性质来建立频繁模式树,并且只需扫描一遍数据库,然后建立每个项的条件树,挖掘以该项为前缀的最大频繁模式,并用最大模式树来存储,最后根据最大模式来找出所有支持度明确的频繁模式.MCFP算法允许用户在挖掘频繁模式过程中动态地改变约束.实验表明,该算法与iCFP算法相比是很有效的.  相似文献   

Sliding window-based frequent pattern mining over data streams   总被引:2,自引:0,他引:2  
Finding frequent patterns in a continuous stream of transactions is critical for many applications such as retail market data analysis, network monitoring, web usage mining, and stock market prediction. Even though numerous frequent pattern mining algorithms have been developed over the past decade, new solutions for handling stream data are still required due to the continuous, unbounded, and ordered sequence of data elements generated at a rapid rate in a data stream. Therefore, extracting frequent patterns from more recent data can enhance the analysis of stream data. In this paper, we propose an efficient technique to discover the complete set of recent frequent patterns from a high-speed data stream over a sliding window. We develop a Compact Pattern Stream tree (CPS-tree) to capture the recent stream data content and efficiently remove the obsolete, old stream data content. We also introduce the concept of dynamic tree restructuring in our CPS-tree to produce a highly compact frequency-descending tree structure at runtime. The complete set of recent frequent patterns is obtained from the CPS-tree of the current window using an FP-growth mining technique. Extensive experimental analyses show that our CPS-tree is highly efficient in terms of memory and time complexity when finding recent frequent patterns from a high-speed data stream.  相似文献   

Pattern Analysis and Applications - Frequent pattern (itemset) mining is one of the established approaches for knowledge discovery. Minimizing the number of database scans (I/O overhead) is a...  相似文献   

