首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Detecting Group Differences: Mining Contrast Sets   总被引:3,自引:0,他引:3  
A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide a search algorithm for mining contrast sets with pruning rules that drastically reduce the computational complexity. Once the contrast sets are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.  相似文献   

2.
《Information Systems》1999,24(1):25-46
Discovering association rules is one of the most important task in data mining. Many efficient algorithms have been proposed in the literature. The most noticeable are Apriori, Mannila's algorithm, Partition, Sampling and DIC, that are all based on the Apriori mining method: pruning the subset lattice (itemset lattice). In this paper we propose an efficient algorithm, called Close, based on a new mining method: pruning the closed set lattice (closed itemset lattice). This lattice, which is a sub-order of the subset lattice, is closely related to Wille's concept lattice in formal concept analysis. Experiments comparing Close to an optimized version of Apriori showed that Close is very efficient for mining dense and/or correlated data such as census style data, and performs reasonably well for market basket style data.  相似文献   

3.
High-utility itemset mining (HUIM) is a popular data mining task with applications in numerous domains. However, traditional HUIM algorithms often produce a very large set of high-utility itemsets (HUIs). As a result, analyzing HUIs can be very time consuming for users. Moreover, a large set of HUIs also makes HUIM algorithms less efficient in terms of execution time and memory consumption. To address this problem, closed high-utility itemsets (CHUIs), concise and lossless representations of all HUIs, were proposed recently. Although mining CHUIs is useful and desirable, it remains a computationally expensive task. This is because current algorithms often generate a huge number of candidate itemsets and are unable to prune the search space effectively. In this paper, we address these issues by proposing a novel algorithm called CLS-Miner. The proposed algorithm utilizes the utility-list structure to directly compute the utilities of itemsets without producing candidates. It also introduces three novel strategies to reduce the search space, namely chain-estimated utility co-occurrence pruning, lower branch pruning, and pruning by coverage. Moreover, an effective method for checking whether an itemset is a subset of another itemset is introduced to further reduce the time required for discovering CHUIs. To evaluate the performance of the proposed algorithm and its novel strategies, extensive experiments have been conducted on six benchmark datasets having various characteristics. Results show that the proposed strategies are highly efficient and effective, that the proposed CLS-Miner algorithmoutperforms the current state-ofthe- art CHUD and CHUI-Miner algorithms, and that CLSMiner scales linearly.  相似文献   

4.
刘慧婷  沈盛霞  赵鹏  姚晟 《计算机应用》2015,35(10):2911-2914
由于不确定数据的向下封闭属性,挖掘全部频繁项集的方法会得到一个指数级的结果。为获得一个较小的合适的结果集,研究了在不确定数据上挖掘频繁闭项集,并提出了一种新的频繁闭项集挖掘算法——NA-PFCIM。该算法将项集挖掘过程看作一个概率分布函数,考虑到基于正态分布模型的方法提取的频繁项集精确度较高,而且支持大型数据库,采用了正态分布模型提取频繁项集。同时,为了减少搜索空间以及避免冗余计算,利用基于深度优先搜索的策略来获得所有的概率频繁闭项集。该算法还设计了两个剪枝策略:超集修剪和子集修剪。最后,在常用的数据集(T10I4D100K、Accidents、Mushroom、Chess)上,将提出的NA-PFCIM算法和基于泊松分布的A-PFCIM算法进行比较。实验结果表明,NA-PFCIM算法能够减少所要扩展的项集,同时减少项集频繁概率的计算,其性能优于对比算法。  相似文献   

5.
针对相关算法在挖掘频繁闭项集时所存在的问题, 提出了一种基于位运算的频繁闭项集挖掘算法。该算法首先将数据集转换成布尔矩阵, 只需扫描数据集一次; 通过位运算计算支持度, 利用矩阵和数组存储辅助信息, 减少时间和空间消耗; 深度优先搜索产生频繁闭项集时利用剪枝策略进一步减少挖掘时间; 利用同生项集性质进行闭合性检测, 无须检查超集或子集。理论分析和实验结果验证了该算法的有效性。  相似文献   

6.
7.
基于FP-Tree 的快速选择性集成算法   总被引:3,自引:1,他引:2  
赵强利  蒋艳凰  徐明 《软件学报》2011,22(4):709-721
选择性集成通过选择部分基分类器参与集成,从而提高集成分类器的泛化能力,降低预测开销.但已有的选择性集成算法普遍耗时较长,将数据挖掘的技术应用于选择性集成,提出一种基于FP-Tree(frequent pattern tree)的快速选择性集成算法:CPM-EP(coverage based pattern mining for ensemble pruning).该算法将基分类器对校验样本集的分类结果组织成一个事务数据库,从而使选择性集成问题可转化为对事务数据集的处理问题.针对所有可能的集成分类器大小,CPM-EP算法首先得到一个精简的事务数据库,并创建一棵FP-Tree树保存其内容;然后,基于该FP-Tree获得相应大小的集成分类器.在获得的所有集成分类器中,对校验样本集预测精度最高的集成分类器即为算法的输出.实验结果表明,CPM-EP算法以很低的计算开销获得优越的泛化能力,其分类器选择时间约为GASEN的1/19以及Forward-Selection的1/8,其泛化能力显著优于参与比较的其他方法,而且产生的集成分类器具有较少的基分类器.  相似文献   

8.
Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.  相似文献   

9.
翟俊海    刘博  张素芳 《智能系统学报》2017,12(3):397-404
特征选择是指从初始特征全集中,依据既定规则筛选出特征子集的过程,是数据挖掘的重要预处理步骤。通过剔除冗余属性,以达到降低算法复杂度和提高算法性能的目的。针对离散值特征选择问题,提出了一种将粗糙集相对分类信息熵和粒子群算法相结合的特征选择方法,依托粒子群算法,以相对分类信息熵作为适应度函数,并与其他基于进化算法的特征选择方法进行了实验比较,实验结果表明本文提出的方法具有一定的优势。  相似文献   

10.
In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.  相似文献   

11.
相比于集成学习,集成剪枝方法是在多个分类器中搜索最优子集从而改善分类器的泛化性能,简化集成过程。帕累托集成剪枝方法同时考虑了分类器的精准度及集成规模两个方面,并将二者均作为优化的目标。然而帕累托集成剪枝算法只考虑了基分类器的精准度与集成规模,忽视了分类器之间的差异性,从而导致了分类器之间的相似度比较大。本文提出了融入差异性的帕累托集成剪枝算法,该算法将分类器的差异性与精准度综合为第1个优化目标,将集成规模作为第2个优化目标,从而实现多目标优化。实验表明,当该改进的集成剪枝算法与帕累托集成剪枝算法在集成规模相当的前提下,由于差异性的融入该改进算法能够获得较好的性能。  相似文献   

12.
In this paper, we propose an efficient rule discovery algorithm, called FD_Mine, for mining functional dependencies from data. By exploiting Armstrong’s Axioms for functional dependencies, we identify equivalences among attributes, which can be used to reduce both the size of the dataset and the number of functional dependencies to be checked. We first describe four effective pruning rules that reduce the size of the search space. In particular, the number of functional dependencies to be checked is reduced by skipping the search for FDs that are logically implied by already discovered FDs. Then, we present the FD_Mine algorithm, which incorporates the four pruning rules into the mining process. We prove the correctness of FD_Mine, that is, we show that the pruning does not lead to the loss of useful information. We report the results of a series of experiments. These experiments show that the proposed algorithm is effective on 15 UCI datasets and synthetic data.  相似文献   

13.
14.
提出了KDD中数据预处理的一种基本算法.针对数据库中的属性,利用非监督学习算法,在获取了面向任务的目标数据子集的基础上,利用混合优化算法进行特征子集的选取.分析了遗传算法和混合遗传算法用于特征子集选择的基本算法,仿真实验说明了混合优化算法的有效性和可行性.  相似文献   

15.
关联规则挖掘是数据挖掘的重要领域之一,利用粗糙集理论来挖掘关联规则的方法已经得到广泛关注.针对不完备信息系统,提出了基于粗糙集理论的快速ORD关联规则挖掘算法.该算法首先采用基于粗糙集理论的属性约简算法进行属性约简,然后采用快速、高效的冗余项集和冗余规则修剪算法--ORD算法获取关联规则.将该算法与其它同类流行的算法在4个UCI数据集上进行实验比较,结果表明该算法性能良好.  相似文献   

16.
Ensemble pruning deals with the reduction of base classifiers prior to combination in order to improve generalization and prediction efficiency. Existing ensemble pruning algorithms require much pruning time. This paper presents a fast pruning approach: pattern mining based ensemble pruning (PMEP). In this algorithm, the prediction results of all base classifiers are organized as a transaction database, and FP-Tree structure is used to compact the prediction results. Then a greedy pattern mining method is explored to find the ensemble of size k. After obtaining the ensembles of all possible sizes, the one with the best accuracy is outputted. Compared with Bagging, GASEN, and Forward Selection, experimental results show that PMEP achieves the best prediction accuracy and keeps the size of the final ensemble small, more importantly, its pruning time is much less than other ensemble pruning algorithms.  相似文献   

17.
挖掘闭合模式的高性能算法   总被引:16,自引:1,他引:16  
频繁闭合模式集惟一确定频繁模式完全集并且尺寸小得多,然而挖掘频繁闭合模式仍然是时间与存储开销很大的任务.提出一种高性能算法来解决这一难题.采用复合型频繁模式树来组织频繁模式集,存储开销较小.通过集成深度与宽度优先策略,伺机选择基于数组或基于树的模式支持子集表示形式,启发式运用非过滤虚拟投影或过滤型投影,实现复合型频繁模式树的快速生成.局部和全局剪裁方法有效地缩小了搜索空间.通过树生成与剪裁代价的平衡实现时间效率与可伸缩性最大化.实验表明,该算法时间效率比其他算法高5倍到3个数量级,空间可伸缩性最佳.它可以进一步应用到无冗余关联规则发现、序列分析等许多数据挖掘问题.  相似文献   

18.
一种快速有效的分布式开采多层关联规则的算法   总被引:6,自引:0,他引:6  
关联规则(association rules)是数据开采的重要研究内容,建立项目的层次关系可以发现更加有意义的规则,主要研究分布式环境下开采多层关联规则的问题,提出了一种快速有效的MLFDM算法,采用的技术包括分布式编码交易表的有效修剪,侯选集的产生及修剪技术,侯选项集的全局支持数的计算方法等,论述了它的原理,具体实现方法及其几个改进算法,实验结果表明,算法MLFDM是有效的,并对MLFDM算法的几个变种进行了讨论。  相似文献   

19.
gMLC: a multi-label feature selection framework for graph classification   总被引:1,自引:1,他引:0  
Graph classification has been showing critical importance in a wide variety of applications, e.g. drug activity predictions and toxicology analysis. Current research on graph classification focuses on single-label settings. However, in many applications, each graph data can be assigned with a set of multiple labels simultaneously. Extracting good features using multiple labels of the graphs becomes an important step before graph classification. In this paper, we study the problem of multi-label feature selection for graph classification and propose a novel solution, called gMLC, to efficiently search for optimal subgraph features for graph objects with multiple labels. Different from existing feature selection methods in vector spaces that assume the feature set is given, we perform multi-label feature selection for graph data in a progressive way together with the subgraph feature mining process. We derive an evaluation criterion to estimate the dependence between subgraph features and multiple labels of graphs. Then, a branch-and-bound algorithm is proposed to efficiently search for optimal subgraph features by judiciously pruning the subgraph search space using multiple labels. Empirical studies demonstrate that our feature selection approach can effectively boost multi-label graph classification performances and is more efficient by pruning the subgraph search space using multiple labels.  相似文献   

20.
陈文 《计算机工程》2012,38(6):63-65
提出一种不产生候选项目集的加权频繁模式挖掘算法。对每个项目集权重进行归一化操作,避免加权支持率大于1,证明该算法满足加权向下封闭性。在此基础上,构建基于加权Fp树的剪枝策略。实例分析和实验结果表明,该算法能减少加权频繁项目集生成过程中的计算量,提高加权频繁项目集的生成效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号