首页 | 本学科首页   官方微博 | 高级检索  
     


Mining non-derivable hypercliques
Authors:Anna Koufakou
Affiliation:1. U.A. Whitaker College of Engineering, Florida Gulf Coast University, Fort Myers, FL, USA
Abstract:A hyperclique (Xiong et al. in Proceedings of the IEEE international conference on data mining, pp 387–394, 2003) is an itemset containing items that are strongly correlated with each other, based on a user-specified threshold. Hypercliques (HCs) have been successfully used in a number of applications, for example, clustering (Xiong et al. in Proceedings of the 4th SIAM international conference on data mining, pp 279–290, 2004) and noise removal (Xiong et al. in IEEE Trans Knowl Data Eng 18(3):304–319, 2006). Even though HC has been shown to respond well to datasets with skewed support distribution and low support threshold, it may still grow very large for dense datasets and lower h-confidence threshold. In this paper, we propose a new pruning method based on combining HCs and non-derivable itemsets (NDIs) (Calders and Goethals in Proceedings of the PKDD international conference on principles of data mining and knowledge discovery, pp 74–85, 2002) in order to substantially reduce the amount of generated HCs. Specifically, we propose a new collection of HCs, called non-derivable hypercliques (NDHCs). The NDHC collection is a lossless representation of HCs, that is, given the itemsets in NDHCs, we can generate the complete HC collection and their support, without additional scanning of the dataset. We present an efficient algorithm to mine all NDHC sets, NDHCMiner, and an algorithm to derive all HC sets and their support from NDHCs, NDHCDeriveAll. We experimentally compare our collection, NDHC with HC, with respect to runtime performance as well as total number of generated sets, using real and artificial data. We also show comparisons with another condensed representation of HCs, maximal hyperclique patterns (MHPs). Our experiments show that the NDHC collection offers substantial advantages over HCs, and even MHPs, especially for dense datasets and lower h-confidence values.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号