Mining non-derivable hypercliques期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Mining non-derivable hypercliques

Authors:	Anna Koufakou

Affiliation:	1. U.A. Whitaker College of Engineering, Florida Gulf Coast University, Fort Myers, FL, USA

Abstract:	A hyperclique (Xiong et al. in Proceedings of the IEEE international conference on data mining, pp 387–394, 2003) is an itemset containing items that are strongly correlated with each other, based on a user-specified threshold. Hypercliques (HCs) have been successfully used in a number of applications, for example, clustering (Xiong et al. in Proceedings of the 4th SIAM international conference on data mining, pp 279–290, 2004) and noise removal (Xiong et al. in IEEE Trans Knowl Data Eng 18(3):304–319, 2006). Even though HC has been shown to respond well to datasets with skewed support distribution and low support threshold, it may still grow very large for dense datasets and lower h-confidence threshold. In this paper, we propose a new pruning method based on combining HCs and non-derivable itemsets (NDIs) (Calders and Goethals in Proceedings of the PKDD international conference on principles of data mining and knowledge discovery, pp 74–85, 2002) in order to substantially reduce the amount of generated HCs. Specifically, we propose a new collection of HCs, called non-derivable hypercliques (NDHCs). The NDHC collection is a lossless representation of HCs, that is, given the itemsets in NDHCs, we can generate the complete HC collection and their support, without additional scanning of the dataset. We present an efficient algorithm to mine all NDHC sets, NDHCMiner, and an algorithm to derive all HC sets and their support from NDHCs, NDHCDeriveAll. We experimentally compare our collection, NDHC with HC, with respect to runtime performance as well as total number of generated sets, using real and artificial data. We also show comparisons with another condensed representation of HCs, maximal hyperclique patterns (MHPs). Our experiments show that the NDHC collection offers substantial advantages over HCs, and even MHPs, especially for dense datasets and lower h-confidence values.

Keywords:
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏