Effects of resampling method and adaptation on clustering ensemble efficacy期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Effects of resampling method and adaptation on clustering ensemble efficacy

Authors:	Behrouz Minaei-Bidgoli Hamid Parvin Hamid Alinejad-Rokny Hosein Alizadeh William F Punch

Affiliation:	1. Department of Computer Engineering, Iran University of Scienceand Technology, Tehran, Iran 2. Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran 4. 7 Tir Street, Tirkhatir Street, Kafshgarkola Street, Imam Square, Ghaemshahr, Mazandaran, 4761764467, Iran 3. Department of Computer Science and Engineering, Michigan State University, 3115 Engineering Building, East Lansing, MI, 48824, USA

Abstract:	Clustering ensembles combine multiple partitions of data into a single clustering solution of better quality. Inspired by the success of supervised bagging and boosting algorithms, we propose non-adaptive and adaptive resampling schemes for the integration of multiple independent and dependent clusterings. We investigate the effectiveness of bagging techniques, comparing the efficacy of sampling with and without replacement, in conjunction with several consensus algorithms. In our adaptive approach, individual partitions in the ensemble are sequentially generated by clustering specially selected subsamples of the given dataset. The sampling probability for each data point dynamically depends on the consistency of its previous assignments in the ensemble. New subsamples are then drawn to increasingly focus on the problematic regions of the input feature space. A measure of data point clustering consistency is therefore defined to guide this adaptation. Experimental results show improved stability and accuracy for clustering structures obtained via bootstrapping, subsampling, and adaptive techniques. A meaningful consensus partition for an entire set of data points emerges from multiple clusterings of bootstraps and subsamples. Subsamples of small size can reduce computational cost and measurement complexity for many unsupervised data mining tasks with distributed sources of data. This empirical study also compares the performance of adaptive and non-adaptive clustering ensembles using different consensus functions on a number of datasets. By focusing attention on the data points with the least consistent clustering assignments, whether one can better approximate the inter-cluster boundaries or can at least create diversity in boundaries and this results in improving clustering accuracy and convergence speed as a function of the number of partitions in the ensemble. The comparison of adaptive and non-adaptive approaches is a new avenue for research, and this study helps to pave the way for the useful application of distributed data mining methods.

Keywords:
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏