Clustering High Dimensional Massive Scientific Datasets期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Clustering High Dimensional Massive Scientific Datasets

Authors:	Ekow J. Otoo Arie Shoshani Seung-Won Hwang

Affiliation:	(1) Lawrence Berkeley National Laboratory, University of California, 1 Cyclotron Road, Berkeley, CA 94720, USA;(2) Department of Computer Science, University of Illinois at Urbana-Champaign, 1304 W. Springfield Avenue, Urbana, IL 61801, USA

Abstract:	Many scientific applications can benefit from an efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(MN) at best, and O(MN log N) in the worst case, using no more than O(MN) storage, for it to be practical. We introduce a hybrid algorithm, called HyCeltyc, for clustering massively large high dimensional datasets in O(MN) time which is linear in the size of the data. HyCeltyc, which stands for HybridCell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction, selection of significant features on which to cluster the data and a grid-based clustering algorihm that is linear in the data size.

Keywords:	clustering high dimensional scientific datasets
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏