Hierarchical clustering of subpopulations with a dissimilarity based on the likelihood ratio statistic: application to clustering massive data sets期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Hierarchical clustering of subpopulations with a dissimilarity based on the likelihood ratio statistic: application to clustering massive data sets

Authors:

Antonio Ciampi Yves Lechevallier Manuel Castejón Limas Ana González Marcos

Affiliation:

(1) Department of Epidemiology and Biostatistics, McGill University, Montreal, P.Q., Canada;(2) INRIA—Rocquencourt, 87153 Le Chesnay Cedex, France;(3) Department of Mechanical, Informatical and Aerospace Engineering, Universidad de León, 24007 León, Spain

Abstract:

The problem of clustering subpopulations on the basis of samples is considered within a statistical framework: a distribution for the variables is assumed for each subpopulation and the dissimilarity between any two populations is defined as the likelihood ratio statistic which compares the hypothesis that the two subpopulations differ in the parameter of their distributions to the hypothesis that they do not. A general algorithm for the construction of a hierarchical classification is described which has the important property of not having inversions in the dendrogram. The essential elements of the algorithm are specified for the case of well-known distributions (normal, multinomial and Poisson) and an outline of the general parametric case is also discussed. Several applications are discussed, the main one being a novel approach to dealing with massive data in the context of a two-step approach. After clustering the data in a reasonable number of ‘bins’ by a fast algorithm such as k-Means, we apply a version of our algorithm to the resulting bins. Multivariate normality for the means calculated on each bin is assumed: this is justified by the central limit theorem and the assumption that each bin contains a large number of units, an assumption generally justified when dealing with truly massive data such as currently found in modern data analysis. However, no assumption is made about the data generating distribution.

Antonio CiampiEmail:

Antonio Ciampi received his M.Sc. and Ph.D. degrees from Queen's University, Kingston, Ontario, Canada in 1973. He taught at the University of Zambia from 1973 to 1977. Returning to Canada he worked as statitician in the Treasury of the Ontario Government. From 1978 to 1985, he was Senior Scientist in the Ontario Cancer Institute, Toronto, and taught at the University of Toronto. In 1985 he moved to Montreal where he is Associate Professor in the Department of Epidemiology, Biostatistics and Occupational Health, McGill University. He has also been Senior Scientist of the Montreal Children's Hospital Research Instititue, in the Montreal Heart Institute and in the St. Mary's Hospital Community Health Research Unit. His research interest include Statistical Learning, Data Mining and Statistical Modeling. MediaObjects/10044_2007_88_Figa_HTML.jpg

MediaObjects/10044_2007_88_Figa_HTML.jpg

Yves Lechevallier In 1976 he joined the INRIA where he was engaged in the project of Clustering and Pattern Recognition. Since 1988 he has been teaching Clustering, Neural Network and Data Mining at the University of PARIS-IX, CNAM and ENSAE. He specializes in Mathematical Statistics, Applied Statistics, Data Analysis and Classification. Current Research Interests: (1) Clustering algorithm (Dynamic Clustering Method, Kohonen Maps, Divisive Clustering Method); (2) Discrimination Problems and Decision Tree Methods; Build an efficient Neural Network by Classification Tree. MediaObjects/10044_2007_88_Figb_HTML.jpg

MediaObjects/10044_2007_88_Figb_HTML.jpg

Manuel Castejón Limas received his engineering degree from the Universidad de Oviedo in 1999 and his Ph.D. degree from the Universidad de La Rioja in 2004. From 2002 he teaches project management at the Universidad de Leon. His research is oriented towards the development of data analysis procedures that may aid project managers on their decision making processes. MediaObjects/10044_2007_88_Figc_HTML.jpg

MediaObjects/10044_2007_88_Figc_HTML.jpg

Ana González Marcos received her M.Sc. and Ph.D. degrees from the University of La Rioja, Spain. In 2003, she joined the University of León, Spain, where she works as a Lecturer in the Department of Mechanical, Informatic and Aerospace Engineering. Her research interests include the application of multivariate analysis and artificial intelligence techniques in order to improve the quality of industrial processes. MediaObjects/10044_2007_88_Figd_HTML.jpg

MediaObjects/10044_2007_88_Figd_HTML.jpg

Keywords:

Cluster analysis Binned data Dissimilarity Likelihood ratio statistic Dendrogram Large data sets

本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏