首页 | 本学科首页   官方微博 | 高级检索  
     


Hierarchical clustering of subpopulations with a dissimilarity based on the likelihood ratio statistic: application to clustering massive data sets
Authors:Antonio Ciampi  Yves Lechevallier  Manuel Castejón Limas  Ana González Marcos
Affiliation:(1) Department of Epidemiology and Biostatistics, McGill University, Montreal, P.Q., Canada;(2) INRIA—Rocquencourt, 87153 Le Chesnay Cedex, France;(3) Department of Mechanical, Informatical and Aerospace Engineering, Universidad de León, 24007 León, Spain
Abstract:The problem of clustering subpopulations on the basis of samples is considered within a statistical framework: a distribution for the variables is assumed for each subpopulation and the dissimilarity between any two populations is defined as the likelihood ratio statistic which compares the hypothesis that the two subpopulations differ in the parameter of their distributions to the hypothesis that they do not. A general algorithm for the construction of a hierarchical classification is described which has the important property of not having inversions in the dendrogram. The essential elements of the algorithm are specified for the case of well-known distributions (normal, multinomial and Poisson) and an outline of the general parametric case is also discussed. Several applications are discussed, the main one being a novel approach to dealing with massive data in the context of a two-step approach. After clustering the data in a reasonable number of ‘bins’ by a fast algorithm such as k-Means, we apply a version of our algorithm to the resulting bins. Multivariate normality for the means calculated on each bin is assumed: this is justified by the central limit theorem and the assumption that each bin contains a large number of units, an assumption generally justified when dealing with truly massive data such as currently found in modern data analysis. However, no assumption is made about the data generating distribution.
Contact Information Antonio CiampiEmail:

Antonio Ciampi   received his M.Sc. and Ph.D. degrees from Queen's University, Kingston, Ontario, Canada in 1973. He taught at the University of Zambia from 1973 to 1977. Returning to Canada he worked as statitician in the Treasury of the Ontario Government. From 1978 to 1985, he was Senior Scientist in the Ontario Cancer Institute, Toronto, and taught at the University of Toronto. In 1985 he moved to Montreal where he is Associate Professor in the Department of Epidemiology, Biostatistics and Occupational Health, McGill University. He has also been Senior Scientist of the Montreal Children's Hospital Research Instititue, in the Montreal Heart Institute and in the St. Mary's Hospital Community Health Research Unit. His research interest include Statistical Learning, Data Mining and Statistical Modeling. MediaObjects/10044_2007_88_Figa_HTML.jpg Yves Lechevallier   In 1976 he joined the INRIA where he was engaged in the project of Clustering and Pattern Recognition. Since 1988 he has been teaching Clustering, Neural Network and Data Mining at the University of PARIS-IX, CNAM and ENSAE. He specializes in Mathematical Statistics, Applied Statistics, Data Analysis and Classification. Current Research Interests: (1) Clustering algorithm (Dynamic Clustering Method, Kohonen Maps, Divisive Clustering Method); (2) Discrimination Problems and Decision Tree Methods; Build an efficient Neural Network by Classification Tree. MediaObjects/10044_2007_88_Figb_HTML.jpg Manuel Castejón Limas   received his engineering degree from the Universidad de Oviedo in 1999 and his Ph.D. degree from the Universidad de La Rioja in 2004. From 2002 he teaches project management at the Universidad de Leon. His research is oriented towards the development of data analysis procedures that may aid project managers on their decision making processes. MediaObjects/10044_2007_88_Figc_HTML.jpg Ana González Marcos   received her M.Sc. and Ph.D. degrees from the University of La Rioja, Spain. In 2003, she joined the University of León, Spain, where she works as a Lecturer in the Department of Mechanical, Informatic and Aerospace Engineering. Her research interests include the application of multivariate analysis and artificial intelligence techniques in order to improve the quality of industrial processes. MediaObjects/10044_2007_88_Figd_HTML.jpg
Keywords:Cluster analysis  Binned data  Dissimilarity  Likelihood ratio statistic  Dendrogram  Large data sets
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号