On efficiently summarizing categorical databases |
| |
Authors: | Jianyong Wang George Karypis |
| |
Affiliation: | (1) Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China;(2) Department of Computer Science, Digital Technology Center and Army HPC Research Center, University of Minnesota, Minneapolis, MN 55455, USA |
| |
Abstract: | Frequent itemset mining was initially proposed and has been studied extensively in the context of association rule mining.
In recent years, several studies have also extended its application to transaction or document clustering. However, most of
the frequent itemset based clustering algorithms need to first mine a large intermediate set of frequent itemsets in order
to identify a subset of the most promising ones that can be used for clustering. In this paper, we study how to directly find
a subset of high quality frequent itemsets that can be used as a concise summary of the transaction database and to cluster
the categorical data. By exploring key properties of the subset of itemsets that we are interested in, we proposed several
search space pruning methods and designed an efficient algorithm called SUMMARY. Our empirical results show that SUMMARY runs
very fast even when the minimum support is extremely low and scales very well with respect to the database size, and surprisingly,
as a pure frequent itemset mining algorithm it is very effective in clustering the categorical data and summarizing the dense
transaction databases.
Jianyong Wang received the Ph.D. degree in computer science in 1999 from the Institute of Computing Technology, the Chinese Academy of
Sciences. Since then, he ever worked as an assistant professor in the Department of Computer Science and Technology, Peking
(Beijing) University in the areas of distributed systems and Web search engines, and visited the School of Computing Science
at Simon Fraser University, the Department of Computer Science at the University of Illinois at Urbana-Champaign, and the
Digital Technology Center and the Department of Computer Science at the University of Minnesota, mainly working in the area
of data mining. He is currently an associate professor of the Department of Computer Science and Technology at Tsinghua University,
P.R. China.
George Karypis received his Ph.D. degree in computer science at the University of Minnesota and he is currently an associate professor at
the Department of Computer Science and Engineering at the University of Minnesota. His research interests spans the areas
of parallel algorithm design, data mining, bioinformatics, information retrieval, applications of parallel processing in scientific
computing and optimization, sparse matrix computations, parallel preconditioners, and parallel programming languages and libraries.
His research has resulted in the development of software libraries for serial and parallel graph partitioning (METIS and ParMETIS),
hypergraph partitioning (hMETIS), for parallel Cholesky factorization (PSPASES), for collaborative filtering-based recommendation
algorithms (SUGGEST), clustering high dimensional datasets (CLUTO), and finding frequent patterns in diverse datasets (PAFI).
He has coauthored over ninety journal and conference papers on these topics and a book title “Introduction to Parallel Computing”
(Publ. Addison Wesley, 2003, 2nd edition). In addition, he is serving on the program committees of many conferences and workshops
on these topics and is an associate editor of the IEEE Transactions on Parallel and Distributed Systems. |
| |
Keywords: | Data mining Frequent itemset Categorical database Clustering |
本文献已被 SpringerLink 等数据库收录! |
|