A scalable algorithm for mining maximal frequent sequences using a sample |
| |
Authors: | Congnan Luo Soon M. Chung |
| |
Affiliation: | (1) Department of Computer Science and Engineering, Wright State University, Dayton, OH 45435, USA |
| |
Abstract: | In this paper, we propose an efficient scalable algorithm for mining Maximal Sequential Patterns using Sampling (MSPS). The MSPS algorithm reduces much more search space than other algorithms because both the subsequence infrequency-based pruning and the supersequence frequency-based pruning are applied. In MSPS, a sampling technique is used to identify long frequent sequences earlier, instead of enumerating all their subsequences. We propose how to adjust the user-specified minimum support level for mining a sample of the database to achieve better overall performance. This method makes sampling more efficient when the minimum support is small. A signature-based method and a hash-based method are developed for the subsequence infrequency-based pruning when the seed set of frequent sequences for the candidate generation is too big to be loaded into memory. A prefix tree structure is developed to count the candidate sequences of different sizes during the database scanning, and it also facilitates the customer sequence trimming. Our experiments showed MSPS has very good performance and better scalability than other algorithms. Congnan Luo received the B.E. degree in Computer Science from Tsinghua University, Beijing, P.R. China, in 1997, the M.S. degree in Computer Science from the Institute of Software, Chinese Academy of Sciences, Beijing, P.R. China, in 2000, and the Ph.D. degree in Computer Science and Engineering from Wright State University, Dayton, OH, in 2006. Currently he is a technical staff at the Teradata division of NCR in San Diego, CA, and his research interests include data mining, machine learning, and databases. Soon M. Chung received the B.S. degree in Electronic Engineering from Seoul National University, Korea, in 1979, the M.S. degree in Electrical Engineering from Korea Advanced Institute of Science and Technology, Korea, in 1981, and the Ph.D. degree in Computer Engineering from Syracuse University, Syracuse, New York, in 1990. He is currently a Professor in the Department of Computer Science and Engineering at Wright State University, Dayton, OH. His research interests include database, data mining, Grid computing, text mining, XML, and parallel and distributed processing. |
| |
Keywords: | Data mining Maximal frequent sequences Sampling Signatures Prefix tree Performance analysis |
本文献已被 SpringerLink 等数据库收录! |
|