首页 | 本学科首页   官方微博 | 高级检索  
     

Efficient Computation of κ-Medians over Data Streams Under Memory Constraints
引用本文:崇志宏,于旭,张振杰,林学民,王伟,周傲英.Efficient Computation of κ-Medians over Data Streams Under Memory Constraints[J].计算机科学技术学报,2006(2).
作者姓名:崇志宏  于旭  张振杰  林学民  王伟  周傲英
作者单位:Department of Computer Science and Engineering Fudan University,Shanghai 200433,P.R. China School of Computer Science and Engineering,University of New South Wales,Sydney 2052,Australia Department of Computer Science and Engineering,Southeast University,Nanjing 210096,P.R. China,Department of Systems Engineering and Engineering Management,The Chinese University of Hong Kong,Shatin N.T. Hong Kong Special Administration Region,P.R. China,School of Computing,National University of Singapore,117574 Singapore,School of Computer Science and Engineering,University of New South Wales,Sydney 2052,Australia,School of Computer Science and Engineering,University of New South Wales,Sydney 2052,Australia,Department of Computer Science and Engineering,Fudan University,Shanghai 200433,P.R. China
基金项目:This work is supported by the National High Technology Development 863 Program of China under Grant No. 2002AA413310 ARC Discovery Grant, Australia under Grant Nos. DP0346004 and DP0345710
摘    要:In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1-∈)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.

关 键 词:data  streams  κ-medians    cluster    data  mining
本文献已被 CNKI 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号