首页 | 本学科首页   官方微博 | 高级检索  
     


Compression-based data mining of sequential data
Authors:Eamonn Keogh  Stefano Lonardi  Chotirat Ann Ratanamahatana  Li Wei  Sang-Hee Lee  John Handley
Affiliation:(1) Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA;(2) Department of Computer Engineering, Chulalongkorn University, Bangkok, Thailand;(3) Department of Anthropology, University of California, Riverside, CA 92521, USA;(4) Xerox Innovation Group, Xerox Corporation, 800 Phillips Road, MS 128-25E, Webster, New York, 14580-9701, USA
Abstract:The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics, learning, and computational theory hold great promise for a parameter-light data-mining paradigm. The results are strongly connected to Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen lines of code. We will show that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets. As a further evidence of the advantages of our method, we will demonstrate its effectiveness to solve a real world classification problem in recommending printing services and products. Responsible editor: Johannes Gehrke
Keywords:Kolmogorov complexity  Parameter-free data mining  Anomaly detection  Clustering
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号