首页 | 本学科首页   官方微博 | 高级检索  
     


Capabilities of outlier detection schemes in large datasets,framework and methodologies
Authors:Jian Tang  Zhixiang Chen  Ada Waichee Fu  David W Cheung
Affiliation:(1) Department of Computer Science, Memorial University of Newfoundland,St. John's, Newfoundland, Canada;(2) Department of Computer Science, University of Texas-Pan American Edinburgh, Texas, USA;(3) Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, Hong Kong;(4) Department of Computer Science and Information Systems, University of Hong Kong, Pokfulam, Hong Kong
Abstract:Outlier detection is concerned with discovering exceptional behaviors of objects. Its theoretical principle and practical implementation lay a foundation for some important applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, discovering computer intrusion, etc. In this paper, we first present a unified model for several existing outlier detection schemes, and propose a compatibility theory, which establishes a framework for describing the capabilities for various outlier formulation schemes in terms of matching users'intuitions. Under this framework, we show that the density-based scheme is more powerful than the distance-based scheme when a dataset contains patterns with diverse characteristics. The density-based scheme, however, is less effective when the patterns are of comparable densities with the outliers. We then introduce a connectivity-based scheme that improves the effectiveness of the density-based scheme when a pattern itself is of similar density as an outlier. We compare density-based and connectivity-based schemes in terms of their strengths and weaknesses, and demonstrate applications with different features where each of them is more effective than the other. Finally, connectivity-based and density-based schemes are comparatively evaluated on both real-life and synthetic datasets in terms of recall, precision, rank power and implementation-free metrics. Jian Tang received an MS degree from the University of Iowa in 1983, and PhD from the Pennsylvania State University in 1988, both from the Department of Computer Science. He joined the Department of Computer Science, Memorial University of Newfoundland, Canada, in 1988, where he is currently a professor. He has visited a number of research institutions to conduct researches ranging over a variety of topics relating to theories and practices for database management and systems. His current research interests include data mining, e-commerce, XML and bioinformatics. Zhixiang Chen is an associate professor in the Computer Science Department, University of Texas-Pan American. He received his PhD in computer science from Boston University in January 1996, BS and MS degrees in software engineering from Huazhong University of Science and Technology. He also studied at the University of Illinois at Chicago. He taught at Southwest State University from Fall 1995 to September 1997, and Huazhong University of Science and Technology from 1982 to 1990. His research interests include computational learning theory, algorithms and complexity, intelligent Web search, informational retrieval, and data mining. Ada Waichee Fu received her BSc degree in computer science in the Chinese University of Hong Kong in 1983, and both MSc and PhD degrees in computer science in Simon Fraser University of Canada in 1986, 1990, respectively; worked at Bell Northern Research in Ottawa, Canada, from 1989 to 1993 on a wide-area distributed database project; joined the Chinese University of Hong Kong in 1993. Her research interests are XML data, time series databases, data mining, content-based retrieval in multimedia databases, parallel, and distributed systems. David Wai-lok Cheung received the MSc and PhD degrees in computer science from Simon Fraser University, Canada, in 1985 and 1989, respectively. He also received the BSc degree in mathematics from the Chinese University of Hong Kong. From 1989 to 1993, he was a member of Scientific Staff at Bell Northern Research, Canada. Since 1994, he has been a faculty member of the Department of Computer Science in the University of Hong Kong. He is also the Director of the Center for E-Commerce Infrastructure Development. His research interests include data mining, data warehouse, XML technology for e-commerce and bioinformatics. Dr. Cheung was the Program Committee Chairman of the Fifth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2001), Program Co-Chair of the Ninth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2005). Dr. Cheung is a member of the ACM and the IEEE Computer Society.
Keywords:Outlier detection  Scheme capability  Distance-based outliers  Density-based outliers  Connectivity-based outliers  Performance metrics
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号