首页 | 本学科首页   官方微博 | 高级检索  
     


Evaluating data mining procedures: techniques for generating artificial data sets
Affiliation:1. Center for Advanced Life Cycle Engineering, University of Maryland, College Park, MD 20742, United States;2. Department of Electronic Engineering, City University of Hong Kong, Hong Kong;1. The Ohio State University College of Medicine, Columbus, Ohio;2. Division of Trauma, Critical Care and Burn, Department of Surgery, The Ohio State University College of Medicine, Columbus, Ohio;1. Interuniversity Centre for Health Economics Research, Faculty of Medicine and Pharmacy, Vrije Universiteit Brussel, Laarbeeklaan 103, 1090 Brussels, Belgium;2. Emergency and Disaster Medicine, Department Emergency Medicine, Universitair Ziekenhuis Brussel, Laarbeeklaan 101, 1090 Brussels, Belgium;3. Medical Registration, Universitair Ziekenhuis Brussel, Laarbeeklaan 101, 1090 Brussels, Belgium;1. Department of Emergency Medicine and Injury Prevention Center, Hasbro Children''s Hospital, Providence, RI;2. Alpert Medical School of Brown University, Providence, RI;3. Department of Statistics, University of Missouri, Columbia, MO;4. Children''s Hospital Association, Overland Park, KS;5. Department of Pediatrics, The Children''s Hospital of Philadelphia, Philadelphia, PA;6. Department of Pediatrics, Children''s Mercy Hospitals and Clinics, University of Missouri-Kansas City School of Medicine, Kansas City, MO;7. Department of Emergency Medicine, Child Health Evaluation and Research Unit, Division of General Pediatrics, C.S. Mott Children''s Hospital, University of Michigan Medical School, Ann Arbor, MI;8. Department of Pediatrics, Ann and Robert H. Lurie Children''s Hospital of Chicago, Northwestern University Feinberg School of Medicine, Chicago, IL;9. Department of Pediatric Emergency Medicine, Children''s Hospitals and Clinics of Minnesota, Minneapolis, MN;10. Department of Pediatrics, Section on Academic General Pediatrics, Baylor College of Medicine, Houston, TX;11. Children''s Health System of Texas, Dallas, TX;12. Children''s Hospital Colorado, Aurora, CO;13. Department of Pediatrics, Cincinnati Children''s Hospital Medical Center, University of Cincinnati College of Medicine, Cincinnati, OH;1. National Research Tomsk Polytechnic University, 634050, Tomsk, Russia;2. Institute of Strength Physics and Materials Science SB RAS, 634021, Tomsk, Russia
Abstract:In this article, we discuss the need to evaluate the performance of data mining procedures and argue that tests done with real data sets cannot provide all the information needed for a thorough assessment of their performance characteristics. We argue that artificial data sets are therefore essential. After a discussion of the desirable characteristics of such artificial data, we describe two pseudo-random generators. The first is based on the multi-variate normal distribution and gives the investigator full control of the degree of correlation between the variables in the artificial data sets. The second is inspired by fractal techniques for synthesizing artificial landscapes and can produce data whose classification complexity can be controlled by a single parameter. We conclude with a discussion of the additional work necessary to achieve the ultimate goal of a method of matching data sets to the most appropriate data mining technique.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号