Forecasting skewed biased stochastic ozone days: analyses,solutions and beyond |
| |
Authors: | Kun Zhang Wei Fan |
| |
Affiliation: | (1) Department of Computer Science, Xavier University, New Orleans, LA, USA;(2) IBM T.J.Watson Research, Hawthorne, NY, USA |
| |
Abstract: | Much work on skewed, stochastic, high dimensional, and biased datasets usually implicitly solve each problem separately. Recently,
we have been approached by Texas Commission on Environmental Quality (TCEQ) to help them build highly accurate ozone level
alarm forecasting models for the Houston area, where these technical difficulties come together in one single problem. Key
characteristics of this problem that is challenging and interesting include: (1) the dataset is sparse (72 features, and 2
or 5% positives depending on the criteria of “ozone days”), (2) evolving over time from year to year, (3) limited in collected
data size (7 years or around 2,500 data entries), (4) contains a large number of irrelevant features, (5) is biased in terms
of “sample selection bias”, and (6) the true model is stochastic as a function of measurable factors. Besides solving a difficult
application problem, this dataset offers a unique opportunity to explore new and existing data mining techniques, and to provide
experience, guidance and solution for similar problems. Our main technical focus addresses on how to estimate reliable probability
given both sample selection bias and a large number of irrelevant features, and how to choose the most reliable decision threshold
to predict the unknown future with different distribution. On the application side, the prediction accuracy of our chosen
approach (bagging probabilistic decision trees and random decision trees) is 20% higher in recall (correctly detects 1–3 more
ozone days, depending on the year) and 10% higher in precision (15–30 fewer false alarm days per year) than state-of-the-art
methods used by air quality control scientists, and these results are significant for TCEQ. On the technical side of data
mining, extensive empirical results demonstrate that, at least for this problem, and probably other problems with similar
characteristics, these two straight-forward non-parametric methods can provide significantly more accurate and reliable solutions
than a number of sophisticated and well-known algorithms, such as SVM and AdaBoost among many others. |
| |
Keywords: | Sample selection bias Probability estimation Skewed distribution Streaming Random decision tree |
本文献已被 SpringerLink 等数据库收录! |
|