COVID-19 Outbreak Prediction
Leakage-free pipeline on 188 daily records (Jan–Jul 2020). Target = daily new cases (stationary). Walk-forward TimeSeriesSplit CV. SEIR model + ARIMA + XGBoost + LSTM + Transformer. Fixes cumulative-count leakage from v1.
188 days COVID-19 global data (Jan–Jul 2020)
Stationary daily-delta + walk-forward CV + SEIR + ARIMA + LSTM + Transformer
COVID-19 daily case forecasting with critical fixes to the data leakage in most published solutions.
4 Critical Fixes vs v1
| Issue | Fix |
|---|---|
| ML trained on cumulative counts (future trend leakage) | Target = daily new cases (stationary) |
| SEIR incubation ~1 day (biologically impossible) | Constrained: incubation 5–14 days |
| Random train/test split (future data in training) | Walk-forward TimeSeriesSplit CV |
| Transformers undertrained (2 epochs) | More epochs + cosine LR |
Dataset
- ▸188 daily records: 2020-01-22 → 2020-07-27
- ▸Features: confirmed, deaths, recovered, active (global aggregate)
Models Compared
| Model | Type |
|---|---|
| SEIR | Compartmental epidemiological model (scipy optimization) |
| ARIMA | Auto (p,d,q), AIC criterion |
| Gradient Boosting | Walk-forward, lag features |
| XGBoost | Optuna-tuned |
| LSTM | Stateful, sequence-to-one |
| Transformer | Encoder-only, positional encoding |
SEIR Model 4 compartments: S→E→I→R. scipy.optimize with bounds: β∈[0.1,1.0], σ∈[1/14,1/5], γ∈[1/14,1/7]. Initial conditions from early outbreak data.
Walk-Forward CV Expanding window — each fold trains on all prior data and predicts one step ahead. The only correct evaluation for time series; random splits would leak future case counts into training.