Time Series

COVID-19 Outbreak Prediction

Leakage-free pipeline on 188 daily records (Jan–Jul 2020). Target = daily new cases (stationary). Walk-forward TimeSeriesSplit CV. SEIR model + ARIMA + XGBoost + LSTM + Transformer. Fixes cumulative-count leakage from v1.

View on Kaggle

Dataset

188 days COVID-19 global data (Jan–Jul 2020)

Approach

Stationary daily-delta + walk-forward CV + SEIR + ARIMA + LSTM + Transformer

Tech Stack

PythonScikit-learnTensorFlowscipy (SEIR)statsmodels

Keywords

SEIRLSTMTransformerEpidemiologyWalk-forwardARIMATimeSeriesSplit

Visualizations6 Charts

Deep Dive

COVID-19 daily case forecasting with critical fixes to the data leakage in most published solutions.

4 Critical Fixes vs v1

Issue	Fix
ML trained on cumulative counts (future trend leakage)	Target = daily new cases (stationary)
SEIR incubation ~1 day (biologically impossible)	Constrained: incubation 5–14 days
Random train/test split (future data in training)	Walk-forward TimeSeriesSplit CV
Transformers undertrained (2 epochs)	More epochs + cosine LR

Dataset

▸188 daily records: 2020-01-22 → 2020-07-27
▸Features: confirmed, deaths, recovered, active (global aggregate)

Models Compared

Model	Type
SEIR	Compartmental epidemiological model (scipy optimization)
ARIMA	Auto (p,d,q), AIC criterion
Gradient Boosting	Walk-forward, lag features
XGBoost	Optuna-tuned
LSTM	Stateful, sequence-to-one
Transformer	Encoder-only, positional encoding

SEIR Model 4 compartments: S→E→I→R. scipy.optimize with bounds: β∈[0.1,1.0], σ∈[1/14,1/5], γ∈[1/14,1/7]. Initial conditions from early outbreak data.

Walk-Forward CV Expanding window — each fold trains on all prior data and predicts one step ahead. The only correct evaluation for time series; random splits would leak future case counts into training.

Back to Projects Hire Me