All Projects
Time Series

COVID-19 Outbreak Prediction

Leakage-free pipeline on 188 daily records (Jan–Jul 2020). Target = daily new cases (stationary). Walk-forward TimeSeriesSplit CV. SEIR model + ARIMA + XGBoost + LSTM + Transformer. Fixes cumulative-count leakage from v1.

Dataset

188 days COVID-19 global data (Jan–Jul 2020)

Approach

Stationary daily-delta + walk-forward CV + SEIR + ARIMA + LSTM + Transformer

Tech Stack
PythonScikit-learnTensorFlowscipy (SEIR)statsmodels
Keywords
SEIRLSTMTransformerEpidemiologyWalk-forwardARIMATimeSeriesSplit
Visualizations6 Charts
Deep Dive

COVID-19 daily case forecasting with critical fixes to the data leakage in most published solutions.

4 Critical Fixes vs v1

IssueFix
ML trained on cumulative counts (future trend leakage)Target = daily new cases (stationary)
SEIR incubation ~1 day (biologically impossible)Constrained: incubation 5–14 days
Random train/test split (future data in training)Walk-forward TimeSeriesSplit CV
Transformers undertrained (2 epochs)More epochs + cosine LR

Dataset

  • 188 daily records: 2020-01-22 → 2020-07-27
  • Features: confirmed, deaths, recovered, active (global aggregate)

Models Compared

ModelType
SEIRCompartmental epidemiological model (scipy optimization)
ARIMAAuto (p,d,q), AIC criterion
Gradient BoostingWalk-forward, lag features
XGBoostOptuna-tuned
LSTMStateful, sequence-to-one
TransformerEncoder-only, positional encoding

SEIR Model 4 compartments: S→E→I→R. scipy.optimize with bounds: β∈[0.1,1.0], σ∈[1/14,1/5], γ∈[1/14,1/7]. Initial conditions from early outbreak data.

Walk-Forward CV Expanding window — each fold trains on all prior data and predicts one step ahead. The only correct evaluation for time series; random splits would leak future case counts into training.