NLP

Fake News Detection

13-model NLP pipeline on 44,898 news articles. Soft Voting Ensemble & Stacking both achieve 99.86% accuracy, AUC=1.0. Only 2 errors on the full test set. DistilBERT matches at 99.87% on 6K subset.

View on Kaggle

99.86%

Voting/Stacking Acc

1.0000

Linear SVC AUC

99.87%

DistilBERT Accuracy

Total test errors

Dataset

44,898 articles (21K real + 23K fake), 70/15/15 split

Approach

Combined TF-IDF (word + char n-grams) → 13-model benchmark → transformer fine-tuning

Tech Stack

PythonScikit-learnXGBoostLightGBMHuggingFace DistilBERTNLTK

Keywords

LinearSVCTF-IDFXGBoostLightGBMDistilBERTVoting EnsembleStacking

Visualizations6 Charts

Deep Dive

Comprehensive fake news detection benchmarking classical ML and transformers on a balanced 44,898-article dataset.

Dataset

▸21,417 real + 23,481 fake news articles
▸70/15/15 stratified train/val/test split
▸Features: TF-IDF word n-grams (1–2, 50K features) + char n-grams (3–5, 30K features) combined

Full 13-Model Benchmark

Model	Accuracy	AUC
Complement NB	96.52%	0.9936
Logistic Regression	99.65%	0.9999
Linear SVC	99.81%	1.0000
SGD Classifier	99.72%	1.0000
Decision Tree	99.63%	0.9950
Random Forest	99.70%	0.9998
Extra Trees	99.37%	0.9997
XGBoost	99.83%	0.9997
LightGBM	99.81%	0.9996
Soft Voting	99.86%	1.0000
Stacking	99.86%	1.0000
BiLSTM	98.5%	—
DistilBERT	99.87%	0.9999

Error Analysis Full test set: 1 false positive + 1 false negative. The dataset has strong source signals — Reuters/AP wire service language vs conspiracy-style language — that combined TF-IDF captures almost perfectly.

Why Combined TF-IDF Beats Standalone Word n-grams capture semantic content; character n-grams capture writing style artifacts (punctuation abuse, ALL-CAPS, unusual spacin g). Combining both gives >99.8% across all reasonable models.

DistilBERT Finding Fine-tuned on only 6K articles (subset) → 99.87% accuracy. Demonstrates transformers generalize better under limited labeled data than classical models trained on full dataset.

Back to Projects Hire Me