NLP

Synthetic Speech Commands Classification

30-class audio CNN achieves perfect 100% test accuracy on 41,849 samples. Mel-spectrogram (64 bins) + SpecAugment. 1.25M-param 4-block CNN. Val accuracy reaches 100% at epoch 8. Label smoothing 0.1.

View on Kaggle

100.00%

Test accuracy

1.00

All-class F1

Epoch 8

100% val achieved

1,246,142

Model params

Dataset

41,849 synthetic .wav files, 30 command classes

Approach

Mel-spectrogram + SpecAugment → 4-block CNN + label smoothing + cosine LR

Tech Stack

PythonPyTorchlibrosa4-block CNNMel-SpectrogramSpecAugment

Keywords

Audio CNNMel-SpectrogramSpecAugmentSpeech Recognition30-classlibrosa

Visualizations5 Charts

Deep Dive

30-class audio command classification achieving perfect test accuracy on synthetic data.

Dataset

▸41,849 .wav files: 30 command classes (bed, bird, cat ... yes, zero)
▸Clean + noisy variants, 16kHz, 1.0s fixed duration
▸Train: 31,386 / Val: 6,277 / Test: 4,186 (stratified)

Feature Extraction

1. Load .wav → normalize to 1.0s (pad/trim)
2. Mel-spectrogram: 64 bins, n_fft=512, hop=160
3. Normalize per-sample to [0, 1]
4. SpecAugment: FreqMask(k=15) + TimeMask(k=35)

CNN Architecture (1,246,142 params) 4 ConvBlocks [32→64→128→256 channels] → GAP → Dense(512) → Dropout(0.3) → Dense(30)

Training 30 epochs, Adam(lr=1e-3), CrossEntropy + label_smoothing=0.1, CosineAnnealingLR, patience=10

Results

Epoch	Val Accuracy
5	98.5%
8	100.0%
30	100.0%

Test accuracy: 100.00% | Per-class F1: 1.00 for all 30 commands

Why Perfect Accuracy? Synthetic speech (text-to-speech) has highly consistent acoustic properties. Unlike real speech (accents, prosody, background noise), synthetic commands cluster tightly in mel-spectrogram space — this is a favorable but unrealistic evaluation condition.

Back to Projects Hire Me