All Projects
NLP

Synthetic Speech Commands Classification

30-class audio CNN achieves perfect 100% test accuracy on 41,849 samples. Mel-spectrogram (64 bins) + SpecAugment. 1.25M-param 4-block CNN. Val accuracy reaches 100% at epoch 8. Label smoothing 0.1.

100.00%
Test accuracy
1.00
All-class F1
Epoch 8
100% val achieved
1,246,142
Model params
Dataset

41,849 synthetic .wav files, 30 command classes

Approach

Mel-spectrogram + SpecAugment → 4-block CNN + label smoothing + cosine LR

Tech Stack
PythonPyTorchlibrosa4-block CNNMel-SpectrogramSpecAugment
Keywords
Audio CNNMel-SpectrogramSpecAugmentSpeech Recognition30-classlibrosa
Visualizations5 Charts
Deep Dive

30-class audio command classification achieving perfect test accuracy on synthetic data.

Dataset

  • 41,849 .wav files: 30 command classes (bed, bird, cat ... yes, zero)
  • Clean + noisy variants, 16kHz, 1.0s fixed duration
  • Train: 31,386 / Val: 6,277 / Test: 4,186 (stratified)

Feature Extraction

1. Load .wav → normalize to 1.0s (pad/trim)
2. Mel-spectrogram: 64 bins, n_fft=512, hop=160
3. Normalize per-sample to [0, 1]
4. SpecAugment: FreqMask(k=15) + TimeMask(k=35)

CNN Architecture (1,246,142 params) 4 ConvBlocks [32→64→128→256 channels] → GAP → Dense(512) → Dropout(0.3) → Dense(30)

Training 30 epochs, Adam(lr=1e-3), CrossEntropy + label_smoothing=0.1, CosineAnnealingLR, patience=10

Results

EpochVal Accuracy
598.5%
8100.0%
30100.0%

Test accuracy: 100.00% | Per-class F1: 1.00 for all 30 commands

Why Perfect Accuracy? Synthetic speech (text-to-speech) has highly consistent acoustic properties. Unlike real speech (accents, prosody, background noise), synthetic commands cluster tightly in mel-spectrogram space — this is a favorable but unrealistic evaluation condition.