Synthetic Speech Commands Classification
30-class audio CNN achieves perfect 100% test accuracy on 41,849 samples. Mel-spectrogram (64 bins) + SpecAugment. 1.25M-param 4-block CNN. Val accuracy reaches 100% at epoch 8. Label smoothing 0.1.
41,849 synthetic .wav files, 30 command classes
Mel-spectrogram + SpecAugment → 4-block CNN + label smoothing + cosine LR
30-class audio command classification achieving perfect test accuracy on synthetic data.
Dataset
- ▸41,849 .wav files: 30 command classes (bed, bird, cat ... yes, zero)
- ▸Clean + noisy variants, 16kHz, 1.0s fixed duration
- ▸Train: 31,386 / Val: 6,277 / Test: 4,186 (stratified)
Feature Extraction
1. Load .wav → normalize to 1.0s (pad/trim)
2. Mel-spectrogram: 64 bins, n_fft=512, hop=160
3. Normalize per-sample to [0, 1]
4. SpecAugment: FreqMask(k=15) + TimeMask(k=35)
CNN Architecture (1,246,142 params) 4 ConvBlocks [32→64→128→256 channels] → GAP → Dense(512) → Dropout(0.3) → Dense(30)
Training 30 epochs, Adam(lr=1e-3), CrossEntropy + label_smoothing=0.1, CosineAnnealingLR, patience=10
Results
| Epoch | Val Accuracy |
|---|---|
| 5 | 98.5% |
| 8 | 100.0% |
| 30 | 100.0% |
Test accuracy: 100.00% | Per-class F1: 1.00 for all 30 commands
Why Perfect Accuracy? Synthetic speech (text-to-speech) has highly consistent acoustic properties. Unlike real speech (accents, prosody, background noise), synthetic commands cluster tightly in mel-spectrogram space — this is a favorable but unrealistic evaluation condition.