Audio & Speech ML
“From raw waveforms to MFCC features — how machines listen and understand speech”
From raw waveforms to MFCC features — STFT spectrograms, Mel filterbanks, audio classification CNNs, CTC loss for speech recognition, SpecAugment, and OpenAI Whisper for production-grade ASR.
Prerequisites
Concepts Covered
∑Key Formulas
Short-Time Fourier Transform
Compute frequency content within a sliding window — produces the spectrogram
Mel Scale
Maps linear frequency to perceptual scale — humans hear pitch logarithmically, especially at high frequencies
MFCC
Discrete Cosine Transform of log Mel filterbank energies — compact, decorrelated audio features
CTC Loss
Connectionist Temporal Classification — allows alignment-free training between input frames and output tokens
▶Interactive Simulation
Why Audio ML Is Harder Than Image ML
Audio presents unique challenges that images don't: (1) Temporal structure — meaning depends on order and timing, not just content (speech is a sequence). (2) Variable length — an utterance can be 0.1s or 60s; images can be padded/resized cleanly, audio padding changes perceived silence. (3) Non-stationarity — statistical properties change over time (pitch, speed, accent). (4) Irrelevant variation — same content spoken faster, louder, with different accent, different microphone, background noise — all should produce the same output. (5) No direct spatial structure — unlike images where CNNs exploit local pixel correlations, raw audio samples are 1D time series at 16,000–44,100 Hz. The spectrogram transformation converts audio into a 2D image-like representation that CNNs can process.
Siri, Google Assistant, Alexa, and Whisper all convert speech to spectrograms (or learned mel filterbanks) before applying neural networks — raw waveforms are almost never fed directly.
The Audio Processing Pipeline
**Raw waveform:** x(t) — a 1D time series of air pressure values sampled at 16kHz (16,000 samples/second for speech). **Spectrogram:** Apply Short-Time Fourier Transform (STFT) with a sliding window (~25ms, stride ~10ms) → matrix of frequency content over time. Y-axis = frequency (0–8kHz), X-axis = time. Bright regions = frequencies present at that time. **Mel spectrogram:** Apply triangular filterbank (Mel scale) to collapse frequency axis to 80–128 Mel bins — matches human auditory perception. **MFCC:** Apply log + Discrete Cosine Transform (DCT) to decorrelate filterbank energies → 13–40 compact coefficients per frame. MFCCs were the gold standard for decades; modern deep learning often skips them and uses log-mel spectrograms directly as CNN input.
A 1-second clip at 16kHz = 16,000 raw samples. After STFT with 25ms windows/10ms stride = ~100 frames × 80 Mel bins = 8,000 values. 50% compression while retaining all perceptual content.
Speech Recognition Pipeline (Whisper-style)
Preprocess: resample to 16kHz, normalize amplitude, pad/trim to fixed length.
Log-Mel spectrogram: apply STFT (window=25ms, hop=10ms, n_fft=400), apply 80 Mel filterbanks, take log.
Encoder: 2D CNN strided conv → Transformer encoder with absolute positional embeddings — encodes audio context.
Decoder: autoregressive Transformer decoder conditioned on encoder output. Trained with teacher forcing on transcripts.
CTC or cross-entropy loss between predicted token sequence and ground-truth transcript.
Inference: beam search (width 5) decodes the most likely token sequence. Optional language model rescoring.
Post-processing: apply punctuation restoration, inverse text normalization (convert '3 dollars' → '$3').
Audio Features with librosa + OpenAI Whisper
import librosa import numpy as np import matplotlib.pyplot as plt class="tok-comment"># ── class="tok-num">1. Load audio ───────────────────────────────────────────────────────────── y, sr = librosa.load(class="tok-str">"speech.wav", sr=class="tok-num">16000) class="tok-comment"># resample to 16kHz print(fclass="tok-str">"Duration: {len(y)/sr:.2f}s, Sample rate: {sr}Hz") class="tok-comment"># ── class="tok-num">2. Waveform to spectrogram ──────────────────────────────────────────────── D = librosa.stft(y, n_fft=class="tok-num">400, hop_length=class="tok-num">160, win_length=class="tok-num">400) spectrogram = np.abs(D)**class="tok-num">2 class="tok-comment"># power spectrogram (magnitude²) S_db = librosa.power_to_db(spectrogram, ref=np.max) class="tok-comment"># decibel scale class="tok-comment"># ── class="tok-num">3. Mel spectrogram ──────────────────────────────────────────────────────── S_mel = librosa.feature.melspectrogram( y=y, sr=sr, n_fft=class="tok-num">400, hop_length=class="tok-num">160, class="tok-comment"># 10ms stride at 16kHz win_length=class="tok-num">400, class="tok-comment"># 25ms window at 16kHz n_mels=class="tok-num">80, class="tok-comment"># class="tok-num">80 Mel bins (Whisper standard) fmin=class="tok-num">50, fmax=class="tok-num">8000, class="tok-comment"># filter between 50Hz and 8kHz ) log_mel = librosa.power_to_db(S_mel, ref=np.max) print(fclass="tok-str">"Log-mel shape: {log_mel.shape}") class="tok-comment"># (class="tok-num">80, T) class="tok-comment"># ── class="tok-num">4. MFCC ────────────────────────────────────────────────────────────────── mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=class="tok-num">13, n_mels=class="tok-num">80) mfcc_delta = librosa.feature.delta(mfcc) class="tok-comment"># velocity: change over time mfcc_delta2 = librosa.feature.delta(mfcc, order=class="tok-num">2) class="tok-comment"># acceleration features = np.vstack([mfcc, mfcc_delta, mfcc_delta2]) class="tok-comment"># class="tok-num">39-dim feature vector print(fclass="tok-str">"MFCC + deltas shape: {features.shape}") class="tok-comment"># (class="tok-num">39, T) class="tok-comment"># ── class="tok-num">5. Audio classification with CNN ───────────────────────────────────────── import torch, torch.nn as nn class AudioCNN(nn.Module): def __init__(self, n_classes: int): super().__init__() self.conv = nn.Sequential( nn.Conv2d(class="tok-num">1, class="tok-num">32, kernel_size=class="tok-num">3, padding=class="tok-num">1), nn.ReLU(), nn.MaxPool2d(class="tok-num">2), nn.Conv2d(class="tok-num">32, class="tok-num">64, kernel_size=class="tok-num">3, padding=class="tok-num">1), nn.ReLU(), nn.MaxPool2d(class="tok-num">2), nn.Conv2d(class="tok-num">64, class="tok-num">128, kernel_size=class="tok-num">3, padding=class="tok-num">1), nn.ReLU(), nn.AdaptiveAvgPool2d((class="tok-num">4, class="tok-num">4)), class="tok-comment"># global average pool to fixed size ) self.fc = nn.Sequential( nn.Linear(class="tok-num">128*class="tok-num">4*class="tok-num">4, class="tok-num">256), nn.ReLU(), nn.Dropout(class="tok-num">0.4), nn.Linear(class="tok-num">256, n_classes), ) def forward(self, x): class="tok-comment"># x: (B, class="tok-num">1, n_mels, T) — log-mel as single-channel class="tok-str">"image" return self.fc(self.conv(x).flatten(class="tok-num">1)) model = AudioCNN(n_classes=class="tok-num">10) class="tok-comment"># e.g., UrbanSound8K: class="tok-num">10 sound classes x = torch.randn(class="tok-num">8, class="tok-num">1, class="tok-num">80, class="tok-num">128) class="tok-comment"># batch of class="tok-num">8 audio clips, class="tok-num">80 mels, class="tok-num">128 frames print(model(x).shape) class="tok-comment"># (class="tok-num">8, class="tok-num">10) class="tok-comment"># ── class="tok-num">6. OpenAI Whisper (speech-to-text) ─────────────────────────────────────── class="tok-comment"># pip install openai-whisper import whisper model_w = whisper.load_model(class="tok-str">"base") class="tok-comment"># 74M params result = model_w.transcribe(class="tok-str">"speech.wav") print(result[class="tok-str">"text"]) class="tok-comment"># full transcript print(result[class="tok-str">"language"]) class="tok-comment"># detected language class="tok-comment"># Timestamps for each word result_ts = model_w.transcribe(class="tok-str">"speech.wav", word_timestamps=True) for seg in result_ts[class="tok-str">"segments"]: print(fclass="tok-str">"[{seg['start']:.2f}s → {seg['end']:.2f}s] {seg['text']}")
Data Augmentation Is Critical for Audio
Audio models overfit easily because a single speaker can sound completely different across recording conditions. Key augmentations: (1) **SpecAugment** (Google, 2019) — randomly mask frequency bands (frequency masking) and time steps (time masking) in the log-mel spectrogram. Simple yet extremely effective — used in Whisper. (2) **Time stretching** — change tempo without changing pitch (librosa.effects.time_stretch). (3) **Pitch shifting** — change pitch without changing tempo. (4) **Background noise mixing** — add babble noise, music, traffic at various SNR levels. (5) **Room impulse response (RIR) convolution** — simulate different acoustic environments. Without augmentation, a model trained on studio-quality speech fails completely on phone calls or outdoor recordings.
SpecAugment alone improved LAS (Listen, Attend and Spell) model WER by 13.9% relative on LibriSpeech — arguably the best single augmentation technique in ASR history.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.