ML Learning Hub
Audiointermediate

Audio & Speech ML

From raw waveforms to MFCC features — how machines listen and understand speech

From raw waveforms to MFCC features — STFT spectrograms, Mel filterbanks, audio classification CNNs, CTC loss for speech recognition, SpecAugment, and OpenAI Whisper for production-grade ASR.

40 min
8 diagrams
8 Concepts Covered

Prerequisites

CNN Architectures
NLP Text Classification

Concepts Covered

STFTMel SpectrogramMFCCAudio CNNCTC LossSpecAugmentWhisperASR

Key Formulas

Short-Time Fourier Transform

Compute frequency content within a sliding window — produces the spectrogram

Mel Scale

Maps linear frequency to perceptual scale — humans hear pitch logarithmically, especially at high frequencies

MFCC

Discrete Cosine Transform of log Mel filterbank energies — compact, decorrelated audio features

CTC Loss

Connectionist Temporal Classification — allows alignment-free training between input frames and output tokens

Interactive Simulation

Loading visualization…
🎯

Why Audio ML Is Harder Than Image ML

motivation

Audio presents unique challenges that images don't: (1) Temporal structure — meaning depends on order and timing, not just content (speech is a sequence). (2) Variable length — an utterance can be 0.1s or 60s; images can be padded/resized cleanly, audio padding changes perceived silence. (3) Non-stationarity — statistical properties change over time (pitch, speed, accent). (4) Irrelevant variation — same content spoken faster, louder, with different accent, different microphone, background noise — all should produce the same output. (5) No direct spatial structure — unlike images where CNNs exploit local pixel correlations, raw audio samples are 1D time series at 16,000–44,100 Hz. The spectrogram transformation converts audio into a 2D image-like representation that CNNs can process.

Siri, Google Assistant, Alexa, and Whisper all convert speech to spectrograms (or learned mel filterbanks) before applying neural networks — raw waveforms are almost never fed directly.

💡

The Audio Processing Pipeline

intuition

**Raw waveform:** x(t) — a 1D time series of air pressure values sampled at 16kHz (16,000 samples/second for speech). **Spectrogram:** Apply Short-Time Fourier Transform (STFT) with a sliding window (~25ms, stride ~10ms) → matrix of frequency content over time. Y-axis = frequency (0–8kHz), X-axis = time. Bright regions = frequencies present at that time. **Mel spectrogram:** Apply triangular filterbank (Mel scale) to collapse frequency axis to 80–128 Mel bins — matches human auditory perception. **MFCC:** Apply log + Discrete Cosine Transform (DCT) to decorrelate filterbank energies → 13–40 compact coefficients per frame. MFCCs were the gold standard for decades; modern deep learning often skips them and uses log-mel spectrograms directly as CNN input.

A 1-second clip at 16kHz = 16,000 raw samples. After STFT with 25ms windows/10ms stride = ~100 frames × 80 Mel bins = 8,000 values. 50% compression while retaining all perceptual content.

⚙️

Speech Recognition Pipeline (Whisper-style)

algorithm
1

Preprocess: resample to 16kHz, normalize amplitude, pad/trim to fixed length.

2

Log-Mel spectrogram: apply STFT (window=25ms, hop=10ms, n_fft=400), apply 80 Mel filterbanks, take log.

3

Encoder: 2D CNN strided conv → Transformer encoder with absolute positional embeddings — encodes audio context.

4

Decoder: autoregressive Transformer decoder conditioned on encoder output. Trained with teacher forcing on transcripts.

5

CTC or cross-entropy loss between predicted token sequence and ground-truth transcript.

6

Inference: beam search (width 5) decodes the most likely token sequence. Optional language model rescoring.

7

Post-processing: apply punctuation restoration, inverse text normalization (convert '3 dollars' → '$3').

</>

Audio Features with librosa + OpenAI Whisper

code
python71 lines
import librosa
import numpy as np
import matplotlib.pyplot as plt

class="tok-comment"># ── class="tok-num">1. Load audio ─────────────────────────────────────────────────────────────
y, sr = librosa.load(class="tok-str">"speech.wav", sr=class="tok-num">16000)   class="tok-comment"># resample to 16kHz
print(fclass="tok-str">"Duration: {len(y)/sr:.2f}s, Sample rate: {sr}Hz")

class="tok-comment"># ── class="tok-num">2. Waveform to spectrogram ────────────────────────────────────────────────
D = librosa.stft(y, n_fft=class="tok-num">400, hop_length=class="tok-num">160, win_length=class="tok-num">400)
spectrogram = np.abs(D)**class="tok-num">2               class="tok-comment"># power spectrogram (magnitude²)
S_db = librosa.power_to_db(spectrogram, ref=np.max)  class="tok-comment"># decibel scale

class="tok-comment"># ── class="tok-num">3. Mel spectrogram ────────────────────────────────────────────────────────
S_mel = librosa.feature.melspectrogram(
    y=y, sr=sr,
    n_fft=class="tok-num">400,
    hop_length=class="tok-num">160,        class="tok-comment"># 10ms stride at 16kHz
    win_length=class="tok-num">400,        class="tok-comment"># 25ms window at 16kHz
    n_mels=class="tok-num">80,             class="tok-comment"># class="tok-num">80 Mel bins (Whisper standard)
    fmin=class="tok-num">50, fmax=class="tok-num">8000,    class="tok-comment"># filter between 50Hz and 8kHz
)
log_mel = librosa.power_to_db(S_mel, ref=np.max)
print(fclass="tok-str">"Log-mel shape: {log_mel.shape}")   class="tok-comment"># (class="tok-num">80, T)

class="tok-comment"># ── class="tok-num">4. MFCC ──────────────────────────────────────────────────────────────────
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=class="tok-num">13, n_mels=class="tok-num">80)
mfcc_delta  = librosa.feature.delta(mfcc)    class="tok-comment"># velocity: change over time
mfcc_delta2 = librosa.feature.delta(mfcc, order=class="tok-num">2)  class="tok-comment"># acceleration

features = np.vstack([mfcc, mfcc_delta, mfcc_delta2])  class="tok-comment"># class="tok-num">39-dim feature vector
print(fclass="tok-str">"MFCC + deltas shape: {features.shape}")  class="tok-comment"># (class="tok-num">39, T)

class="tok-comment"># ── class="tok-num">5. Audio classification with CNN ─────────────────────────────────────────
import torch, torch.nn as nn

class AudioCNN(nn.Module):
    def __init__(self, n_classes: int):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(class="tok-num">1, class="tok-num">32, kernel_size=class="tok-num">3, padding=class="tok-num">1), nn.ReLU(),
            nn.MaxPool2d(class="tok-num">2),
            nn.Conv2d(class="tok-num">32, class="tok-num">64, kernel_size=class="tok-num">3, padding=class="tok-num">1), nn.ReLU(),
            nn.MaxPool2d(class="tok-num">2),
            nn.Conv2d(class="tok-num">64, class="tok-num">128, kernel_size=class="tok-num">3, padding=class="tok-num">1), nn.ReLU(),
            nn.AdaptiveAvgPool2d((class="tok-num">4, class="tok-num">4)),    class="tok-comment"># global average pool to fixed size
        )
        self.fc = nn.Sequential(
            nn.Linear(class="tok-num">128*class="tok-num">4*class="tok-num">4, class="tok-num">256), nn.ReLU(), nn.Dropout(class="tok-num">0.4),
            nn.Linear(class="tok-num">256, n_classes),
        )
    def forward(self, x):
        class="tok-comment"># x: (B, class="tok-num">1, n_mels, T) — log-mel as single-channel class="tok-str">"image"
        return self.fc(self.conv(x).flatten(class="tok-num">1))

model = AudioCNN(n_classes=class="tok-num">10)   class="tok-comment"># e.g., UrbanSound8K: class="tok-num">10 sound classes
x = torch.randn(class="tok-num">8, class="tok-num">1, class="tok-num">80, class="tok-num">128)  class="tok-comment"># batch of class="tok-num">8 audio clips, class="tok-num">80 mels, class="tok-num">128 frames
print(model(x).shape)            class="tok-comment"># (class="tok-num">8, class="tok-num">10)

class="tok-comment"># ── class="tok-num">6. OpenAI Whisper (speech-to-text) ───────────────────────────────────────
class="tok-comment"># pip install openai-whisper
import whisper
model_w = whisper.load_model(class="tok-str">"base")          class="tok-comment"># 74M params
result  = model_w.transcribe(class="tok-str">"speech.wav")
print(result[class="tok-str">"text"])                          class="tok-comment"># full transcript
print(result[class="tok-str">"language"])                      class="tok-comment"># detected language

class="tok-comment"># Timestamps for each word
result_ts = model_w.transcribe(class="tok-str">"speech.wav", word_timestamps=True)
for seg in result_ts[class="tok-str">"segments"]:
    print(fclass="tok-str">"[{seg['start']:.2f}s → {seg['end']:.2f}s] {seg['text']}")
⚠️

Data Augmentation Is Critical for Audio

pitfall

Audio models overfit easily because a single speaker can sound completely different across recording conditions. Key augmentations: (1) **SpecAugment** (Google, 2019) — randomly mask frequency bands (frequency masking) and time steps (time masking) in the log-mel spectrogram. Simple yet extremely effective — used in Whisper. (2) **Time stretching** — change tempo without changing pitch (librosa.effects.time_stretch). (3) **Pitch shifting** — change pitch without changing tempo. (4) **Background noise mixing** — add babble noise, music, traffic at various SNR levels. (5) **Room impulse response (RIR) convolution** — simulate different acoustic environments. Without augmentation, a model trained on studio-quality speech fails completely on phone calls or outdoor recordings.

SpecAugment alone improved LAS (Listen, Attend and Spell) model WER by 13.9% relative on LibriSpeech — arguably the best single augmentation technique in ASR history.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.