ML Learning Hub
Visionintermediate

Object Detection: YOLO & Faster-RCNN

From image classification to locating and labelling every object in the scene

From sliding windows to single-shot detectors — IoU, anchor boxes, NMS, mAP, and the two-stage vs one-stage architecture trade-off. How YOLO detects 80 object categories in real-time at 30 FPS.

45 min
11 diagrams
8 Concepts Covered

Prerequisites

CNN Architectures

Concepts Covered

IoUAnchor BoxesNMSmAPYOLOFaster-RCNNFPNTwo-Stage vs One-Stage

Key Formulas

IoU

Intersection over Union — measure of bounding box quality; IoU > 0.5 is conventionally a correct detection

mAP

Mean Average Precision — area under Precision-Recall curve, averaged over all classes

YOLO Loss

Weighted sum: box regression + objectness confidence + class probabilities

NMS

Keep only the most confident box when multiple boxes heavily overlap the same object

Interactive Simulation

Loading visualization…
🎯

Beyond Classification: Where and What?

motivation

Image classification answers 'is there a cat?' Detection answers 'where are the cats, and are there also dogs?' This shift from a single label to a variable number of (class, bounding-box) outputs is what makes object detection the core task in autonomous driving, medical imaging, retail checkout, and surveillance. Every self-driving car runs a real-time detector processing 30+ frames per second. The evolution from sliding-window classifiers (DPM, 2010) → two-stage detectors (RCNN, 2014; Faster-RCNN, 2015) → single-stage detectors (YOLO v1, 2016 → v8, 2023) is one of the fastest-moving areas in computer vision.

Tesla's Autopilot runs 8 cameras through a custom detection network at 36 FPS on a 72 TOPS custom chip. The entire model must fit in a tight latency budget while detecting objects 200m away.

💡

Two-Stage vs One-Stage: The Fundamental Trade-off

intuition

**Two-stage detectors (Faster-RCNN):** Stage 1 — Region Proposal Network (RPN) suggests ~300 candidate regions that might contain objects. Stage 2 — a classification + regression head refines each proposal. Pro: high accuracy (easier to classify a cropped region). Con: slow (sequential stages). **One-stage detectors (YOLO, SSD):** Divide the image into a grid. Each cell directly predicts bounding box offsets, objectness score, and class probabilities in a single forward pass. Pro: fast (real-time capable). Con: harder to train, misses small/overlapping objects. **Anchor-based vs anchor-free:** YOLO v1-v3 used anchor boxes (predefined aspect ratios). YOLO v8 / FCOS / CenterNet are anchor-free — predict box center + width/height directly, simpler and often better.

YOLO = 'You Only Look Once.' The insight: instead of running a classifier at thousands of sliding window positions, predict all boxes simultaneously in one pass of the network.

⚙️

YOLO Inference Pipeline

algorithm
1

Divide input image into an S×S grid (e.g., 13×13 for 416px input in YOLO v2).

2

For each cell: predict B bounding boxes (each: x, y, w, h relative to cell, + objectness score) and C class probabilities.

3

Box coordinates: x, y are offsets from cell center (0–1), w/h are log-scale offsets from anchor sizes.

4

Objectness × class probability = class-specific confidence score for each box.

5

Apply Non-Maximum Suppression (NMS): for each class, sort boxes by confidence, keep highest-confidence box, suppress boxes with IoU > 0.5 with the kept box, repeat.

6

Final output: variable-length list of (class, confidence, x1, y1, x2, y2) tuples.

</>

Object Detection with YOLOv8 (Ultralytics)

code
python72 lines
class="tok-comment"># pip install ultralytics
from ultralytics import YOLO
import numpy as np
import cv2

class="tok-comment"># ── class="tok-num">1. Load pretrained YOLO v8 ────────────────────────────────────────────────
model = YOLO(class="tok-str">"yolov8n.pt")   class="tok-comment"># nano model (class="tok-num">3.2M params, fastest)
class="tok-comment"># Other sizes: yolov8s.pt, yolov8m.pt, yolov8l.pt, yolov8x.pt

class="tok-comment"># ── class="tok-num">2. Inference on a single image ────────────────────────────────────────────
results = model(class="tok-str">"path/to/image.jpg", conf=class="tok-num">0.25, iou=class="tok-num">0.5)

for r in results:
    boxes = r.boxes                  class="tok-comment"># Boxes object
    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[class="tok-num">0].tolist()  class="tok-comment"># absolute pixel coords
        conf  = box.conf[class="tok-num">0].item()              class="tok-comment"># confidence score
        cls   = int(box.cls[class="tok-num">0].item())          class="tok-comment"># class index
        label = model.names[cls]
        print(fclass="tok-str">"{label}: {conf:.2f} at ({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")

class="tok-comment"># ── class="tok-num">3. Fine-tuning on custom dataset ─────────────────────────────────────────
class="tok-comment"># Dataset format: YOLO txt format
class="tok-comment"># data.yaml:
class="tok-comment">#   train: /path/to/train/images
class="tok-comment">#   val:   /path/to/val/images
class="tok-comment">#   nc: class="tok-num">3                       # number of classes
class="tok-comment">#   names: [class="tok-str">'cat', class="tok-str">'dog', class="tok-str">'car']

model = YOLO(class="tok-str">"yolov8s.pt")     class="tok-comment"># start from ImageNet pretrained
results = model.train(
    data=class="tok-str">"data.yaml",
    epochs=class="tok-num">50,
    imgsz=class="tok-num">640,
    batch=class="tok-num">16,
    lr0=class="tok-num">0.01,                   class="tok-comment"># initial learning rate
    lrf=class="tok-num">0.01,                   class="tok-comment"># final lr fraction
    augment=True,               class="tok-comment"># mosaic, flip, scale augmentation
    device=class="tok-num">0,                   class="tok-comment"># GPU class="tok-num">0
)
print(fclass="tok-str">"mAP50: {results.metrics.mAP50:.4f}")

class="tok-comment"># ── class="tok-num">4. IoU calculation from scratch ──────────────────────────────────────────
def iou(box1, box2):
    class="tok-str">"""box = [x1, y1, x2, y2]"""
    x1 = max(box1[class="tok-num">0], box2[class="tok-num">0]); y1 = max(box1[class="tok-num">1], box2[class="tok-num">1])
    x2 = min(box1[class="tok-num">2], box2[class="tok-num">2]); y2 = min(box1[class="tok-num">3], box2[class="tok-num">3])
    inter = max(class="tok-num">0, x2-x1) * max(class="tok-num">0, y2-y1)
    area1 = (box1[class="tok-num">2]-box1[class="tok-num">0]) * (box1[class="tok-num">3]-box1[class="tok-num">1])
    area2 = (box2[class="tok-num">2]-box2[class="tok-num">0]) * (box2[class="tok-num">3]-box2[class="tok-num">1])
    return inter / (area1 + area2 - inter + class="tok-num">1e-6)

gt   = [class="tok-num">100, class="tok-num">50, class="tok-num">250, class="tok-num">200]
pred = [class="tok-num">110, class="tok-num">60, class="tok-num">260, class="tok-num">210]
print(fclass="tok-str">"\nIoU = {iou(gt, pred):.4f}")

class="tok-comment"># ── class="tok-num">5. Manual NMS ─────────────────────────────────────────────────────────────
def nms(boxes, scores, iou_threshold=class="tok-num">0.5):
    class="tok-str">"""Boxes: (N,class="tok-num">4) xyxy, Scores: (N,)"""
    order = np.argsort(scores)[::-class="tok-num">1]
    keep  = []
    while len(order) > class="tok-num">0:
        i = order[class="tok-num">0]
        keep.append(i)
        ious = np.array([iou(boxes[i], boxes[j]) for j in order[class="tok-num">1:]])
        order = order[class="tok-num">1:][ious < iou_threshold]
    return keep

boxes  = np.array([[class="tok-num">100,class="tok-num">50,class="tok-num">250,class="tok-num">200],[class="tok-num">105,class="tok-num">55,class="tok-num">255,class="tok-num">205],[class="tok-num">200,class="tok-num">100,class="tok-num">350,class="tok-num">250]])
scores = np.array([class="tok-num">0.95, class="tok-num">0.87, class="tok-num">0.72])
kept   = nms(boxes, scores)
print(fclass="tok-str">"Kept boxes: {kept}")  class="tok-comment"># [class="tok-num">0, class="tok-num">2] — box class="tok-num">1 suppressed (overlaps with class="tok-num">0)
⚠️

mAP and the IoU Threshold Trap

pitfall

mAP@0.5 (IoU threshold 0.5) and mAP@0.5:0.95 (average over IoU thresholds from 0.5 to 0.95 in 0.05 steps) tell very different stories. A model with great mAP@0.5 but poor mAP@0.5:0.95 localises objects loosely — fine for coarse tasks, bad for robotic grasping. Also: mAP treats all classes equally, which hides poor performance on rare classes. For imbalanced datasets (e.g., rare traffic signs), report per-class AP separately. Common training pitfalls: (1) Forgetting to normalize bounding box coordinates to image size. (2) Using confidence threshold too low during NMS — keep conf_threshold ≈ 0.25 during inference. (3) Overfitting on small datasets — always use strong augmentation (mosaic, random crop, color jitter).

A 1% mAP improvement on COCO benchmark (an 80-class, 330k image dataset) represents months of research — context matters when comparing models in your domain.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.