Visionintermediate

Object Detection: YOLO & Faster-RCNN

“From image classification to locating and labelling every object in the scene”

From sliding windows to single-shot detectors — IoU, anchor boxes, NMS, mAP, and the two-stage vs one-stage architecture trade-off. How YOLO detects 80 object categories in real-time at 30 FPS.

45 min

11 diagrams

8 Concepts Covered

Prerequisites

→CNN Architectures

Concepts Covered

IoUAnchor BoxesNMSmAPYOLOFaster-RCNNFPNTwo-Stage vs One-Stage

Previous: RNN, LSTM & GRU — Sequence Modeling Next: Image Segmentation: UNet & DeepLab

∑Key Formulas

IoU

Intersection over Union — measure of bounding box quality; IoU > 0.5 is conventionally a correct detection

mAP

Mean Average Precision — area under Precision-Recall curve, averaged over all classes

YOLO Loss

Weighted sum: box regression + objectness confidence + class probabilities

NMS

Keep only the most confident box when multiple boxes heavily overlap the same object

▶Interactive Simulation

Loading visualization…

🎯

Beyond Classification: Where and What?

motivation

Image classification answers 'is there a cat?' Detection answers 'where are the cats, and are there also dogs?' This shift from a single label to a variable number of (class, bounding-box) outputs is what makes object detection the core task in autonomous driving, medical imaging, retail checkout, and surveillance. Every self-driving car runs a real-time detector processing 30+ frames per second. The evolution from sliding-window classifiers (DPM, 2010) → two-stage detectors (RCNN, 2014; Faster-RCNN, 2015) → single-stage detectors (YOLO v1, 2016 → v8, 2023) is one of the fastest-moving areas in computer vision.

Tesla's Autopilot runs 8 cameras through a custom detection network at 36 FPS on a 72 TOPS custom chip. The entire model must fit in a tight latency budget while detecting objects 200m away.

💡

Two-Stage vs One-Stage: The Fundamental Trade-off

intuition

**Two-stage detectors (Faster-RCNN):** Stage 1 — Region Proposal Network (RPN) suggests ~300 candidate regions that might contain objects. Stage 2 — a classification + regression head refines each proposal. Pro: high accuracy (easier to classify a cropped region). Con: slow (sequential stages). **One-stage detectors (YOLO, SSD):** Divide the image into a grid. Each cell directly predicts bounding box offsets, objectness score, and class probabilities in a single forward pass. Pro: fast (real-time capable). Con: harder to train, misses small/overlapping objects. **Anchor-based vs anchor-free:** YOLO v1-v3 used anchor boxes (predefined aspect ratios). YOLO v8 / FCOS / CenterNet are anchor-free — predict box center + width/height directly, simpler and often better.

YOLO = 'You Only Look Once.' The insight: instead of running a classifier at thousands of sliding window positions, predict all boxes simultaneously in one pass of the network.

⚙️

YOLO Inference Pipeline

algorithm

Divide input image into an S×S grid (e.g., 13×13 for 416px input in YOLO v2).

For each cell: predict B bounding boxes (each: x, y, w, h relative to cell, + objectness score) and C class probabilities.

Box coordinates: x, y are offsets from cell center (0–1), w/h are log-scale offsets from anchor sizes.

Objectness × class probability = class-specific confidence score for each box.

Apply Non-Maximum Suppression (NMS): for each class, sort boxes by confidence, keep highest-confidence box, suppress boxes with IoU > 0.5 with the kept box, repeat.

Final output: variable-length list of (class, confidence, x1, y1, x2, y2) tuples.

</>

Object Detection with YOLOv8 (Ultralytics)

code

python72 lines

# pip install ultralytics
from ultralytics import YOLO
import numpy as np
import cv2

# ── 1. Load pretrained YOLO v8 ────────────────────────────────────────────────
model = YOLO("yolov8n.pt")   # nano model (3.2M params, fastest)
# Other sizes: yolov8s.pt, yolov8m.pt, yolov8l.pt, yolov8x.pt

# ── 2. Inference on a single image ────────────────────────────────────────────
results = model("path/to/image.jpg", conf=0.25, iou=0.5)

for r in results:
    boxes = r.boxes                  # Boxes object
    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0].tolist()  # absolute pixel coords
        conf  = box.conf[0].item()              # confidence score
        cls   = int(box.cls[0].item())          # class index
        label = model.names[cls]
        print(f"{label}: {conf:.2f} at ({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")

# ── 3. Fine-tuning on custom dataset ─────────────────────────────────────────
# Dataset format: YOLO txt format
# data.yaml:
#   train: /path/to/train/images
#   val:   /path/to/val/images
#   nc: 3                       # number of classes
#   names: ['cat', 'dog', 'car']

model = YOLO("yolov8s.pt")     # start from ImageNet pretrained
results = model.train(
    data="data.yaml",
    epochs=50,
    imgsz=640,
    batch=16,
    lr0=0.01,                   # initial learning rate
    lrf=0.01,                   # final lr fraction
    augment=True,               # mosaic, flip, scale augmentation
    device=0,                   # GPU 0
)
print(f"mAP50: {results.metrics.mAP50:.4f}")

# ── 4. IoU calculation from scratch ──────────────────────────────────────────
def iou(box1, box2):
    """box = [x1, y1, x2, y2]"""
    x1 = max(box1[0], box2[0]); y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2]); y2 = min(box1[3], box2[3])
    inter = max(0, x2-x1) * max(0, y2-y1)
    area1 = (box1[2]-box1[0]) * (box1[3]-box1[1])
    area2 = (box2[2]-box2[0]) * (box2[3]-box2[1])
    return inter / (area1 + area2 - inter + 1e-6)

gt   = [100, 50, 250, 200]
pred = [110, 60, 260, 210]
print(f"\nIoU = {iou(gt, pred):.4f}")

# ── 5. Manual NMS ─────────────────────────────────────────────────────────────
def nms(boxes, scores, iou_threshold=0.5):
    """Boxes: (N,4) xyxy, Scores: (N,)"""
    order = np.argsort(scores)[::-1]
    keep  = []
    while len(order) > 0:
        i = order[0]
        keep.append(i)
        ious = np.array([iou(boxes[i], boxes[j]) for j in order[1:]])
        order = order[1:][ious < iou_threshold]
    return keep

boxes  = np.array([[100,50,250,200],[105,55,255,205],[200,100,350,250]])
scores = np.array([0.95, 0.87, 0.72])
kept   = nms(boxes, scores)
print(f"Kept boxes: {kept}")  # [0, 2] — box 1 suppressed (overlaps with 0)

⚠️

mAP and the IoU Threshold Trap

pitfall

mAP@0.5 (IoU threshold 0.5) and mAP@0.5:0.95 (average over IoU thresholds from 0.5 to 0.95 in 0.05 steps) tell very different stories. A model with great mAP@0.5 but poor mAP@0.5:0.95 localises objects loosely — fine for coarse tasks, bad for robotic grasping. Also: mAP treats all classes equally, which hides poor performance on rare classes. For imbalanced datasets (e.g., rare traffic signs), report per-class AP separately. Common training pitfalls: (1) Forgetting to normalize bounding box coordinates to image size. (2) Using confidence threshold too low during NMS — keep conf_threshold ≈ 0.25 during inference. (3) Overfitting on small datasets — always use strong augmentation (mosaic, random crop, color jitter).

A 1% mAP improvement on COCO benchmark (an 80-class, 330k image dataset) represents months of research — context matters when comparing models in your domain.

?Knowledge Check

Progress is saved in your browser — no account needed.

RNN, LSTM & GRU — Sequence Modeling

Image Segmentation: UNet & DeepLab

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.

Get in touch View services