Object Detection: YOLO & Faster-RCNN
“From image classification to locating and labelling every object in the scene”
From sliding windows to single-shot detectors — IoU, anchor boxes, NMS, mAP, and the two-stage vs one-stage architecture trade-off. How YOLO detects 80 object categories in real-time at 30 FPS.
Prerequisites
Concepts Covered
∑Key Formulas
IoU
Intersection over Union — measure of bounding box quality; IoU > 0.5 is conventionally a correct detection
mAP
Mean Average Precision — area under Precision-Recall curve, averaged over all classes
YOLO Loss
Weighted sum: box regression + objectness confidence + class probabilities
NMS
Keep only the most confident box when multiple boxes heavily overlap the same object
▶Interactive Simulation
Beyond Classification: Where and What?
Image classification answers 'is there a cat?' Detection answers 'where are the cats, and are there also dogs?' This shift from a single label to a variable number of (class, bounding-box) outputs is what makes object detection the core task in autonomous driving, medical imaging, retail checkout, and surveillance. Every self-driving car runs a real-time detector processing 30+ frames per second. The evolution from sliding-window classifiers (DPM, 2010) → two-stage detectors (RCNN, 2014; Faster-RCNN, 2015) → single-stage detectors (YOLO v1, 2016 → v8, 2023) is one of the fastest-moving areas in computer vision.
Tesla's Autopilot runs 8 cameras through a custom detection network at 36 FPS on a 72 TOPS custom chip. The entire model must fit in a tight latency budget while detecting objects 200m away.
Two-Stage vs One-Stage: The Fundamental Trade-off
**Two-stage detectors (Faster-RCNN):** Stage 1 — Region Proposal Network (RPN) suggests ~300 candidate regions that might contain objects. Stage 2 — a classification + regression head refines each proposal. Pro: high accuracy (easier to classify a cropped region). Con: slow (sequential stages). **One-stage detectors (YOLO, SSD):** Divide the image into a grid. Each cell directly predicts bounding box offsets, objectness score, and class probabilities in a single forward pass. Pro: fast (real-time capable). Con: harder to train, misses small/overlapping objects. **Anchor-based vs anchor-free:** YOLO v1-v3 used anchor boxes (predefined aspect ratios). YOLO v8 / FCOS / CenterNet are anchor-free — predict box center + width/height directly, simpler and often better.
YOLO = 'You Only Look Once.' The insight: instead of running a classifier at thousands of sliding window positions, predict all boxes simultaneously in one pass of the network.
YOLO Inference Pipeline
Divide input image into an S×S grid (e.g., 13×13 for 416px input in YOLO v2).
For each cell: predict B bounding boxes (each: x, y, w, h relative to cell, + objectness score) and C class probabilities.
Box coordinates: x, y are offsets from cell center (0–1), w/h are log-scale offsets from anchor sizes.
Objectness × class probability = class-specific confidence score for each box.
Apply Non-Maximum Suppression (NMS): for each class, sort boxes by confidence, keep highest-confidence box, suppress boxes with IoU > 0.5 with the kept box, repeat.
Final output: variable-length list of (class, confidence, x1, y1, x2, y2) tuples.
Object Detection with YOLOv8 (Ultralytics)
class="tok-comment"># pip install ultralytics from ultralytics import YOLO import numpy as np import cv2 class="tok-comment"># ── class="tok-num">1. Load pretrained YOLO v8 ──────────────────────────────────────────────── model = YOLO(class="tok-str">"yolov8n.pt") class="tok-comment"># nano model (class="tok-num">3.2M params, fastest) class="tok-comment"># Other sizes: yolov8s.pt, yolov8m.pt, yolov8l.pt, yolov8x.pt class="tok-comment"># ── class="tok-num">2. Inference on a single image ──────────────────────────────────────────── results = model(class="tok-str">"path/to/image.jpg", conf=class="tok-num">0.25, iou=class="tok-num">0.5) for r in results: boxes = r.boxes class="tok-comment"># Boxes object for box in boxes: x1, y1, x2, y2 = box.xyxy[class="tok-num">0].tolist() class="tok-comment"># absolute pixel coords conf = box.conf[class="tok-num">0].item() class="tok-comment"># confidence score cls = int(box.cls[class="tok-num">0].item()) class="tok-comment"># class index label = model.names[cls] print(fclass="tok-str">"{label}: {conf:.2f} at ({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})") class="tok-comment"># ── class="tok-num">3. Fine-tuning on custom dataset ───────────────────────────────────────── class="tok-comment"># Dataset format: YOLO txt format class="tok-comment"># data.yaml: class="tok-comment"># train: /path/to/train/images class="tok-comment"># val: /path/to/val/images class="tok-comment"># nc: class="tok-num">3 # number of classes class="tok-comment"># names: [class="tok-str">'cat', class="tok-str">'dog', class="tok-str">'car'] model = YOLO(class="tok-str">"yolov8s.pt") class="tok-comment"># start from ImageNet pretrained results = model.train( data=class="tok-str">"data.yaml", epochs=class="tok-num">50, imgsz=class="tok-num">640, batch=class="tok-num">16, lr0=class="tok-num">0.01, class="tok-comment"># initial learning rate lrf=class="tok-num">0.01, class="tok-comment"># final lr fraction augment=True, class="tok-comment"># mosaic, flip, scale augmentation device=class="tok-num">0, class="tok-comment"># GPU class="tok-num">0 ) print(fclass="tok-str">"mAP50: {results.metrics.mAP50:.4f}") class="tok-comment"># ── class="tok-num">4. IoU calculation from scratch ────────────────────────────────────────── def iou(box1, box2): class="tok-str">"""box = [x1, y1, x2, y2]""" x1 = max(box1[class="tok-num">0], box2[class="tok-num">0]); y1 = max(box1[class="tok-num">1], box2[class="tok-num">1]) x2 = min(box1[class="tok-num">2], box2[class="tok-num">2]); y2 = min(box1[class="tok-num">3], box2[class="tok-num">3]) inter = max(class="tok-num">0, x2-x1) * max(class="tok-num">0, y2-y1) area1 = (box1[class="tok-num">2]-box1[class="tok-num">0]) * (box1[class="tok-num">3]-box1[class="tok-num">1]) area2 = (box2[class="tok-num">2]-box2[class="tok-num">0]) * (box2[class="tok-num">3]-box2[class="tok-num">1]) return inter / (area1 + area2 - inter + class="tok-num">1e-6) gt = [class="tok-num">100, class="tok-num">50, class="tok-num">250, class="tok-num">200] pred = [class="tok-num">110, class="tok-num">60, class="tok-num">260, class="tok-num">210] print(fclass="tok-str">"\nIoU = {iou(gt, pred):.4f}") class="tok-comment"># ── class="tok-num">5. Manual NMS ───────────────────────────────────────────────────────────── def nms(boxes, scores, iou_threshold=class="tok-num">0.5): class="tok-str">"""Boxes: (N,class="tok-num">4) xyxy, Scores: (N,)""" order = np.argsort(scores)[::-class="tok-num">1] keep = [] while len(order) > class="tok-num">0: i = order[class="tok-num">0] keep.append(i) ious = np.array([iou(boxes[i], boxes[j]) for j in order[class="tok-num">1:]]) order = order[class="tok-num">1:][ious < iou_threshold] return keep boxes = np.array([[class="tok-num">100,class="tok-num">50,class="tok-num">250,class="tok-num">200],[class="tok-num">105,class="tok-num">55,class="tok-num">255,class="tok-num">205],[class="tok-num">200,class="tok-num">100,class="tok-num">350,class="tok-num">250]]) scores = np.array([class="tok-num">0.95, class="tok-num">0.87, class="tok-num">0.72]) kept = nms(boxes, scores) print(fclass="tok-str">"Kept boxes: {kept}") class="tok-comment"># [class="tok-num">0, class="tok-num">2] — box class="tok-num">1 suppressed (overlaps with class="tok-num">0)
mAP and the IoU Threshold Trap
mAP@0.5 (IoU threshold 0.5) and mAP@0.5:0.95 (average over IoU thresholds from 0.5 to 0.95 in 0.05 steps) tell very different stories. A model with great mAP@0.5 but poor mAP@0.5:0.95 localises objects loosely — fine for coarse tasks, bad for robotic grasping. Also: mAP treats all classes equally, which hides poor performance on rare classes. For imbalanced datasets (e.g., rare traffic signs), report per-class AP separately. Common training pitfalls: (1) Forgetting to normalize bounding box coordinates to image size. (2) Using confidence threshold too low during NMS — keep conf_threshold ≈ 0.25 during inference. (3) Overfitting on small datasets — always use strong augmentation (mosaic, random crop, color jitter).
A 1% mAP improvement on COCO benchmark (an 80-class, 330k image dataset) represents months of research — context matters when comparing models in your domain.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.