Python ML Stack: NumPy, Pandas & Matplotlib
“Your data science toolkit — NumPy, Pandas, Matplotlib and the Jupyter workflow”
Master the tools every ML engineer uses daily — NumPy vectorized operations, Pandas DataFrames for real-world data, and Matplotlib/Seaborn for exploratory visualization. The foundation everything else builds on.
Concepts Covered
∑Key Formulas
Vectorized Mean
np.mean(X) — NumPy computes this in C, orders of magnitude faster than a Python loop
Broadcasting
NumPy stretches the smaller array along the missing dimension — avoids explicit loops
Pearson Correlation
np.corrcoef(X,Y) — measures linear dependence between two features
▶Interactive Simulation
Why This Stack Before Anything Else
Every ML framework — scikit-learn, PyTorch, TensorFlow, JAX — sits on top of NumPy arrays. Understanding how arrays work in memory (contiguous C-order layout, dtype, strides) is the difference between writing O(n²) Python loops and vectorized O(n) NumPy operations that run at C speed. Pandas gives you labeled DataFrames for real-world messy data, and Matplotlib/Seaborn let you see what's happening before you model it. The entire ML ecosystem speaks NumPy — mastering it is mastering the lingua franca.
A Python for-loop over 10M numbers takes ~4 seconds. np.sum() takes ~8ms — 500× faster. This matters when you're computing gradients over a neural network.
NumPy Essentials — What You Actually Need
Array creation: np.array(), np.zeros(), np.ones(), np.linspace(), np.arange(), np.random.randn()
Shape manipulation: .reshape(), .T (transpose), np.concatenate(), np.stack(), np.squeeze()
Vectorized math: +, -, *, / broadcast element-wise; np.dot() / @ for matrix multiplication
Indexing: arr[2:5], arr[arr > 0] (boolean mask), arr[:, 0] (column slice)
Aggregations: .sum(), .mean(), .std(), .max(), .argmax() — all accept axis= parameter
Broadcasting rule: align shapes from the right, dimensions must match or be 1
NumPy, Pandas & Matplotlib — Full Workflow
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns class="tok-comment"># ── NumPy: arrays, broadcasting, vectorized ops ─────────────────────────────── X = np.random.randn(class="tok-num">1000, class="tok-num">5) class="tok-comment"># class="tok-num">1000 samples, class="tok-num">5 features y = class="tok-num">2*X[:,class="tok-num">0] - X[:,class="tok-num">1] + class="tok-num">0.5*np.random.randn(class="tok-num">1000) print(X.shape, X.dtype) class="tok-comment"># (class="tok-num">1000, class="tok-num">5) float64 print(X.mean(axis=class="tok-num">0).round(class="tok-num">3)) class="tok-comment"># per-feature means ≈ class="tok-num">0 print(X.std(axis=class="tok-num">0).round(class="tok-num">3)) class="tok-comment"># per-feature stds ≈ class="tok-num">1 class="tok-comment"># Broadcasting: subtract mean and divide by std (manual StandardScaler) X_scaled = (X - X.mean(axis=class="tok-num">0)) / X.std(axis=class="tok-num">0) class="tok-comment"># Matrix multiply: X @ W where W is class="tok-num">5×class="tok-num">2 W = np.random.randn(class="tok-num">5, class="tok-num">2) Z = X_scaled @ W class="tok-comment"># shape (class="tok-num">1000, class="tok-num">2) class="tok-comment"># Boolean indexing high_income = X[X[:,class="tok-num">0] > class="tok-num">1.0] class="tok-comment"># rows where feature class="tok-num">0 > class="tok-num">1σ print(fclass="tok-str">"High income rows: {len(high_income)}") class="tok-comment"># ── Pandas: DataFrames, EDA ─────────────────────────────────────────────────── df = pd.DataFrame(X, columns=[fclass="tok-str">"feat_{i}" for i in range(class="tok-num">5)]) df[class="tok-str">"target"] = y class="tok-comment"># Quick EDA print(df.describe().round(class="tok-num">2)) class="tok-comment"># count, mean, std, quartiles print(df.isnull().sum()) class="tok-comment"># check for missing values print(df.dtypes) class="tok-comment"># Groupby example df[class="tok-str">"group"] = np.where(df[class="tok-str">"feat_0"] > class="tok-num">0, class="tok-str">"high", class="tok-str">"low") print(df.groupby(class="tok-str">"group")[class="tok-str">"target"].agg([class="tok-str">"mean",class="tok-str">"std"]).round(class="tok-num">3)) class="tok-comment"># Correlations corr = df.drop(columns=class="tok-str">"group").corr() print(corr[class="tok-str">"target"].sort_values(ascending=False).round(class="tok-num">3)) class="tok-comment"># ── Matplotlib / Seaborn: visualization ────────────────────────────────────── fig, axes = plt.subplots(class="tok-num">1, class="tok-num">3, figsize=(class="tok-num">15, class="tok-num">4)) class="tok-comment"># class="tok-num">1. Distribution plot axes[class="tok-num">0].hist(df[class="tok-str">"target"], bins=class="tok-num">50, color=class="tok-str">"class="tok-comment">#6c63ff", alpha=class="tok-num">0.8, edgecolor=class="tok-str">"white") axes[class="tok-num">0].set_title(class="tok-str">"Target distribution") axes[class="tok-num">0].set_xlabel(class="tok-str">"y") class="tok-comment"># class="tok-num">2. Scatter + regression line axes[class="tok-num">1].scatter(df[class="tok-str">"feat_0"], df[class="tok-str">"target"], alpha=class="tok-num">0.3, s=class="tok-num">10, color=class="tok-str">"class="tok-comment">#06b6d4") m, b = np.polyfit(df[class="tok-str">"feat_0"], df[class="tok-str">"target"], class="tok-num">1) x_line = np.linspace(-class="tok-num">3, class="tok-num">3, class="tok-num">100) axes[class="tok-num">1].plot(x_line, m*x_line + b, color=class="tok-str">"class="tok-comment">#ff6b6b", lw=class="tok-num">2, label=fclass="tok-str">"slope={m:.2f}") axes[class="tok-num">1].set_title(class="tok-str">"Feature class="tok-num">0 vs Target") axes[class="tok-num">1].legend() class="tok-comment"># class="tok-num">3. Correlation heatmap sns.heatmap(corr, annot=True, fmt=class="tok-str">".2f", cmap=class="tok-str">"coolwarm", center=class="tok-num">0, ax=axes[class="tok-num">2], cbar=False) axes[class="tok-num">2].set_title(class="tok-str">"Correlation matrix") plt.tight_layout() plt.show() class="tok-comment"># ── Jupyter tips ────────────────────────────────────────────────────────────── class="tok-comment"># %timeit np.dot(X, W) # benchmark any cell class="tok-comment"># %matplotlib inline # show plots in notebook class="tok-comment"># df.head() # preview first class="tok-num">5 rows class="tok-comment"># df.info() # dtypes + non-null counts class="tok-comment"># pd.set_option(class="tok-str">'display.max_columns', None) # show all columns
The Most Common NumPy Bugs
1) Shape mismatch: (100,) ≠ (100,1). Always check .shape before matrix ops. Use .reshape(-1,1) to add a dimension. 2) Integer division: np.array([3])/2 gives 1.5 in Python 3 but watch out with dtype=int arrays. 3) Copying vs views: arr[0:5] returns a VIEW — modifying it modifies the original. Use .copy() to be safe. 4) In-place vs out-of-place: X *= 2 modifies X in-place; Y = X * 2 creates a new array. 5) NaN propagation: np.mean([1,2,np.nan]) = NaN. Use np.nanmean() for NaN-safe aggregations.
np.shares_memory(a, b) tells you if two arrays share underlying data — crucial to know when you're 'copying' slices.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.