Essay

Gradients and small steps

A tiny note on optimization—why the boring update rule still runs the world.

Wednesday, March 18, 20261 min readBy Shane

math · ml · intuition

Most of what we call “learning” in machine learning is still, at heart, a loop: look at a signal, nudge parameters, repeat. The update everyone learns first is gradient descent:

w_{t+1} = w_t - \eta \nabla L(w_t)

The step size $\eta$ is where intuition lives. Too large and you overshoot; too small and you burn compute without going anywhere interesting.

Why this still matters

Even fancy optimizers (momentum, Adam, Lion, …) are playing the same game: turn messy, high-dimensional curvature into something you can walk through without falling over. The human part is choosing the loss $L$ so that “downhill” actually means “better for people.”

If you remember one thing: the gradient is a local promise, not a global guarantee. Pair it with good problem framing, evaluation, and skepticism.

Thanks for reading.

— Shane