Gradient Descent — omnipedia.tech

Gradient descent is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the core update rule behind training nearly all deep learning models.

Update Rule

Given a loss $\mathcal{L}(\theta)$ and learning rate $\eta$, parameters are updated as:

$$ \theta_{t+1} = \theta_t - \eta,\nabla_\theta,\mathcal{L}(\theta_t) $$

The gradient $\nabla_\theta,\mathcal{L}$ points in the direction of steepest ascent, so subtracting it moves toward lower loss.

Variants

Variant	Batch size	Characteristic
Batch GD	Full dataset	Stable, slow per step
Stochastic GD	1 sample	Noisy, fast, escapes local minima
Mini-batch GD	$m$ samples	Practical balance; standard in practice

With Momentum

Momentum accumulates a velocity vector in the gradient direction, dampening oscillation:

$$ v_{t+1} = \beta v_t + (1-\beta),\nabla_\theta,\mathcal{L}(\theta_t) \qquad \theta_{t+1} = \theta_t - \eta,v_{t+1} $$

Implementation

def sgd(params, grads, lr=0.01, momentum=0.9, velocity=None):
    if velocity is None:
        velocity = [0.0] * len(params)
    updated = []
    for p, g, v in zip(params, grads, velocity):
        v_new = momentum * v + (1 - momentum) * g
        updated.append((p - lr * v_new, v_new))
    return updated