Gradient Descent

Gradient descent is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the core update rule behind training nearly all deep learning models.

Update Rule

Given a loss $\mathcal{L}(\theta)$ and learning rate $\eta$, parameters are updated as:

$$ \theta_{t+1} = \theta_t - \eta,\nabla_\theta,\mathcal{L}(\theta_t) $$

The gradient $\nabla_\theta,\mathcal{L}$ points in the direction of steepest ascent, so subtracting it moves toward lower loss.

Variants

Variant Batch size Characteristic
Batch GD Full dataset Stable, slow per step
Stochastic GD 1 sample Noisy, fast, escapes local minima
Mini-batch GD $m$ samples Practical balance; standard in practice

With Momentum

Momentum accumulates a velocity vector in the gradient direction, dampening oscillation:

$$ v_{t+1} = \beta v_t + (1-\beta),\nabla_\theta,\mathcal{L}(\theta_t) \qquad \theta_{t+1} = \theta_t - \eta,v_{t+1} $$

Implementation

def sgd(params, grads, lr=0.01, momentum=0.9, velocity=None):
    if velocity is None:
        velocity = [0.0] * len(params)
    updated = []
    for p, g, v in zip(params, grads, velocity):
        v_new = momentum * v + (1 - momentum) * g
        updated.append((p - lr * v_new, v_new))
    return updated