As described here, we add penalty terms to the loss function that discourages large weights.
New Loss:Etotal=Etraining+λ⋅Eweights
L2 Penalty:EW=21∑iwi2 (the sum of all squared weights).
The optimiser now tries to minimise both training error and the size of the weights. This forces the model to use smaller, smoother weights at every single step.
L1 Regularisation:
Similar to L2, but the penalty is different.
L1 Penalty:EW=∑i∣wi∣ (the sum of all absolute values of weights).
Key Effect: L1 encourages sparsity. It has a strong tendency to push many weights to be exactly zero. This effectively “turns off” unhelpful features, making it useful for feature selection.
Dropout:
In Training: In mini-batches, randomly “drop” a fraction of nodes. The forward and backward pass happens on these thinned networks.
In Testing: Use the full network (scaling the activation functions to account for there being more contributions).
Intuition: It’s like many networks. When combined, their opinions are combined. This way, each weight is actually pulling its own weight and become a useful feature detector.
Data Augmentation:
Literally just make synthetic data by applying realistic transformation to your existing data.