Early Stopping:

L2 Regularisation:

  • As described here, we add penalty terms to the loss function that discourages large weights.
    • New Loss: 
    • L2 Penalty:  (the sum of all squared weights).
  • The optimiser now tries to minimise both training error and the size of the weights. This forces the model to use smaller, smoother weights at every single step.

L1 Regularisation:

  • Similar to L2, but the penalty is different.
    • L1 Penalty:  (the sum of all absolute values of weights).
    • Key Effect: L1 encourages sparsity. It has a strong tendency to push many weights to be exactly zero. This effectively “turns off” unhelpful features, making it useful for feature selection.

Dropout:

  • In Training: In mini-batches, randomly “drop” a fraction of nodes. The forward and backward pass happens on these thinned networks.
  • In Testing: Use the full network (scaling the activation functions to account for there being more contributions).
  • Intuition: It’s like many networks. When combined, their opinions are combined. This way, each weight is actually pulling its own weight and become a useful feature detector.

Data Augmentation:

  • Literally just make synthetic data by applying realistic transformation to your existing data.