In ML, you’ll often have data imbalances. For our example, let’s consider images (classifying between dogs or cars) and CNNs. You may choose to augment one side by artificially cropping, rotating, blurring etc the image so the classes are now more balanced.
Why does the model generalise beyond the distortions though? Two reasons:
- In CNNs, they first learn lines, curves, edges etc, and then begin to learn tails, wheels etc. By feeding enough images, distorted or not, they’re still learning the basic edges
- It forbids the model from overfitting, to either the individual category choice or given features in the minority class. (Does it actually?)
Bias - Variance Tradeoff:
- A high-bias model underfits the data because the model’s design is too simple. (A linear regression trying to model a curve)
- Solution: Increase model complexity.
- Variance is when the model is too sensitive to specific training data and thus overfits. It learns the noise and fails to generalise.
- Solution: Get more training data.