What:

Traditional transformers used Supervised Learning with translation pairs. BERT introduced LMM, where you mask ~15% of words in a sentence and train the model to predict them using the full context - i.e. from the left and from the right. Forces learning deep relationships.

Why this is fucking amazing.

This, though it may not seem like it, is fucking amazing. Why? Because before, you had a supervised learning problem. You had to get humans to provide input and output text (e.g. translations). But now, you can train on the entire internet in an unsupervised way. BUT WITH MORE TEXT! Genius genuinely