What:

It’s a kind of Transformer introduced by Google DeepMind in 2018. It’s exclusively the Encoder . It’s largely the same in terms of architecture, but it differs in application. Also uses MLM

Implemenation Differences:

FeatureOriginal TransformerBERT
ContextLeft-To-RightBidirectional (looks left and right of the word).
Training MethodTask specific. E.g. for translation, we get pairs of text.The base model is pretrained using Masked Language Modelling (MLM) and NSP1.
FlexibilityA single task requires full training, as it’s designed for Seq2Seq.Since it’s widely fine-tunable, it’s highly flexible.

Footnotes

  1. Next Sentence Prediction (NSP): BERT is also trained to predict if, given 2 sentences, which one logically follows the other. Ironically, it was later found that NSP is actually useless in modern contexts.