What:
It’s a kind of Transformer introduced by Google DeepMind in 2018. It’s exclusively the Encoder . It’s largely the same in terms of architecture, but it differs in application. Also uses MLM
Implemenation Differences:
Feature | Original Transformer | BERT |
---|---|---|
Context | Left-To-Right | Bidirectional (looks left and right of the word). |
Training Method | Task specific. E.g. for translation, we get pairs of text. | The base model is pretrained using Masked Language Modelling (MLM) and NSP1. |
Flexibility | A single task requires full training, as it’s designed for Seq2Seq. | Since it’s widely fine-tunable, it’s highly flexible. |
Footnotes
-
Next Sentence Prediction (NSP): BERT is also trained to predict if, given 2 sentences, which one logically follows the other. Ironically, it was later found that NSP is actually useless in modern contexts. ↩