BERT

What:

It’s a kind of Transformer introduced by Google DeepMind in 2018. It’s exclusively the Encoder . It’s largely the same in terms of architecture, but it differs in application. Also uses MLM

Implemenation Differences:

Feature	Original Transformer	BERT
Context	Left-To-Right	Bidirectional (looks left and right of the word).
Training Method	Task specific. E.g. for translation, we get pairs of text.	The base model is pretrained using Masked Language Modelling (MLM) and NSP¹.
Flexibility	A single task requires full training, as it’s designed for Seq2Seq.	Since it’s widely fine-tunable, it’s highly flexible.

Next Sentence Prediction (NSP): BERT is also trained to predict if, given 2 sentences, which one logically follows the other. Ironically, it was later found that NSP is actually useless in modern contexts. ↩

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter

BERT

What:

Implemenation Differences:

Graph View

Backlinks

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter

BERT

What:

Implemenation Differences:

Footnotes

Graph View

Backlinks