Positional Encoding

Note about Natural Language Processing

Interestingly, relative positioning of words matters more than absolute position of words.

What:

Transformers are order-agnostic. They can’t tell the different between “The cat sat on the mat” and “The mat sat on the cat”. This is obviously bad. PE’s are ways of adding that location information into normal vectors. There’s 2 main ways:

Learnable Positional Embeddings / Encodings:

Similar to how we have a (learned) lookup table for word embeddings, we could also try to define one for positions. It would learn the best way to encoding the position during training for that task.

Advantages / Drawbacks

It fails to generalise well to longer sequences.
Good for fixed sequence lengths.

Sinusoidal Encodings:

We’re basically encoding the positions using the below formula. It’s great.

$PE (p os, 2 i) = sin (\frac{p os}{1000 0 ^{\frac{2 i}{d _{model}}}})$

$PE (p os, 2 i + 1) = cos (\frac{p os}{1000 0 ^{\frac{2 i}{d _{model}}}})$

where:

$p os$ is the position in the sequence (0, 1, 2, …).
$i$ is the dimension index.
$d_{model}$ is the total embedding size (e.g., 16 in our example).
$1000 0^{\frac{2 i}{d _{model}}}$ is a scaling factor that controls how quickly frequencies change.

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter

Positional Encoding

What:

Learnable Positional Embeddings / Encodings:

Advantages / Drawbacks

Sinusoidal Encodings:

Graph View

Table of Contents

Backlinks