Interestingly, relative positioning of words matters more than absolute position of words.

What:

Transformers are order-agnostic. They can’t tell the different between “The cat sat on the mat” and “The mat sat on the cat”. This is obviously bad. PE’s are ways of adding that location information into normal vectors. There’s 2 main ways:

Learnable Positional Encodings:

Similar to how we have a (learned) lookup table for word embeddings, we could also try to define one for positions. It would learn the best way to encoding the position during training for that task.

Advantages / Drawbacks
  • It fails to generalise well to longer sequences.
  • Good for fixed sequence lengths.

Sinusoidal Encodings:

We’re basically encoding the positions using the below formula. It’s great.

where:

  • is the position in the sequence (0, 1, 2, …).
  • is the dimension index.
  • is the total embedding size (e.g., 16 in our example).
  • is a scaling factor that controls how quickly frequencies change.