What:

The best sequence of tokens in a Language is not necessarily picking the best token at each step. We can do beam search to find it.

  • Here, we’ve got a beam of 2.
  • The higher the beam, the more branches we’re searching - the more computationally hard.
  • The lower, the worse the searching actually is (no guarantee to being close to the global optima).

We tend to use this (as opposed to sampling) for non-deterministic stuff. E.g. language translation.