A type of Machine Learning paradigm.

In Reinforcement Learning (RL), an agent learns by interacting continuously with an environment. Imagine we’re taking an RL bot on Flappy Bird. At each step, the agent goes through the following cycle:

  • Agent: Our bird! In RL, it’s normally the AI we’re training.
  • Observation (State): This is everything that the agent sees. How to setup the environment is not necessarily obvious. For example:
    • We could just feed in the raw pixels of the game.
    • We could feed in the relative locations of the pipes.
    • Or we could feed in the a Lidar scan of everything the bird “sees”.
    • Our agent will be able to see just the relative locations of the upcoming pipes.
  • Action (Space): These are the set of all possible actions one can take. For our bird, it’s quite simple; flap or don’t flap. But what about an agent learning to play go? It would be MUCH larger.
  • Reward: After performing an action, the agent receives a numerical reward. This reward signals whether the action was good or bad, guiding the agent to improve its behavior over time. The rewards for our bird is:
    • +0.1 - every frame it stays alive
    • +1.0 - successfully passing a pipe
    • -1.0 - dying
    • -0.5 - touch the top of the screen
  • Next Observation: The action then gets fed back into the environment, the environment changes, we make an observation and repeat the cycle again.
  • Policy: This is the rules the bird learns on how to play successfully.
  • Goal: The goal is to update the policy to get as much reward as possible.

Exploitation Vs. Exploration:

When training a model, there’s dials and knobs we can turn to influence the learning process.

Weirdly contrived scenario 👶:

Imagine a weirdly dexterous baby was plopped into a room with a table and some Jenga blocks. It’s literally never seen Jenga blocks before.

We give the baby sugar when it stacks blocks higher than it has previously. (I.e. The reward function is the height of the tower).

First, it kinda has to learn how to stack blocks. It’s gonna try random actions, drop the blocks, and see what happens. But it really quickly discovers that if it stacks it sideways, it gets more sugar, quickly. In it’s exploration phase, it learned to stack sideways.

It’s able to stack them pretty quickly, pretty high. It learns, “Hey. This is a tried and tested method. Stack them on their side, get sugar”. This is the “exploitation phase”.

But after around 3 sideways blocks, the tower becomes too unstable and just falls. But because it refuses to try any other method, it’s stuck on the 3 block tower.

If it explored more, it likely would have learned to stack them flat, or in a pyramid. But maybe that would’ve taken too long to learn.

How that contrived example relates to RL:

Our agent starts by trying random actions (a random policy). It’s exploring. The percentage of the time it does something random is called the exploration rate. We start with a high exploration rate, and then decrease it as it learns more. There’s also a minimum exploration rate, so it never stops exploring completely.