What:

Imagine taking RLHF and the reward model this time is based on an off-the-shelf-LLM following a bunch of rules (a constitution).