Reinforcement Learning (part 3): RLHF and PPO

Reinforcement Learning (part 3): RLHF and PPO#

This session consists of two parts and aims to clarify the connection between reinforcement learning and language model fine-tuning, especially as it was been employed in recent systems like ChatGPT.

In particular, the first part of the session introduces the Reinforcement Learning from Human Feedback (RLHF) approach as it was used for training ChatGPT and InstructGPT. Core components of the approaches are discussed, especially focusing on the reward modeling aspect. The second part provides an overview of developments that were made in the domain of policy-gradient methods (e.g., advantage estimation, gradient variance reduction techniques etc). The session focuses on Proximal Policy Optimization (PPO), one of the most popular deep RL algorithms in the current literature.

The goal of the session is to connect more closely algorithms from RL and approaches that are applied to fine-tune LLMs, as well as to outline core concepts in SOTA policy-gradient algorithms.

Slides for the session can be found here.

Further materials (optional)#

Below, further materials on RL, algorithms discussed in the session as well as materials introducing and reviewing reward modeling from human feedback can be found.