Session 3#
This session will provide an introduction to Reinforcement Learning, the field of study and set of methods for computational formalization of learning goal-directed behavior.
Slides from the third lecture can be found here.
The discussion of the day will be on the paper by Anthony et al. (2017). Thinking Fast and Slow with Deep Learning and Tree Search.
For those who is curious, one general textbook on Reinforcement Learning can be found here.
Exercises#
Exercise 3.1.: Reinforcement Learning & Agency
What types of information would be difficult / inefficient to learn from trial-and-error learning?
Learning from experts constitutes and important aspect of human learning. What are some situations where humans learn from experts (implicitly / explicitly)? What makes learning from experts efficient?
The notion of world / environment models is prominent in, e.g., Bayesian approaches to cognitive science. We have seen an example of a Bayesian model in the context of category prediction, whoch built on an intuitive model of how categories are constructed. This model was used to infer likely category of an observation via Bayesian inference. What, if any, are the differences between this “model of the world” and how it’s used to perform a task, the how the model of the world (or, environment) is used in model-based RL (e.g., ExIt)?
Try to come up with a reward formalizing the goal of successfully navigating from the classroom to, e.g., your home. Which aspects are more difficult to formalize, which are easier? How does this relate to considerations of AI safety?
What is the difference between reward misspecification and reward hacking? Provide a conceivable example of reward hacking in the context of LMs.
What steps are involved in a typical RLHF pipeline?
Click below to see possible solutions.
What types of information would be difficult / inefficient to learn from trial-and-error learning?
points mentioned in discussion: rule based things, information with very large action space and sparse reward, common sense, something like minecraft (things without clear ‘done’ state), self-driving cars fi only real physical car is available (i.e., things with very high cost of error)
Learning from experts constitutes and important aspect of human learning. What are some situations where humans learn from experts (implicitly / explicitly)? What makes learning from experts efficient?
efficiency: e.g., learning things that would take too long to discover for an individual agent (e.g., mathematical theory) or too costly to discover for an individual; experts can guide the learner through most efficient paths
examples: explicit: school, second language learning, implicit: kids learning from parents or peers, native language acquisition
The notion of world / environment models is prominent in, e.g., Bayesian approaches to cognitive science. We have seen an example of a Bayesian model in the context of category prediction, whoch built on an intuitive model of how categories are constructed. This model was used to infer likely category of an observation via Bayesian inference. What, if any, are the differences between this “model of the world” and how it’s used to perform a task, the how the model of the world (or, environment) is used in model-based RL (e.g., ExIt)?
Bayesian model: the generative model of (the relevant aspect of) the world is used to perform Bayesian inference, i.e., inverted
RL world model: the model is used for forward simulation
Try to come up with a reward formalizing the goal of successfully navigating from the classroom to, e.g., your home. Which aspects are more difficult to formalize, which are easier? How does this relate to considerations of AI safety?
examples: easy: rewarding the quick minimization of distance towards the goal (i.e., navigating in the right direction), difficult: correctly generalizing to unforseen situations
these considerations relate to the difficulty of correct reward specification for agents that are supposed to be employed in the real world, therefore highlighting the difficulty of building AI systems that are safe across the board
What is the difference between reward misspecification and reward hacking? Provide a conceivable example of reward hacking in the context of LMs.
Reward misspecification occurs when the specified reward function does not accurately capture the true objective or desired behavior. Reward hacking refers to the behavior of RL agents exploiting gaps or loopholes in the specified reward function to achieve high rewards without actually fulfilling the intended objectives. As was mentioned in the discussion, one can view reward hacking as a possible (specific) result of reward misspecification. A useful (optional) resource for learning more is, e.g., this blogpost.
What steps are involved in a typical RLHF pipeline?
supervised fine-tuning -> reward model training -> RL based fine-tuning of the fine-tuned LLM with the help of the reward model
In practical terms, the Gymnasium environment / package is one of the most practical tools for training RL models. Using the following tutorial, you will learn the basics of the environment with a basic task from RL, the multu-armed bandit task. In this task, the agent has to learn the rewards associated with a finite set of choice options (or, arms) through making repeated choices. As discussed in the lecture, you can think of repeatedly exploring restaurant options in a new town as an example of this task. Please read the following primer in order to solve the following exercise.
Exercise 3.2.: Multi-armed bandit tasks in the Gym
Please head to this webbook and navigate to the exercise Google Colab that (can be found right at the top of the page). Please work through the first section (“Multi-Armed Bandit”) of the exercise sheet, specifically, implementing the
RandomAgent
, and theRewardAveraging
. The conceptual idea of averaging is described in detail here, Sections 2.1–2.7, and in a brief format as relevant for the task here.