Homework 2

Homework 2#

The learning goals for the second hands-on sheet is to gain practical experience with the concepts we discussed throughout the recent sessions in class. In particular, the exercise will focus on:

the human feedback which goes into RLHF and provide practical questions about a publich RLHF dataset which is commonly used for finetuning LLMs
actually trying to finetune a small language model (GPT-2) with reinfrocement learning
evaluating the fine-tuned model on common benchmark tasks.

Again, the homework is intended to showcase important practical aspects, further conceptual understanding of the topics we discuss in class and provide practical tools and exercise for your own future work. It is not meant to dismay you. Therefore, even if you don’t have a lot of ML / programming / technical background, you are warmly encouraged to take on the tasks, ask questions and discuss any concerns you have (with fellow students or me). There are also some hints and links to resources throughout the tasks which may help you get information which will help solving the tasks.

Some of the linked resources include, e.g., libraries or links to functions from libraries which may implement some of the tasks that are included in this homework. However, the provided started code intentionally spells out many steps “by hand” rather than using convenience functions from libraries. This is also meant to help you becom familiar with critical computation steps which might be hidden behind such libraries.

Homework logistics#

You will have two weeks to complete the assignment (until Saturday, December 23rd, 6pm German time).
Please do and submit your homework by yourself!
However, you are warmly encouraged to ask questions and help each other, without posting full solutions, via active discussions in the dedicated Forum space on Moodle (“Homework 2”). Most active participants of the Forum discussions will earn some extra points for their grade!
Please submit your solutions via Moodle. You will find a quiz called “Homework 2” with questions and answer fields corresponding the respective exercise numbers listed below.
If you have questions or difficulties with the homework, please try to solve them with the help of your fellow students via Forum. However, don’t hesitate to reach out to me via email if you have any questions, struggle or feel overwhelmed.

Preliminaries#

The exercises below will require you to execute Python code. You can do so either on your own machine, or by using Google Colab (free, only requires a Google account). You can easily do the latter by pressing the Colab icon at the top of the webook’s page. You are encouraged to use the Colab option to avoid complications with local package installations etc. To speed up the execution of the code on Colab (especially Exercise 2 and 3), you can use the available GPU. For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.

However, if you do want to run the code locally on your machine, I strongly encourage you to create an environment (e.g., with Conda) before you install any dependencies, and please keep in mind that pretrained language model weights might take up quite a bit of space on your hard drive or might require high RAM for prediction.

Note that the class uses PyTorch. For those of you who wish to complete final projects which include programming, you are also free to use TensorFlow for that (but I may be able to provide less support with that).

Exercise 2: Note on saving your model trained on Colab

Please note that you may want to save your fine-tuned model in exercise 2 in order to be able to re-use it later. Improtantly, sessions on Colab only persist information (including files saved to the session drive) as long as your runtime is connected. Therefore, please download your model to your local machine or mount Colab to your Google drive (see instructions in Exercise 2).

Exercise 1 (10 points)#

In this exercise, we will look at the aspect of human feedback in the RLHF pipeline which we discussed from a more theoretical perspective in course sessions.

Your job for this exercise is to inspect an open-source human-feedback dataset provided by researchers from Anthropic.

TASK:

Load the following dataset from Huggingface: Anthropic/hh-rlhf (you can choose how to access the dataset as you wish)
Understand the structure of the dataset
Answer questions about the dataset and some samples from it on Moodle

Hints and helpful materials:

slides from session 4 discuss human feedback in the RLHF context
blogpost on RLHF

Exercise 2 (25 points)#

In this task, we will fine-tuned our very own LM with reinforcement learning!

We will use reinforcement learning to fine-tune a pretrained language model for the task of positive review generation. Specifically, we will fine-tune an LM for generating a positive movie review continuation based on a partial reivew provided as input. The continuation should be positive even if the input was negative. For this, we will use the IMDB dataset of movie reviews (we will only use a half of the train split to speed up training). The task is inspired by the IMDB task from this paper. For example, we want the fine-tuned model to do the following:

Example input: “I would put this at the top of my list of films “

Example model prediction we want: “which I would recoomend to all of my friends. Great movie!”

For this exercise, your task is to implement a prominent policy-gradient algorithm – REINFORCE (Williams, 1992). This was one of the first algorithms introduced in the literature in the area of policy gradient methods which preceded more advanced methods like PPO we have seen in the lecture. Versions of REINFORCE are still used today; e.g., the Sparrow model which was introduced in class was trained with REINFORCE with a baseline. Similar to other policy gradient methods, it allows us to directly learn a parameterized policy following which will maximize expected returns, without learning value functions.

The REINFORCE weight update rule provides a mechanism for updating parameters of the policy in order to maximize expected returns in the following way:

\[ \theta_{t+1} = \theta_{t} + \alpha \; R\; \nabla \log \pi(a \mid s) \]

where $\theta_{t}$ are the current policy parameters, $\alpha$ is a learning rate, $R$ is the reward for the current episode, $\pi$ the current policy and $a$ is the action taken in the state $s$ (for this exercise we assume a bandit environment). Sometimes, for variance reduction purposes, a reward baseline $b$ is used and $(R-b)$ is used instead of $R$. Note that REINFORCE also allows us to learn the policy given rollouts under its current parameterization (i.e., we use $\log P(a \mid s)$ under the current policy). In other words, we approximate the true gradient of the expected return with respect to the policy parameters via sampling. Since we focus on episodic tasks (i.e., sequential tasks which end when a goal state is reached; in our bandit-environment case, we only have one state, so this observation is trivial) and use returns for complete rollouts, REINFORCE is a also categorized as a Monte-Carlo algorithm (no need to worry about this if you are not familiar with them).

We will use REINFORCE to fine-tune GPT-2 which already was subjected to supervised fine-tuning for predicting reviews on the IMDB dataset (available on HuggingFace): lvwerra/gpt2-imdb

As a reward function, we will use a pretrained sentiment classifier based on the DistilBERT architecture, also trained on the IMDB dataset (available on HuggingFace): lvwerra/distilbert-imdb. We assume that the classifier provides “ground truth” labels of the sentiment of IMDB reviews by providing the scores of each review being positive (1) or negative (0). For each sample, the classifier provides scores for both labels, which can then be transformed to probabilities of each label being the true one for a given sample. You can find example outputs of he classifier below.

Since want our policy to predict positive reviews, we can use the scores of the positive label as the reward where higher scores mean more positive reviews, i.e., better performance.

For your convenience, some boilerplate code is already provided below.

YOUR TASK:

familiarize yourself with the dataset, the models and the provided code
implement the REINFORCE update rule by completing the code
implement the reward computation with the classifier by completing the code
train the model for 1 epoch (e.g., on Colab)
save the trained model (instructions below)
submit your code and example test outputs on Moodle
answer the additional questions on Moodle

Hints and additional materials:

please note that the REINFORCE update rule provides a way to update parameters so as to maximize the reward function (which is the objective function in case of reinforcement learning). However, standard PyTorch optimizers which we use for training minimize the objective function. Please take this into account in your implementation of REINFORCE.
you can find an example with a PyTorch implementation of using REINFORCE for a grid world navigation task here
a practical course on deep RL by HuggingFace, specifically focusing on REINFORCE here

# note: if you are running the code on Colab, you may need to install the HuggingFace 'transformers' library
# for that, uncomment and run the following line:

# !pip install transformers
# !pip install datasets

# import libraries

from datasets import load_dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM, 
    AutoModelForSequenceClassification,
    LogitsProcessorList,
    MinLengthLogitsProcessor,
    TemperatureLogitsWarper,
    StoppingCriteriaList,
    MaxLengthCriteria,
)
import torch
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np 

# load the IMDB dataset
imdb_ds = load_dataset("imdb")

Below, you can see the structure of the dataset:

# inspect a sample from the train split of the dataset
imdb_ds['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.',
 'label': 0}

Below, we load the pretrained models and respective tokenizers that will be used to initialize the policy and the reward model.

# Load policy model 
policy_tokenizer = AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
policy = AutoModelForCausalLM.from_pretrained("lvwerra/gpt2-imdb")

# Load reward model 
reward_tokenizer = AutoTokenizer.from_pretrained("lvwerra/distilbert-imdb")
reward_model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")

Before incorporating these models in a RL pipeline, below you can check what they output and and how the response is output (espceially for the reward model):

# Run an example input through the policy model just to see how it works

test_txt = "This movie is "
input_ids = policy_tokenizer(test_txt, return_tensors='pt')
out = policy.generate(
    **input_ids, 
    do_sample=True, 
    temperature=0.9, 
    max_length=20, 
    return_dict_in_generate=True, 
    output_scores=True, 
    renormalize_logits=True
)
print("Example prediction of the pretrained policy model: ", policy_tokenizer.decode(out.sequences[0]))

Example prediction of the pretrained policy model:  This movie is icky. Everything else is just stupid. I find the acting to be laughable and

# Run an example from the IMDB train split to see how the reward model works

input_reward = reward_tokenizer(imdb_ds['train'][0]['text'], return_tensors='pt')
out_reward = reward_model(**input_reward)
print("Raw output format of the reward model: ", out_reward)
# transform logits to probabilities
reward = torch.softmax(out_reward.logits, dim=1)
print(reward) # reward at index 1 is the probability of being positive; i.e., this can be used as the training reward

Raw output format of the reward model:  SequenceClassifierOutput(loss=None, logits=tensor([[ 0.4397, -0.7132]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
tensor([[0.7600, 0.2400]], grad_fn=<SoftmaxBackward0>)

Below, a dataset is defined for convenient preprocessing and loading of IMDB texts. In particular, since we want to train a system to predict review continuations given partial reviews as inputs, we do not need the full reviews supplied in the IMDB dataset. The dataloader below only uses the first 64 tokens for all reviews and returns these as input for our training.

class ImdbDataset(torch.utils.data.Dataset):
    """
    Wrapper for the IMDB dataset which returns the tokenized text
    and truncates / pads to a maximum length of 64 tokens.
    This is done following the paper referenced above where the input review
    snippets were maximally 64 tokens and then the review had to be completed
    with a positive sentiment.
    """
    def __init__(self, dataset, policy_tokenizer):
        self.dataset = dataset
        self.tokenizer = policy_tokenizer
        # following the paper referenced above, input texts are <= 64 tokens
        self.max_len = 64
        self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

    def __getitem__(self, idx):
        # get the text from the dataset
        text = self.dataset[idx]['text']
        # tokenize the text
        # and manually prepend BOS token (GPT-2 tokenizer doesn't do it somehow)
        tokens = self.tokenizer(
            "<|endoftext|>" + text, 
            truncation=True, 
            max_length=self.max_len, 
            padding='max_length',
            return_tensors='pt'
          )
        # return the tokens and the attention mask
        return {
            'input_ids': tokens.input_ids.squeeze().to(self.device), 
            'attention_mask': tokens.attention_mask.squeeze().to(self.device)
          }
    
    def __len__(self):
        return len(self.dataset)

Below, we define a helper function wrapping around our reward model which will be used during RL training in order to score the generations of the policy.

# reward modeling function

def compute_reward(
    reward_model, 
    reward_tokenizer, 
    sample,
    device=torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
):
    """
    Computes the reward, formalized as the probability of a sample being positive. 

    Parameters
    ----------
    reward_model: AutoModelForSequenceClassification
        The pretrained sentiment classifier to use for computing the reward.
    reward_tokenizer: AutoTokenizer
        The tokenizer to use for the reward model.
    sample: list[str]
        List of reviews generated by the policy of length batch_size.

    Returns
    -------
    reward: torch.Tensor
        Tensor of rewards of shape (batch_size,)
    """
    # tokenize the sample
    input_ids = reward_tokenizer(
        sample, 
        truncation=True, 
        max_length=128, 
        padding='max_length',
        return_tensors='pt'
    )
    input_ids = input_ids.to(device)
    # get the reward model prediction 
    ### YOUR CODE HERE

    # transform logits to probabilities, use these are reward
    ### YOUR CODE HERE

    # return the reward
    return reward

Below, we define the main training loop. Next to defining the hyperparameters and the iteration over the training data, the REINFORCE update which is used as the training signal should be implemented here.

Hint: When you first implement and test your REINFORCE implementation, you do not need to test on the entire training dataset. Test your code on a small number of training steps, print intermediate steps etc in order to sanity-check the implementation.

# trainining set up

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# dataset and dataloader
policy_tokenizer.pad_token = policy_tokenizer.eos_token
policy_tokenizer.padding_side = "left"
reward_tokenizer.padding_side = "left"
policy.config.pad_token_id = policy_tokenizer.eos_token_id
policy.generation_config.pad_token_id = policy_tokenizer.eos_token_id

policy = policy.to(dtype=torch.bfloat16).to(device)
reward_model = reward_model.to(dtype=torch.bfloat16).to(device)

##### Hyperparameters #####
num_epochs = 1
batch_size = 4
learning_rate = ### YOUR CODE HERE ####
###########################

# instantiate the dataloader
dataset = ImdbDataset(imdb_ds['train'], policy_tokenizer)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# 
num_steps = len(dataset) // batch_size / 2
print("Number of training steps ", num_steps)

# optimizer
optimizer = torch.optim.Adam(policy.parameters(), lr=learning_rate)

  Cell In[10], line 18
    learning_rate = ### YOUR CODE HERE ####
                    ^
SyntaxError: invalid syntax

# processors for the probability distribution over next tokens for generation (i.e., sampling next action)
logits_processor = LogitsProcessorList(
    [
        MinLengthLogitsProcessor(1, eos_token_id=policy_tokenizer.eos_token_id),
    ]
)
# instantiate logits processors
logits_warper = LogitsProcessorList(
    [
        TemperatureLogitsWarper(0.9),
    ]
)
# instantiate stopping criterion
stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=112)])

# trainining set up

losses = []
rewards_list = []
# training loop
for epoch in range(num_epochs):
    for step, batch in enumerate(dataloader):
        
        
        out = policy.sample(
            batch["input_ids"],
            logits_processor=logits_processor,
            logits_warper=logits_warper,
            stopping_criteria=stopping_criteria,
            output_scores=True,
            return_dict_in_generate=True,
        ) 
        # decode predictions for the reward model
        out_decoded = policy_tokenizer.batch_decode(
            out.sequences,
            skip_special_tokens=True
        )
        # print the current sequence every 10 steps
        if step % 10 == 0:
          print("current sequence: ", out_decoded)

        # below, the log probs of the generated sequences are retrieved
        out_scores = torch.stack(out.scores).squeeze()
        log_probs = torch.nn.functional.log_softmax(out_scores, dim=-1)
        # reshape the tensor to shape shape (batch_size, sequence_size, vocab_size)
        log_probs = log_probs.permute(1, 0, 2)
        # get log probs for generated tokens only
        sequence_ids = out.sequences[:, batch['input_ids'][0].shape[0]:]
        log_probs_continuations_tokens = log_probs.gather(
            dim=-1, 
            index=sequence_ids.unsqueeze(-1)
        ).squeeze()
        
        # compute log probability of the sequence based on the token log probs
        #### YOUR CODE HERE #####
        log_probs_continuations_sentences = 
        # compute the reward with the helper function defined above
        #### YOUR CODE HERE #####
        rewards = 
        
        rewards_list.append(rewards.detach().cpu())
        # compute the loss
        #### REINFORCE implementation (i.e., implementation of the relevant parts of the formula above here) ####
        #### YOUR CODE HERE ######
        loss = 
        losses.append(loss.detach().cpu())
        # compute the gradients
        loss.backward()
        # update the parameters
        optimizer.step()
        # zero the gradients
        optimizer.zero_grad()
        # print the loss
        print(f'Epoch: {epoch}, Step: {step}, Loss: {loss.item()}')

Saving the trained model#

We will use the trained model in the next erxercise; therefore, it should be save if you want to re-use it for exercise 3 at a later point. You can save the model to your Google Drive, if you are working on Colab, or locally.

When you execute the following cell, you will be prompted to authorize Colab to access your Drive (this is a prerequisite for using this functionality, unfortunately). Please follow the displayed instructions and then execute the following code cells. Once executed, please double-check that you Drive now indeed contains your model, so as to not loose your work.

Alternatively, if you do not wish to have Colab access your Drive, you can just manually download your model. To do so, please skip the next two cells, and just execute the saving cell after. Then, navigate to the directory symbol on the left panel of Colab, right-click on the new model directory and download it. If you work on a local machine, also just execute this local saving code cell.

# FOR GOOGLE DRIVE & COLAB USE ONLY
# mount Colab to Drive
from google.colab import drive

# FOR GOOGLE DRIVE & COLAB USE ONLY

# do not execute this if you don't want to save to Drive
drive.mount('/content/drive')

policy.save_pretrained('/content/drive/My Drive/gpt2_imdb_policy')

# FOR LOCAL SAVING (TO COLAB SESSION OR YOUR MACHINE)
policy.save_pretrained('gpt2_imdb_policy')

Below, we inspect the training dynamics. Please answer the respective questions about the plots on Moodle.

# Plot the fine-tuning loss
### YOUR CODE HERE #####

# compute average batch rewards (i.e., average reward per training step)
##### YOUR CODE HERE #####

Exercise 3 (15 points)#

Finally, we will get our hands dirty with evaluating LLMs which already have been trained. In this task, we will use a few tasks from one of the most-used LM benchmarks, the SuperGLUE benchmark:

a natural language inference (NLI) task “rte”,
- a task wherein the model has to predict whether a second sentence is entailed by the first one (i.e., predict the label ‘entailment’ or ‘no entailment’)
a question answering task “boolq”,
- a task wherein the model has to predict an answer (yes/no) to a question, given context
and a sentence continuation task “copa”.
- a task wherein the model has to select one of two sentences as the more plausible continuation given an input sentence.

We will be using (subset of) the validation splits of the tasks for our evaluation.

With the introduction of first language models like BERT, a common approach to using benchmarks like SuperGLUE was to fine-tune the pretrained model on the train split of the benchmark datasets, and then use the test splits for evaluation. With SOTA LLMs, it is more common to do zero- or few-shot evaluation where the model has to, e.g., predict labels or select answer options without special fine-tuning, just given instructions.

We are also not going to fine-tune our model on these specific tasks. Instead, as introduced in class, we are going to compare the log probabilities of different answer options (e.g., log probabilities of “entailment” vs. “no entailment” following a pair of sentences from the RTE-task). With this method, the assumption is that a model’s output prediction for a particular trial is correct iff: $$\log P_{LM}(\text{<correct label> | context}) > \log P_{LM}(\text{<incorrect label> | context}) $$

For tasks like “copa” where there is no single label but instead a sentence continuation, we are going to compute the average token log probability as a single-number representation of the continuation. Here, the model’s prediction will count as correct iff the average log probability of the correct continuation sentence will be higher, given the input, than for the incorrect continuation. We will not using task instructions in our evaluation since the model wasn’t fine-tuned on instruction-following.

Your job is to complete the code below, evaluate the model which you have fine-tuned above and summarize the results you find in a few words (see below for more detailed step-by-step instructions). If you have issues with the previous task and cannot use your own fine-tuned model, please use the initial IMDB fine-tuned GPT-2 with which we initialized the policy in exercise 2. Please indicate which model you are testing on Moodle in the respective exercise responses.