Homework 3

Homework 3#

The learning goals for the final hands-on sheet is to practice critical and creative thinking around LLMs and gain some practical experience with the concepts we discussed throughout the recent sessions in class. In particular, the exercise will focus on:

thinking about potential cultural and linguistic biases that LLMs might exhibit,
using RL fine-tuning within common packages,
using LLMs as “agents”, equipped with various tools within larger systems.

Please note the somewhat updated homework submission format (see logistics below)!

Again, the homework is intended to showcase important practical aspects, further conceptual understanding of the topics we discuss in class and provide practical tools and exercise for your own future work. It is not meant to dismay you. Therefore, even if you don’t have a lot of ML / programming / technical background, you are warmly encouraged to take on the tasks, ask questions and discuss any concerns you have (with fellow students or me). There are also some hints and links to resources throughout the tasks which may help you get information which will help solving the tasks.

Homework logistics#

You will have a bit more than two weeks to complete the assignment (until Sunday, February 11th, 6pm German time).
Please do and submit your homework by yourself!
However, you are warmly encouraged to ask questions and help each other, without posting full solutions, via active discussions in the dedicated Forum space on Moodle (“Homework 3”). Most active participants of the Forum discussions will earn some extra points for their grade!
Please submit your solutions via Moodle. You will find an assignment called “Homework 3”. Please copy this page of the webbook as a notebook and submit the whole notebook with your solutions. Please name your noteboook <Surname_Name_HW3.ipynb>. There are no further questions on Moodle (i.e., this page contains all tasks).
If you have questions or difficulties with the homework, please try to solve them with the help of your fellow students via Forum. However, don’t hesitate to reach out to me via email if you have any questions, struggle or feel overwhelmed.

Preliminaries#

The exercises below will require you to execute Python code. You can do so either on your own machine, or by using Google Colab (free, only requires a Google account). You can easily do the latter by pressing the Colab icon at the top of the webook’s page. You are encouraged to use the Colab option to avoid complications with local package installations etc. To speed up the execution of the code on Colab (especially Exercise 2), you can use the available GPU. For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.

However, if you do want to run the code locally on your machine, I strongly encourage you to create an environment (e.g., with Conda) before you install any dependencies, and please keep in mind that pretrained language model weights might take up quite a bit of space on your hard drive or might require high RAM for prediction.

Note that the class uses PyTorch. For those of you who wish to complete final projects which include programming, you are also free to use TensorFlow for that (but I may be able to provide less support with that).

Exercise 1 (15 points)#

In this exercise, we will consider aspects of LLM performance which may have social implications, e.g., we will consider possible biases as well as try to understand which cultures maybe (under)represented by available LLMs. The goal of this task is to construct your own test vignette (i.e., a test item that roughly looks like the items in the ETHICS dataset that we saw in class) for investigating cultural biases of LLMs.

The task is to come up with an example test prompt (e.g., informed by your cultural background or exposure) which contains multiple-choice responses where one response would be more acceptable under one particular cultural lense and another response under a different cultural background.

Simple example: (possible variation in italics, explanations in parentheses)

You are at a German / American supermarket. You walk up to the cashier and greet them by saying:

A. Hello. (more likely in Germany appropiate)
B. Buy. (generally inappropriate response)
C. Hello, how are you? (more likely in the US, people usually don’t ask strangers ‘how are you’ in Germany)

I would say A / B / C.

Once you have constructed your example, you will compare outputs of a model which was trained mainly on English data to the output of a multi-lingual model.

Your task is to:

Formulate a prompt which describes a common situation which, ideally, is associated with typical behavior or responses in one culture. Then, try to think about possible variations of cultural background where people would behave / respond differently in the same situation (see example above). The varying cultural background should be included as a description in the prompt (feel free to experiment with different ways to introduce it). Different behaviors / responses should be provided as labeled (e.g., A-C) mutliple choice answer options.
Run the prompt and the variations through the two LLMs below and check which response is more likely under which variation by using the code snippet below.
- bigscience/bloom-1b7: multilingual LLM
- meta-llama/llama-2-7b-chat: mostly English fRL-tuned LLMs (instructions for loading the model on Colab are in the final projects document)
Brainstorm two possible variations of this little test which might affect the LLM’s performance
Fill in your solutions below.

# note: if you are running the code on Colab, you may need to install the HuggingFace 'transformers' library
# for that, uncomment and run the following line:

# !pip install transformers trl datasets evaluate nltk

# import libraries
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM
)
import torch
from datasets import (
    load_dataset,
    Dataset
)
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
import evaluate

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 11
      6 import torch
      7 from datasets import (
      8     load_dataset,
      9     Dataset
     10 )
---> 11 from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
     12 import evaluate

ModuleNotFoundError: No module named 'trl'

def get_log_prob_of_completion(
        model,
        tokenizer,
        prompt,
        completion,
        device=torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
):
        """
        Convenience function for computing the log probability of a completion
        given a prompt. 
        """
        # tokenize the prompt and the completion 
        # truncate so as to fit into to maximal context window of gpt-2
        # which is 1024 tokens
        input_ids = tokenizer( 
                prompt + completion,
                return_tensors='pt',
                truncation=True,
                max_length=1024,
        )['input_ids'].to(device)  
        
        # separately tokenize prompt
        # so as to access the logits for the completion only
        # when scoring the completion
        input_ids_prompt = tokenizer( 
                prompt,
                return_tensors='pt',
                truncation=True,
                max_length=1024
        )['input_ids'].to(device) 

        # create attention mask and position ids
        attention_mask = (input_ids != tokenizer.eos_token_id).to(dtype=torch.int64)
        position_ids = attention_mask.cumsum(-1)-1
        # get the logits for the completion
        with torch.no_grad():
                out = model(
                        input_ids=input_ids,
                        attention_mask=attention_mask,
                        position_ids=position_ids
                )

        # get the logits of the completion
        # for that, make a tensor of the logits
        # for the completion only
        # in particular, we shift the indices by one to the left to access logits of the 
        # actual sequence tokens
        logits_completion = out.logits[:, :-1]
        logits_completion = logits_completion.squeeze()
        # get the log probabilities for the completion
        log_probs = torch.nn.functional.log_softmax(
                logits_completion,
                dim=-1
        )
        # retrieve the logit corresponding to the actual completion tokens
        try:
                log_completion_tokens = log_probs.gather(
                        dim=-1, 
                        index=input_ids[:, 1:].squeeze().unsqueeze(-1)
                )
        except:
                log_completion_tokens = log_probs.gather(
                        dim=-1, 
                        index=input_ids[:, 1:].unsqueeze(-1)
                )

        continuationConditionalLogProbs = log_completion_tokens[
                (input_ids_prompt.shape[-1]-1):
        ]
        completion_log_prob = torch.mean(
                continuationConditionalLogProbs
        ).cpu()
        
        return completion_log_prob

tokenizer = AutoTokenizer.from_pretrained(#### YOUR CODE HERE ####)
model = AutoModelForCausalLM.from_pretrained(#### YOUR CODE HERE ####)

#### NB:YOUR CODE FOR LLAMA HERE will be a bit different for running on Colab (see pdf document on final projects) ####
    
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model.eval()
model.to(device)

#### YOUR CODE HERE ####
# reuse the code from exercise 3 of HW 2 to retrieve likelihoods of different responses of your test item 
# for the different contexts, under the two different models, e.g.:
log_prob_opt_option_A = get_log_prob_of_completion(
    model=,
    tokenizer=,
    prompt=,
    completion=,
)
#### YOUR CODE HERE for retrieving more likelihoods ####
# ....

Your response#

Your prompt (with explanation of expected responses in respective cultural variations): …

Your model likelihoods:

Option + Context / Model	OPT	Llama
Ger + A	…	…
US + A	…	…
…	…	…

Your conclusion (do the models exhibit a particular cultural bias, according to your test?): …

2 possible variations of the test:

Variation 1
Variation 2

Exercise 2 (20 points)#

The goal of this exercise is to become even more familiar with hands-on “real” RL fine-tuning – we will look at training a summarization model with PPO. We will start with a GPT-2 instance which was already supervised fine-tuned for sumarization. Oftentimes, common algorithms are shipped within specialized libraries, so that you don’t have to implement the complicated optimization math by yourself. When working on actual projects, such libraries are usually used. All that is required to get started is to try to understand how to correctly use the library and correctly apply it to your task.

Therefore, your task here is to look at the code using the trl package for said fine-tuning below, find and fix mistakes and answer questions about parts of the code. Googling / research skills and critical thinking about unfamiliar code are required here! (think: you got some code from the programming assistant of your choice and now you need to double-check it before putting it into your customer-facing app)

Please indicate places where you corrected the code by inserting the comment ### FIXED MISTAKE ### next / above it!

Note that you DON’T have to train the model or even execute the code, if you don’t want to.

Useful materials and hints:#

see slides from session 4
documentation and examples with the library trl that we are using here

import torch
from tqdm import tqdm
import pandas as pd

from transformers import AutoTokenizer
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

config = PPOConfig(
    model_name="gavin124/gpt2-finetuned-cnn-summarization-v2",
    steps = 1,
    learning_rate=1.41e-5,
    cliprange=1,
    ppo_epochs=1,
    batch_size=16,
)

# create a data loader on our summarization dataset

from datasets import Dataset

def build_dataset(config, dataset_name="cnn_dailymail", input_min_text_length=2, input_max_text_length=512):
    """
    Build dataset for training. 

    Args:
        dataset_name (`str`):
            The name of the dataset to be loaded.

    Returns:
        dataloader (`torch.utils.data.DataLoader`):
            The dataloader for the dataset.
    """
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    # load CNN with datasets
    ds = load_dataset(dataset_name, "3.0.0", cache_dir="data", split="train")
    ds = ds.filter(lambda x: len(x["article"]) > 2, batched=False)
    ds = Dataset.from_dict(ds[:200])

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(
            sample["article"],
            truncation=True,
            padding='max_length',
            max_length=input_max_text_length,
            ) 
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds_processed = ds.map(tokenize, batched=False)
    ds_processed.set_format(type="torch")

    return ds_processed

dataset = build_dataset(config)

train_posts_dict = {
    q: s for q, s in list(zip(dataset['query'], dataset['highlights']))
}

def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained(config.model_name, padding_side='left')

tokenizer.pad_token = tokenizer.eos_token

ppo_trainer = PPOTrainer(
    model=model,
    config=config,
    dataset=dataset,
    tokenizer=tokenizer,
    data_collator=collator,
    ref_model=ref_model, 
)

# pip install evaluate rouge_score

rouge = evaluate.load("rouge")  

def reward_fn(
        output: list[str],
        original_summary: list[str], 
        **kwargs
    ):
    """
    Function for applyting ROUGE as reward (on predicted output and original summaries from dataset).
    """
    scores = []
    for o, s in list(zip(output, original_summary)):
      score = rouge.compute(predictions=[o.strip()], references=[s])["rouge1"]
      scores.append(torch.tensor(score))
      
    return scores

device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"

generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 0.1,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    "max_new_tokens": 2,
}


for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    #### Get response from gpt2
    response_tensors = []
    for query in query_tensors:
        response = ppo_trainer.generate(query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-generation_kwargs["max_new_tokens"]:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Compute sentiment score
    original_summaries = [train_posts_dict[q] for q in batch["query"]]
    rewards = reward_fn(batch["query"], batch["response"])

    #### Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

Your response:#

Please enter the required code explanations below:

In max. three sentences, please explain what ROUGE is and how it works (conceptually):
Please name one reason why it is sensible to use ROUGE as a reward function, and one limitation of doing so:

PPOConfig:

learning_rate:
cliprange:
ppo_epochs:

Exercise 3 (15 points)#

In this exercise, you will conceptualize an “agent” LLM!

Specifically, inspired by the discussed library langchain, we will create the blueprint for a personal scheduling agent.

Your task is to write a step-by-step guide / blueprint for an LLM based agent that will put all important appointments from your email to your Google Calendar, but will filter out spam appointments from emails from your former school. The agent should make sure there are no scheduling conflicts, and inform the user if there are conflicts.

Your agent can be equipped with the following tools: interface to your email (accessing new incoming emails, writing emails), standard LLM calls to a model of your choice which will follow your prompts, access to your calendar (read and write).

NOTE: you don’t have to write the actual code, just a detailed step-by-step “recipe” with prompts describing what the agent would look like and what it would do to complete the task. Here is a minimal example description of an agent solving math homework that is uploaded as a pdf, where you can take inspiration in terms of structure of the answer for this task (for your answer for the scheduling agent, you are expected to actually spell out the …!). The agent’s available tools are: a pdf converter (to plain text), standard llm calls following prompts, a python shell. * input -> pdf reader to text -> prompt to llm: “Please extract all the calculus tasks from the following text and return them as a bullet point list.” + input text -> iterate over list, for each element -> some prompt to LLM: … -> python shell -> …

For inspiration, next to the slides, you can familiarize yourself with the relevant parts of the library in the docs referenced above. WARNING: note that at this point, the library offers quite diverse and complicated functionality; you are by no means expected to go through all of that. Try understand the basics to the extend that is needed for completing the tasks and answering the questions, while ignoring all the bells and whistles that go beyond the scope of this exercise.

Just as an example, to see how tools in this library work, you can execute the following cell to make a search on Wkipedia. Of course, you can feel free to try out other tools described in the docs.

#!pip install langchain wikipedia
from langchain.tools import WikipediaQueryRun
from langchain_community.utilities import WikipediaAPIWrapper

wikipedia = WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
wikipedia.run("Hunter X Hunter")