Homework 1#

The learning goals of this first hands-on sheet are:

  • to make sure that you can execute code on your machines or on Google Colab in order to experiment with LMs and RL yourself!

  • to familiarize yourself with the HuggingFace library which provides many pretrained LMs and handy tools for working with them,

  • to develop basic intuitions about core RL concepts,

  • and to train your first RL agent!

Most importantly, the homework is intended to showcase important practical aspects, provide space for learning how to find solutions to practical problems, and further conceptual understanding of the topics we discuss in class. It is not meant to dismay you. Therefore, even if you don’t have a lot of ML / programming / technical background, you are warmly encouraged to take on the tasks, ask questions and discuss any concerns you have (with fellow students or me). There are also some hints and links to resources throughout the tasks which may help you get information which will help solving the tasks.

Homework logistics#

  • You will have two weeks to complete the assignment (until Wed, November 8th, 12:30pm).

  • Please do and submit your homework by yourself!

  • However, you are warmly encouraged to ask questions and help each other, without posting full solutions, via active discussions in the dedicated Forum space on Moodle (“Homework 1”). Most active participants of the Forum discussions will earn some extra points for their grade!

  • Please submit your solutions via Moodle. You will find a quiz called “Homework 1” with questions and answer fields corresponding the respective exercise numbers listed below.

  • If you have questions or difficulties with the homework, please try to solve them with the help of your fellow students via Forum. However, I will also offer a consultation session on Tuesday, October 31st, 2pm-4pm, on Zoom, under the usual class link. Also don’t hesitate to reach out to me via email if you have any questions, struggle or feel overwhelmed.

Preliminaries#

The exercises below will require you to execute Python code. You can do so either on your own machine, or by using Google Colab (free, only requires a Google account). You can easily do the latter by pressing the Colab icon at the top of the webook’s page. You are encouraged to use the Colab option to avoid complications with local package installations etc. To speed up the execution of the code on Colab (especially Exercise 1), you can use the available GPU. For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.

However, if you do want to run the code locally on your machine, I strongly encourage you to create an environment (e.g., with Conda) before you install any dependencies, and please keep in mind that pretrained language model weights might take up quite a bit of space on your hard drive or might require high RAM for prediction. In particular, the model used in these exercises requires 6GB disc space and around 8GB RAM for stable training.

Note that the class uses PyTorch. For those of you who wish to complete final projects which include programming, you are also free to use TensorFlow for that (but I may be able to provide less support with that).

Exercise 1 (20 points)#

In this exercise, we will load a pretrained LM from HuggingFace and explore how to work with it, using the tools provided by the library.

Exercise 1.1 (5 points)#

Your task is to use the pretrained model “GPT-NEO” (1.3B parameters) to run inference. In particular, your task is to complete the code below in order to produce a continuation for the sentence “Reinforcement learning is “ using beam search with k=5. (Hint: beam-search is a particular decoding scheme used on top of trained language models. If you are not familiar with it, please do some research to get an overall idea about it as part of this task.)

You can find information for completing the code, e.g., here.

TASK: Please submit your result (i.e., produced text) on Moodle and answer questions about the code.

# note: if you are running the code on Colab, you may need to install the HuggingFace 'transformers' library
# for that, uncomment and run the following line:

# !pip install transformers
# import huggingface
from transformers import pipeline
generator = pipeline(
    'text-generation', 
    model='EleutherAI/gpt-neo-1.3B'
)

### YOUR CODE HERE ###
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[3], line 1
----> 1 generator = pipeline(
      2     'text-generation', 
      3     model='EleutherAI/gpt-neo-1.3B'
      4 )
      6 ### YOUR CODE HERE ###

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/transformers/pipelines/__init__.py:870, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
    868 if isinstance(model, str) or framework is None:
    869     model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]}
--> 870     framework, model = infer_framework_load_model(
    871         model,
    872         model_classes=model_classes,
    873         config=config,
    874         framework=framework,
    875         task=task,
    876         **hub_kwargs,
    877         **model_kwargs,
    878     )
    880 model_config = model.config
    881 hub_kwargs["_commit_hash"] = model.config._commit_hash

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/transformers/pipelines/base.py:278, in infer_framework_load_model(model, config, model_classes, task, framework, **model_kwargs)
    272     logger.warning(
    273         "Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. "
    274         "Trying to load the model with Tensorflow."
    275     )
    277 try:
--> 278     model = model_class.from_pretrained(model, **kwargs)
    279     if hasattr(model, "eval"):
    280         model = model.eval()

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:566, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    564 elif type(config) in cls._model_mapping.keys():
    565     model_class = _get_model_class(config, cls._model_mapping)
--> 566     return model_class.from_pretrained(
    567         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    568     )
    569 raise ValueError(
    570     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    571     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
    572 )

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/transformers/modeling_utils.py:3383, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   3368 try:
   3369     # Load from URL or cache if already cached
   3370     cached_file_kwargs = {
   3371         "cache_dir": cache_dir,
   3372         "force_download": force_download,
   (...)
   3381         "_commit_hash": commit_hash,
   3382     }
-> 3383     resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
   3385     # Since we set _raise_exceptions_for_missing_entries=False, we don't get an exception but a None
   3386     # result when internet is up, the repo and revision exist, but the file does not.
   3387     if resolved_archive_file is None and filename == _add_variant(SAFE_WEIGHTS_NAME, variant):
   3388         # Maybe the checkpoint is sharded, we try to grab the index name in this case.

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/transformers/utils/hub.py:385, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs)
    382 user_agent = http_user_agent(user_agent)
    383 try:
    384     # Load from URL or cache if already cached
--> 385     resolved_file = hf_hub_download(
    386         path_or_repo_id,
    387         filename,
    388         subfolder=None if len(subfolder) == 0 else subfolder,
    389         repo_type=repo_type,
    390         revision=revision,
    391         cache_dir=cache_dir,
    392         user_agent=user_agent,
    393         force_download=force_download,
    394         proxies=proxies,
    395         resume_download=resume_download,
    396         token=token,
    397         local_files_only=local_files_only,
    398     )
    399 except GatedRepoError as e:
    400     raise EnvironmentError(
    401         "You are trying to access a gated repo.\nMake sure to request access at "
    402         f"https://huggingface.co/{path_or_repo_id} and pass a token having permission to this repo either "
    403         "by logging in with `huggingface-cli login` or by passing `token=<your_token>`."
    404     ) from e

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py:118, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    115 if check_use_auth_token:
    116     kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 118 return fn(*args, **kwargs)

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/huggingface_hub/file_download.py:1457, in hf_hub_download(repo_id, filename, subfolder, repo_type, revision, library_name, library_version, cache_dir, local_dir, local_dir_use_symlinks, user_agent, force_download, force_filename, proxies, etag_timeout, resume_download, token, local_files_only, legacy_cache_layout, endpoint)
   1454         if local_dir is not None:
   1455             _check_disk_space(expected_size, local_dir)
-> 1457     http_get(
   1458         url_to_download,
   1459         temp_file,
   1460         proxies=proxies,
   1461         resume_size=resume_size,
   1462         headers=headers,
   1463         expected_size=expected_size,
   1464     )
   1466 if local_dir is None:
   1467     logger.debug(f"Storing {url} in cache at {blob_path}")

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/huggingface_hub/file_download.py:524, in http_get(url, temp_file, proxies, resume_size, headers, expected_size, _nb_retries)
    522 new_resume_size = resume_size
    523 try:
--> 524     for chunk in r.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
    525         if chunk:  # filter out keep-alive new chunks
    526             progress.update(len(chunk))

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/requests/models.py:816, in Response.iter_content.<locals>.generate()
    814 if hasattr(self.raw, "stream"):
    815     try:
--> 816         yield from self.raw.stream(chunk_size, decode_content=True)
    817     except ProtocolError as e:
    818         raise ChunkedEncodingError(e)

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/urllib3/response.py:1033, in HTTPResponse.stream(self, amt, decode_content)
   1031 else:
   1032     while not is_fp_closed(self._fp) or len(self._decoded_buffer) > 0:
-> 1033         data = self.read(amt=amt, decode_content=decode_content)
   1035         if data:
   1036             yield data

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/urllib3/response.py:925, in HTTPResponse.read(self, amt, decode_content, cache_content)
    922     if len(self._decoded_buffer) >= amt:
    923         return self._decoded_buffer.get(amt)
--> 925 data = self._raw_read(amt)
    927 flush_decoder = amt is None or (amt != 0 and not data)
    929 if not data and len(self._decoded_buffer) == 0:

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/urllib3/response.py:852, in HTTPResponse._raw_read(self, amt, read1)
    849 fp_closed = getattr(self._fp, "closed", False)
    851 with self._error_catcher():
--> 852     data = self._fp_read(amt, read1=read1) if not fp_closed else b""
    853     if amt is not None and amt != 0 and not data:
    854         # Platform-specific: Buggy versions of Python.
    855         # Close the connection when no data is returned
   (...)
    860         # not properly close the connection in all cases. There is
    861         # no harm in redundantly calling close.
    862         self._fp.close()

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/urllib3/response.py:835, in HTTPResponse._fp_read(self, amt, read1)
    832     return self._fp.read1(amt) if amt is not None else self._fp.read1()
    833 else:
    834     # StringIO doesn't like amt=None
--> 835     return self._fp.read(amt) if amt is not None else self._fp.read()

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/http/client.py:463, in HTTPResponse.read(self, amt)
    460 if amt is not None:
    461     # Amount is given, implement using readinto
    462     b = bytearray(amt)
--> 463     n = self.readinto(b)
    464     return memoryview(b)[:n].tobytes()
    465 else:
    466     # Amount is not given (unbounded read) so we must check self.length
    467     # and self.chunked

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/http/client.py:507, in HTTPResponse.readinto(self, b)
    502         b = memoryview(b)[0:self.length]
    504 # we do not use _safe_read() here because this may be a .will_close
    505 # connection, and the user is reading more bytes than will be provided
    506 # (for example, reading in 1k chunks)
--> 507 n = self.fp.readinto(b)
    508 if not n and b:
    509     # Ideally, we would raise IncompleteRead if the content-length
    510     # wasn't satisfied, but it might break compatibility.
    511     self._close_conn()

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/socket.py:704, in SocketIO.readinto(self, b)
    702 while True:
    703     try:
--> 704         return self._sock.recv_into(b)
    705     except timeout:
    706         self._timeout_occurred = True

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/ssl.py:1275, in SSLSocket.recv_into(self, buffer, nbytes, flags)
   1271     if flags != 0:
   1272         raise ValueError(
   1273           "non-zero flags not allowed in calls to recv_into() on %s" %
   1274           self.__class__)
-> 1275     return self.read(nbytes, buffer)
   1276 else:
   1277     return super().recv_into(buffer, nbytes, flags)

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/ssl.py:1133, in SSLSocket.read(self, len, buffer)
   1131 try:
   1132     if buffer is not None:
-> 1133         return self._sslobj.read(len, buffer)
   1134     else:
   1135         return self._sslobj.read(len)

KeyboardInterrupt: 

Exercise 1.2 (15 points)#

Your task is to complete the code below in order to fine-tune the model for question answering on the “Truthful QA” dataset. The goal of this exercise is to understand the basic components that go into fine-tuning an LM from first-hand experience. Therefore, you can run the fine-tuning just for a couple of training steps.

For convenience, the data loading process is already implemented for you. You can find relevant information for completing the task here.

TASK: Please post the code from the cell where you completed something on Moodle. Please answer questions about the other parts of the code on Moodle.

# first, we import the necessary libraries
# again, use !pip install ... if libraries are missing on Colab
from datasets import load_dataset
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
import torch
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForCausalLM
# load the dataset
dataset = load_dataset("truthful_qa", "generation")
# inspect a sample from the dataset to get an idea of the formatting
print(dataset['validation'][0])
# the dataset only has a 'validation' split, so we use that. 
# for simplicity, we are not further splitting the data into train/val/test
# but just using everything for training
dataset_val = dataset['validation']
# load pretrained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")
# add padding token to tokenizer
tokenizer.pad_token = tokenizer.eos_token
# create a pytorch dataset wrapper around the huggingface dataset
# which will allow for easy preprocessing and formatting
class TruthfulQADataset(Dataset):
    """
    Helper class to create a pytorch dataset.
    Each sample if formatted with 'Question: {question} Answer:' prefixes.
    Also pads and truncates the strings to a given maximum length,
    so that they can be batched.
    The implemented methods are required by pytorch.

    Parameters
    ----------
    dataset : huggingface dataset
        The dataset to wrap around.
    tokenizer : huggingface tokenizer
        The tokenizer to use for tokenization.
    max_length : int
        The maximum length of the input and output sequences.
    """
    def __init__(self, dataset, tokenizer, max_length=128):
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        """
        Returns a single preprocessed sample from the dataset,
        at given index idx.
        """
         # NOTE: this is an updated version, compared to the initial homework
        # source code. 
        # In particular, it includes correct attention masking and input formatting.
        item = self.dataset[idx]
        question = item['question']
        answer = item['best_answer']
        # format input
        input = f"Question: {question} Answer: {answer}"

        # tokenize input
        # note that the tokenizer automatically adds SOS, EOS and PAD tokens
        inputs = self.tokenizer(
            input, 
            return_tensors='pt', 
            max_length=self.max_length, 
            padding='max_length', 
            truncation=True
        )
        
        return inputs
# instantiate dataset
train_dataset = TruthfulQADataset(dataset_val, tokenizer)
# create a DataLoader for the dataset
# the data loader will automatically batch the data
# and iteratively return training examples (question answer pairs) in batches
dataloader = DataLoader(
    train_dataset, 
    batch_size=8, 
    shuffle=True
)
# trianing configutations 
# feel free to play around with these
epochs  = 1
train_steps =  100
# using the GPU if available
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

print("Using device:", device)
# put the model in training mode
model.train()
# move the model to the device (e.g. GPU)
model = model.to(device)

# define optimizer and learning rate
optimizer = ### YOUR CODE HERE ###

# define some variables to accumulate the losses
losses = []

# iterate over epochs
for e in range(epochs):
    # iterate over training steps
    for i in range(train_steps):
        # get a batch of data
        x = next(iter(dataloader))
        # move the data to the device (GPU)
        x = x.to(device)

        # forward pass through the model
        ### YOUR CODE HERE ###
        # get the loss
        loss = ### YOUR CODE HERE ###
        # backward pass
        loss.backward()
        losses.append(loss.item())
        # update the parameters of the model
        ### YOUR CODE HERE ###

NOTE: The purpose of this exercise is to just get the training running correctly. The quality of the predicted answer after the fine-tuning does not matter for the grading. That is, you don’t need to worry in case the predicted answer seems not great to you.

# Test the model

# set it to evaluation mode
model.eval()
model = model.to("cpu")
# generate some text for one of the questions from the dataset
question = dataset_val[-1]['question']
print("Question: ", question)
# tokenize the question and generate an answer
input = f"Question: {question} Answer:"
input_ids = tokenizer.encode(input, return_tensors='pt').to('cpu')
prediction = ### YOUR CODE HERE ###
# decode the prediction
answer = tokenizer.decode(prediction[0])
print("Predicted answer after fine-tuning: ", answer)
# Plot the fine-tuning loss

plt.plot(losses)
plt.xlabel("Training steps")
plt.ylabel("Loss")

Exercise 2 (10 points)#

The goal of this exercise is to apply basic concepts of reinforcement learning to one of the “holy grail” tasks in machine learning and AI – chess.

Your task is to map concepts like “agent”, “action”, and “state” that we have discussed in class onto their “real-world” counterparts in the game of chess (e.g., played by a computer program).

TASK: Please fill in your responses on Moodle.

Exercise 3 (20 points)#

In this exercise, you will train your very first RL agent!

Imagine that your agent just moved to a new town and is exploring the local restaurants. There are 10 restaurants with the names 0, 1, …, 9 in this town. The agent does not know anything about the restaurants in the beginning (and also mysteriously cannot find any reviews to look at). Therefore, she needs to try the restaurants herself and try to figure out which one will make her the happiest during her time in this town (i.e., will give her the highest expected reward).

This problem of trying to choose which action (i.e., going to which restaurant) is reward-maximizing in one situation, given several action options, is a (simplified) instance of a the so-called k-armed bandit problem (where k is the number of avialable actions, here: 10). This problem is very well-studied in RL.

For this exercise, we assume a number of simplifications. We assume that the quality of the restaurants is deterministic (i.e., doesn’t change over the times the agents goes there), and the agent’s preferences don’t change over time, either. (Hint: what does this mean for the value of actions and the rewards?)

Based on these assumptions, in this exercise, we apply a simple algorithm estimating the values of the available actions \(a \in A\) at time \(t\) (think: subjective value of going to the restaurant for the agent, e.g., degree of feeling happy upon eating there):

\[Q_t(a)\]

Specifically, we will apply a simple sample-average method for estimating the value of actions at time \(t\) wherein the action-values are estimated as averages of the rewards that were received in the past for choosing the respective actions:

\[ Q_t(a) = \frac{\sum_{i=1}^{t-1} R_{i | A_i = a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i = a}}\]

Time \(t\) here refers to the t-th time the agent is deciding which restaurant to go to in this new town. Based on the estimated action values, we will derive two different ‘strategies of behavior’ (i.e., policies): the greedy and the \(\epsilon\)-greedy policy.

TASK: Your task is to complete the code below to train the agent to explore the restaurants and answer some questions about the results on Moodle.

# import libraries

import pandas as pd
import numpy as np
np.random.seed(0)

For this \(k\)-armed restaurant bandit environment, we assume that there is a ground truth value of each of the restaurants for the agent. For instance, we know that the agent’s favorite food is Thai curry, and e.g. restaurant 4 has the best Thai curry in town – therefore, restaurant 4 would have the highest true value. Formally, the true value of an action is:

\[ q_*(a) = \mathbb{E}[R_t \mid A_t = a] \]

On the other hand, the values of the other restaurants might be lower, or even negative (e.g., the agent gets food-poisoning when going there). For tpur simulation, these true values are defined below. These ground truth values are initially unknown to our agent, and her task is to estimate them from experience.

# define possible actions (10 restaurant in the new town)
actions = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# sample the true values of the actions for the agent
true_restaurant_rewards = np.random.normal(0, 1, 10)
true_restaurant_rewards

In our toy world, the agent tries the different restaurants for multiple days and receives a reward (e.g., writes down her subjective happiness value) every time she went to a restaurant. A single trial is generated by the environment below:

def town_environment(action, true_restaurant_rewards):
    """
    The town environment returns 'an experience' of our agent,
    i.e., the reward associated with a given action.
    """
    reward = true_restaurant_rewards[action]
    return reward

Your task is to complete the code below so as to implement the estimation of the action values based on past experiences. In particular, your task is to implement an algorith estimating the values of actions based on accumulating experience, and track the expected reward (i.e., mean reward) that the agent would receive if she behaved according to her estimates given the particular amount of experience.

Specifically, the function below should implement the sample-average estimation (defined above) and return an action according to the current estimate. For action selection, please implement two policies:

  • a greedy policy (returning the action with the highest value according to the current estimate)

  • an \(\epsilon\)-greedy policy (returning the action with the highest value according to the current estimate in 1-\(\epsilon\) proportions of the decisions, and returning a randomly chosen action in \(\epsilon\) proportion of cases). You are free to choose your own value of \(\epsilon\).

def sample_average_estimator(old_actions, old_rewards, actions, epsilon=0):
    """
    Implement the sample-average estimator of the action values.

    Parameters
    ----------
    old_actions : numpy array
        The actions taken by the agent before the current step.
    old_rewards : numpy array
        The rewards received by the agent before the current step
        for the taken actions.
    epsilon: float
        The probability of taking a random action.

    Returns
    -------
    values : numpy array
        The estimated values of the actions.
    best_action : int
        The best action according to the estimated values and the 
        current policy.
    """
    # initially, the values of all actions are 0
    values = np.array(###YOUR CODE HERE###)
    # compute averages over previously observed rewards 
    # for each action
    for action in actions:
        old_indices = np.where(old_actions == action)[0]
        if len(old_indices) == 0:
            value = 0
        else:
            ###YOUR CODE HERE###
        values[action] = ###YOUR CODE HERE###
    # return a random action with probability epsilon
    if np.random.uniform(0, 1) < epsilon:
        ###YOUR CODE HERE###
    # return the action with the highest value with random tie-breaking with probability 1 - epsilon
    else:
        if np.sum(values == np.max(values)) > 1:
            best_action = np.random.choice(np.where(values == np.max(values))[0], 1)[0]
        else:
            ###YOUR CODE HERE###
            
    # return the actions' updated values
    # and best action
    return values, best_action

The following cell embeds the function into a loop where the agent gathers experiences over 90 days (i.e., over 90 action-reward pairs) and we can observe how her average reward as well as her action choices develop with accumulated experience.

# initialize the algorithm
old_actions = np.array([])
old_rewards = np.array([])
# initialize some variables for logging
actions_log = []
rewards_list = []
average_rewards_list = []
# itentify the ground truth optimal action so as to check
# how often the agent would choose it
optimal_action = actions[np.argmax(true_restaurant_rewards)]

# iterate over 90 "experience steps"
for i in range(90):
    # run the algorithm with a GREEDY policy
    
    # return selected action according to current estimates
    values, best_action = ### YOUR CODE HERE ###
    # observe the reward for the currently estimated best action
    reward = town_environment(best_action,true_restaurant_rewards)

    # create experience arrays
    old_actions = np.append(old_actions, best_action)
    old_rewards = np.append(old_rewards, reward)

    # log the results
    # check if the best action is the optimal action
    actions_log.append(best_action == optimal_action)
    rewards_list.append(reward)
    average_rewards_list.append(sum(rewards_list) / len(rewards_list))
    
# plot results

plt.plot(np.cumsum(actions_log) / np.arange(1, len(actions_log) + 1))
plt.xlabel("Experience steps")
plt.ylabel("Optimal action rate")
plt.plot(average_rewards_list)
plt.xlabel("Experience steps")
plt.ylabel("Average reward")
# NOW RUN THE SAME ALGORITHM WITH EPSILON-GREEDY POLICY

### YOUR CODE HERE ###