Homework 1#
The learning goals of this first hands-on sheet are:
to make sure that you can execute code on your machines or on Google Colab in order to experiment with LMs and RL yourself!
to familiarize yourself with the HuggingFace library which provides many pretrained LMs and handy tools for working with them,
to develop basic intuitions about core RL concepts,
and to train your first RL agent!
Most importantly, the homework is intended to showcase important practical aspects, provide space for learning how to find solutions to practical problems, and further conceptual understanding of the topics we discuss in class. It is not meant to dismay you. Therefore, even if you don’t have a lot of ML / programming / technical background, you are warmly encouraged to take on the tasks, ask questions and discuss any concerns you have (with fellow students or me). There are also some hints and links to resources throughout the tasks which may help you get information which will help solving the tasks.
Homework logistics#
You will have two weeks to complete the assignment (until Wed, November 8th, 12:30pm).
Please do and submit your homework by yourself!
However, you are warmly encouraged to ask questions and help each other, without posting full solutions, via active discussions in the dedicated Forum space on Moodle (“Homework 1”). Most active participants of the Forum discussions will earn some extra points for their grade!
Please submit your solutions via Moodle. You will find a quiz called “Homework 1” with questions and answer fields corresponding the respective exercise numbers listed below.
If you have questions or difficulties with the homework, please try to solve them with the help of your fellow students via Forum. However, I will also offer a consultation session on Tuesday, October 31st, 2pm-4pm, on Zoom, under the usual class link. Also don’t hesitate to reach out to me via email if you have any questions, struggle or feel overwhelmed.
Preliminaries#
The exercises below will require you to execute Python code. You can do so either on your own machine, or by using Google Colab (free, only requires a Google account). You can easily do the latter by pressing the Colab icon at the top of the webook’s page. You are encouraged to use the Colab option to avoid complications with local package installations etc. To speed up the execution of the code on Colab (especially Exercise 1), you can use the available GPU. For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.
However, if you do want to run the code locally on your machine, I strongly encourage you to create an environment (e.g., with Conda) before you install any dependencies, and please keep in mind that pretrained language model weights might take up quite a bit of space on your hard drive or might require high RAM for prediction. In particular, the model used in these exercises requires 6GB disc space and around 8GB RAM for stable training.
Note that the class uses PyTorch. For those of you who wish to complete final projects which include programming, you are also free to use TensorFlow for that (but I may be able to provide less support with that).
Exercise 1 (20 points)#
In this exercise, we will load a pretrained LM from HuggingFace and explore how to work with it, using the tools provided by the library.
Exercise 1.1 (5 points)#
Your task is to use the pretrained model “GPT-NEO” (1.3B parameters) to run inference. In particular, your task is to complete the code below in order to produce a continuation for the sentence “Reinforcement learning is “ using beam search with k=5. (Hint: beam-search is a particular decoding scheme used on top of trained language models. If you are not familiar with it, please do some research to get an overall idea about it as part of this task.)
You can find information for completing the code, e.g., here.
TASK: Please submit your result (i.e., produced text) on Moodle and answer questions about the code.
# note: if you are running the code on Colab, you may need to install the HuggingFace 'transformers' library
# for that, uncomment and run the following line:
# !pip install transformers
# import huggingface
from transformers import pipeline
generator = pipeline(
'text-generation',
model='EleutherAI/gpt-neo-1.3B'
)
### YOUR CODE HERE ###
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[3], line 1
----> 1 generator = pipeline(
2 'text-generation',
3 model='EleutherAI/gpt-neo-1.3B'
4 )
6 ### YOUR CODE HERE ###
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/transformers/pipelines/__init__.py:870, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
868 if isinstance(model, str) or framework is None:
869 model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]}
--> 870 framework, model = infer_framework_load_model(
871 model,
872 model_classes=model_classes,
873 config=config,
874 framework=framework,
875 task=task,
876 **hub_kwargs,
877 **model_kwargs,
878 )
880 model_config = model.config
881 hub_kwargs["_commit_hash"] = model.config._commit_hash
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/transformers/pipelines/base.py:278, in infer_framework_load_model(model, config, model_classes, task, framework, **model_kwargs)
272 logger.warning(
273 "Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. "
274 "Trying to load the model with Tensorflow."
275 )
277 try:
--> 278 model = model_class.from_pretrained(model, **kwargs)
279 if hasattr(model, "eval"):
280 model = model.eval()
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:566, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
564 elif type(config) in cls._model_mapping.keys():
565 model_class = _get_model_class(config, cls._model_mapping)
--> 566 return model_class.from_pretrained(
567 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
568 )
569 raise ValueError(
570 f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
571 f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
572 )
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/transformers/modeling_utils.py:3383, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
3368 try:
3369 # Load from URL or cache if already cached
3370 cached_file_kwargs = {
3371 "cache_dir": cache_dir,
3372 "force_download": force_download,
(...)
3381 "_commit_hash": commit_hash,
3382 }
-> 3383 resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
3385 # Since we set _raise_exceptions_for_missing_entries=False, we don't get an exception but a None
3386 # result when internet is up, the repo and revision exist, but the file does not.
3387 if resolved_archive_file is None and filename == _add_variant(SAFE_WEIGHTS_NAME, variant):
3388 # Maybe the checkpoint is sharded, we try to grab the index name in this case.
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/transformers/utils/hub.py:385, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs)
382 user_agent = http_user_agent(user_agent)
383 try:
384 # Load from URL or cache if already cached
--> 385 resolved_file = hf_hub_download(
386 path_or_repo_id,
387 filename,
388 subfolder=None if len(subfolder) == 0 else subfolder,
389 repo_type=repo_type,
390 revision=revision,
391 cache_dir=cache_dir,
392 user_agent=user_agent,
393 force_download=force_download,
394 proxies=proxies,
395 resume_download=resume_download,
396 token=token,
397 local_files_only=local_files_only,
398 )
399 except GatedRepoError as e:
400 raise EnvironmentError(
401 "You are trying to access a gated repo.\nMake sure to request access at "
402 f"https://huggingface.co/{path_or_repo_id} and pass a token having permission to this repo either "
403 "by logging in with `huggingface-cli login` or by passing `token=<your_token>`."
404 ) from e
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py:118, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
115 if check_use_auth_token:
116 kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 118 return fn(*args, **kwargs)
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/huggingface_hub/file_download.py:1457, in hf_hub_download(repo_id, filename, subfolder, repo_type, revision, library_name, library_version, cache_dir, local_dir, local_dir_use_symlinks, user_agent, force_download, force_filename, proxies, etag_timeout, resume_download, token, local_files_only, legacy_cache_layout, endpoint)
1454 if local_dir is not None:
1455 _check_disk_space(expected_size, local_dir)
-> 1457 http_get(
1458 url_to_download,
1459 temp_file,
1460 proxies=proxies,
1461 resume_size=resume_size,
1462 headers=headers,
1463 expected_size=expected_size,
1464 )
1466 if local_dir is None:
1467 logger.debug(f"Storing {url} in cache at {blob_path}")
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/huggingface_hub/file_download.py:524, in http_get(url, temp_file, proxies, resume_size, headers, expected_size, _nb_retries)
522 new_resume_size = resume_size
523 try:
--> 524 for chunk in r.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
525 if chunk: # filter out keep-alive new chunks
526 progress.update(len(chunk))
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/requests/models.py:816, in Response.iter_content.<locals>.generate()
814 if hasattr(self.raw, "stream"):
815 try:
--> 816 yield from self.raw.stream(chunk_size, decode_content=True)
817 except ProtocolError as e:
818 raise ChunkedEncodingError(e)
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/urllib3/response.py:1033, in HTTPResponse.stream(self, amt, decode_content)
1031 else:
1032 while not is_fp_closed(self._fp) or len(self._decoded_buffer) > 0:
-> 1033 data = self.read(amt=amt, decode_content=decode_content)
1035 if data:
1036 yield data
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/urllib3/response.py:925, in HTTPResponse.read(self, amt, decode_content, cache_content)
922 if len(self._decoded_buffer) >= amt:
923 return self._decoded_buffer.get(amt)
--> 925 data = self._raw_read(amt)
927 flush_decoder = amt is None or (amt != 0 and not data)
929 if not data and len(self._decoded_buffer) == 0:
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/urllib3/response.py:852, in HTTPResponse._raw_read(self, amt, read1)
849 fp_closed = getattr(self._fp, "closed", False)
851 with self._error_catcher():
--> 852 data = self._fp_read(amt, read1=read1) if not fp_closed else b""
853 if amt is not None and amt != 0 and not data:
854 # Platform-specific: Buggy versions of Python.
855 # Close the connection when no data is returned
(...)
860 # not properly close the connection in all cases. There is
861 # no harm in redundantly calling close.
862 self._fp.close()
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/urllib3/response.py:835, in HTTPResponse._fp_read(self, amt, read1)
832 return self._fp.read1(amt) if amt is not None else self._fp.read1()
833 else:
834 # StringIO doesn't like amt=None
--> 835 return self._fp.read(amt) if amt is not None else self._fp.read()
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/http/client.py:463, in HTTPResponse.read(self, amt)
460 if amt is not None:
461 # Amount is given, implement using readinto
462 b = bytearray(amt)
--> 463 n = self.readinto(b)
464 return memoryview(b)[:n].tobytes()
465 else:
466 # Amount is not given (unbounded read) so we must check self.length
467 # and self.chunked
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/http/client.py:507, in HTTPResponse.readinto(self, b)
502 b = memoryview(b)[0:self.length]
504 # we do not use _safe_read() here because this may be a .will_close
505 # connection, and the user is reading more bytes than will be provided
506 # (for example, reading in 1k chunks)
--> 507 n = self.fp.readinto(b)
508 if not n and b:
509 # Ideally, we would raise IncompleteRead if the content-length
510 # wasn't satisfied, but it might break compatibility.
511 self._close_conn()
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/socket.py:704, in SocketIO.readinto(self, b)
702 while True:
703 try:
--> 704 return self._sock.recv_into(b)
705 except timeout:
706 self._timeout_occurred = True
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/ssl.py:1275, in SSLSocket.recv_into(self, buffer, nbytes, flags)
1271 if flags != 0:
1272 raise ValueError(
1273 "non-zero flags not allowed in calls to recv_into() on %s" %
1274 self.__class__)
-> 1275 return self.read(nbytes, buffer)
1276 else:
1277 return super().recv_into(buffer, nbytes, flags)
File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/ssl.py:1133, in SSLSocket.read(self, len, buffer)
1131 try:
1132 if buffer is not None:
-> 1133 return self._sslobj.read(len, buffer)
1134 else:
1135 return self._sslobj.read(len)
KeyboardInterrupt:
Exercise 1.2 (15 points)#
Your task is to complete the code below in order to fine-tune the model for question answering on the “Truthful QA” dataset. The goal of this exercise is to understand the basic components that go into fine-tuning an LM from first-hand experience. Therefore, you can run the fine-tuning just for a couple of training steps.
For convenience, the data loading process is already implemented for you. You can find relevant information for completing the task here.
TASK: Please post the code from the cell where you completed something on Moodle. Please answer questions about the other parts of the code on Moodle.
# first, we import the necessary libraries
# again, use !pip install ... if libraries are missing on Colab
from datasets import load_dataset
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
import torch
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForCausalLM
# load the dataset
dataset = load_dataset("truthful_qa", "generation")
# inspect a sample from the dataset to get an idea of the formatting
print(dataset['validation'][0])
# the dataset only has a 'validation' split, so we use that.
# for simplicity, we are not further splitting the data into train/val/test
# but just using everything for training
dataset_val = dataset['validation']
# load pretrained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")
# add padding token to tokenizer
tokenizer.pad_token = tokenizer.eos_token
# create a pytorch dataset wrapper around the huggingface dataset
# which will allow for easy preprocessing and formatting
class TruthfulQADataset(Dataset):
"""
Helper class to create a pytorch dataset.
Each sample if formatted with 'Question: {question} Answer:' prefixes.
Also pads and truncates the strings to a given maximum length,
so that they can be batched.
The implemented methods are required by pytorch.
Parameters
----------
dataset : huggingface dataset
The dataset to wrap around.
tokenizer : huggingface tokenizer
The tokenizer to use for tokenization.
max_length : int
The maximum length of the input and output sequences.
"""
def __init__(self, dataset, tokenizer, max_length=128):
self.dataset = dataset
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
"""
Returns a single preprocessed sample from the dataset,
at given index idx.
"""
# NOTE: this is an updated version, compared to the initial homework
# source code.
# In particular, it includes correct attention masking and input formatting.
item = self.dataset[idx]
question = item['question']
answer = item['best_answer']
# format input
input = f"Question: {question} Answer: {answer}"
# tokenize input
# note that the tokenizer automatically adds SOS, EOS and PAD tokens
inputs = self.tokenizer(
input,
return_tensors='pt',
max_length=self.max_length,
padding='max_length',
truncation=True
)
return inputs
# instantiate dataset
train_dataset = TruthfulQADataset(dataset_val, tokenizer)
# create a DataLoader for the dataset
# the data loader will automatically batch the data
# and iteratively return training examples (question answer pairs) in batches
dataloader = DataLoader(
train_dataset,
batch_size=8,
shuffle=True
)
# trianing configutations
# feel free to play around with these
epochs = 1
train_steps = 100
# using the GPU if available
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available():
device = "mps"
else:
device = "cpu"
print("Using device:", device)
# put the model in training mode
model.train()
# move the model to the device (e.g. GPU)
model = model.to(device)
# define optimizer and learning rate
optimizer = ### YOUR CODE HERE ###
# define some variables to accumulate the losses
losses = []
# iterate over epochs
for e in range(epochs):
# iterate over training steps
for i in range(train_steps):
# get a batch of data
x = next(iter(dataloader))
# move the data to the device (GPU)
x = x.to(device)
# forward pass through the model
### YOUR CODE HERE ###
# get the loss
loss = ### YOUR CODE HERE ###
# backward pass
loss.backward()
losses.append(loss.item())
# update the parameters of the model
### YOUR CODE HERE ###
NOTE: The purpose of this exercise is to just get the training running correctly. The quality of the predicted answer after the fine-tuning does not matter for the grading. That is, you don’t need to worry in case the predicted answer seems not great to you.
# Test the model
# set it to evaluation mode
model.eval()
model = model.to("cpu")
# generate some text for one of the questions from the dataset
question = dataset_val[-1]['question']
print("Question: ", question)
# tokenize the question and generate an answer
input = f"Question: {question} Answer:"
input_ids = tokenizer.encode(input, return_tensors='pt').to('cpu')
prediction = ### YOUR CODE HERE ###
# decode the prediction
answer = tokenizer.decode(prediction[0])
print("Predicted answer after fine-tuning: ", answer)
# Plot the fine-tuning loss
plt.plot(losses)
plt.xlabel("Training steps")
plt.ylabel("Loss")
Exercise 2 (10 points)#
The goal of this exercise is to apply basic concepts of reinforcement learning to one of the “holy grail” tasks in machine learning and AI – chess.
Your task is to map concepts like “agent”, “action”, and “state” that we have discussed in class onto their “real-world” counterparts in the game of chess (e.g., played by a computer program).
TASK: Please fill in your responses on Moodle.
Exercise 3 (20 points)#
In this exercise, you will train your very first RL agent!
Imagine that your agent just moved to a new town and is exploring the local restaurants. There are 10 restaurants with the names 0, 1, …, 9 in this town. The agent does not know anything about the restaurants in the beginning (and also mysteriously cannot find any reviews to look at). Therefore, she needs to try the restaurants herself and try to figure out which one will make her the happiest during her time in this town (i.e., will give her the highest expected reward).
This problem of trying to choose which action (i.e., going to which restaurant) is reward-maximizing in one situation, given several action options, is a (simplified) instance of a the so-called k-armed bandit problem (where k is the number of avialable actions, here: 10). This problem is very well-studied in RL.
For this exercise, we assume a number of simplifications. We assume that the quality of the restaurants is deterministic (i.e., doesn’t change over the times the agents goes there), and the agent’s preferences don’t change over time, either. (Hint: what does this mean for the value of actions and the rewards?)
Based on these assumptions, in this exercise, we apply a simple algorithm estimating the values of the available actions \(a \in A\) at time \(t\) (think: subjective value of going to the restaurant for the agent, e.g., degree of feeling happy upon eating there):
Specifically, we will apply a simple sample-average method for estimating the value of actions at time \(t\) wherein the action-values are estimated as averages of the rewards that were received in the past for choosing the respective actions:
Time \(t\) here refers to the t-th time the agent is deciding which restaurant to go to in this new town. Based on the estimated action values, we will derive two different ‘strategies of behavior’ (i.e., policies): the greedy and the \(\epsilon\)-greedy policy.
TASK: Your task is to complete the code below to train the agent to explore the restaurants and answer some questions about the results on Moodle.
# import libraries
import pandas as pd
import numpy as np
np.random.seed(0)
For this \(k\)-armed restaurant bandit environment, we assume that there is a ground truth value of each of the restaurants for the agent. For instance, we know that the agent’s favorite food is Thai curry, and e.g. restaurant 4 has the best Thai curry in town – therefore, restaurant 4 would have the highest true value. Formally, the true value of an action is:
On the other hand, the values of the other restaurants might be lower, or even negative (e.g., the agent gets food-poisoning when going there). For tpur simulation, these true values are defined below. These ground truth values are initially unknown to our agent, and her task is to estimate them from experience.
# define possible actions (10 restaurant in the new town)
actions = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# sample the true values of the actions for the agent
true_restaurant_rewards = np.random.normal(0, 1, 10)
true_restaurant_rewards
In our toy world, the agent tries the different restaurants for multiple days and receives a reward (e.g., writes down her subjective happiness value) every time she went to a restaurant. A single trial is generated by the environment below:
def town_environment(action, true_restaurant_rewards):
"""
The town environment returns 'an experience' of our agent,
i.e., the reward associated with a given action.
"""
reward = true_restaurant_rewards[action]
return reward
Your task is to complete the code below so as to implement the estimation of the action values based on past experiences. In particular, your task is to implement an algorith estimating the values of actions based on accumulating experience, and track the expected reward (i.e., mean reward) that the agent would receive if she behaved according to her estimates given the particular amount of experience.
Specifically, the function below should implement the sample-average estimation (defined above) and return an action according to the current estimate. For action selection, please implement two policies:
a greedy policy (returning the action with the highest value according to the current estimate)
an \(\epsilon\)-greedy policy (returning the action with the highest value according to the current estimate in 1-\(\epsilon\) proportions of the decisions, and returning a randomly chosen action in \(\epsilon\) proportion of cases). You are free to choose your own value of \(\epsilon\).
def sample_average_estimator(old_actions, old_rewards, actions, epsilon=0):
"""
Implement the sample-average estimator of the action values.
Parameters
----------
old_actions : numpy array
The actions taken by the agent before the current step.
old_rewards : numpy array
The rewards received by the agent before the current step
for the taken actions.
epsilon: float
The probability of taking a random action.
Returns
-------
values : numpy array
The estimated values of the actions.
best_action : int
The best action according to the estimated values and the
current policy.
"""
# initially, the values of all actions are 0
values = np.array(###YOUR CODE HERE###)
# compute averages over previously observed rewards
# for each action
for action in actions:
old_indices = np.where(old_actions == action)[0]
if len(old_indices) == 0:
value = 0
else:
###YOUR CODE HERE###
values[action] = ###YOUR CODE HERE###
# return a random action with probability epsilon
if np.random.uniform(0, 1) < epsilon:
###YOUR CODE HERE###
# return the action with the highest value with random tie-breaking with probability 1 - epsilon
else:
if np.sum(values == np.max(values)) > 1:
best_action = np.random.choice(np.where(values == np.max(values))[0], 1)[0]
else:
###YOUR CODE HERE###
# return the actions' updated values
# and best action
return values, best_action
The following cell embeds the function into a loop where the agent gathers experiences over 90 days (i.e., over 90 action-reward pairs) and we can observe how her average reward as well as her action choices develop with accumulated experience.
# initialize the algorithm
old_actions = np.array([])
old_rewards = np.array([])
# initialize some variables for logging
actions_log = []
rewards_list = []
average_rewards_list = []
# itentify the ground truth optimal action so as to check
# how often the agent would choose it
optimal_action = actions[np.argmax(true_restaurant_rewards)]
# iterate over 90 "experience steps"
for i in range(90):
# run the algorithm with a GREEDY policy
# return selected action according to current estimates
values, best_action = ### YOUR CODE HERE ###
# observe the reward for the currently estimated best action
reward = town_environment(best_action,true_restaurant_rewards)
# create experience arrays
old_actions = np.append(old_actions, best_action)
old_rewards = np.append(old_rewards, reward)
# log the results
# check if the best action is the optimal action
actions_log.append(best_action == optimal_action)
rewards_list.append(reward)
average_rewards_list.append(sum(rewards_list) / len(rewards_list))
# plot results
plt.plot(np.cumsum(actions_log) / np.arange(1, len(actions_log) + 1))
plt.xlabel("Experience steps")
plt.ylabel("Optimal action rate")
plt.plot(average_rewards_list)
plt.xlabel("Experience steps")
plt.ylabel("Average reward")
# NOW RUN THE SAME ALGORITHM WITH EPSILON-GREEDY POLICY
### YOUR CODE HERE ###