Homework 11

In this homework, we will use gradient free reinforcement learning to improve the agent we trained in homework 10. You will use your homework 10 model and policy specifics and fine-tune the imitation agent by gradient free method.

start code

This homework is very open-ended. You can do anything you want short of hand-coding a policy. The only requirement is that the policy is learned.

Possible methods:

Random search
Hill climbing
Augmented Random Search (SPSA)
Cross Entropy Method
Any other evolutionary algorithm

Input example

Observation Image

input

Output example

Logits of prediction actions:

-5.1 -1 0.6 0.2 -0.1 0.1

Getting Started

We provide you with starter code that loads the dataset from a training and validation set. We also provide an optional tensorboard interface.

Define your model in models.py and modify the training code in train.py.
Train your model.
```
 python3 -m homework.train
```
Test your model by measuring the performance
```
 python3 -m homework.test
```
To evaluate your code against grader, execute:
```
 python3 -m grader homework
```
Note that the grader can take a long time because it contains two parts and will train your agents for the first grading part. Make sure your training code is working before running the grader.
Create the submission file
```
 python3 -m homework.bundle
```

Parallel data collection

We provide you with a parallel data collection interface in policy_eval.py. To use the interface to collect data in parallel,

evaluators = [PolicyEvaluator.remote(level, iterations) for _ in range(n_workers)]

rewards = ray.get([
	evaluator.eval.remote(m, H) for m, evaluator in zip(models, evaluators)
])

Installing Ray

To use the parallel data collection, you need to install the ray library

pip3 install ray

Hint: Parallelize N/2 evaluators at the same time, where N is the number of CPU cores on your machine.

Setting up Supertux

This homework requires you to setup Pytux for performing the online evalution by playing the actual game. Instructions to set up Supertux can be found here.
Once you have either downloaded the binary or compiled the Supertux source, create the symlinks for pytux and data folders using the following commands
```
cd path/to/homework_11
ln -s path/to/pytux pytux
ln -s path/to/data data
```
Make sure the folder structure looks like this:
- homework_11
- grader
- homework
- pytux
- data

Pro-tip: Fine-tune from Imitation Learning agent

To speed up the training for this assignment, you can fine-tune your architecture from homework 10.

Grading

The grading will depend on your gradient free optimization implementation and your final policy performance. The grading schema is as follows:

Linear grading of average performance across 5 levels from 0.2-0.35 training from scratch: 100 points

We will manually check the implementation of each submission, and outputting constant actions or harcoding part of the predictions will result in zero points.

Relevant operations

operations of prior assignments