Homework 11

In this homework, we will use gradient free reinforcement learning to improve the agent we trained in homework 10. You will use your homework 10 model and policy specifics and fine-tune the imitation agent by gradient free method.

start code

This homework is very open-ended. You can do anything you want short of hand-coding a policy. The only requirement is that the policy is learned.

Possible methods:

Input example

Observation Image

input

Output example

Logits of prediction actions:

-5.1 -1 0.6 0.2 -0.1 0.1

Getting Started

We provide you with starter code that loads the dataset from a training and validation set. We also provide an optional tensorboard interface.

  1. Define your model in models.py and modify the training code in train.py.
  2. Train your model.
     python3 -m homework.train
    
  3. Test your model by measuring the performance
     python3 -m homework.test
    
  4. To evaluate your code against grader, execute:
     python3 -m grader homework
    

    Note that the grader can take a long time because it contains two parts and will train your agents for the first grading part. Make sure your training code is working before running the grader.

  5. Create the submission file
     python3 -m homework.bundle
    

Parallel data collection

We provide you with a parallel data collection interface in policy_eval.py. To use the interface to collect data in parallel,

evaluators = [PolicyEvaluator.remote(level, iterations) for _ in range(n_workers)]

rewards = ray.get([
	evaluator.eval.remote(m, H) for m, evaluator in zip(models, evaluators)
])

Installing Ray

To use the parallel data collection, you need to install the ray library

pip3 install ray

Hint: Parallelize N/2 evaluators at the same time, where N is the number of CPU cores on your machine.

Setting up Supertux

  1. This homework requires you to setup Pytux for performing the online evalution by playing the actual game. Instructions to set up Supertux can be found here.
  2. Once you have either downloaded the binary or compiled the Supertux source, create the symlinks for pytux and data folders using the following commands
    cd path/to/homework_11
    ln -s path/to/pytux pytux
    ln -s path/to/data data
    
  3. Make sure the folder structure looks like this:
    • homework_11
    • grader
    • homework
    • pytux
    • data

Pro-tip: Fine-tune from Imitation Learning agent

To speed up the training for this assignment, you can fine-tune your architecture from homework 10.

Grading

The grading will depend on your gradient free optimization implementation and your final policy performance. The grading schema is as follows:

We will manually check the implementation of each submission, and outputting constant actions or harcoding part of the predictions will result in zero points.

Relevant operations