In this homework, we will use gradient free reinforcement learning to improve the agent we trained in homework 10. You will use your homework 10 model and policy specifics and fine-tune the imitation agent by gradient free method.
This homework is very open-ended. You can do anything you want short of hand-coding a policy. The only requirement is that the policy is learned.
- Random search
- Hill climbing
- Augmented Random Search (SPSA)
- Cross Entropy Method
- Any other evolutionary algorithm
Logits of prediction actions:
-5.1 -1 0.6 0.2 -0.1 0.1
We provide you with starter code that loads the dataset from a training and validation set. We also provide an optional tensorboard interface.
- Define your model in
models.pyand modify the training code in
- Train your model.
python3 -m homework.train
- Test your model by measuring the performance
python3 -m homework.test
- To evaluate your code against grader, execute:
python3 -m grader homework
Note that the grader can take a long time because it contains two parts and will train your agents for the first grading part. Make sure your training code is working before running the grader.
- Create the submission file
python3 -m homework.bundle
Parallel data collection
We provide you with a parallel data collection interface in
policy_eval.py. To use the interface to collect data in parallel,
evaluators = [PolicyEvaluator.remote(level, iterations) for _ in range(n_workers)] rewards = ray.get([ evaluator.eval.remote(m, H) for m, evaluator in zip(models, evaluators) ])
To use the parallel data collection, you need to install the ray library
pip3 install ray
N/2 evaluators at the same time, where
N is the number of CPU cores on your machine.
Setting up Supertux
- This homework requires you to setup Pytux for performing the online evalution by playing the actual game. Instructions to set up Supertux can be found here.
- Once you have either downloaded the binary or compiled the Supertux source, create the symlinks for
datafolders using the following commands
cd path/to/homework_11 ln -s path/to/pytux pytux ln -s path/to/data data
- Make sure the folder structure looks like this:
Pro-tip: Fine-tune from Imitation Learning agent
To speed up the training for this assignment, you can fine-tune your architecture from homework 10.
The grading will depend on your gradient free optimization implementation and your final policy performance. The grading schema is as follows:
- Linear grading of average performance across 5 levels from 0.2-0.35 training from scratch: 100 points
We will manually check the implementation of each submission, and outputting constant actions or harcoding part of the predictions will result in zero points.
- operations of prior assignments