{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Coding Week 12\n", "\n", "This week we will train a deep network to balance a pole on a cart (CartPole-v1) using vanilla policy gradient (REINFORCE).\n", "\n", "![pole](https://gym.openai.com/videos/2019-10-21--mqt8Qj1mwo/CartPole-v1/poster.jpg)\n", "\n", "A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.\n", "\n", "Inputs: pole angle, velocity, cart location, velocity.\n", "\n", "Output: left (-1) or right (+1) action\n", "\n", "## Setup\n", "We will use OpenAI gym in this assignment" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import torch\n", "import numpy as np\n", "import gym\n", "\n", "from torch.distributions import Categorical\n", "\n", "import IPython.display as ipd\n", "from IPython.display import Video\n", "\n", "def sample_categorical(logits):\n", " return Categorical(logits=logits).sample().item()\n", "\n", "def rollout(env, policy, sample=sample_categorical, max_steps=500):\n", " \"\"\"\n", " Roll out a policy.\n", " env: Gym environment to roll out\n", " policy: A function that maps states s to distributions of actions (logits)\n", " sample: A sampling function that convert logits to samples\n", " \"\"\"\n", " s_l_a_r = list()\n", " s = env.reset()\n", "\n", " for _ in range(max_steps):\n", " # Ask out policy what to do\n", " logit = policy(s)\n", " # Sample an action according to the policy\n", " a = sample(logit)\n", " # Take a step in the environment\n", " sp, r, done, info = env.step(a)\n", " # Store the result\n", " s_l_a_r.append((s, logit, a, r))\n", " # and move to the new state\n", " s = sp\n", " \n", " if done: break\n", "\n", " return s_l_a_r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Out network (policy) is a two layer network. I know it's underwhelming :(" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "class Policy(torch.nn.Module): \n", " def __init__(self, env, k=16):\n", " super().__init__()\n", " \n", " self.net = torch.nn.Sequential(\n", " torch.nn.Linear(env.observation_space.shape, k),\n", "# torch.nn.ReLU(),\n", " torch.nn.Linear(k, env.action_space.n))\n", "\n", " def forward(self, s):\n", " return self.net(torch.FloatTensor(s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall what we're trying to do in RL: maximize the expected return of a policy $\\pi$ (or in turn minmize a los $L$)\n", "$$\n", "-L = E_{\\tau \\sim P_\\pi}[R(\\tau)],\n", "$$\n", "where $\\tau = \\{s_0, a_0, s_1, a_1, \\ldots\\}$ is a trajectory of states and actions.\n", "The return of a trajectory is then defined as the sum of individual rewards $R(\\tau) = \\sum_k r(s_k)$ (we won't discount in this assignment).\n", "\n", "Policy gradient computes the gradient of the loss $L$ using the log-derivative trick\n", "$$\n", "\\nabla_\\pi L = -E_{\\tau \\sim P_\\pi}[\\sum_k r(s_k) \\nabla_\\pi \\sum_i \\log \\pi(a_i | s_i)].\n", "$$\n", "Since the return $r(s_k)$ only depends on action $a_i$ in the past $i < k$ we can further simplify the above equation:\n", "$$\n", "\\nabla_\\pi L = -E_{\\tau \\sim P_\\pi}\\left[\\sum_i \\left(\\nabla_\\pi \\log \\pi(a_i | s_i)\\right)\\left(\\sum_{k=i}^{|\\tau|} r(s_k) \\right)\\right].\n", "$$\n", "We will implement an estimator for this objective below. There are a few steps that we need to follow:\n", "\n", " * The expectation $E_{\\tau \\sim P_\\pi}$ are rollouts of our policy\n", " * The log probability $\\log \\pi(a_i | s_i)$ uses the Categorical.log_prob\n", " * Gradient computation uses the .backward() function\n", " * The gradient $\\nabla_\\pi L$ is then used in a standard optimizer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0: 40.26 (66.00)\n", "1: 51.65 (35.00)\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "2: 54.24 (500.00)\n", "3: 276.22 (500.00)\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "4: 439.64 (500.00)\n", "5: 500.00 (500.00)\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Let's setup the gym environment, ignore visual_env\n", "env = gym.make('CartPole-v1')\n", "visual_env = gym.wrappers.Monitor(env, './gym-results', force=True, video_callable=lambda i: i%2==0)\n", "\n", "policy = Policy(env)\n", "optim = torch.optim.RMSprop(policy.parameters(), lr=1e-2)\n", "\n", "# Training starts here\n", "for epoch in range(20):\n", " play_time = []\n", " \n", " for it in range(100):\n", " # Rollout\n", " trajectory = rollout(env, policy)\n", " play_time.append(len(trajectory))\n", " # Compute (future) return and gradient\n", " R = 0\n", " optim.zero_grad()\n", " for s, logit, a, r in reversed(trajectory):\n", " R += r\n", " p = Categorical(logits=logit)\n", " l = p.log_prob(torch.tensor(a))\n", " (-l*R).backward()\n", " optim.step()\n", " # Display how well we're doing and show a nice video\n", " trajectory = rollout(visual_env, policy, sample=lambda x: int(x > x))\n", " print('%d: %.2f (%.2f)' % (epoch, np.mean(play_time), len(trajectory)))\n", " if epoch % 2 == 1:\n", " ipd.display(Video(str(list(sorted(Path('./gym-results/').glob('*.mp4')))[-1])))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on method log_prob in module torch.distributions.categorical:\n", "\n", "log_prob(value) method of torch.distributions.categorical.Categorical instance\n", " Returns the log of the probability density/mass function evaluated at\n", " value.\n", " \n", " Args:\n", " value (Tensor):\n", "\n" ] } ], "source": [ "help(p.log_prob)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }