Designed for: Credit assignment, on- versus off-policy.
Description: In the Roadrunner environment, an agent is tasked with moving along a 1-D plane, starting from position 0 up to a variable target position T = W-1. The goal of the agent is to arrive exactly at the target position. Moving beyond the target position means that the agent falls off a cliff, thus losing the game. The agent only has to option to increase or decrease its current speed, starting from zero. The agent needs to learn how to control its speed appropriately. Similar to missing the target position, slowing down too much also causes the agent to lose the game.
Variable parameters:
Environment size (W): How many cells the environment consists of.
Negative Reward Size (R): How much reward is given when the agent 1) falls off the cliff or 2) slows down beyond a speed of 0.
Max speed (MAX_SPEED): what the maximum speed is of the agent, which we need to limit to make tabular learning feasible.
State space:
MultiDiscrete([W, MAX_SPEED]). Unique state (x, dx) based on agent's location (x) and speed (dx).
Action space: Discrete(3). Agent can 1) do nothing, maintaining its current speed, 2) speed up, and 3) slow down. All these change the agent's speed (dx).
Dynamics: Every step the agent's location (x) is updated with its speed (dx). If the agent moves beyond W, the agent is reset to W. If the agent's speed drops below 0, the episode terminates.
Reward function: -1 at every step. Reaching the target position T gives +1. Falling off the cliff W gives R (default is -100). Slowing down below a speed of 0 also gives R.
Initial state: x = 0, dx=0.
Termination: Agent reaches T, or its speed drops below 0 (dx < 0), or the agent times out.
Compared algorithms (notebook): Q-Learning and SARSA.