Designed for: Model-based reinforcement learning.
Description: The agent (yellow square) enters a supermarket, where at each step it can move up, down, left or right. It has three items on its shopping list, indicated by the red, blue and green squares. The agent can always exit the supermarket at the bottom-right. To mimick the cost of actions in the real world, each a takes some time to finish (see variable parameters). However, the agent also has access to model(s,a) of the environment, which executes without an extra time penalty.
Variable parameters:
Step timeout (step_timeout): interval the environment blocks between subsequent calls to the step() method.
Model noise (noise): standard deviation of zero-centered Gaussian noise added to the reward model.
State space: Discrete(800). Unique state identified based on agent (x,y) location plus collection status of the three items.
Action space: Discrete(4). Agent can move up, down, left and right.
Dynamics: The agent moves in the specified direction, unless it hits a wall. It automatically collects items when it steps on them.
Reward function: A -1 default penalty on each step, with added +25 for each collected item, and added +50 for leaving the supermarket (bottom-right).
Initial state: Top-left, visible in the figure at the yellow circle.
Termination: Bottom-right, visible in the figure as the bottom-right door.
Compared algorithms (notebook): Dynamic Programming, Dyna, Prioritized Sweeping.