Designed for: Deep exploration (sparse reward exploration).
Description: An agent (visualised as a green smile face) is bouldering and aiming to reach the top of the wall. The agent needs to take a sequence of right actions without making any wrong decisions in order to reach the top grip. A wrong action will take the agent back to the initial state. Correct grip points are visualised as grey solid stones while black cracked stones indicate wrong actions.
Variable parameters:
Height (H): the length of the sequence of actions the agent has to perform correctly to reach the top. Increasing H makes the exploration problem deeper exponentially.
Number of grip points (N): larger N makes the size of the action space larger.
State space: Discrete(H). The current position (height) of the agent on the wall.
Action space: Discrete(N). The index of the grip the agent will grip.
Dynamics: If the taken action matches the grip the agent will move upward, otherwise the agent falls back to the initial state.
Reward function: A 1 when the agent successfully reaches the top, otherwise the reward will always be 0.
Initial state: The bottom of the wall.
Termination: When the agent successfully reaches the top or the maximum number of steps is reached.
Compared algorithms (notebook): Q-learning with e-greedy, Q-learning with count-based intrinsic reward, Go-Explore.