Designed for: Partial observability (memory)
Description: The agent has to navigate a sequence (corridor) of doors. The agent is shown a set of doors in each step of the sequence. Only one of the observed doors advances the agent to the next step in the sequence. Opening any other door terminates the episode. Each time the sequence is completed, the sequence repeats with an increased length of one. Only on the final set of doors in the sequence, the agent is shown what the correct door to go through is. In all other states, the agent observes no useful information and has to open the correct door based on information obtained in prior states. The task automatically scales in its memory challenge due to the length of the sequence increasing when the agent does well.
Variable parameters:
Max speed (num_doors): The number of doors in each state of the sequence. This increases the state space, but not the difficulty of memorization for the agent. This difficulty is scaled by the environment each time a sequence is completed successfully.
State space: Discrete(num_doors+1). A state for each of the doors being the correct door or no correct door visible.
Action space: Discrete(num_doors). Each action corresponds to opening a door (open door 0, open door 1 etc.)
Dynamics: Every action opens one of the available doors. If this is the correct door, the agent proceeds. If the door is the final door of the sequence, the agent has to replay the entire sequence with one additional door at the end.
Reward function: Opening the correct door rewards the agent with 1, else no reward is given.
Initial state: The correct door is observed in the first state as it is the final door of the first sequence, which is of length 1.
Termination: Termination occurs when the agent opens an incorrect door.
Compared algorithms (notebook): Tabular Q-learning with framestacking, Deep Q Learning with framestacking (Stable Baselines 3) and Recurrent PPO (Stable Baselines 3 - contrib)