Designed for: Credit assignment, depth.
Description: This environment aims at simulating the study efforts of a student for an exam. Each day, the student can decide to either: Study for the exam, Sleep to regain energy, Go Out at the cost of their energy level, or do any Other action with no particular effect towards the final exam.
For this specific course, studying for the exam is only effective on a day with a lecture. The total number of days before the exam can be set as a hyperparameter. In order to pass the exam, the student must study on enough lecture days and have enough energy to actually take the exam.
Variable parameters:
Number of actions: This directly controls the variance of trajectories. Smaller values make n-step work better with larger n, and vice versa.
Action reward noise means: This sets the range of the mean rewards for all actions. At initialization a reward between the negative of this value and 0 is randomly drawn for each action.
Action reward noise sigma: This parameter introduces variance on all action rewards and can be used to change the variance of the returns.
Total days: The total number of days until the exam. This is effectively the episode length and also a means of adapting the variance.
Lecture days: The number of lectures that will take place before the exam.
Lectures needed: The minimum number of lectures that are needed to have enough knowledge for the exam.
Energy needed: The minimum energy level needed to pass the exam.
State space: MultiDiscrete([5, 5, H]), where the first entry is the knowledge level, the second entry the energy level, and H the horizon.
Action space: Discrete(N), where N is the number of actions.
Dynamics: Studying on a lecture day increases the knowledge level. Sleeping increases the energy level. (until a maximum of 4). Going out decreases the energy level (until a minimum of 0). Any other action has no effect.
Reward function: The reward is 10, if the student studies on the exam day and their knowledge and energy are bigger than their pre-defined threshold. Furthermore, every action has a reward that is sampled from N(r(i), s) where r(i) is the reward noise mean of action i and s is the reward noise sigma.
Initial state: No knowledge, no energy and time step 0.
Termination: When reaching H steps.
Compared algorithms (notebook): 1-step SARSA, n-step SARSA, full Monte-Carlo update.