Designed for: State signal & Model Noise
Description: The agents needs to take care of the Tamagotchi by keeping it healthy (hp) and happy (happiness). Happiness is based on the internal variables (joy, energy, food, hygiene) where low values have a bigger impact on the Tamagotchi's happiness. The hp is part of the state space and de- or increases as a function of the happiness variable. The hp of the Tamagotchi is only partially informative of the best action the agent can take but in combination with the utterances made by the Tamagotchi, does provide sufficient information. These utterances can be of different lengths but have some relation to the preferred next action. The quality of the state signal (i.e., the utterances) is determined by the temperature parameter (see parameters).
Variable parameters:
Message noise (tau): temperature parameter that influences the informativeness of the utterances. For tau --> 0 the message becomes a perfect description of the internal variables, for tau --> ∞ it becomes pure noise.
Message length (max_msg_length): length of the utterances.
State space: MultiDiscrete([100, H, ...]) where the first variable indicates the HP level, and the second (or more) variables indicate the communicated message, out of a vocabulary of size H.
Action space: Discrete(4). The agent can perform four actions: play, feed, clean, sleep
Dynamics: The actions of the agent influence the internal variables of the Tamagotchi. Every action results in a positive update (+30) of the corresponding internal variable and a small decrease (-5) of the other variables. When the action is not the ideal action according to the Tamagotchi, all variables (thus including the one that has just been updated) will decrease by -10. After the variables have been updated the Tamagotchi decides which action should be taken next (through weighing the importance of each internal variable) and generates an utterance of a specified length accordingly. As described in above, these utterances can be very informative or noisy.
Reward function: Ranges between {-200 : 75}, the reward corresponds to the happiness of the Tamagotchi. This is calculated based on the internal variables of the Tamagotchi, variables that are low contribute more to the overall happiness than high values.
Initial state: hp:100 and U, ... where U is an utterance out of the vocabulary size H.
Termination: hp== 0 or when steps_per_episode == 100
Compared algorithms (notebook): Q-learning with variations of noise.