Hi Soheil,

To answer question 1:

max_episode_length and len(episode_buffer) == 30 actually correspond to two different things. The first corresponds to a time-out when the episode has gone on for too long, and the second corresponds to the rollout buffer for a given episode being full. Both are triggered at different times, and we want to capture both.

and question 2:

The specific discounting used in the code is related to Generalized Advantage Estimation, and I am taking the implementation from the OpenAI implementation of their universe-starter-agent. It is simple enough to replace this with the more basic advantage estimation that I describe in the article if you want to try.

Hope that helps.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

