Hi Henrik,

The piece of the puzzle you might be missing is that when we are using single time-steps to determine actions, we are collecting the hidden state every time, and feeding it back into the network at the next step in the environment. In this way we are maintaining the temporal dependencies even though we only pass a single example through at a time.

Then when training we use sets of 30 steps in order to train the LSTM to detect and utilize the temporal dependencies in the state information.

I hope that has made it a little clearer.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store