Hi Henrik,

1 min readApr 10, 2017

Hi Henrik,

The piece of the puzzle you might be missing is that when we are using single time-steps to determine actions, we are collecting the hidden state every time, and feeding it back into the network at the next step in the environment. In this way we are maintaining the temporal dependencies even though we only pass a single example through at a time.

Then when training we use sets of 30 steps in order to train the LSTM to detect and utilize the temporal dependencies in the state information.

I hope that has made it a little clearer.

Written by Arthur Juliani