Arthur Juliani
1 min readApr 10, 2017

--

Hi Henrik,

The piece of the puzzle you might be missing is that when we are using single time-steps to determine actions, we are collecting the hidden state every time, and feeding it back into the network at the next step in the environment. In this way we are maintaining the temporal dependencies even though we only pass a single example through at a time.

Then when training we use sets of 30 steps in order to train the LSTM to detect and utilize the temporal dependencies in the state information.

I hope that has made it a little clearer.

--

--

Arthur Juliani

Interested in artificial intelligence, neuroscience, philosophy, psychedelics, and meditation. http://arthurjuliani.com/