If I understand your question, you are asking why, when an update takes place in the middle of an episode (say between steps 70 and 100 for example), we don’t use the initial hidden state for that sequence (hidden state #70) as input rather than re-initializing the hidden state. I think that is something interesting, and potentially worth pursuing. In such a case you would have to keep around the hidden state that you expect to be used at the start of a batch (or right after the last update was applied). The reason that in practicality this doesn’t help much is that the LSTM can’t really hold information from more than a couple dozen time steps in the past, and on top of that to solve the task we don’t really need it to know more than a dozen at most time-steps into the past to solve the problem. By reinitializing for training, we may lose a little, but the hidden state quickly becomes properly populated as the LSTM processes our batch of experiences.