Hi Andy,

I think the point of confusion is that I am actually using an update procedure from this paper: https://arxiv.org/pdf/1509.02971.pdf, not the original paper my Mnih et al. In my implementation the target network is slowly updated toward the values of the primary network at every update step (of the main network), instead of being updated infrequently, but all at once like in the original paper. The authors of the newer paper found that it improved training stability, and I did the same in my own experiments.

I hope that clears things up.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

