Hi Eric,

The separate target network is used to produce the “target” Q values that we regress our current Q values toward. This value consists of the immediate reward plus the discounted estimated future reward from taking the most valuable action in the next state. This is used in all variants of DQN. In Double DQN the target network is still used to produce the target value, but not to select the action which that value comes from, as in traditional DQN. Hopefully that makes things a little more clear.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store