The separate target network is used to produce the “target” Q values that we regress our current Q values toward. This value consists of the immediate reward plus the discounted estimated future reward from taking the most valuable action in the next state. This is used in all variants of DQN. In Double DQN the target network is still used to produce the target value, but not to select the action which that value comes from, as in traditional DQN. Hopefully that makes things a little more clear.