Hi Farah,

The first equation intuitively means that the expected value of a given state and action can be determined by the reward at a given time-step plus the expected maximum return at the next time-step. It doesn’t need to determine all Q values for all actions, only the one which is the maximum, since the algorithm acts optimistically in regards to its expected reward over time.

The loss function is a way of determining how far the current Q estimation is from the Q estimation if we take into account one future timestep’s expected reward. Since an agent can only ever take a single action, Q-target is calculated using both the action taken and the maximum-valued future action. We also only want to update the Q value for the action we have taken, since that is the only action which we know what state will follow from it.

I hope this helped clear things up.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store