The reason advantage is fed in separately is because we don’t actually want the gradients from advantage to change the value estimate. We only want them to affect the policy probabilities. This implementation also uses “generalized advantage estimate” where the values we use are weighted in a specific way, and as such can’t be taken directly from the network output.

I hope that makes things a little clearer.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store