Hi Yuanpu,

I apologize for the late reply. If you look at the paper “Bridging the Gap Between Value and Policy Based Reinforcement Learning.” They explain how the A3C gradient is actually not 0 for an optimal policy:

“However, one difference is that the actor-critic definition of advantage Aθ,φ(s1:t) is a measure of the advantage of the trajectory s1:t compared to the average trajectory cho- sen by πθ starting from s1 in terms of reward. By contrast, Cθ,φ(s1:t) can be seen as measuring the advantage of the rewards along the trajectory compared to the log- probability of the policy πθ. At the optimal policy, when the log-probability of the policy is proportional to rewards, this measure of advantage will be 0 on every trajectory, which is not the case for Aθ,φ(s1:t).”

https://arxiv.org/abs/1702.08892

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store