Hi Anders,

I would suggest taking a look at the entropy value at the point that the networks begins to only provide a single output. It is likely that it has dropped to 0, or become infinity. Both situations could lead to the network always choosing the first action. I would recommend try different learning rates, entropy regularizer strengths, and larger batch sizes. You could also have the network output the actual action probabilities, and take a look at those over time.

Generally, policy collapse refers to any time to policy had been progressing towards behavior that increases cumulative reward, and suddenly no longer produces action patterns that lead to rewards. Only producing a single action regardless of the state is definitely a kind of policy collapse. Another common collapse scenario is that the policy becomes completely random.

Hope that helps.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store