Hi Mohamed,

In your situation the positive reward signal is very sparse, and as such vanilla policy gradient is not a robust enough method to capture the signal from this sparse reward. I would suggest looking at other more advanced methods that I discuss in this tutorial series, such as DQN or A3C.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store