Hi Bharat,

I haven’t worked with that environment in particular before, but it is likely the case that it may be more complex than CartPole, and as such require some parameter tuning, or a longer training time with a smaller learning rate to achieve a good result. In your code I see that you are using a learning rate of 1e-2, which is likely too large, and producing divergent policy updates.

I hope this helps. Good luck!

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store