Hi Bharat,

I haven’t worked with that environment in particular before, but it is likely the case that it may be more complex than CartPole, and as such require some parameter tuning, or a longer training time with a smaller learning rate to achieve a good result. In your code I see that you are using a learning rate of 1e-2, which is likely too large, and producing divergent policy updates.

I hope this helps. Good luck!

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

