Hi Sung,

using tf.gradients and then apply_gradients and using minimize are actually equivalent. The difference is that you are applying the gradients after every timestep when using minimize on the loss. This is actually equivalent to setting the batch_size variable to 1.

In the case of CartPole we actually don’t need to wait till 50 updates are complete in order to get a robust policy gradient, and as such it apparently just slows the training process down to do so. Other tasks with shorter episodes, or tasks where each episode may require more episodes before updating. I have adjusted the notebook to only update every 5 episodes instead of every 50.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store