tf.gradients and then
apply_gradients and using
minimize are actually equivalent. The difference is that you are applying the gradients after every timestep when using
minimize on the loss. This is actually equivalent to setting the
batch_size variable to 1.
In the case of CartPole we actually don’t need to wait till 50 updates are complete in order to get a robust policy gradient, and as such it apparently just slows the training process down to do so. Other tasks with shorter episodes, or tasks where each episode may require more episodes before updating. I have adjusted the notebook to only update every 5 episodes instead of every 50.