Hi Sung,
using tf.gradients
and then apply_gradients
and using minimize
are actually equivalent. The difference is that you are applying the gradients after every timestep when using minimize
on the loss. This is actually equivalent to setting the batch_size
variable to 1.
In the case of CartPole we actually don’t need to wait till 50 updates are complete in order to get a robust policy gradient, and as such it apparently just slows the training process down to do so. Other tasks with shorter episodes, or tasks where each episode may require more episodes before updating. I have adjusted the notebook to only update every 5 episodes instead of every 50.