Hi Larry,

  1. input_y — probability is used to determine the direction of the gradient to move in depending on the action. * advantage then adjusts the loss depending on the positive or negative reward.
  2. This line collects the gradients for all the variables. In this case, there are only two trainable variables (W1 and W2). Once we collect the gradients, we send them back into the network for updating once we have accumulated enough traces.

Hope that makes sense!

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store