Hi Larry,

This code ensures that a gradient can be computed for both possible actions that the network produces. Since we are using a single value (0–1) for both actions, we need to be able to mask the gradient based on which action took place when given a specific output. This line makes that possible:

loglik = tf.log(input_y*(input_y — probability) + (1 — input_y)*(input_y + probability))

If input_y is 0, then the first term is eliminated, and it becomes tf.log(probability) .

If input_y is 1 instead then it becomes tf.log(1-probability) and we ensure that the term inside the log is never negative, and that we can maximally utilize probabilities related to both of the actions.

I hope this clears it up for you.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store