Hi Larry,

You are right that this produces nan values. The code was essentially disregarding any actions taken which received the o label. Since our network’s actions are determined by a single output value, a change in the parameters of the network which decrease the likelihood of one action increase the likelihood of the other action. That is why it still worked regardless.

This code:

loglik = tf.log(input_y*(input_y — probability) + (1 — input_y)*(input_y + probability))

loss = -tf.reduce_mean(loglik * advantages)

is actually what should be done to properly mask the actions and take advantage of all the information available to the agent. I will update the ipython notebook with this new version.

Thanks for bringing this to my attention!

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store