You are right that this produces
nan values. The code was essentially disregarding any actions taken which received the
o label. Since our network’s actions are determined by a single output value, a change in the parameters of the network which decrease the likelihood of one action increase the likelihood of the other action. That is why it still worked regardless.
loglik = tf.log(input_y*(input_y — probability) + (1 — input_y)*(input_y + probability))
loss = -tf.reduce_mean(loglik * advantages)
is actually what should be done to properly mask the actions and take advantage of all the information available to the agent. I will update the ipython notebook with this new version.
Thanks for bringing this to my attention!