This code ensures that a gradient can be computed for both possible actions that the network produces. Since we are using a single value (0–1) for both actions, we need to be able to mask the gradient based on which action took place when given a specific output. This line makes that possible:
loglik = tf.log(input_y*(input_y — probability) + (1 — input_y)*(input_y + probability))
If input_y is 0, then the first term is eliminated, and it becomes tf.log(probability) .
If input_y is 1 instead then it becomes tf.log(1-probability) and we ensure that the term inside the log is never negative, and that we can maximally utilize probabilities related to both of the actions.
I hope this clears it up for you.